The Three Levels of Goodhart's Curse post by Scott Garrabrant 185 days ago | Vadim Kosoy, Abram Demski and Paul Christiano like this | 2 comments Note: I now consider this post deprecated and instead recommend this updated version. Goodhart’s curse is a neologism by Eliezer Yudkowsky stating that “neutrally optimizing a proxy measure U of V seeks out upward divergence of U from V.” It is related to many near by concepts (e.g. the tails come apart, winner’s curse, optimizer’s curse, regression to the mean, overfitting, edge instantiation, goodhart’s law). I claim that there are three main mechanisms through which Goodhart’s curse operates. Goodhart’s Curse Level 1 (regressing to the mean): We are trying to optimize the value of $$V$$, but since we cannot observe $$V$$, we instead optimize a proxy $$U$$, which is an unbiased estimate of $$V$$. When we select for points with a high $$U$$ value, we will be biased towards points for which $$U$$ is an overestimate of $$V$$. As a simple example imagine $$V$$ and $$E$$ (for error) are independently normally distributed with mean 0 and variance 1, and $$U=V+E$$. If we sample many points and take the one with the largest $$U$$ value, we can predict that $$E$$ will likely be positive for this point, and thus the $$U$$ value will predictably be an overestimate of the $$V$$ value. In many cases, (like the one above) the best you can do without observing $$V$$ is still to take the largest $$U$$ value you can find, but you should still expect that this $$U$$ value overestimates $$V$$. Similarly, if $$U$$ is not necessarily an unbiased estimator of $$V$$, but $$U$$ and $$V$$ are correlated, and you sample a million points and take the one with the highest $$U$$ value, you will end up with a $$V$$ value on average strictly less than if you could just take a point with a one in a million $$V$$ value directly. Goodhart’s Curse Level 2 (optimizing away the correlation): Here, we assume $$U$$ and $$V$$ are correlated on average, but there may be different regions in which this correlation of stronger or weaker. When we optimize $$U$$ to be very high, we zoom in on the region of very large $$U$$ values. This region could in principle have very small $$V$$ values. As a very simple example imagine $$U$$ is integer uniform between 0 and 1000 inclusive, and $$V$$ is equal to $$U$$ mod 1000. Overall, $$U$$ and $$V$$ are correlated. The point where $$U$$ is 1000 and $$V$$ is 0 is an outlier, but it is only one point and does not sway the correlation that much. However, when we apply a lot of optimization pressure, we through away all the points with low $$U$$ values, and left with a small number of extreme points. Since this is a small number of points, the correlation between $$U$$ and $$V$$ says little about what value $$V$$ will take. Another more realistic example is that $$U$$ and $$V$$ are two correlated dimensions in a multivariate normal distribution, but we cut off the normal distribution to only include the disk of points in which U^2+V^2  by Sören Mindermann 112 days ago | link (x-posted from Arbital ==> Goodhart’s curse) On “Conditions for Goodhart’s curse”: It seems like with AI alignment the curse happens mostly when V is defined in terms of some high-level features of the state, which are normally not easily maximized. I.e., V is something like a neural network \(V:s \mapsto V(s) where $$s$$ is the state. Now suppose U’ is a neural network which outputs the AI’s estimate of these features. The AI can then manipulate the state/input to maximize these features. That’s just the standard problem of adversarial examples. So it seems like the conditions we’re looking for are generally met in the common setting were adversarial examples do work to maximize some loss function. One requirement there is that the input space is high-dimensional. So why doesn’t the 2D Gaussian example go wrong? [This is about the example from Arbital ==> Goodhart’s Curse where there is no bound $$\sqrt{n}$$ on $$V$$ and $$U$$]. There’s no high-level features to optimize by using the flexibility of the input space. On the other hand, you don’t need a flexible input space to fall prey to the winner’s curse. Instead of using the high flexibility of the input space you use the ‘high flexibility’ of the noise if you have many data points. The noise will take any possible value with enough data, causing the winner’s curse. If you care about a feature that is bounded under the real-world distribution but noise is unbounded, you will find that the most promising-looking data points are actually maximizing the noise. There’s a noise-free (i.e. no measurement errors) variant of the winner’s curse which suggests another connection to adversarial examples. If you simply have $$n$$ data points and pick the one that maximizes some outcome measure, you can conceptualize this as evolutionary optimization in the input space. Usually, adversarial examples are generated by following the gradient in the input space. Instead, the winner’s curse uses evolutionary optimization. reply
 by Sören Mindermann 112 days ago | link (also x-posted from https://arbital.com/p/goodharts_curse/#subpage-8s5) Another, speculative point: If $$V$$ and $$U$$ were my utility function and my friend’s, my intuition is that an agent that optimizes the wrong function would act more robustly. If true, this may support the theory that Goodhart’s curse for AI alignment would be to a large extent a problem of defending against adversarial examples by learning robust features similar to human ones. Namely, the robust response may be because me and my friend have learned similar robust, high-level features; we just give them different importance. reply

NEW DISCUSSION POSTS

[Delegative Reinforcement
 by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
 by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
 by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
 by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
 by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
 by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
 by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
 by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
 by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
 by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
 by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
 by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
 by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
 by Abram Demski on Policy Selection Solves Most Problems | 1 like

Looking "at the very
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes