Goodhart’s curse is a neologism by Eliezer Yudkowsky stating that “neutrally optimizing a proxy measure U of V seeks out upward divergence of U from V.” It is related to many near by concepts (e.g. the tails come apart, winner’s curse, optimizer’s curse, regression to the mean, overfitting, edge instantiation, goodhart’s law). I claim that there are three main mechanisms through which Goodhart’s curse operates.
Goodhart’s Curse Level 1 (regressing to the mean): We are trying to optimize the value of \(V\), but since we cannot observe \(V\), we instead optimize a proxy \(U\), which is an unbiased estimate of \(V\). When we select for points with a high \(U\) value, we will be biased towards points for which \(U\) is an overestimate of \(V\).
As a simple example imagine \(V\) and \(E\) (for error) are independently normally distributed with mean 0 and variance 1, and \(U=V+E\). If we sample many points and take the one with the largest \(U\) value, we can predict that \(E\) will likely be positive for this point, and thus the \(U\) value will predictably be an overestimate of the \(V\) value.
In many cases, (like the one above) the best you can do without observing \(V\) is still to take the largest \(U\) value you can find, but you should still expect that this \(U\) value overestimates \(V\).
Similarly, if \(U\) is not necessarily an unbiased estimator of \(V\), but \(U\) and \(V\) are correlated, and you sample a million points and take the one with the highest \(U\) value, you will end up with a \(V\) value on average strictly less than if you could just take a point with a one in a million \(V\) value directly.
Goodhart’s Curse Level 2 (optimizing away the correlation): Here, we assume \(U\) and \(V\) are correlated on average, but there may be different regions in which this correlation of stronger or weaker. When we optimize \(U\) to be very high, we zoom in on the region of very large \(U\) values. This region could in principle have very small \(V\) values.
As a very simple example imagine \(U\) is integer uniform between 0 and 1000 inclusive, and \(V\) is equal to \(U\) mod 1000. Overall, \(U\) and \(V\) are correlated. The point where \(U\) is 1000 and \(V\) is 0 is an outlier, but it is only one point and does not sway the correlation that much. However, when we apply a lot of optimization pressure, we through away all the points with low \(U\) values, and left with a small number of extreme points. Since this is a small number of points, the correlation between \(U\) and \(V\) says little about what value \(V\) will take.
Another more realistic example is that \(U\) and \(V\) are two correlated dimensions in a multivariate normal distribution, but we cut off the normal distribution to only include the disk of points in which \(U^2+V^2<n\) for some large \(n\). This example represents a correlation between \(U\) and \(V\) in naturally occurring points, but also a boundary around what types of feasible that need not respect this correlation.
Imagine you were to sample \(k\) points in the above example and take the one with the largest \(U\) value. As you increase \(k\), at first, this optimization pressure lets you find better and better points for both \(U\) and \(V\), but as you increase \(k\) to infinity, eventually you sample so many points that you will find a point near \(U=\sqrt{n}, V=0\). When enough optimization pressure was applied, the correlation between \(U\) and \(V\) stopped mattering, and instead the boundary of what kinds of points were possible at all decided what kind of point was selected.
Goodhart’s Curse Level 3 (adversarial correlations): Here we are selecting a world with a high \(U\) value because we want a would with a high \(V\) value, and we believe \(U\) to a good proxy for \(V\). However, there is another agent who wants to optimize some other value \(W\). Assume that \(W\) and \(V\) are contradictory. Points with hight \(W\) value necessarily have low \(V\) value, since the demand using similar resources.
Since you are using \(U\) as a proxy, this other agent is incentivized to make \(U\) and \(W\) correlated as much as it can. It wants to cause your process which selects a large \(U\) value to also select a large \(W\) value (and thus a small \(V\) value).
Making \(U\) and \(W\) correlated may be difficult, but thanks to Level 2 of Goodhart’s Curse, the adversary need only make them correlated at the extreme values of \(U\).
For example if you run an company, and you have an programmer employee that you want to create a working product (\(V\)). You incentivize the employee by selecting for or rewarding employees that produce a large number of lines of code (\(U\)). The employee wants you to pay him to slack off all day (\(W\)). \(W\) and \(V\) are contradictory. The employee is incentivized to make worlds with high \(U\) also have high \(W\), and thus have low \(V\). Thus, the employee may adversarially write a script to generate a bunch of random lines of code that do nothing, giving himself more time to slack off.
Level 3 is the thing most behind the original Goodhart’s Law (although level 2 contributes as well.)
Level 3 also is the mechanism behind a superintelligent AI making a Treacherous Turn. Here, \(V\) is doing what the human’s want forever. \(U\) is doing what the humans want before in the training cases where the AI does not have enough power to take over, and \(W\) is whatever the AI wants to do with the universe.
Finally, Level 3 is also behind the malignancy of the universal prior, where you want to predict well forever (V), so hypotheses might predict well for a while (U), so that they can manipulate the world with their future predictions (W).
