A putative new idea for AI control; index here.
I’ve found quite a few people dubious of my “radical skepticism” post on human preferences. Most “radical skepticism” arguments - Descartes’s Demon, various no-free-lunch theorems, Boltzmann Brains - generally turn out to be irrelevant, in practice, one way or another.
But the human preferences are in a different category. For a start, it’s clear that getting them correct is important for AI alignement - we can’t just ignore errors. But most importantly, simplicity priors/Kolmogorov complexity/Occam’s razor don’t help with learning human preferences, as illustrated most compactly with the the substitution of (-p,-R) for (p,R).
But this still feels like a bit of a trick. Maybe we can just assume rationality plus a bit of noise, or rationality most of the time, and get something a lot more reasonable.
Structured noise has to be explained
And, indeed, if humans were rational plus a bit of noise, things would be simple. A noisy signal has high Kolmogorov complexity, but there are ways to treat noise as being of low complexity.
The problem with that approach is that explaining noise is completely different from explaining the highly structured noise that we know as bias.
Take the anchoring bias, for example. In one of the classical experiments, American students were asked if they would buy a product for the price that was the last two digits of their social security number, and were then asked what price they would buy the product for. The irrelevant last two digits had a strong distorting effect on their willingness to pay.
Modelling anchoring behaviour
Let’s model a simplified human, if not asked about their social security number, as valuing a cordless keyboard at $25, plus noise η drawn from a normal distribution of mean $0 and standard deviation $5.
If they are first asked about their social secuity number s, their valuation shifts to 3/4($25) + 1/4(s) + η.
To explain human values, we have two theories about their rewards:
- R(1)=3/4($25) + 1/4(s).
We have three theories about their planning rationality:
- p(0): the human is rational.
- p(1): p(0) + noise η.
- p(2): the human is rational, but has an anchoring bias of 1/4 of the value of the number they hear.
- p(3): p(2) + noise η.
If we count noise as being of low complexity, then R(0), p(0), and p(1) are low complexity rewards/planners. R(1), p(2), and p(3) are higher complexity.
If we run the experiment multiple times, it’s easy to realise that p(0) is incompatible with observations, as is p(2) - there genuinely is noise in the human response. All the other pairs are possible (after all, noise could explain any behaviour on the part of the human). Since (p(1), R(0)) is the lowest complexity pair, can we not just conclude that the human has genuine preferences with some noise and bias?
(p(1), R(0)) is the lowest complexity pair, yes, but the agent’s behaviour becomes more and more improbable under that assumption. As we sample more and more, (p(3), R(0)) and (p(1), R(1)) become more probable, as they fit the data much better; the deviation between the unbiased mean of $25 and the sample mean (which converges to 1/4($50)+3/4($25)= $31.25) becomes more and more inexplicable as the number of datapoints rises:
Now, (p(3), R(0)) is the “true” pair - an anchoring-biased planner and a simpler reward. Yay! But (p(1), R(1)) fits the data just as well: a simple planner and a strange reward. They are of basically equal complexity, so we have no way of distinguishing them.
The probem is that bias is a highly structured deviation from rationality, and can only be explained by complex structured noise. The models with simple noise get ruled out because they don’t fit the data, and models with structured noise are no simpler than models with complex preferences.
(You might argue for imposing a stronger complexity prior on the planner p than the reward R, but in that case you have to somehow solve situations where people have genuinely complex preferences, such as cooking, music, and similar).
There is no easy way to get around the radical skepticism on the values of non-rational agents - but there are ways, which I will post soon.