Bias in rationality is much worse than noise
discussion post by Stuart Armstrong 24 days ago | discuss

A putative new idea for AI control; index here.

I’ve found quite a few people dubious of my “radical skepticism” post on human preferences. Most “radical skepticism” arguments - Descartes’s Demon, various no-free-lunch theorems, Boltzmann Brains - generally turn out to be irrelevant, in practice, one way or another.

But the human preferences are in a different category. For a start, it’s clear that getting them correct is important for AI alignement - we can’t just ignore errors. But most importantly, simplicity priors/Kolmogorov complexity/Occam’s razor don’t help with learning human preferences, as illustrated most compactly with the the substitution of (-p,-R) for (p,R).

But this still feels like a bit of a trick. Maybe we can just assume rationality plus a bit of noise, or rationality most of the time, and get something a lot more reasonable.

# Structured noise has to be explained

And, indeed, if humans were rational plus a bit of noise, things would be simple. A noisy signal has high Kolmogorov complexity, but there are ways to treat noise as being of low complexity.

The problem with that approach is that explaining noise is completely different from explaining the highly structured noise that we know as bias.

Take the anchoring bias, for example. In one of the classical experiments, American students were asked if they would buy a product for the price that was the last two digits of their social security number, and were then asked what price they would buy the product for. The irrelevant last two digits had a strong distorting effect on their willingness to pay.

# Modelling anchoring behaviour

Let’s model a simplified human, if not asked about their social security number, as valuing a cordless keyboard at $25, plus noise η drawn from a normal distribution of mean$0 and standard deviation $5. If they are first asked about their social secuity number s, their valuation shifts to 3/4($25) + 1/4(s) + η.

To explain human values, we have two theories about their rewards:

• R(0)=$25. • R(1)=3/4($25) + 1/4(s).

We have three theories about their planning rationality:

• p(0): the human is rational.
• p(1): p(0) + noise η.
• p(2): the human is rational, but has an anchoring bias of 1/4 of the value of the number they hear.
• p(3): p(2) + noise η.

If we count noise as being of low complexity, then R(0), p(0), and p(1) are low complexity rewards/planners. R(1), p(2), and p(3) are higher complexity.

If we run the experiment multiple times, it’s easy to realise that p(0) is incompatible with observations, as is p(2) - there genuinely is noise in the human response. All the other pairs are possible (after all, noise could explain any behaviour on the part of the human). Since (p(1), R(0)) is the lowest complexity pair, can we not just conclude that the human has genuine preferences with some noise and bias?

Unfortunately, no.

(p(1), R(0)) is the lowest complexity pair, yes, but the agent’s behaviour becomes more and more improbable under that assumption. As we sample more and more, (p(3), R(0)) and (p(1), R(1)) become more probable, as they fit the data much better; the deviation between the unbiased mean of $25 and the sample mean (which converges to 1/4($50)+3/4($25)=$31.25) becomes more and more inexplicable as the number of datapoints rises:

Now, (p(3), R(0)) is the “true” pair - an anchoring-biased planner and a simpler reward. Yay! But (p(1), R(1)) fits the data just as well: a simple planner and a strange reward. They are of basically equal complexity, so we have no way of distinguishing them.

The probem is that bias is a highly structured deviation from rationality, and can only be explained by complex structured noise. The models with simple noise get ruled out because they don’t fit the data, and models with structured noise are no simpler than models with complex preferences.

(You might argue for imposing a stronger complexity prior on the planner p than the reward R, but in that case you have to somehow solve situations where people have genuinely complex preferences, such as cooking, music, and similar).

There is no easy way to get around the radical skepticism on the values of non-rational agents - but there are ways, which I will post soon.

### NEW DISCUSSION POSTS

Indeed there is some kind of
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
 by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
 by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
 by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
 by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
 by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

I think the point I was
 by Abram Demski on Predictable Exploration | 0 likes

(also x-posted from
 by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

(x-posted from Arbital ==>
 by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

>If the other players can see
 by Stuart Armstrong on Predictable Exploration | 0 likes