Bias in rationality is much worse than noise
discussion post by Stuart Armstrong 110 days ago | discuss

A putative new idea for AI control; index here.

I’ve found quite a few people dubious of my “radical skepticism” post on human preferences. Most “radical skepticism” arguments - Descartes’s Demon, various no-free-lunch theorems, Boltzmann Brains - generally turn out to be irrelevant, in practice, one way or another.

But the human preferences are in a different category. For a start, it’s clear that getting them correct is important for AI alignement - we can’t just ignore errors. But most importantly, simplicity priors/Kolmogorov complexity/Occam’s razor don’t help with learning human preferences, as illustrated most compactly with the the substitution of (-p,-R) for (p,R).

But this still feels like a bit of a trick. Maybe we can just assume rationality plus a bit of noise, or rationality most of the time, and get something a lot more reasonable.

# Structured noise has to be explained

And, indeed, if humans were rational plus a bit of noise, things would be simple. A noisy signal has high Kolmogorov complexity, but there are ways to treat noise as being of low complexity.

The problem with that approach is that explaining noise is completely different from explaining the highly structured noise that we know as bias.

Take the anchoring bias, for example. In one of the classical experiments, American students were asked if they would buy a product for the price that was the last two digits of their social security number, and were then asked what price they would buy the product for. The irrelevant last two digits had a strong distorting effect on their willingness to pay.

# Modelling anchoring behaviour

Let’s model a simplified human, if not asked about their social security number, as valuing a cordless keyboard at $25, plus noise η drawn from a normal distribution of mean$0 and standard deviation $5. If they are first asked about their social secuity number s, their valuation shifts to 3/4($25) + 1/4(s) + η.

To explain human values, we have two theories about their rewards:

• R(0)=$25. • R(1)=3/4($25) + 1/4(s).

We have three theories about their planning rationality:

• p(0): the human is rational.
• p(1): p(0) + noise η.
• p(2): the human is rational, but has an anchoring bias of 1/4 of the value of the number they hear.
• p(3): p(2) + noise η.

If we count noise as being of low complexity, then R(0), p(0), and p(1) are low complexity rewards/planners. R(1), p(2), and p(3) are higher complexity.

If we run the experiment multiple times, it’s easy to realise that p(0) is incompatible with observations, as is p(2) - there genuinely is noise in the human response. All the other pairs are possible (after all, noise could explain any behaviour on the part of the human). Since (p(1), R(0)) is the lowest complexity pair, can we not just conclude that the human has genuine preferences with some noise and bias?

Unfortunately, no.

(p(1), R(0)) is the lowest complexity pair, yes, but the agent’s behaviour becomes more and more improbable under that assumption. As we sample more and more, (p(3), R(0)) and (p(1), R(1)) become more probable, as they fit the data much better; the deviation between the unbiased mean of $25 and the sample mean (which converges to 1/4($50)+3/4($25)=$31.25) becomes more and more inexplicable as the number of datapoints rises:

Now, (p(3), R(0)) is the “true” pair - an anchoring-biased planner and a simpler reward. Yay! But (p(1), R(1)) fits the data just as well: a simple planner and a strange reward. They are of basically equal complexity, so we have no way of distinguishing them.

The probem is that bias is a highly structured deviation from rationality, and can only be explained by complex structured noise. The models with simple noise get ruled out because they don’t fit the data, and models with structured noise are no simpler than models with complex preferences.

(You might argue for imposing a stronger complexity prior on the planner p than the reward R, but in that case you have to somehow solve situations where people have genuinely complex preferences, such as cooking, music, and similar).

There is no easy way to get around the radical skepticism on the values of non-rational agents - but there are ways, which I will post soon.

### NEW DISCUSSION POSTS

[Delegative Reinforcement
 by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
 by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
 by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
 by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
 by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
 by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
 by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
 by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
 by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
 by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
 by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
 by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
 by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
 by Abram Demski on Policy Selection Solves Most Problems | 1 like

Looking "at the very
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes