A putative new idea for AI control; index here.
In a previous post, I presented a model of human rationality and reward as pair (p, R), with p our (ir)rational planning algorithm (called a planner), and R our reward, with p(R) giving our actions/policy. I also showed that human behaviour is insufficient to establish R (or p) in any meaningful way whatsoever.
Yet humans seem to make judgements about each other’s rationality and true rewards all the time. And not only do these judgements often agree with each other (almost everyone agrees that the anchoring bias is a bias - or a out-of-context heuristic, ie a bias - and almost nobody argues that it’s actually a human value), but they often seem to have predictive ability. What’s going on?
Adding normative assumptions
To tackle this, we need to add normative assumptions. Normative assumptions are simply assumptions that distinguish between two pairs (p, R) (p’, R’) that lead to the same policy: p(R)=p’(R’).
Since those pairs predict the same policy, they cannot be distinguished via observations. So a normative assumption is an extra piece, that cannot itself be deduced from observation, and distinguishes between planner-reward pairs.
Because normative assumptions cannot be deduced from observations, they are part of the definition of the human reward function. They are not abstract means of converging on this true reward; they are part of the procedure that defines this reward.
What are they?
There are two way of seeing such normative assumptions. The first is as an extra update rule: upon seeing observation o, the probabilities of (p, R) and (p’, R’) would normally go to α and α’, but, with the normative assumption, there is an extra update to the relative probabilities of the two.
Equivalently, a normative assumption can be seen as an adjustment to the priors of these pairs. The two approaches are equivalent, but sometimes an extra update is computationally tractable whereas the equivalent prior would be intractable (and vice-versa).
So, if normative assumptions are definitions of rationality/reward, what are good candidates to pick? How do we, as humans, define our own rationality and reward?
Well, one strong way seems to be through our feeling of regret (human regret, not regret in the machine learning sense). We often feel and express regret when something turns out the wrong way from what we wanted. If we take “feelings and expressions of regret encode true reward information” as a normative assumption, then this restricts the number of (p, R) pairs that are compatible with such an assumption.
In a very simple example, someone has a 50% chance of receiving either a sword (s) or an iPhone (i). After that, they will either say h=“I’m happy with my gift”, or ~h=“I’m unhappy with my gift”.
The human’s reward is R(α, β, γ, δ) = αR(s-i) + βR(h-~h) + γR(h-~h|s) + δR(h-~h|i). The α, β, γ, and δ terms are constants (that can be negative). The R(s-i) expresses the extra reward the human gets from receiving the sword rather than the iPhone (if negative, it encodes the opposite reward preference). The R(h-~h) expresses the extra reward the human gets from say h as opposed to ~h; here h and ~h are seen as pure “speech acts”, which don’t mean anything intrinsically.
The R(h-~h|s) expresses the same preference, conditional on the human having received the sword, and R(h-~h|i) expresses it conditional on the human having received the iPhone.
I’ll restrict to two planners, p, which is fully rational, and -p, which is fully anti-rational.
Assume the human will say h following s, and ~h following i (that’s the human “policy”). Then we are restricted to the following pairs:
- (p, R(α, β, γ, δ) | γ≥-β≥δ) or (-p, R(α, β, γ, δ) | γ≤-β≤δ).
The conditions on the constants ensures the human will follow the correct policy. If we add the normative assumption about expressing regret, we can now say that the human values the sword more than the iPhone, and further restrict to:
- (p, R(α, β, γ, δ) | γ≥-β≥δ, α≥0) or (-p, R(α, β, γ, δ) | γ≤-β≤δ, α≥0).
Interestingly, we haven’t ruled out the human being anti-rational! The anti-rational human is one that desires the sword, but also wants to lie about it, but just spectacularly messes up its attempt at lying.
Planning and multiple attempts
We could allow the human to take actions that increase or decrease the probability of getting the sword. Then, if they tried to get the sword and then claimed they wanted it, we would conclude they were rational, using p. If they tried to avoid the sword and then claimed they wanted it, we could conclude they were anti-rational, using -p.
Of course, that makes two decisions, and we need to add alternative p’s, which are differently rational/anti-rational with respect to “getting the sword” and “expressing preferences”.
But those p’s are more complex, and we can now finally start to make use of the simplicity prior - especially if we repeat variants of the experiment multiple times. Then if the human is taken to be always correct in their expression of regret (our assumption), and if their actions are consistently in line with their stated values (observation), we can conclude they are most likely rational. Victory?
Is the regret normative assumption correct?
Normative assumptions are not intrinsically correct or incorrect, as they cannot be deduced from observation. The question is, instead, do they give a good definition of human reward, according to our judgment.
And the answer is… kinda?
First of all, it depends on context. Is the person speaking privately? Are they speaking to a beloved but formal relative who has just given them the present? Are they actually talking about that gift rather than another?
It’s clear that we cannot say for sure, just because the human expresses regret following a iPhone gift, that this is a correct description of their reward function. It depends, in a complicated way, on the context.
But “depending on the context” destroys the work of the simplicity prior in the previous paragraph. Unless we have a complicated description of when the regret assumption applies, we fall back into the difficulty of distinguishing complex noise and biases, from actual values.
It’s not completely hopeless - the assumption that stated regret is often correct does do some work, combined with a simplicity prior. But to get the most out of it, we need a much more advanced understanding of regret.
We also need to balance this with other normative assumptions, lest we simply create a regret-minimiser.
One way of figuring out how regret works is to ask people meta-questions - is the regret that you/they expressed in this situation likely to be genuine? How about that situation? If we make the normative assumption that the answers to these meta-questions are correct, we will get a more complicated understanding of regret, which will start reducing the space of (p, R) that we need to consider.
Another approach might be to look at the brain and body correlates of genuine regret. I elided the distinction between these approaches by referring to “feelings and expressions of regret”, but I’ll return to this distinction in a later post.