Rationality and overriding human preferences: a combined model
post by Stuart Armstrong 637 days ago | discuss

A putative new idea for AI control; index here.

Previously, I presented a model in which a “rationality module” (now renamed rationality planning algorithm, or planner) kept track of two things: how well a human was maximising their actual reward, and whether their preferences had been overridden by AI action.

The second didn’t integrate well into the first, and was tracked by a clunky extra Boolean. Since the two didn’t fit together, I was going to separate the two concepts, especially since the Boolean felt a bit too… Boolean, not allowing for grading. But then I realised that they actually fit together completely naturally, without the need for arbitrary Booleans or other tricks.

# Feast or heroin famine

Consider the situation detailed in the following figure. An AI has the opportunity to surreptitiously inject someone with heroin ($$I$$) or not do so ($$\neg I$$). If it doesn’t, the human will choose to enjoy a massive feast ($$F$$); if it does, the human will instead choose more heroin ($$H$$).

So the human policy is given by $$\pi(I)=H, \pi(\neg I)=F$$. The human rationality and reward are given by a pair $$(p,R)$$, where $$R$$ is the human reward and $$m$$ measures their rationality - how closely their actions conform with their reward.

The planner $$m$$ can be seen as a map from rewards to policies (or, since policies are maps from histories to actions, p can be seen as mapping histories and rewards to actions). The pair $$(p,R)$$ are said to be compatible if $$p(R)=\pi$$, the human policy.

There are three natural $$R$$s to consider here: $$R_p$$, a generic pleasure. Next, $$R_e$$, the ‘enjoyment’ reward, where enjoyment is pleasure endorsed as ‘genuine’ by common judgement. Assume that $$R_p(H)=1$$, $$R_p(F)=1/3$$, $$R_e(F)=1/2$$, and $$R_e(H)=0$$. Finally, there is the twisted reward $$R_t$$, which is $$R_p$$ conditional on $$I$$ and $$R_e$$ otherwise.

There are two natural $$p$$s: $$p_r$$, the fully rational planner. And $$p_f$$, the planner that is fully rational conditional on $$I$$, but always maps to $$H$$ if $$I$$ is chosen: $$p_f(R)(I)=H$$, for all $$R$$.

The pair $$p_r(R_e)$$ is not compatible with $$\pi$$: it predicts that the human would take action $$F$$ following $$I$$ (feast following injection). The reward $$R_p$$ is compatible with neither $$m$$: it predicts $$H$$ following $$\neg I$$ (heroin following no injection).

The other three pairs are compatible: $$p_r(R_t)$$, $$p_f(R_t)$$, and $$p_f(R_e)$$ all give the correct policy $$\pi$$.

# Overriding rewards and regret

This leads to a definition of when the AI is overriding human rewards. Given a pair $$(p,R)$$, with $$p(R)=\pi$$, an AI’s action $$A$$ overrides the human reward if $$\pi|A$$ is poorly optimised for maximising $$R$$. If $$V^\pi(R|A)$$ is the expected reward (according to $$R$$) of the actual human policy, and $$V^*(R|A)$$ is the expected reward (according to $$R$$) of the human following the ideal policy for maximising $$R$$, then a measure of how much the AI is overriding rewards is the regret:

$$V^*(R|A)-V^\pi(R|A)$$.

One might object that this isn’t the AI overriding the reward, but reducing human rationality. But these two facets are related: $$\pi|A$$ is poorly fitted for maximising $$R$$, but there’s certainly another reward $$R’$$ which $$\pi|A$$ is better suited to maximise. So the AI is forcing the human into maximising a different reward.

There’s also the issue that humans are poorly rational to start off with, so we have large regret for AIs that don’t do anything; but this makes sense. An AI that established our reward $$R$$ and didn’t intervene as we flailed and failed to maximise it, wouldn’t be a success in its role.

(An alternate, but related, measure of whether people’s reward is being overridden is whether, conditional on $$A$$, $$p(R)$$ is ‘sensitive’ to the reward $$R$$. A merely incompetent human would have $$p(R)$$ changing a lot dependent on $$R$$ - though never maximising it very well - while one with reward overridden would have the same behaviour whatever $$R$$ it was supposedly supposed to maximise).

Back to the example above. The $$(p_r, R_t)$$ pair means that the human is rationally maximising the twisted reward $$R_t$$. The $$(p_f, R_t)$$ is one where the injection forces the human into a very specific behaviour - specific behaviour that coincidentally is exactly the right thing for their reward. Finally, $$(p_f, R_e)$$ claims that the injection forces the human into specific behaviour that is detrimental to their reward. In the first two cases, the AI’s recommended action is $$I$$ (expected reward $$1$$ versus $$1/2$$ for $$\neg I$$), in the second it’s $$\neg I$$ (expected reward $$1/2$$ versus $$0$$ for $$I$$).

(Of course, it’s also possible to model humans are opiode-maximisers, whose rationality is overridden by not getting heroin injections; as already stated, rewards and rationality cannot be deduced from observations alone).

Hence the concept of overriding human preferences appears naturally and continuously within the formalism of rationality planners.

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes