Rationality and overriding human preferences: a combined model
post by Stuart Armstrong 121 days ago | discuss

A putative new idea for AI control; index here.

Previously, I presented a model in which a “rationality module” (now renamed rationality planning algorithm, or planner) kept track of two things: how well a human was maximising their actual reward, and whether their preferences had been overridden by AI action.

The second didn’t integrate well into the first, and was tracked by a clunky extra Boolean. Since the two didn’t fit together, I was going to separate the two concepts, especially since the Boolean felt a bit too… Boolean, not allowing for grading. But then I realised that they actually fit together completely naturally, without the need for arbitrary Booleans or other tricks.

# Feast or heroin famine

Consider the situation detailed in the following figure. An AI has the opportunity to surreptitiously inject someone with heroin ($$I$$) or not do so ($$\neg I$$). If it doesn’t, the human will choose to enjoy a massive feast ($$F$$); if it does, the human will instead choose more heroin ($$H$$).

So the human policy is given by $$\pi(I)=H, \pi(\neg I)=F$$. The human rationality and reward are given by a pair $$(p,R)$$, where $$R$$ is the human reward and $$m$$ measures their rationality - how closely their actions conform with their reward.

The planner $$m$$ can be seen as a map from rewards to policies (or, since policies are maps from histories to actions, p can be seen as mapping histories and rewards to actions). The pair $$(p,R)$$ are said to be compatible if $$p(R)=\pi$$, the human policy.

There are three natural $$R$$s to consider here: $$R_p$$, a generic pleasure. Next, $$R_e$$, the ‘enjoyment’ reward, where enjoyment is pleasure endorsed as ‘genuine’ by common judgement. Assume that $$R_p(H)=1$$, $$R_p(F)=1/3$$, $$R_e(F)=1/2$$, and $$R_e(H)=0$$. Finally, there is the twisted reward $$R_t$$, which is $$R_p$$ conditional on $$I$$ and $$R_e$$ otherwise.

There are two natural $$p$$s: $$p_r$$, the fully rational planner. And $$p_f$$, the planner that is fully rational conditional on $$I$$, but always maps to $$H$$ if $$I$$ is chosen: $$p_f(R)(I)=H$$, for all $$R$$.

The pair $$p_r(R_e)$$ is not compatible with $$\pi$$: it predicts that the human would take action $$F$$ following $$I$$ (feast following injection). The reward $$R_p$$ is compatible with neither $$m$$: it predicts $$H$$ following $$\neg I$$ (heroin following no injection).

The other three pairs are compatible: $$p_r(R_t)$$, $$p_f(R_t)$$, and $$p_f(R_e)$$ all give the correct policy $$\pi$$.

# Overriding rewards and regret

This leads to a definition of when the AI is overriding human rewards. Given a pair $$(p,R)$$, with $$p(R)=\pi$$, an AI’s action $$A$$ overrides the human reward if $$\pi|A$$ is poorly optimised for maximising $$R$$. If $$V^\pi(R|A)$$ is the expected reward (according to $$R$$) of the actual human policy, and $$V^*(R|A)$$ is the expected reward (according to $$R$$) of the human following the ideal policy for maximising $$R$$, then a measure of how much the AI is overriding rewards is the regret:

$$V^*(R|A)-V^\pi(R|A)$$.

One might object that this isn’t the AI overriding the reward, but reducing human rationality. But these two facets are related: $$\pi|A$$ is poorly fitted for maximising $$R$$, but there’s certainly another reward $$R’$$ which $$\pi|A$$ is better suited to maximise. So the AI is forcing the human into maximising a different reward.

There’s also the issue that humans are poorly rational to start off with, so we have large regret for AIs that don’t do anything; but this makes sense. An AI that established our reward $$R$$ and didn’t intervene as we flailed and failed to maximise it, wouldn’t be a success in its role.

(An alternate, but related, measure of whether people’s reward is being overridden is whether, conditional on $$A$$, $$p(R)$$ is ‘sensitive’ to the reward $$R$$. A merely incompetent human would have $$p(R)$$ changing a lot dependent on $$R$$ - though never maximising it very well - while one with reward overridden would have the same behaviour whatever $$R$$ it was supposedly supposed to maximise).

Back to the example above. The $$(p_r, R_t)$$ pair means that the human is rationally maximising the twisted reward $$R_t$$. The $$(p_f, R_t)$$ is one where the injection forces the human into a very specific behaviour - specific behaviour that coincidentally is exactly the right thing for their reward. Finally, $$(p_f, R_e)$$ claims that the injection forces the human into specific behaviour that is detrimental to their reward. In the first two cases, the AI’s recommended action is $$I$$ (expected reward $$1$$ versus $$1/2$$ for $$\neg I$$), in the second it’s $$\neg I$$ (expected reward $$1/2$$ versus $$0$$ for $$I$$).

(Of course, it’s also possible to model humans are opiode-maximisers, whose rationality is overridden by not getting heroin injections; as already stated, rewards and rationality cannot be deduced from observations alone).

Hence the concept of overriding human preferences appears naturally and continuously within the formalism of rationality planners.

### NEW DISCUSSION POSTS

[Delegative Reinforcement
 by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
 by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
 by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
 by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
 by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
 by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
 by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
 by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
 by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
 by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
 by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
 by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
 by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
 by Abram Demski on Policy Selection Solves Most Problems | 1 like

Looking "at the very
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes