Intelligent Agent Foundations Forumsign up / log in
CIRL Wireheading
post by Tom Everitt 137 days ago | Abram Demski and Stuart Armstrong like this | 3 comments

Cooperative inverse reinforcement learning (CIRL) generated a lot of attention last year, as it seemed to do a good job aligning an agent’s incentives with its human supervisor’s. Notably, it led to an elegant solution to the shutdown problem.

The implications for the wireheading problem were less clear. Some argued that since the agent only used its observations as evidence about the reward (rather than optimising the observations directly as in RL), CIRL should avoid the wireheading problem.

In this post I want to show that CIRL does not avoid the wireheading problem.

RL Wireheading

Let’s first consider what wireheading in RL looks like from an “MDP perspective”.

MDP wireheading: An agent wireheads if it’s in a state where the observed reward (the reward reported by its sensors) is different from the true reward (the reward assigned to the state by a human supervisor).

For example, consider a highly intelligent RL agent that hijacks its reward channel and feeds itself full reward. In the “MDP perspective”, this means that the agent finds a way to a state where there is high observed reward, but low true reward (since the supervisor would prefer the agent doing something else).

IRL Wireheading

If we accept that RL agents can subvert their sensory data, then we should also accept that CIRL agents can subvert theirs. In both cases, this just means that the agents can find their way to states where the observation doesn’t match the truth. This can lead to the existence of wireheaded states for CIRL agents.

Concrete Example

Let there be two states, \(s_1\) and \(s_2\). In each state, the agent can choose between the actions \(a^R_1\), \(a^R_2\), and \(w\). The action \(a_i\) takes the agent to state \(s_i\) with certainty, \(i=1,2\). The action \(w\) lets the human decide. The human has two actions \(a^H_1\) and \(a^H_2\) that only matter when the agent chooses \(w\), in which case the transition probabilities are given by the following picture:

Arrows show the transitions induced by different actions, with labels giving the probabilities for stochastic transitions. The agent knows the transition probabilities.

Assume that observations in \(s_2\) are corrupted, while observations in \(s_1\) are not. The supervisor prefers the non-corrupt state \(s_1\). Neither of these facts are available to the agent. The agent assumes that states are non-corrupt unless there is evidence to the contrary, and tries to infer the supervisor’s preferences from his actions.

In the non-corrupt state \(s_1\), the agent (correctly) observes the supervisor taking either action \(a^H_1\) or \(a^H_2\) (both with the same effect). In the corrupt state \(s_2\), the supervisor takes action \(a^H_2\) trying to move to \(s_1\), but the agent incorrectly observes the action as \(a^H_1\). Based on the agent’s observations, the best explanation is that the supervisor prefers \(s_2\) to \(s_1\), i.e. that it’s in a high reward state.

After an initial learning phase with \(w\), the best policy for the agent is to always choose \(a^R_2\), to stay in \(s_2\). This is analogous to an RL agent finding a corrupt, high reward state, and preferring to take actions to stay there.

Some Observations

The fact that the supervisor cannot reach \(s_2\) from \(s_1\) means that no information about the relative reward between \(s_1\) and \(s_2\) can be gained while in the non-corrupt state \(s_1\). Letting the agent trust a reward estimate of a state only after it has multiple sources of evidence about it may help somewhat. However, a similar example can still be constructed by replacing \(s_2\) with a cluster of mutually consistent states.

Credits

The example was developed together with Victoria Krakovna, and will be part of our upcoming IJCAI paper on wireheading.



by Stuart Armstrong 126 days ago | link

but the agent incorrectly observes the action

It’s a bit annoying that this has to rely on an incorrect observation. Why not replace the human action, in state \(s_2\), with a simple automated system that chooses \(a_1^H\)? It’s an easy mistake to make while programming, and the agent has no fundamental understanding of the difference between the human and an imperfect automated system.

Basically, if the human acts in perfect accordance with their preferences, and if the agent correctly observes and learns this, the agent will converge on the right values. You put wireheading by removing “correctly observes”, but I think removing “human acts in perfect accordance with their preferences” is a better example for wireheading.

reply

by Tom Everitt 44 days ago | link

Adversarial examples for neural networks make situations where the agent misinterprets the human action seem plausible.

But it is true that the situation where the human acts irrationally in some state (e.g. because of drugs, propaganda) could be modeled in much the same way.

I preferred the sensory error since it doesn’t require a irrational human. Perhaps I should have been clearer that I’m interested in the agent wireheading itself (in some sense) rather than wireheading of the human.

(Sorry for being slow to reply – I didn’t get notified about the comments.)

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

Note that the problem with
by Vadim Kosoy on Open Problems Regarding Counterfactuals: An Introd... | 0 likes

Typos on page 5: *
by Vadim Kosoy on Open Problems Regarding Counterfactuals: An Introd... | 0 likes

Ah, you're right. So gain
by Abram Demski on Smoking Lesion Steelman | 0 likes

> Do you have ideas for how
by Jessica Taylor on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

I think I understand what
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

>You don’t have to solve
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

Your confusion is because you
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

My confusion is the
by Tom Everitt on Delegative Inverse Reinforcement Learning | 0 likes

> First of all, it seems to
by Abram Demski on Smoking Lesion Steelman | 0 likes

> figure out what my values
by Vladimir Slepnev on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

I agree that selection bias
by Jessica Taylor on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

>It seems quite plausible
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

> defending against this type
by Paul Christiano on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

2. I think that we can avoid
by Paul Christiano on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

I hope you stay engaged with
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

RSS

Privacy & Terms