Stable Pointers to Value: An Agent Embedded in Its Own Utility Function
discussion post by Abram Demski 668 days ago | Tom Everitt, Scott Garrabrant and Vladimir Slepnev like this | discuss

(This post is largely a write-up of a conversation with Scott Garrabrant.)

## Stable Pointers to Value

How do we build stable pointers to values?

As a first example, consider the wireheading problem for AIXI-like agents in the case of a fixed utility function which we know how to estimate from sense data. As discussed in Daniel Dewey’s Learning What to Value and other places, if you try to implement this by putting the utility calculation in a box which rewards an AIXI-like RL agent, the agent can eventually learn to modify or remove the box, and happily does so if it can get more reward by doing so. This is because the RL agent predicts, and attempts to maximize, reward received. If it understands that it can modify the reward-giving box to get more reward, it will.

We can fix this problem by integrating the same reward box with the agent in a better way. Rather than having the RL agent learn what the output of the box will be and plan to maximize the output of the box, we use the box directly to evaluate possible futures, and have the agent plan to maximize that evaluation. Now, if the agent considers modifying the box, it evaluates that future with the current box. The box as currently configured sees no advantage to such tampering. This is called an observation-utility maximizer (to contrast it with reinforcement learning). Daniel Dewey goes on to show that we can incorporate uncertainty about the utility function into observation-utility maximizers, recovering the kind of “learning what is being rewarded” that RL agents were supposed to provide, but without the perverse incentive to try and make the utility turn out to be something easy to maximize.

This feels much like a use/mention distinction. The RL agent is maximizing “the function in the utility module”, whereas the observation-utility agent (OU agent) is maximizing the function in the utility module.

## The Easy vs Hard Problem

I’ll call the problem which OU agents solve the easy problem of wireheading. There’s also the hard problem of wireheading: how do you build a stable pointer to values if you can’t build an observation-utility box? For example, how do you set things up so that the agent wants to satisfy a human, without incentivising the AI to manipulate the human to be easy to satisfy, or creating other problems in the attempt to avoid this? Daniel Dewey’s approach of incorporating uncertainty about the utility function into the utility box doesn’t seem to cut it – or at least, it’s not obvious how to set up that uncertainty in the right way.

The hard problem is the wireheading problem Which Tom Everitt attempts to make progress on in Avoiding Wireheading with Value Reinforcement Learning and Reinforcement Learning with a Corrupted Reinforcement Channel. It’s also connected to the problem of Generalizable Environmental Goals in AAMLS. CIRL gets at an aspect of this problem as well, showing how it can be solved if the problem of environmental goals is solved (and if we can assume that humans are perfectly rational, or that we can somehow factor out their irrationality – Stuart Armstrong has some useful thoughts on why this is difficult). Approval-directed agents can be seen as an attempt to turn the hard problem into the easy problem, by treating humans as the evaluation box rather than trying to infer what the human wants.

All these approaches have different advantages and disadvantages, and the point of this post isn’t to evaluate them. My point is more to convey the overall picture which seems to connect them. In a sense, the hard problem is just an extension of the same use/mention distinction which came up with the easy problem. We have some idea how to maximize “human values”, but we don’t know how to actually maximize human values. Metaphorically, we’re trying to dereference the pointer.

Stuart Armstrong’s indifference work is a good illustration of what’s hard about the hard problem. In the RL vs OU case, you’re going to constantly struggle with the RL agent’s misaligned incentives until you switch to an OU agent. You can try to patch things by explicitly punishing manipulation of the reward signal, warping the agent’s beliefs to think manipulation of the rewards is impossible, etc, but this is really the wrong approach. Switching to OU makes all of that unnecessary. Unfortunately, in the case of the hard problem, it’s not clear there’s an analogous move which makes all the slippery problems disappear.

## Illustration: An Agent Embedded in Its Own Utility Function

If an agent is logically uncertain of its own utility function, the easy problem can turn into the hard problem.

It’s quite possible that an agent might be logically uncertain of its own utility function if the function is quite difficult to compute. In particular, human judgement could be difficult to compute even after learning all the details of the human’s preferences, so that the AI needs uncertain reasoning about what the model tells it.

Why can this turn the easy problem of wireheading into the hard problem? If the agent is logically uncertain about the utility function, its decisions may have logical correlations with the utility function. This can give the agent some logical control over its utility function, reintroducing a wireheading problem.

As a concrete example, suppose that we have constructed an AI which maximizes CEV: it wants to do what an imaginary version of human society, deliberating under ideal conditions, would decide is best. Obviously, the AI cannot actually simulate such an ideal society. Instead, the AI does its best to reason about what such an ideal society would do.

Now, suppose the agent figures out that there would be an exact copy of itself inside the ideal society. Perhaps the ideal society figures out that it has been constructed as a thought experiment to make decisions about the real world, so they construct a simulation of the real world in order to better understand what they will be making decisions about. Furthermore, suppose for the sake of argument that our AI can break out of the simulation and exert arbitrary control over the ideal society’s decisions.

Naively, it seems like what the AI will do in this situation is take control over the ideal society’s deliberation, and make the CEV values as easy to satisfy as possible – just like an RL agent modifying its utility module.

Obviously, this could be taken as reason to make sure the ideal society doesn’t figure out that it’s just a thought experiment, or that they don’t construct copies of the AI. But, we don’t generally want good properties of the AI to rely on assumptions about what humans do; wherever possible, we want to design the AI to avoid such problems.

## Indifference and CDT

Of course, the real problem isn’t an AI getting copied in a CEV hypothetical. The real case is that an AI notices some kind of logical control over its utility function. Without the thought experiment, I might have said that it’s perfectly fine for an AI to make its utility function easier to satisfy if it somehow had a logical correlation which allowed it to do so. The thought experiment clarifies that it’s much like the case of an RL agent modifying its utility box: we don’t want the agent to think this is a way of achieving high utility.

It seems like the right thing to do is for the AI to ignore any influence which its actions have on its estimate of its utility function. It should act as if it only has influence over the real world. That way, the ideal society which defines CEV can build all the copies of the AI they want; the AI only considers how its actions have influence over the real world. It avoids corrupting the CEV.

Clearly, this would be an indifference-style solution. What’s interesting to me is that it also looks like a CDT-style solution. In fact, this seems like an answer to my question at the end of Smoking Lesion Steelman: a case of ignorance about your own utility function which doesn’t arise from an obviously bad agent design. Like the smoking lesion steelman, ignorance about utility here seems to recommend CDT-like reasoning over EDT-like reasoning.

This picture is a little too clean, and likely badly wrong in some respect: several of these concepts are likely to come apart when examined more closely. Nonetheless, this seems like an interesting way of looking at things.

EDIT: It bears mentioning that we would likely want more than indifference in the agent-embedded-in-CEV problem. Perhaps the agent has to actively avoid undue influence, or has to protect the CEV from the inside somehow. These correspond to reasons why corrigibility is not just indifference: you want a corrigible agent to actively maintain its corrigibility, protect humans in ways that avoid manipulating the human’s preferences, etc. However, I also think it is interesting just to consider the indifference part of the problem – that’s the part which seems connected co CDT.

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes