Agents that don't become maximisers
post by Stuart Armstrong 320 days ago | discuss

A putative new idea for AI control; index here.

According to the basic AI drives thesis, (almost) any agent capable of self-modification will self-modify into an expected utility maximiser.

The typical examples are the inconsistent utility maximisers, the satisficers, unexploitable agents, and it’s easy to think that all agents fall roughly into these broad categories. There’s also the observation that when looking at full policies rather than individual actions, many biased agents become expected utility maximisers (unless they want to lose pointlessly).

Nevertheless… there is an entire category of agents that generically seem to not self-modify into maximisers. These are agents that attempt to maximise $$f(\mathbb{E}(U))$$ where $$U$$ is some utility function, $$\mathbb{E}(U)$$ is its expectation, and $$f$$ is a function that is neither wholly increasing nor decreasing.

# Intransitive example

Let there be a $$U$$ with three action $$a_0$$, $$a_5$$, and $$a_{10}$$ that set $$U$$ to $$0$$, $$5$$, and $$10$$, respectively.

The function $$f$$ is $$1$$ in the range $$(4,6)$$ and is $$0$$ elsewhere. Hence the agent needs to set the expectation of $$U$$ to be in that range.

What will happen is that one action will be randomly removed from the set, and the agent will then have to choose among the remaining two actions. What possible policies can the agent take?

Well, there are three option sets the agent could face – $$(a_0, a_5)$$, $$(a_5, a_{10})$$, and $$(a_{10}, a_0)$$ – each with two options and hence $$2^3=8$$ pure policies. Two of those policies – choosing always the first option in those ordered pairs, or choosing always the second option – are intransitive, as they rank no option above the other two.

But actually those intransitive options have an expected utility of $$(0+5+10)/3 = 5$$, which is just what the agent wants.

Even worse, none of the other (transitive) policies are acceptable. You can see this because each of the six transitive policies can be reached by taking one of the intransitive policies and flipping a choice, which must change the expected utility by $$\pm 5/3$$ or $$\pm 10/3$$, moving it out of the $$(4,6)$$ range.

Thus there are no possible expected utility maximalisation that correspond to these options, as such maximalisations are always transitive.

Or another way of seeing this: the random policy of picking an action randomly has an expectation of $$(0+0+5+5+10+10)/6 = 5$$, so is also an acceptable policy. But for expected utility maximalisation, if the random policy is acceptable, then so is every other policy, which is not the case.

# Stability and information

The agent defined above is actually stable under self-modification: it can simply wait till it knows which action is going to be removed, and then pick $$a_5$$ in both cases where this is possible, and choose randomly between $$a_0$$ and $$a_{10}$$ if that pair comes up. And that’s what it would do if it faced any of those three choice from the start.

But that’s an artefact of the options in the setup. If instead the actions had been $$a_0$$, $$a_4$$, and $$a_{11}$$, then all the previous results would remain valid, but the agent would want to self modify (if only to deal with the $$(a_0, a_4)$$ option).

What about information? Is it always good for the agent to know more?

Well, if the agent can self-modify before receiving extra information, then extra information can never be a negative (trivial proof: the agent can self-modify to ignore the information if it were negative to know).

But if the agent cannot self-modify before receiving the information, then it can sometimes pay to not learn or to forget some things. For instance, maybe there was an extra piece of information that informed the agent of the utilities of the various actions; then the agent might want to erase that information simply so its successor would be tempted to choose randomly.

# Why are satificers different?

Note that this framework does not include satisficers, who can be seen as having a $$c$$ such that $$g(u)=0$$ for $$u < c$$ and $$g(u)=1$$ for $$u\geq c$$, and maximising $$g(\mathbb{E}(U))$$.

But this $$g$$ is an increasing (step) function, and that makes all the difference. A expected utility maximiser choosing between policies $$p$$ and $$q$$ will pick $$p$$ if $$\mathbb{E}(U|p) > \mathbb{E}(U|q)$$. If $$g$$ is increasing, then $$g(\mathbb{E}(U|p)) \geq g(\mathbb{E}(U|q))$$, so such a choice is also permissible to a satisficer. The change from $$>$$ to $$\geq$$ is why satisficers can become maximisers (maximising is compatible with satisficing) but not the opposite.

# Is this design plausible?

It might seem bizarre to have an agent that restricts expected utility to a particular range, but it’s actually quite sensible, at least intuitively.

The problem with maximisers is that the extreme optimised policy is likely to include dangerous side-effects we didn’t expect. Satisficers were supposed to solve this, by allowing the agent to not focus only on the extreme optimised policy, but their failure mode is that they don’t preclude following such a policy. Hence this design might be felt to be superior, as it also rules out the extreme optimised policies.

Its failure mode, though, is that it don’t preclude, for instance, a probabilistic mix of extreme optimised policy with a random inefficient one.

### NEW DISCUSSION POSTS

[Delegative Reinforcement
 by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
 by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
 by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
 by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
 by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
 by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
 by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
 by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
 by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
 by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
 by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
 by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
 by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
 by Abram Demski on Policy Selection Solves Most Problems | 1 like

Looking "at the very
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes