Forum Digest: Corrigibility, utility indifference, & related control ideas
post by Benja Fallenstein 1063 days ago | Kaya Stechly, Jessica Taylor, Nate Soares, Patrick LaVictoire and Stuart Armstrong like this | 3 comments

This is a quick recap of the posts of this forum that deal with corrigibility (making sure that if you get an agent’s goal system wrong, it doesn’t try to prevent you from changing it), utility indifference (the idea to remove incentives to manipulate you so that you change or not change the agent’s goal system, by adding rewards to its utility function that make it get the same utility in both cases), and related AI control ideas. It’s current as of 3/21/15.

# Papers

As background to the posts listed below, the following two papers may be helpful.

• Corrigibility, by Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong (2015). This paper introduces the problem of corrigibility and analyzes some simple models, including a version of Stuart Armstrong’s utility indifference. Abstract:

As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences. We introduce the notion of corrigibility and analyze utility functions that attempt to make an agent shut down safely if a shutdown button is pressed, while avoiding incentives to prevent the button from being pressed or cause the button to be pressed, and while ensuring propagation of the shutdown behavior as it creates new subsystems or self-modifies. While some proposals are interesting, none have yet been demonstrated to satisfy all of our intuitive desiderata, leaving this simple problem in corrigibility wide-open.

• Utility Indifference, by Stuart Armstrong (2010). An older paper by Stuart explaining the utility indifference approach.

# Corrigibility

• Generalizing the Corrigibility paper’s impossibility result?, Benja Fallenstein. The Corrigibility paper looks at a particular linear way to combine two utility functions, $$\mathcal{U}_N$$ and $$\mathcal{U}_S$$, which incentivize normal operation and shutdown, respectively. It shows that most such linear combinations lead to unintended behavior. Is it possible to avoid this problem by considering non-linear combinations? It turns out that this question isn’t quite well-formed.

# Utility indifference

• Utility indifference and infinite improbability drives, Benja Fallenstein. (Corrigibility paper doesn’t exactly reflect Stuart’s approach, and Stuart’s approach avoids the exact problem stated in the Corrigibility paper, but it still can be interpreted as shifting the agent’s probability distribution, and this still makes the agent do stupid things.)

• Un-manipulable counterfactuals, Stuart Armstrong. The Corrigibility paper uses causal counterfactuals, à la Pearl. In this post, Stuart suggests defining counterfactuals by conditioning on a chaotic random event the AI can’t influence. For example, we might make it so that an oracle has a low probability of producing no output, and a high probability to output its prediction of what would have happened conditional on it not outputting anything.

• Orthogonality: action counterfactuals, Stuart Armstrong. Suggests a version of utility indifference where a shutdown button does not change the agent’s utility function directly, but permits the agent to execute an action that changes its utility function; additionally suggests to define the utility of this action to be computed in a similar way as in other versions utility indifference, but with a small additional term $$\epsilon$$ rewarding a change in utility. Argues that this incentivizes the agent to manipulate its operators to press the shutdown button, but only if this action is extremely cheap.

# Safe oracles

• Predictors that don’t try to manipulate you(?), Benja Fallenstein. If you implement an agent whose only goal is to output correct predictions about future events, this agent may still have an incentive to manipulate the environment to make it easier to predict. This post suggests a potential way to define an agent which wants to make correct predictions but does not want to make its environment easier to predict.

• Non-manipulative oracles, Stuart Armstrong. Suggests to avoid manipulation by a predictor by having the oracle not predict what will happen in the actual world, but what would happen in a counterfactual world where the oracle didn’t produce any output.

# Low-impact agents

• AI-created pseudo-deontology, Stuart Armstrong. Proposes to implement an agent $$A$$ whose only task is to create an agent $$B$$, whose utility function will be modified by some noise before $$B$$ is run. Argues that this may lead $$A$$ to create a $$B$$ which “follows its motivation to some extent, but not to extreme amounts”, because $$A$$ wants $$B$$’s behavior to be robust to this noise.

• Restrictions that are hard to hack, Stuart Armstrong. Putting specific restrictions on an agent’s motivation is problematic as a safety technique, because the agent will usually be able to find unintended instantiations that satisfy the lettr but not the spirit of the restriction. This post suggests that unintended instantiations are more informative about small changes in the restrictions they instantiate than intended instantiations, and suggests to use this to make unintended instantiations less likely.

• Creating a satisficer, Stuart Armstrong. Proposes a potential way of creating an agent that tries to act in a way that does well on one utility function $$u$$ while trying to have little impact on many other utility functions $$v$$.

# Odds and ends

• Resource gathering agent, Stuart Armstrong. Argues that if we take an arbitrary utility function $$u$$, and build an agent that assigns 50% probability that it wants to maximize $$u$$ and 50% probability that it wants to maximize $$-u$$ (it will find out the truth tomorrow), then we get an agent that is purely interested in convergent instrumental goals like resource gathering. Suggests that if we could somehow “subtract off” such an agent from another agent, we could construct an agent that doesn’t try to follow convergent instrumental goals.

• Acausal trade barriers, Stuart Armstrong. Suggests a technique similar to utility indifference which may disincentivize agents from acausally trading with each other.

• Anti-Pascaline agents, Stuart Armstrong. Given a random variable $$X$$, an event $$A$$, and a small $$\varepsilon > 0$$, define $${\overline p}_{\varepsilon}(X\mid A)$$ to be such that conditional on $$A$$, the probability of $$X \ge {\overline p}_{\varepsilon}(X\mid A)$$ is $$\varepsilon$$. Similarly define $${\underline p}_\varepsilon(X\mid A)$$ by replacing $$\ge$$ by $$\le$$. Then define $$\mathbb{E}_\varepsilon[X\mid A] := \mathbb{E}[X'\mid A]$$, where $$X'$$ is $$X$$ bounded to $$[{\underline p}_\varepsilon(X\mid A),{\overline p}_\varepsilon(X\mid A)]$$. Given a utility function $$u$$, this post proposes taking the action $$a$$ that maximizes $$\mathbb{E}_\varepsilon[u\mid a]$$ as an “unprincipled” approach to dealing with Pascal’s Mugging.

 by Patrick LaVictoire 1063 days ago | link This reminds me, I should post the Loki corrigibility model here. reply
 by Stuart Armstrong 1061 days ago | link Thanks for that! I think some of the old stuff is likely superseded, I’ll see once the various ideas settle. And “resource gathering agent” should not be in “low-impact agents” (the “subtraction” idea does not seem a good one, but there are other uses for resource gathering agents). reply
 by Benja Fallenstein 1061 days ago | Stuart Armstrong likes this | link Categorization is hard! :-) I wanted to break it up because long lists are annoying to read, but there was certainly some arbitrariness in dividing it up. I’ve moved “resource gathering agent” to the odds & ends. reply

### NEW DISCUSSION POSTS

[Delegative Reinforcement
 by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
 by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
 by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
 by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
 by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
 by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
 by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
 by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
 by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
 by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
 by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
 by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
 by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
 by Abram Demski on Policy Selection Solves Most Problems | 1 like

Looking "at the very
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes