Forum Digest: Corrigibility, utility indifference, & related control ideas
post by Benja Fallenstein 974 days ago | Kaya Stechly, Jessica Taylor, Nate Soares, Patrick LaVictoire and Stuart Armstrong like this | 3 comments

This is a quick recap of the posts of this forum that deal with corrigibility (making sure that if you get an agent’s goal system wrong, it doesn’t try to prevent you from changing it), utility indifference (the idea to remove incentives to manipulate you so that you change or not change the agent’s goal system, by adding rewards to its utility function that make it get the same utility in both cases), and related AI control ideas. It’s current as of 3/21/15.

# Papers

As background to the posts listed below, the following two papers may be helpful.

• Corrigibility, by Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong (2015). This paper introduces the problem of corrigibility and analyzes some simple models, including a version of Stuart Armstrong’s utility indifference. Abstract:

As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences. We introduce the notion of corrigibility and analyze utility functions that attempt to make an agent shut down safely if a shutdown button is pressed, while avoiding incentives to prevent the button from being pressed or cause the button to be pressed, and while ensuring propagation of the shutdown behavior as it creates new subsystems or self-modifies. While some proposals are interesting, none have yet been demonstrated to satisfy all of our intuitive desiderata, leaving this simple problem in corrigibility wide-open.

• Utility Indifference, by Stuart Armstrong (2010). An older paper by Stuart explaining the utility indifference approach.

# Corrigibility

• Generalizing the Corrigibility paper’s impossibility result?, Benja Fallenstein. The Corrigibility paper looks at a particular linear way to combine two utility functions, $$\mathcal{U}_N$$ and $$\mathcal{U}_S$$, which incentivize normal operation and shutdown, respectively. It shows that most such linear combinations lead to unintended behavior. Is it possible to avoid this problem by considering non-linear combinations? It turns out that this question isn’t quite well-formed.

# Utility indifference

• Utility indifference and infinite improbability drives, Benja Fallenstein. (Corrigibility paper doesn’t exactly reflect Stuart’s approach, and Stuart’s approach avoids the exact problem stated in the Corrigibility paper, but it still can be interpreted as shifting the agent’s probability distribution, and this still makes the agent do stupid things.)

• Un-manipulable counterfactuals, Stuart Armstrong. The Corrigibility paper uses causal counterfactuals, à la Pearl. In this post, Stuart suggests defining counterfactuals by conditioning on a chaotic random event the AI can’t influence. For example, we might make it so that an oracle has a low probability of producing no output, and a high probability to output its prediction of what would have happened conditional on it not outputting anything.

• Orthogonality: action counterfactuals, Stuart Armstrong. Suggests a version of utility indifference where a shutdown button does not change the agent’s utility function directly, but permits the agent to execute an action that changes its utility function; additionally suggests to define the utility of this action to be computed in a similar way as in other versions utility indifference, but with a small additional term $$\epsilon$$ rewarding a change in utility. Argues that this incentivizes the agent to manipulate its operators to press the shutdown button, but only if this action is extremely cheap.

# Safe oracles

• Predictors that don’t try to manipulate you(?), Benja Fallenstein. If you implement an agent whose only goal is to output correct predictions about future events, this agent may still have an incentive to manipulate the environment to make it easier to predict. This post suggests a potential way to define an agent which wants to make correct predictions but does not want to make its environment easier to predict.

• Non-manipulative oracles, Stuart Armstrong. Suggests to avoid manipulation by a predictor by having the oracle not predict what will happen in the actual world, but what would happen in a counterfactual world where the oracle didn’t produce any output.

# Low-impact agents

• AI-created pseudo-deontology, Stuart Armstrong. Proposes to implement an agent $$A$$ whose only task is to create an agent $$B$$, whose utility function will be modified by some noise before $$B$$ is run. Argues that this may lead $$A$$ to create a $$B$$ which “follows its motivation to some extent, but not to extreme amounts”, because $$A$$ wants $$B$$’s behavior to be robust to this noise.

• Restrictions that are hard to hack, Stuart Armstrong. Putting specific restrictions on an agent’s motivation is problematic as a safety technique, because the agent will usually be able to find unintended instantiations that satisfy the lettr but not the spirit of the restriction. This post suggests that unintended instantiations are more informative about small changes in the restrictions they instantiate than intended instantiations, and suggests to use this to make unintended instantiations less likely.

• Creating a satisficer, Stuart Armstrong. Proposes a potential way of creating an agent that tries to act in a way that does well on one utility function $$u$$ while trying to have little impact on many other utility functions $$v$$.

# Odds and ends

• Resource gathering agent, Stuart Armstrong. Argues that if we take an arbitrary utility function $$u$$, and build an agent that assigns 50% probability that it wants to maximize $$u$$ and 50% probability that it wants to maximize $$-u$$ (it will find out the truth tomorrow), then we get an agent that is purely interested in convergent instrumental goals like resource gathering. Suggests that if we could somehow “subtract off” such an agent from another agent, we could construct an agent that doesn’t try to follow convergent instrumental goals.

• Acausal trade barriers, Stuart Armstrong. Suggests a technique similar to utility indifference which may disincentivize agents from acausally trading with each other.

• Anti-Pascaline agents, Stuart Armstrong. Given a random variable $$X$$, an event $$A$$, and a small $$\varepsilon > 0$$, define $${\overline p}_{\varepsilon}(X\mid A)$$ to be such that conditional on $$A$$, the probability of $$X \ge {\overline p}_{\varepsilon}(X\mid A)$$ is $$\varepsilon$$. Similarly define $${\underline p}_\varepsilon(X\mid A)$$ by replacing $$\ge$$ by $$\le$$. Then define $$\mathbb{E}_\varepsilon[X\mid A] := \mathbb{E}[X'\mid A]$$, where $$X'$$ is $$X$$ bounded to $$[{\underline p}_\varepsilon(X\mid A),{\overline p}_\varepsilon(X\mid A)]$$. Given a utility function $$u$$, this post proposes taking the action $$a$$ that maximizes $$\mathbb{E}_\varepsilon[u\mid a]$$ as an “unprincipled” approach to dealing with Pascal’s Mugging.

 by Patrick LaVictoire 974 days ago | link This reminds me, I should post the Loki corrigibility model here. reply
 by Stuart Armstrong 972 days ago | link Thanks for that! I think some of the old stuff is likely superseded, I’ll see once the various ideas settle. And “resource gathering agent” should not be in “low-impact agents” (the “subtraction” idea does not seem a good one, but there are other uses for resource gathering agents). reply
 by Benja Fallenstein 972 days ago | Stuart Armstrong likes this | link Categorization is hard! :-) I wanted to break it up because long lists are annoying to read, but there was certainly some arbitrariness in dividing it up. I’ve moved “resource gathering agent” to the odds & ends. reply

### NEW DISCUSSION POSTS

Unfortunately, it's not just
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
 by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
 by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
 by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

I think the point I was
 by Abram Demski on Predictable Exploration | 0 likes

(also x-posted from
 by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

(x-posted from Arbital ==>
 by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

>If the other players can see
 by Stuart Armstrong on Predictable Exploration | 0 likes

 by Abram Demski on Predictable Exploration | 0 likes

> So I wound up with
 by Abram Demski on Predictable Exploration | 0 likes

Hm, I got the same result
 by Alex Appel on Predictable Exploration | 1 like

Paul - how widely do you want
 by David Krueger on Funding opportunity for AI alignment research | 0 likes

I agree, my intuition is that
 by Abram Demski on Smoking Lesion Steelman III: Revenge of the Tickle... | 0 likes