Intelligent Agent Foundations Forumsign up / log in
Forum Digest: Corrigibility, utility indifference, & related control ideas
post by Benja Fallenstein 1188 days ago | Kaya Stechly, Jessica Taylor, Nate Soares, Patrick LaVictoire and Stuart Armstrong like this | 3 comments

This is a quick recap of the posts of this forum that deal with corrigibility (making sure that if you get an agent’s goal system wrong, it doesn’t try to prevent you from changing it), utility indifference (the idea to remove incentives to manipulate you so that you change or not change the agent’s goal system, by adding rewards to its utility function that make it get the same utility in both cases), and related AI control ideas. It’s current as of 3/21/15.


As background to the posts listed below, the following two papers may be helpful.

  • Corrigibility, by Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong (2015). This paper introduces the problem of corrigibility and analyzes some simple models, including a version of Stuart Armstrong’s utility indifference. Abstract:

    As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences. We introduce the notion of corrigibility and analyze utility functions that attempt to make an agent shut down safely if a shutdown button is pressed, while avoiding incentives to prevent the button from being pressed or cause the button to be pressed, and while ensuring propagation of the shutdown behavior as it creates new subsystems or self-modifies. While some proposals are interesting, none have yet been demonstrated to satisfy all of our intuitive desiderata, leaving this simple problem in corrigibility wide-open.

  • Utility Indifference, by Stuart Armstrong (2010). An older paper by Stuart explaining the utility indifference approach.


  • Generalizing the Corrigibility paper’s impossibility result?, Benja Fallenstein. The Corrigibility paper looks at a particular linear way to combine two utility functions, \(\mathcal{U}_N\) and \(\mathcal{U}_S\), which incentivize normal operation and shutdown, respectively. It shows that most such linear combinations lead to unintended behavior. Is it possible to avoid this problem by considering non-linear combinations? It turns out that this question isn’t quite well-formed.

Utility indifference

  • Utility indifference and infinite improbability drives, Benja Fallenstein. (Corrigibility paper doesn’t exactly reflect Stuart’s approach, and Stuart’s approach avoids the exact problem stated in the Corrigibility paper, but it still can be interpreted as shifting the agent’s probability distribution, and this still makes the agent do stupid things.)

  • Un-manipulable counterfactuals, Stuart Armstrong. The Corrigibility paper uses causal counterfactuals, à la Pearl. In this post, Stuart suggests defining counterfactuals by conditioning on a chaotic random event the AI can’t influence. For example, we might make it so that an oracle has a low probability of producing no output, and a high probability to output its prediction of what would have happened conditional on it not outputting anything.

  • Orthogonality: action counterfactuals, Stuart Armstrong. Suggests a version of utility indifference where a shutdown button does not change the agent’s utility function directly, but permits the agent to execute an action that changes its utility function; additionally suggests to define the utility of this action to be computed in a similar way as in other versions utility indifference, but with a small additional term \(\epsilon\) rewarding a change in utility. Argues that this incentivizes the agent to manipulate its operators to press the shutdown button, but only if this action is extremely cheap.

Safe oracles

  • Predictors that don’t try to manipulate you(?), Benja Fallenstein. If you implement an agent whose only goal is to output correct predictions about future events, this agent may still have an incentive to manipulate the environment to make it easier to predict. This post suggests a potential way to define an agent which wants to make correct predictions but does not want to make its environment easier to predict.

  • Non-manipulative oracles, Stuart Armstrong. Suggests to avoid manipulation by a predictor by having the oracle not predict what will happen in the actual world, but what would happen in a counterfactual world where the oracle didn’t produce any output.

Manipulating an agent’s beliefs

Low-impact agents

  • AI-created pseudo-deontology, Stuart Armstrong. Proposes to implement an agent \(A\) whose only task is to create an agent \(B\), whose utility function will be modified by some noise before \(B\) is run. Argues that this may lead \(A\) to create a \(B\) which “follows its motivation to some extent, but not to extreme amounts”, because \(A\) wants \(B\)’s behavior to be robust to this noise.

  • Restrictions that are hard to hack, Stuart Armstrong. Putting specific restrictions on an agent’s motivation is problematic as a safety technique, because the agent will usually be able to find unintended instantiations that satisfy the lettr but not the spirit of the restriction. This post suggests that unintended instantiations are more informative about small changes in the restrictions they instantiate than intended instantiations, and suggests to use this to make unintended instantiations less likely.

  • Creating a satisficer, Stuart Armstrong. Proposes a potential way of creating an agent that tries to act in a way that does well on one utility function \(u\) while trying to have little impact on many other utility functions \(v\).

Odds and ends

  • Resource gathering agent, Stuart Armstrong. Argues that if we take an arbitrary utility function \(u\), and build an agent that assigns 50% probability that it wants to maximize \(u\) and 50% probability that it wants to maximize \(-u\) (it will find out the truth tomorrow), then we get an agent that is purely interested in convergent instrumental goals like resource gathering. Suggests that if we could somehow “subtract off” such an agent from another agent, we could construct an agent that doesn’t try to follow convergent instrumental goals.

  • Acausal trade barriers, Stuart Armstrong. Suggests a technique similar to utility indifference which may disincentivize agents from acausally trading with each other.

  • Anti-Pascaline agents, Stuart Armstrong. Given a random variable \(X\), an event \(A\), and a small \(\varepsilon > 0\), define \({\overline p}_{\varepsilon}(X\mid A)\) to be such that conditional on \(A\), the probability of \(X \ge {\overline p}_{\varepsilon}(X\mid A)\) is \(\varepsilon\). Similarly define \({\underline p}_\varepsilon(X\mid A)\) by replacing \(\ge\) by \(\le\). Then define \(\mathbb{E}_\varepsilon[X\mid A] := \mathbb{E}[X'\mid A]\), where \(X'\) is \(X\) bounded to \([{\underline p}_\varepsilon(X\mid A),{\overline p}_\varepsilon(X\mid A)]\). Given a utility function \(u\), this post proposes taking the action \(a\) that maximizes \(\mathbb{E}_\varepsilon[u\mid a]\) as an “unprincipled” approach to dealing with Pascal’s Mugging.

by Patrick LaVictoire 1188 days ago | link

This reminds me, I should post the Loki corrigibility model here.


by Stuart Armstrong 1186 days ago | link

Thanks for that!

I think some of the old stuff is likely superseded, I’ll see once the various ideas settle. And “resource gathering agent” should not be in “low-impact agents” (the “subtraction” idea does not seem a good one, but there are other uses for resource gathering agents).


by Benja Fallenstein 1186 days ago | Stuart Armstrong likes this | link

Categorization is hard! :-) I wanted to break it up because long lists are annoying to read, but there was certainly some arbitrariness in dividing it up. I’ve moved “resource gathering agent” to the odds & ends.






I found an improved version
by Alex Appel on A Loophole for Self-Applicative Soundness | 0 likes

I misunderstood your
by Sam Eisenstat on A Loophole for Self-Applicative Soundness | 0 likes

Caught a flaw with this
by Alex Appel on A Loophole for Self-Applicative Soundness | 0 likes

As you say, this isn't a
by Sam Eisenstat on A Loophole for Self-Applicative Soundness | 1 like

Note: I currently think that
by Jessica Taylor on Predicting HCH using expert advice | 0 likes

Counterfactual mugging
by Jessica Taylor on Doubts about Updatelessness | 0 likes

What do you mean by "in full
by David Krueger on Doubts about Updatelessness | 0 likes

It seems relatively plausible
by Paul Christiano on Maximally efficient agents will probably have an a... | 1 like

I think that in that case,
by Alex Appel on Smoking Lesion Steelman | 1 like

Two minor comments. First,
by Sam Eisenstat on No Constant Distribution Can be a Logical Inductor | 1 like

A: While that is a really
by Alex Appel on Musings on Exploration | 0 likes

> The true reason to do
by Jessica Taylor on Musings on Exploration | 0 likes

A few comments. Traps are
by Vadim Kosoy on Musings on Exploration | 1 like

I'm not convinced exploration
by Abram Demski on Musings on Exploration | 0 likes

Update: This isn't really an
by Alex Appel on A Difficulty With Density-Zero Exploration | 0 likes


Privacy & Terms