 Forum Digest: Corrigibility, utility indifference, & related control ideas   post by Benja Fallenstein 1063 days ago  Kaya Stechly, Jessica Taylor, Nate Soares, Patrick LaVictoire and Stuart Armstrong like this  3 comments  
 This is a quick recap of the posts of this forum that deal with corrigibility (making sure that if you get an agent’s goal system wrong, it doesn’t try to prevent you from changing it), utility indifference (the idea to remove incentives to manipulate you so that you change or not change the agent’s goal system, by adding rewards to its utility function that make it get the same utility in both cases), and related AI control ideas. It’s current as of 3/21/15.
Papers
As background to the posts listed below, the following two papers may be helpful.
Corrigibility, by Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong (2015). This paper introduces the problem of corrigibility and analyzes some simple models, including a version of Stuart Armstrong’s utility indifference. Abstract:
As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences. We introduce the notion of corrigibility and analyze utility functions that attempt to make an agent shut down safely if a shutdown button is pressed, while avoiding incentives to prevent the button from being pressed or cause the button to be pressed, and while ensuring propagation of the shutdown behavior as it creates new subsystems or selfmodifies. While some proposals are interesting, none have yet been demonstrated to satisfy all of our intuitive desiderata, leaving this simple problem in corrigibility wideopen.
Utility Indifference, by Stuart Armstrong (2010). An older paper by Stuart explaining the utility indifference approach.
Corrigibility
 Generalizing the Corrigibility paper’s impossibility result?, Benja Fallenstein. The Corrigibility paper looks at a particular linear way to combine two utility functions, \(\mathcal{U}_N\) and \(\mathcal{U}_S\), which incentivize normal operation and shutdown, respectively. It shows that most such linear combinations lead to unintended behavior. Is it possible to avoid this problem by considering nonlinear combinations? It turns out that this question isn’t quite wellformed.
Utility indifference
Utility indifference and infinite improbability drives, Benja Fallenstein. (Corrigibility paper doesn’t exactly reflect Stuart’s approach, and Stuart’s approach avoids the exact problem stated in the Corrigibility paper, but it still can be interpreted as shifting the agent’s probability distribution, and this still makes the agent do stupid things.)
Unmanipulable counterfactuals, Stuart Armstrong. The Corrigibility paper uses causal counterfactuals, à la Pearl. In this post, Stuart suggests defining counterfactuals by conditioning on a chaotic random event the AI can’t influence. For example, we might make it so that an oracle has a low probability of producing no output, and a high probability to output its prediction of what would have happened conditional on it not outputting anything.
Orthogonality: action counterfactuals, Stuart Armstrong. Suggests a version of utility indifference where a shutdown button does not change the agent’s utility function directly, but permits the agent to execute an action that changes its utility function; additionally suggests to define the utility of this action to be computed in a similar way as in other versions utility indifference, but with a small additional term \(\epsilon\) rewarding a change in utility. Argues that this incentivizes the agent to manipulate its operators to press the shutdown button, but only if this action is extremely cheap.
Safe oracles
Predictors that don’t try to manipulate you(?), Benja Fallenstein. If you implement an agent whose only goal is to output correct predictions about future events, this agent may still have an incentive to manipulate the environment to make it easier to predict. This post suggests a potential way to define an agent which wants to make correct predictions but does not want to make its environment easier to predict.
Nonmanipulative oracles, Stuart Armstrong. Suggests to avoid manipulation by a predictor by having the oracle not predict what will happen in the actual world, but what would happen in a counterfactual world where the oracle didn’t produce any output.
Manipulating an agent’s beliefs
Lowimpact agents
AIcreated pseudodeontology, Stuart Armstrong. Proposes to implement an agent \(A\) whose only task is to create an agent \(B\), whose utility function will be modified by some noise before \(B\) is run. Argues that this may lead \(A\) to create a \(B\) which “follows its motivation to some extent, but not to extreme amounts”, because \(A\) wants \(B\)’s behavior to be robust to this noise.
Restrictions that are hard to hack, Stuart Armstrong. Putting specific restrictions on an agent’s motivation is problematic as a safety technique, because the agent will usually be able to find unintended instantiations that satisfy the lettr but not the spirit of the restriction. This post suggests that unintended instantiations are more informative about small changes in the restrictions they instantiate than intended instantiations, and suggests to use this to make unintended instantiations less likely.
Creating a satisficer, Stuart Armstrong. Proposes a potential way of creating an agent that tries to act in a way that does well on one utility function \(u\) while trying to have little impact on many other utility functions \(v\).
Odds and ends
Resource gathering agent, Stuart Armstrong. Argues that if we take an arbitrary utility function \(u\), and build an agent that assigns 50% probability that it wants to maximize \(u\) and 50% probability that it wants to maximize \(u\) (it will find out the truth tomorrow), then we get an agent that is purely interested in convergent instrumental goals like resource gathering. Suggests that if we could somehow “subtract off” such an agent from another agent, we could construct an agent that doesn’t try to follow convergent instrumental goals.
Acausal trade barriers, Stuart Armstrong. Suggests a technique similar to utility indifference which may disincentivize agents from acausally trading with each other.
AntiPascaline agents, Stuart Armstrong. Given a random variable \(X\), an event \(A\), and a small \(\varepsilon > 0\), define \({\overline p}_{\varepsilon}(X\mid A)\) to be such that conditional on \(A\), the probability of \(X \ge {\overline p}_{\varepsilon}(X\mid A)\) is \(\varepsilon\). Similarly define \({\underline p}_\varepsilon(X\mid A)\) by replacing \(\ge\) by \(\le\). Then define \(\mathbb{E}_\varepsilon[X\mid A] := \mathbb{E}[X'\mid A]\), where \(X'\) is \(X\) bounded to \([{\underline p}_\varepsilon(X\mid A),{\overline p}_\varepsilon(X\mid A)]\). Given a utility function \(u\), this post proposes taking the action \(a\) that maximizes \(\mathbb{E}_\varepsilon[u\mid a]\) as an “unprincipled” approach to dealing with Pascal’s Mugging.
 
 
 NEW POSTSNEW DISCUSSION POSTS[Delegative Reinforcement
by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals  1 like 
Intermediate update:
The
by Alex Appel on Further Progress on a Bayesian Version of Logical ...  0 likes 
Since Briggs [1] shows that
by 258 on In memoryless Cartesian environments, every UDT po...  2 likes 
This doesn't quite work. The
by Nisan Stiennon on Logical counterfactuals and differential privacy  0 likes 
I at first didn't understand
This is somewhat related to
by Vadim Kosoy on The set of Logical Inductors is not Convex  0 likes 
This uses logical inductors
by Abram Demski on The set of Logical Inductors is not Convex  0 likes 
Nice writeup. Is oneboxing
Hi Alex!
The definition of
by Vadim Kosoy on Delegative Inverse Reinforcement Learning  0 likes 
A summary that might be
by Alex Appel on Delegative Inverse Reinforcement Learning  1 like 
I don't believe that
by Alex Appel on Delegative Inverse Reinforcement Learning  0 likes 
This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi...  0 likes 
When considering an embedder
The differences between this
by Abram Demski on Policy Selection Solves Most Problems  1 like 
Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems  0 likes 
