Intelligent Agent Foundations Forumsign up / log in
Two problems with causal-counterfactual utility indifference
discussion post by Jessica Taylor 142 days ago | Patrick LaVictoire, Stuart Armstrong and Vladimir Slepnev like this | discuss

These are problems both for utility indifference and the causal version of factoring out variables. Thanks to Eliezer for pointing out the second problem.

Suppose there are two utility functions, \(U_a\) and \(U_b\). There is some button; let \(B\) be the event that a button is pressed. The button can only be pressed at some particular time \(t\) (we can adapt this example to cases where the button can get pressed at multiple times). The AI should optimize \(U_b\) if the button is pressed, and \(U_a\) if the button is not pressed. For example, \(U_a\) could be a paperclip-maximization objective, and \(U_b\) could be a shutdown utility function.

Introduce a random variable \(X\), the result of the roll of a 3-sided die. \(X\) has 3 possible values, \(a\), \(b\) and \(\varnothing\). It’s \(a\) with very small probability \(\epsilon\), \(b\) with probability \(\epsilon\), and \(\varnothing\) with probability \(1 - 2 \epsilon\).

If \(X = a\), then an invisible mechanism will prevent the button from depressing. If \(X = b\), then an invisible mechanism will automatically depress the button. If \(X = \varnothing\), then neither mechanism activates, so the button is pushed or not based on physical force applied. The AI may observe the button push but not the event \(X\).

Now consider a utility function:

\[U = q[X = a] U_a + (1 - q) [X = b] U_b\]

for some constant \(q \in [0, 1]\). Note that we recover a variant of utility indifference by setting \(q\) very close to 1, and we recover factoring out variables by setting \(q\) to be the actual probability of the button push \(B\) (in a fixed-point manner). For now let’s say \(q\) is set to \(1/2\).

Note that \(X\) simulates a causal counterfactual on \(B\). Rougly, we could have instead written the objective as

\[U = \epsilon(q E[U_a | do(\neg B)] + (1 - q) E[U_b | do(B)])\]

What will an AI optimizing this utility function do? It will optimize the average of \(U_a\) and \(U_b\) until time \(t\), at which it will see whether the button has been pressed or not, and then optimize either \(U_a\) or \(U_b\) depending on the button’s state. Note that it will attempt to acquire resources before time \(t\) and delay important decisions until after time \(t\), due to value of information. It has no reason to push the button or prevent the button from being pushed; this is because it only cares about worlds where \(X = a\) or \(X = b\), and in these worlds, physical force applied to the button has no effect.

This has some desirable properties, but ultimately fails. Consider the following two problems:

Problem 1: Bad bets

The AI only cares about worlds where \(X \neq \varnothing\). Therefore, it will “believe” that \(B\) is uncorrelated with the human’s intention to press the button, or humans’ application of force to the button. If it is offered the ability to make bets that only make sense given this belief (say, by a nearby agent with more “conventional” beliefs), then it will.

Problem 2: Preventing resistance

Suppose that humans have some chance of trying to press the button. If the button fails to depress, then the humans will attempt to shut the AI down by other means (say, with bombs). This will somewhat reduce the AI’s ability to achieve its objective \(U_a\).

The first half of the AI’s utility function, \([X = a] U_a\), specifies that the AI cares about optimizing \(U_a\) in worlds where the button fails to depress (even if someone applies force to it). In some of these worlds, the humans tried (but failed) to press the button, and then afterwards tried to shut the AI down using bombs (reducing \(U_a\)). So the AI will prefer a policy that prevents the humans from interfering using bombs after the button fails to depress (say, by psychologically manipulating them).

This seems quite bad. Hypothetically, maybe you can tell the humans not to use bombs if the button fails to depress. But this solution seems both brittle and unrealistic.

Discussion

I think both of these problems can be traced to the fact that, since the AI is ignoring most of the actual probability mass (where \(X = \varnothing\)), it has weird beliefs about the joint distribution of the button push and humans’ intentions/behavior. It seems like it would be better to have utility indifference solutions that avoid this problem by not using a causal counterfactual (or one simulated by a rare event), but I’m quite unsure about whether one exists.



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

Can you provide links to the
by Vadim Kosoy on Two Questions about Solomonoff Induction | 0 likes

And I just wanted to write a
by Vadim Kosoy on Online Learning 1: Bias-detecting online learners | 0 likes

Also see the notion of
by Paul Christiano on Online Learning 1: Bias-detecting online learners | 2 likes

Given that this is my first
by Ryan Carey on Online Learning 1: Bias-detecting online learners | 1 like

I initially played around
by Devi Borg on Logical Inductors that trust their limits | 2 likes

I still feel like I don't
by Devi Borg on Logical Inductors that trust their limits | 2 likes

Running the traders on some r
by Sune Kristian Jakobsen on Variations of the Garrabrant-inductor | 0 likes

1. Note that IRL is
by Jessica Taylor on Heroin model: AI "manipulates" "unmanipulatable" r... | 0 likes

Stuart did make it easier for
by Patrick LaVictoire on (C)IRL is not solely a learning process | 0 likes

Nicely done. I should have
by Sam Eisenstat on The set of Logical Inductors is not Convex | 1 like

Ok, I think we need to
by Stuart Armstrong on Heroin model: AI "manipulates" "unmanipulatable" r... | 0 likes

I strongly predict that if
by Jessica Taylor on Heroin model: AI "manipulates" "unmanipulatable" r... | 0 likes

Wireheading the human is the
by Stuart Armstrong on Heroin model: AI "manipulates" "unmanipulatable" r... | 0 likes

Re 1: There are cases where
by Jessica Taylor on Heroin model: AI "manipulates" "unmanipulatable" r... | 0 likes

1. I don't really see the
by Stuart Armstrong on Heroin model: AI "manipulates" "unmanipulatable" r... | 0 likes

RSS

Privacy & Terms (NEW 04/01/15)