The Happy Dance Problem
post by Abram Demski 520 days ago | Scott Garrabrant and Stuart Armstrong like this | 1 comment

Since the invention of logical induction, people have been trying to figure out what logically updateless reasoning could be. This is motivated by the idea that, in the realm of Bayesian uncertainty (IE, empirical uncertainty), updateless decision theory is the simple solution to the problem of reflective consistency. Naturally, we’d like to import this success to logically uncertain decision theory.

At a research retreat during the summer, we realized that updateless decision theory wasn’t so easy to define even in the seemingly simple Bayesian case. A possible solution was written up in Conditioning on Conditionals. However, that didn’t end up being especially satisfying.

Here, I introduce the happy dance problem, which more clearly illustrates the difficulty in defining updateless reasoning in the Bayesian case. I also outline Scott’s current thoughts about the correct way of reasoning about this problem.

(Ideas here are primarily due to Scott.)

The Happy Dance Problem

Suppose an agent has some chance of getting a pile of money. In the case that the agent gets the pile of money, it has a choice: it can either do a happy dance, or not. The agent would rather not do the happy dance, as it is embarrassing.

I’ll write “you get a pile of money” as $$M$$, and “you do a happy dance” as $$H$$.

So, the agent has the following utility function:

• U(¬M) = $0 • U(M & ¬H) =$1000
• U(M & H) = $900 A priori, the agent assigns the following probabilities to events: • P(¬M) = .5 • P(M & ¬H) = .1 • P(M & H) = .4 IE, the agent expects itself to do the happy dance. Conditioning on Conditionals In order to make an updateless decision, we need to condition on the policy of dancing, and on the policy of not dancing. How do we condition on a policy? We could change the problem statement by adding a policy variable and putting in the conditional probabilities of everything given the different policies, but this is just cheating: in order to fill in those conditional probabilities, you need to already know how to condition on a policy. (This simple trick seems to be what kept us from noticing that UDT isn’t so easy to define in the Bayesian setting for so long.) A naive attempt would be to condition on the material conditional representing each policy, $$M \supset H$$ and $$M \supset \neg H$$. This gets the wrong answer. The material conditional simply rules out the one outcome inconsistent with the policy. Conditioning on $$M \supset H$$, we get: • P(¬M) = .555 • P(M & H) = .444 For an expected utility of$400.

Conditioning on $$M \supset \neg H$$, we get:

• P(¬M) = .833
• P(M & ¬H) = .166

For an expected utility of \$166.66.

So, to sum up, the agent thinks it should do the happy dance because refusing to do the happy dance makes worlds where it gets the money less probable. This doesn’t seem right.

Conditioning on Conditionals solved this by sending the probabilistic conditional P(H|M) to one or zero to represent the effect of a policy, rather than using the material conditional. However, this approach is unsatisfactory for a different reason.

Happy dance is similar to Newcomb’s problem with a transparent box (where Omega judges you on what you do when you see the full box): doing the dance is like one-boxing. Now, the correlation between doing the dance and getting the pile of money comes from Omega rather than just being part of an arbitrary prior. But, sending the conditional probability of one-boxing upon seeing the money to one doesn’t make the world where the pile of money appears any more probable. So, this version of updateless reasoning gets transparent-box Newcomb wrong. There isn’t enough information in the probability distribution to distinguish it from Happy Dance style problems.

Observation Counterfactuals

We can solve the problem in what seems like the right way by introducing a basic notion of counterfactual, which I’ll write $$\Box \mkern-7mu \rightarrow$$. This is supposed to represent “what the agent’s code will do on different inputs”. The idea is that if we have the policy of dancing when we see the money, $$M \Box \mkern-7mu \rightarrow H$$ is true even in the world where we don’t see any money. So, even if dancing upon seeing money is a priori probable, conditioning on not doing so knocks out just as much probability mass from non-money worlds as from money worlds. However, if a counterfactual $$A \Box \mkern-7mu \rightarrow B$$ is true and $$A$$ is true, then its consequent $$B$$ must also be true. So, conditioning on a policy does change the probability of taking actions in the expected way.

In Happy Dance, there is no correlation between $$M \Box \mkern-7mu \rightarrow H$$ and $$M$$; so, we can condition on $$M \Box \mkern-7mu \rightarrow H$$ and $$M \Box \mkern-7mu \rightarrow \neg H$$ to decide which policy is better, and get the result we expect. In Newcomb’s problem, on the other hand, there is a correlation between the policy chosen and whether the pile of money appears, because Omega is checking what the agent’s code does if it sees different inputs. This allows the decision theory to produce different answers in the different problems.

It’s not clear where the beliefs about this correlation come from, so these counterfactuals are still almost as mysterious as explicitly giving conditional probabilities for everything given different policies. However, it does seem to say something nontrivial about the structure of reasoning.

Also, note that these counterfactuals are in the opposite direction from what we normally think about: rather than the counterfactual consequences of actions we didn’t take, now we need to know the counterfactual actions we’d take under outcomes we didn’t see!

 by Wei Dai 519 days ago | Scott Garrabrant likes this | link We can solve the problem in what seems like the right way by introducing a basic notion of counterfactual, which I’ll write □→. This is supposed to represent “what the agent’s code will do on different inputs”. The idea is that if we have the policy of dancing when we see the money, M□→H is true even in the world where we don’t see any money. (I’m confused about why this notation needs to be introduced. I haven’t been following all the DT discussions super closely, so I’d appreciate if someone could catch me up. Or, since I’m visiting MIRI soon, perhaps someone could catch me up in person.) In the language of my original UDT post, I would have written this as S(‘M’)=‘H’, where S is the agent’s code (M and H in quotes here to denote that they’re input/output strings rather than events). This is a logical statement about the output of S given ‘M’ as input, which I had conjectured could be conditioned on the same way we’d condition on any other logical statement (once we have a solution to logical uncertainty). Of course, issues like Agent Simulates Predictor has since come up, so is this new idea/notation an attempt to solve some of those issues? Can you explain what advantages this notation has over the S(‘M’)=‘H’ type of notation? It’s not clear where the beliefs about this correlation come from, so these counterfactuals are still almost as mysterious as explicitly giving conditional probabilities for everything given different policies. Intuitively, it comes from the fact that there’s a chunk of computation in Omega that’s analyzing S, which should be logically correlated with S’s actual output. Again, this was a guess of what a correct solution to logical uncertainty would say when you run the math. (Now that we have logical induction, can we tell if it actually says this?) reply

NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes