Intelligent Agent Foundations Forumsign up / log in
The Happy Dance Problem
post by Abram Demski 30 days ago | Scott Garrabrant and Stuart Armstrong like this | 1 comment

Since the invention of logical induction, people have been trying to figure out what logically updateless reasoning could be. This is motivated by the idea that, in the realm of Bayesian uncertainty (IE, empirical uncertainty), updateless decision theory is the simple solution to the problem of reflective consistency. Naturally, we’d like to import this success to logically uncertain decision theory.

At a research retreat during the summer, we realized that updateless decision theory wasn’t so easy to define even in the seemingly simple Bayesian case. A possible solution was written up in Conditioning on Conditionals. However, that didn’t end up being especially satisfying.

Here, I introduce the happy dance problem, which more clearly illustrates the difficulty in defining updateless reasoning in the Bayesian case. I also outline Scott’s current thoughts about the correct way of reasoning about this problem.


(Ideas here are primarily due to Scott.)

The Happy Dance Problem

Suppose an agent has some chance of getting a pile of money. In the case that the agent gets the pile of money, it has a choice: it can either do a happy dance, or not. The agent would rather not do the happy dance, as it is embarrassing.

I’ll write “you get a pile of money” as \(M\), and “you do a happy dance” as \(H\).

So, the agent has the following utility function:

  • U(¬M) = $0
  • U(M & ¬H) = $1000
  • U(M & H) = $900

A priori, the agent assigns the following probabilities to events:

  • P(¬M) = .5
  • P(M & ¬H) = .1
  • P(M & H) = .4

IE, the agent expects itself to do the happy dance.

Conditioning on Conditionals

In order to make an updateless decision, we need to condition on the policy of dancing, and on the policy of not dancing. How do we condition on a policy? We could change the problem statement by adding a policy variable and putting in the conditional probabilities of everything given the different policies, but this is just cheating: in order to fill in those conditional probabilities, you need to already know how to condition on a policy. (This simple trick seems to be what kept us from noticing that UDT isn’t so easy to define in the Bayesian setting for so long.)

A naive attempt would be to condition on the material conditional representing each policy, \(M \supset H\) and \(M \supset \neg H\). This gets the wrong answer. The material conditional simply rules out the one outcome inconsistent with the policy.

Conditioning on \(M \supset H\), we get:

  • P(¬M) = .555
  • P(M & H) = .444

For an expected utility of $400.

Conditioning on \(M \supset \neg H\), we get:

  • P(¬M) = .833
  • P(M & ¬H) = .166

For an expected utility of $166.66.

So, to sum up, the agent thinks it should do the happy dance because refusing to do the happy dance makes worlds where it gets the money less probable. This doesn’t seem right.

Conditioning on Conditionals solved this by sending the probabilistic conditional P(H|M) to one or zero to represent the effect of a policy, rather than using the material conditional. However, this approach is unsatisfactory for a different reason.

Happy dance is similar to Newcomb’s problem with a transparent box (where Omega judges you on what you do when you see the full box): doing the dance is like one-boxing. Now, the correlation between doing the dance and getting the pile of money comes from Omega rather than just being part of an arbitrary prior. But, sending the conditional probability of one-boxing upon seeing the money to one doesn’t make the world where the pile of money appears any more probable. So, this version of updateless reasoning gets transparent-box Newcomb wrong. There isn’t enough information in the probability distribution to distinguish it from Happy Dance style problems.

Observation Counterfactuals

We can solve the problem in what seems like the right way by introducing a basic notion of counterfactual, which I’ll write \(\Box \mkern-7mu \rightarrow\). This is supposed to represent “what the agent’s code will do on different inputs”. The idea is that if we have the policy of dancing when we see the money, \(M \Box \mkern-7mu \rightarrow H\) is true even in the world where we don’t see any money. So, even if dancing upon seeing money is a priori probable, conditioning on not doing so knocks out just as much probability mass from non-money worlds as from money worlds. However, if a counterfactual \(A \Box \mkern-7mu \rightarrow B\) is true and \(A\) is true, then its consequent \(B\) must also be true. So, conditioning on a policy does change the probability of taking actions in the expected way.

In Happy Dance, there is no correlation between \(M \Box \mkern-7mu \rightarrow H\) and \(M\); so, we can condition on \(M \Box \mkern-7mu \rightarrow H\) and \(M \Box \mkern-7mu \rightarrow \neg H\) to decide which policy is better, and get the result we expect. In Newcomb’s problem, on the other hand, there is a correlation between the policy chosen and whether the pile of money appears, because Omega is checking what the agent’s code does if it sees different inputs. This allows the decision theory to produce different answers in the different problems.

It’s not clear where the beliefs about this correlation come from, so these counterfactuals are still almost as mysterious as explicitly giving conditional probabilities for everything given different policies. However, it does seem to say something nontrivial about the structure of reasoning.

Also, note that these counterfactuals are in the opposite direction from what we normally think about: rather than the counterfactual consequences of actions we didn’t take, now we need to know the counterfactual actions we’d take under outcomes we didn’t see!



by Wei Dai 29 days ago | Scott Garrabrant likes this | link

We can solve the problem in what seems like the right way by introducing a basic notion of counterfactual, which I’ll write □→. This is supposed to represent “what the agent’s code will do on different inputs”. The idea is that if we have the policy of dancing when we see the money, M□→H is true even in the world where we don’t see any money.

(I’m confused about why this notation needs to be introduced. I haven’t been following all the DT discussions super closely, so I’d appreciate if someone could catch me up. Or, since I’m visiting MIRI soon, perhaps someone could catch me up in person.)

In the language of my original UDT post, I would have written this as S(‘M’)=‘H’, where S is the agent’s code (M and H in quotes here to denote that they’re input/output strings rather than events). This is a logical statement about the output of S given ‘M’ as input, which I had conjectured could be conditioned on the same way we’d condition on any other logical statement (once we have a solution to logical uncertainty). Of course, issues like Agent Simulates Predictor has since come up, so is this new idea/notation an attempt to solve some of those issues? Can you explain what advantages this notation has over the S(‘M’)=‘H’ type of notation?

It’s not clear where the beliefs about this correlation come from, so these counterfactuals are still almost as mysterious as explicitly giving conditional probabilities for everything given different policies.

Intuitively, it comes from the fact that there’s a chunk of computation in Omega that’s analyzing S, which should be logically correlated with S’s actual output. Again, this was a guess of what a correct solution to logical uncertainty would say when you run the math. (Now that we have logical induction, can we tell if it actually says this?)

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Without reading closely, this
by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

RSS

Privacy & Terms