Intelligent Agent Foundations Forumsign up / log in
Where does ADT Go Wrong?
discussion post by Abram Demski 274 days ago | Jack Gallagher and Jessica Taylor like this | 1 comment

The main success of asymptotic decision theory is that it correctly solves Agent Simulates Predictor, as well as a swath of related problems. This is a challenging problem. Unfortunately, ADT hasn’t generalized well to other problems. It seems like perhaps the property which allows it to get ASP right is precisely the property which blocks it from generalizing further.

There are two algorithms in ADT. The first, SADT, is given a decision problem in functional form so that counterfactuals are well-defined. This is called an embedder. SADT has a set of strategies which it can choose from, and it chooses whichever strategy gets the highest expectation when plugged into the embedder. This can be thought of as imitating whichever agent does best on the problem.

The second algorithm, DDT, is for learning the embedder to give SADT. DDT has a set of possible embedders, and chooses optimistically from the embedders which it thinks could be equivalent to the true decision problem it faces. This way, it rules out overly optimistic embedders and settles on the best of the embedders which is equivalent to the true problem.

The main sign of a problem with SADT is that it will choose to crash into itself in a game of chicken. Consider the embedder which works as follows: it inserts the chosen policy into the slot of player 1. Player 2 is a new instance of SADT which looks at the problem via an embedder in which it plays the role of player 2 against player 1’s strategy. If player 1 chooses a strategy which always goes straight rather than swerving, then SADT expects player 2 to learn to swerve – that is what SADT would do if it saw itself up against a go-straight bot. Unfortunately, the real player 2 is thinking the same thing. So, both players go straight, and crash into each other; the worst possible outcome.

In short, SADT thinks that it moves first in games against other SADT agents. Since each copy thinks it moves first, the outcome can be very bad.

The simple analysis of what went wrong is that this game is outside of the optimality conditions for SADT. That’s not very interesting, though. Why doesn’t SADT work well outside those conditions?

One possible analysis of what goes wrong is that SADT thinks it can get the outcome of any other agent just by copying that agent’s strategy. Unlike SADT, LIDT considers what happens if LIDT itself took a specified action. We don’t have many optimality results for LIDT, because this is hard to analyze – it relies on reasoning about counterfactual occurrences based on history. SADT instead uses a well-defined notion of counterfactual, but seemingly at the cost of a poor fit to reality.

If this is true, any fix to ADT would likely stop solving ASP. ADT gets ASP right because it thinks it’ll get the reward which other agents get if it switches to those other agent’s strategies. Switching to a strategy which always two-boxes looks like a bad idea, because the predictor would not be giving money to that other agent. If ADT were thinking more realistically, it would reason like LIDT, seeing that it gets more money on this particular instance by two-boxing. But, of course, this “realistic” thinking makes it get less money in the end.

I don’t think this analysis is quite right, though. It’s true that the example with the game of chicken goes wrong because SADT thinks it can do as well as some other agent by copying that agent’s strategy, but that isn’t SADT’s fault. It’s only using the counterfactual it is given.

So, my analysis of where SADT goes wrong in this example is that the embedder is wrong.

Consider three possible embedders:

  • Copy: Takes an agent as input, and uses that agent as both player 1 and player 2.
  • Spoofer: Takes the input agent as player 1. Player 2 is a ADT agent who thinks it is playing against the input agent.
  • Delusion: Takes the input agent as player 1. Player 2 is a nADT agent who thinks it is playing against an ADT agent.

“Spoofer” is the embedder which I described before. It “spoofs” the other player into thinking it is up against whatever strategy it’s trying out. “Delusion”, on the other hand, has a player 2 who thinks it’s up against another SADT agent no matter who it’s actually up against. (I didn’t come up with the name for the delusion embedder – I would have called it the real embedder. I preserved the name choice here to represent the fact that it’s very debatable which embedder better represents the real situation.)

If SADT used the copy or delusion embedder, it would not crash into itself in a game of chicken. So why did I say ADT crashes into itself?

DDT learns to use the spoofer. It considers all three embedders to be equivalent to the real situation, because if you feed in the ADT agent itself, you recover the true scenario. So, it makes the most optimistic choice among these – which is the spoofer.

So, it seems like DDT is to blame. It’s not totally clear what part of DDT to blame. Maybe the optimistic choice doesn’t make as much sense as it seems. But, I claim that DDT’s reality filter is faulty. It checks that the embedder amounts to reality if you feed in the DDT agent. This doesn’t seem like the right check to perform. I think it should be requiring that the embedder amounts to reality if you feed in the policy which will be selected in the end. This would rule out the spoofing embedder in a game of chicken, since the policy which looks best according to that embedder implies more reward than you can really get. Furthermore, it seems like this is the right reason to rule out that embedder.

I think such a modified reality check might just turn it into an LIDT-like thing. I haven’t figured out very much about it, though.

As you can see, many of the explanations here were sketchy. I’ve omitted some details, especially with respect to the optimality conditions of ADT. I encourage you to treat this as pointers to arguments which might be made, and think through what’s going on with ADT yourself.



by Jack Gallagher 260 days ago | link

When considering an embedder \(F\), in universe \(U\), in response to which SADT picks policy \(\pi\), I would be tempted to apply the following coherence condition:

\[ E[F(\pi)] = E[F(DDT)] = E[U] \]

(all approximately of course)

I’m not sure if this would work though. This is definitely a necessary condition for reasonable counterfactuals, but not obviously sufficient.

A potentially useful augmentation is to use absolute expected difference: \[E[|F(\pi) - F(DDT)|] = E[|F(DDT) - U|] = 0\]

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

> For another thing, consider
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

RSS

Privacy & Terms