Anything you can do with n AIs, you can do with two (with directly opposed objectives) post by Jessica Taylor 1024 days ago | Patrick LaVictoire and Stuart Armstrong like this | 2 comments Summary: For any normal-form game, it’s possible to cast the problem of finding a correlated equilibrium in this game as a 2-player zero-sum game. This seems useful because zero-sum games are easy to analyze and more resistant to collusion. Consider the following class of games (equivalent to the class of normal-form games): There are $$n$$ players. Player $$i$$’s action set is $$\mathcal{A}_i$$. After each player $$i$$ chooses their action $$A_i$$, the state $$X$$ results from these actions (perhaps stochastically), and each player $$i$$ receives utility $$U_i(X)$$. Often, we are interested in finding the Nash equilibrium of a game like this. One strategy for this is to instantiate the $$n$$ players as agents. However, this could cause collusion; see here and here for previous writing on collusion. Zero-sum games seem more resistant to collusion (although maybe not 100% resistant). Additionally, 2-player zero-sum games are typically easier to reason about than $$n$$-player games. So we might be interested in finding the Nash equilibrium using a zero-sum game. I don’t actually know how to find a mixed Nash equilibrium, so instead I’ll present a strategy for finding a correlated equilibrium (a superset of Nash equilibria which are computationally easier to find). Here’s how it works: The actor chooses actions $$A_1, ..., A_n$$. The critic chooses a player index $$I$$, observes the action $$A_I$$, and suggests an alternative action $$A_I'$$. Flip a fair coin. If it comes up heads, observe the state $$X$$ that results from actions $$A_1, ..., A_n$$, and give the actor utility $$U_I(X)$$. If it comes up tails, observe the state $$X$$ that results from actions $$A_1, ..., A_{I-1}, A_I', A_{I+1}, ..., A_n$$, and give the actor utility $$-U_I(X)$$. Either way, the critic’s utility is the negation of the actor’s utility. It will be useful to use the concept of an $$\epsilon$$-correlated equilibrium. While a correlated equilibrium is where no player can gain any expected utility by strategy modification, an $$\epsilon$$-correlated equilibrium is where no player can gain more than $$\epsilon$$ expected utility by strategy modification. Note that the critic’s policies correspond to mixtures of strategy modifications; the critic can be seen as jointly picking a player $$I$$ and a strategy modification $$\phi : \mathcal{A}_I \rightarrow \mathcal{A}_I$$ for the player. Furthermore, the critic’s expected utility is half the expected utility gained by the corresponding player for the average strategy modification in this mixture: $\text{expected critic utility} = \frac{1}{2}\left(\mathbb{E}[U_I(X) | \text{tails}] - \mathbb{E}[U_I(X) | \text{heads}]\right)$ because the critic’s expected utility is half the difference between player $$I$$’s expected utility given strategy modification ($$\text{tails}$$) and player $$I$$’s expected utility given no strategy modification ($$\text{heads}$$). Some facts result: Suppose the actor chooses $$A_1, ..., A_n$$ from some joint distribution that is an $$\epsilon$$-correlated equilibrium of the original game. Then the actor’s expected utility is at least $$-\epsilon/2$$ regardless of the critic’s policy. Suppose the actor chooses $$A_1, ..., A_n$$ from some joint distribution that is not an $$\epsilon$$-correlated equilibrium of the original game. Then the critic’s best response results in a utility of no more than $$-\epsilon/2$$ for the actor. Correlated equilibria always exist, so at a Nash equilibrium in the zero-sum game, the actor always outputs a correlated equilibrium and gets expected utility 0. Perhaps in real life, it is inconvenient to observe the state $$X$$ resulting from actions $$A_1, ..., A_{I-1}, A_I', A_{I+1}, ..., A_n$$, because we can only observe the state by outputting actions, and maybe we always want to output actions from a correlated equilibrium. In this case we could use counterfactual oversight to usually output $$A_1, ..., A_n$$, but run the procedure above occasionally to gather training data. It’s not clear when it’s acceptable to occasionally output strategy-modified action profiles (instead of action profiles from a correlated equilibrium).

 by Stuart Armstrong 1018 days ago | link We have to be careful that the game is really zero-sum. Some setups, with reward signals, seem zero sum but if the AI’s hack it, could become positive sum. reply
 by Jessica Taylor 1018 days ago | link This is true. reply

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes