Summary: For any normalform game, it’s possible to cast the problem of finding a correlated equilibrium in this game as a 2player zerosum game. This seems useful because zerosum games are easy to analyze and more resistant to collusion.
Consider the following class of games (equivalent to the class of normalform games):
There are \(n\) players. Player \(i\)’s action set is \(\mathcal{A}_i\). After each player \(i\) chooses their action \(A_i\), the state \(X\) results from these actions (perhaps stochastically), and each player \(i\) receives utility \(U_i(X)\).
Often, we are interested in finding the Nash equilibrium of a game like this. One strategy for this is to instantiate the \(n\) players as agents. However, this could cause collusion; see here and here for previous writing on collusion. Zerosum games seem more resistant to collusion (although maybe not 100% resistant). Additionally, 2player zerosum games are typically easier to reason about than \(n\)player games.
So we might be interested in finding the Nash equilibrium using a zerosum game. I don’t actually know how to find a mixed Nash equilibrium, so instead I’ll present a strategy for finding a correlated equilibrium (a superset of Nash equilibria which are computationally easier to find). Here’s how it works:
 The actor chooses actions \(A_1, ..., A_n\).
 The critic chooses a player index \(I\), observes the action \(A_I\), and suggests an alternative action \(A_I'\).
 Flip a fair coin. If it comes up heads, observe the state \(X\) that results from actions \(A_1, ..., A_n\), and give the actor utility \(U_I(X)\).
 If it comes up tails, observe the state \(X\) that results from actions \(A_1, ..., A_{I1}, A_I', A_{I+1}, ..., A_n\), and give the actor utility \(U_I(X)\).
 Either way, the critic’s utility is the negation of the actor’s utility.
It will be useful to use the concept of an \(\epsilon\)correlated equilibrium. While a correlated equilibrium is where no player can gain any expected utility by strategy modification, an \(\epsilon\)correlated equilibrium is where no player can gain more than \(\epsilon\) expected utility by strategy modification.
Note that the critic’s policies correspond to mixtures of strategy modifications; the critic can be seen as jointly picking a player \(I\) and a strategy modification \(\phi : \mathcal{A}_I \rightarrow \mathcal{A}_I\) for the player. Furthermore, the critic’s expected utility is half the expected utility gained by the corresponding player for the average strategy modification in this mixture:
\[\text{expected critic utility} = \frac{1}{2}\left(\mathbb{E}[U_I(X)  \text{tails}]  \mathbb{E}[U_I(X)  \text{heads}]\right)\]
because the critic’s expected utility is half the difference between player \(I\)’s expected utility given strategy modification (\(\text{tails}\)) and player \(I\)’s expected utility given no strategy modification (\(\text{heads}\)). Some facts result:
 Suppose the actor chooses \(A_1, ..., A_n\) from some joint distribution that is an \(\epsilon\)correlated equilibrium of the original game. Then the actor’s expected utility is at least \(\epsilon/2\) regardless of the critic’s policy.
 Suppose the actor chooses \(A_1, ..., A_n\) from some joint distribution that is not an \(\epsilon\)correlated equilibrium of the original game. Then the critic’s best response results in a utility of no more than \(\epsilon/2\) for the actor.
Correlated equilibria always exist, so at a Nash equilibrium in the zerosum game, the actor always outputs a correlated equilibrium and gets expected utility 0.
Perhaps in real life, it is inconvenient to observe the state \(X\) resulting from actions \(A_1, ..., A_{I1}, A_I', A_{I+1}, ..., A_n\), because we can only observe the state by outputting actions, and maybe we always want to output actions from a correlated equilibrium. In this case we could use counterfactual oversight to usually output \(A_1, ..., A_n\), but run the procedure above occasionally to gather training data. It’s not clear when it’s acceptable to occasionally output strategymodified action profiles (instead of action profiles from a correlated equilibrium).
