Intelligent Agent Foundations Forumsign up / log in
Superrationality in arbitrary games
discussion post by Vadim Kosoy 836 days ago | Jessica Taylor, Nate Soares, Patrick LaVictoire, Scott Garrabrant and Stuart Armstrong like this | 5 comments

This post is an informal summary of ideas that Scott and I discussed on my recent visit to Los Angeles, meant to serve as a temporary “savepoint” and reference. The approach outlined here, albeit very promising in my opinion, requires a more work to become mature.


We introduce a new type of game theoretic equilibrium, which is a candidate game theoretic desideratum for decision theory. This can be regarded as a generalization of Hofstadter’s superrationality to arbitrary games. This equilibrium can be interpreted as the outcome of games between transparent agents running CDT with restricted self-modification but also as the outcome of games between transparent agents employing a certain type of logical counterfactuals. We suggest it should be possible to use the formalism of optimal predictors to rigorously construct a logical decision theory which implements the latter interpretation.


It is commonly assumed that two agents running the correct decision theory would cooperate in a transparent Prisoner’s dilemma, that is, in a game of prisoner’s dilemma in which each player knows the source code of the other player\(^1\). It is possible to argue this behavior will indeed arise in UDT, at least for sufficiently similar players, but the argument is unrigorous since an agreed upon formalisation of UDT is lacking. It is also possible to formally prove this behavior for certain set-ups of proof-based decision theory. However, it is not presently clear what is the correct desideratum for decision theory in arbitrary games, especially games that are asymmetric.

The correct decision theory is likely to use logical counterfactuals but current understanding of logical counterfactuals is poor. On the other hand, it is much easier to model causal counterfactuals. Although CDT often leads to bad outcomes (and in particular defects in prisoner’s dilemma) it is reflectively unstable and given the chance is expected to self-modify into some kind of better decision theory\(^2\). We will now see that a naive formalisation of this idea fails to achieve satisfactory game-theoretic properties. This failure is later amended by restricting the “self-modification” to only produce programs of a certain form, arriving at what we hope is the correct desideratum.

Self-modifying CDT

Fix a universal Turing machine to be used for representing Turing machines as words (programs).

Consider a game \(G\) in normal form between \(n\) players. Denote \(\Sigma_i\) the set of pure strategies for the \(i\)-th player and \(u_i : \prod_{j=1}^n \Sigma_j \rightarrow {\mathbb{R}}\) the utility function of the \(i\)-th player.

Given \(t \in {\mathbb{N}}\), we define the associated self-modification game \(G^{SM(t)}\) as follows. The pure strategies of the \(i\)-th player are \(\Sigma_i^{SM(t)}:={{\{ 0, 1 \}^{t}}}\). These strategy represent programs that receive the strategies of the other players as input, run for \(t\) time steps and produce a strategy in the original game, possibly using random coin flips. The utility function of the \(i\)-th player is given by

\[u_i^{SM(t)}(\pi_1, \pi_2 \ldots \pi_n) := E_{U^{nt}}[u_i(ev^t(\pi_1; \pi_2, \pi_3 \ldots \pi_n, x_1), ev^t(\pi_2; \pi_1, \pi_3 \ldots \pi_n, x_2) \ldots ev^t(\pi_n; \pi_1, \pi_2 \ldots \pi_{n-1}, x_n))]\]

For example, denoting the prisoner’s dilemma \(G_{PD}\) and taking \(t >> 0\), there are Nash equilibria in \(G_{PD}^{SM(t)}\) which corresponds to the outcome \((C,C)\) in \(G_{PD}\). For example take \(q\) to be “cliquebot” program, constructed using quining:

\[q(\pi):=\begin{cases}C \text{ if } q \text{ is a prefix of } \pi \\ D \text { otherwise}\end{cases}\]

Then obviously \((q_1,q_2)\) is a Nash equilibrium of \(G_{PD}^{SM(t)}\) for any \(q_1,q_2 \in {{\{ 0, 1 \}^{t}}}\) s.t. \(q\) is a prefix of \(q_1\) and \(q_2\). However, the “defectbot” \(d(\pi):=D\) also gives a Nash equilibrium. More generally, for any \(p \in [0,1]\) a dyadic rational we can define the \(p\)-cliquebot to be

\[q_p(\pi):=\begin{cases}C \text{ with probability }p \text{ if } q_p \text{ is a prefix of } \pi \\ D \text { otherwise}\end{cases}\]

The defectbot equilibrium turns out to be unstable in the sense that it’s not a thermodynamic equilibrium at zero temperature (see below). However \(p\)-cliquebots for \(p > 0\) yield equilibria which are stable in this sense. This is a deficiency of the formalism which is avoided by the metathreat hierarchy (see below).

Thermodynamic Game Equilibria

Consider a game \(G\) in normal form as before. Denote \(\Delta_i\) the set of mixed strategies of the \(i\)-th player i.e.

\[\Delta_i := \{\mu: \Sigma_i \rightarrow [0,1] \mid \sum_{a \in \Sigma_i} \mu(a) = 1\}\]

Consider \(T \in ({\mathbb{R}}^{>0})^n\). A thermodynamic equilibrium of \(G\) at temperature \(T\) is \(\mu:=(\mu_1, \mu_2 \ldots \mu_n) \in \prod_{i=1}^n \Delta_i\) s.t.

\[\mu_i(a)=Z_i^{-1} \operatorname{exp}(\frac{E_{\mu}[u_i(a_1, a_2 \ldots a \ldots a_n)]}{T_i})\]

where \(Z_i \in {\mathbb{R}}\) are normalization constants. Alternatively, \(\mu\) is a thermodynamic equilibrium iff

\[\mu_i = \arg\max_{\mu \in \Delta_i} (E_{\mu}[u_i(a_1, a_2 \ldots a_n)] + T_i S[\mu_i])\]

where \(S[\mu_i]\) is the information-theoretic entropy of \(\mu_i\).

At least one thermodynamic equilibrium exists for any game and temperature, as a consequence of Brouwer’s fixed-point theorem.

\(\mu \in \prod_{i=1}^n \Delta_i\) is called a thermodynamic equilibrium of \(G\) at zero temperature when there is a sequence \(\{(\mu^k \in \prod_{i=1}^n \Delta_i, \, T^k \in ({\mathbb{R}}^{>0})^n)\}_{k \in {\mathbb{N}}}\) s.t. \({\lim_{k \rightarrow \infty}} \mu^k = \mu\), \({\lim_{k \rightarrow \infty}} T^k = 0\) and \(\mu^k\) is a thermodynamic equilibrium at temperature \(T^k\).

At least one thermodynamic equilibrium at zero temperature exists for any game, as a consequence of the Bolzano-Weierstrass theorem. Any thermodynamic equilibrium at zero temperature is a trembling hand equilibrium but the converse is false in general. We believe that for finite extensive-form games, a thermodynamic equilibrium at zero temperature is a subgame perfect equilibrium.

We believe \((d,d)\) is a trembling hand equilibrium in \(G_{PD}^{SM(t)}\) but not a thermodynamic equilibrium at zero temperature. On the other hand \((q_p, q_p)\) is a thermodynamic equilibrium at zero temperature for \(p > 0\) (when random bits are appended to \(q_p\) until it becomes a string of length \(t\)).

Thermodynamic equilibria are interesting because it is possible to implement them using a natural decision theory (that is, a decision theory that treats everything as a 1-player game by modeling the other players). Games between transparent CDT agents using reflective oracles lead to Nash equilibria, and it should be straightforward to “thermalize” the decision theory and get thermodynamic equilibria at finite temperature. A similar result should be possible using optimal predictor schemes instead of reflective oracles.

The Metathreat Hierarchy

The \(p\)-cliquebot can be informally regarded as an infinite tower of threats of increasing level of “meta”. The 1st level says any strategy other than cooperation with probability \(p\) will be answered by defection. The \(k+1\)-st level says that failure to produce the \(k\)-th level threat will be answered by defection. This way the players blackmail each other into complying with this suboptimal strategy. However, if we had a way to truncate the tower at any finite level, it would apart: there would be no justification for each player to produce the highest level threat and it would unwind downwards recursively. The metathreat hierarchy is a formal method to implement this truncation.

Consider a game \(G\) in normal form as before. Fix \(m \in {\mathbb{N}}\). The associated level \(0\) metathreat game with \(m\) coin-flips is just \(G\) itself: \(G^{MT(0,m)}:=G\). We define the associated level \(k+1\) metathreat game with \(m\) coin-flips \(G^{MT(k+1,m)}\) recursively. The pure strategies of the \(i\)-th player are

\[\Sigma_i^{MT(k+1,m)}:=\{\pi: \prod_{j \ne i} \Sigma_j^{MT(k,m)} \times {{\{ 0, 1 \}^{m}}} \rightarrow \Sigma_i^{MT(k,m)} \mid \exists a \in \Sigma_i^{MT(k,m)} \forall b \in \prod_{j \ne i} \Sigma_j^{MT(k,m)} \exists x \in {{\{ 0, 1 \}^{m}}}: \pi(b,x)=a \}\]

Consider \(\pi \in \prod_{i=1}^n \Sigma_i^{MT(k+1,m)}\). Then there is \(\mu^\pi \in \prod_{i=1}^n \Delta_i^{MT(k,m)}\) unique s.t.

\[\mu_i^\pi(a) = Pr_{\mu^\pi \times U^m}[\pi_i(b,x)=a]\]

It exists because of Brouwer’s fixed-point theorem and is unique because of standard results about Markov chains (\(\pi\) defines a Markov chain on \(\prod_{i=1}^n \Sigma_i^{MT(k,m)}\), \(\mu^\pi\) is a stationary distribution of this chain and it has to be unique because of the properties of the chain that follow from the condition on \(\pi_i\)). This allows us to define the utility functions

\[u_i^{MT(k+1,m)}(\pi) = E_{\mu^\pi}[u_i^{MT(k,m)}(a)]\]

Denote \(\Gamma^{MT(k,m)}\) the space of correlated mixed outcomes of \(G^{MT(k,m)}\):

\[\Gamma^{MT(k,m)}:=\{\nu: \prod_{i=1}^n \Sigma_i^{MT(k,m)} \rightarrow [0,1] \mid \sum_{\pi \in \prod_{i=1}^n \Sigma_i^{MT(k,m)}} \nu(\pi) = 1\}\]

There is a natural sense in which \(\prod_{i=1}^n \Delta_i^{MT(k,m)} \subseteq \Gamma^{MT(k,m)}\).

For \(k > j\) we define the unwinding operator \(\omega^{MT(k \rightarrow j,m)}: \Gamma^{MT(k,m)} \rightarrow \Gamma^{MT(j,m)}\) recursively by

\[\omega^{MT(k+1 \rightarrow k,m)}(\nu):=\sum_{\pi \in \prod_{i=1}^n \Sigma_i^{MT(k+1,m)}} \nu(\pi) \mu^\pi\]

\[\omega^{MT(k \rightarrow j,m)}(\nu):=\omega^{MT(j+1 \rightarrow j,m)}(\omega^{MT(k \rightarrow j+1,m)}(\nu))\]

Consider \(\nu \in \Gamma\), a correlated mixed outcome of \(G\). \(\nu\) is called a level \(k\) metathreat equilibrium when there is a sequence \(\{(m^i \in {\mathbb{N}}, \mu^i \in \prod_{i=1}^n \Delta_i^{MT(k,m^i)})\}_{i \in {\mathbb{N}}}\) s.t. \({\lim_{i \rightarrow \infty}} m_i = \infty\), \({\lim_{i \rightarrow \infty}} \omega^{MT(k \rightarrow 0,m_i)}(\mu^i) = \nu\) and \(\mu^i\) is a thermodynamic equilibrium of \(G^{MT(k,m^i)}\) at zero temperature. The Bolzano-Weierstrass theorem ensures such \(\nu\) exists for any game and value of \(k\).

We believe that for \(m >> 0\), \(G_{PD}^{MT(1,m)}\) has a unique thermodynamic equilibrium at zero temperature which is a mixture of \(\pi\) s.t.

\[Pr_{U^m}[\pi(C,x)=C] = 1\]

\[Pr_{U^m}[\pi(D,x)=C] = 2^{-m}\]

Therefore \((C,C)\) is the unique level \(1\) metathreat equilibrium of \(G_{PD}\).

Some questions for further research:

  • Does the set of level \(k\) metathreat equilibria stabilize for \(k >> 0\)?
  • Are level \(k\) metathreat equilibria guaranteed to be Pareto efficient for \(k >> 0\)? Or even already for \(k > 0\)?
  • How do level \(k\) metathreat equilibria look like for \(G_{PD}\) in \(\Sigma^{MT(k,m)}\)-space for \(k > 1\)?
  • Does anything change for \(k >> 0\) if we explicitly introduce shared random bits? It seems plausible that shared random bits can always be replaced by going to higher values of \(k\).

One decision theoretic interpretation of metathreat equilibria is in terms of self-modifying CDT. The elements of \(\Sigma^{MT(k,m)}\) can be interpreted as programs that use reflective oracles or optimal predictor schemes to make predictions about the opposing programs and decide on a strategy accordingly. Thus our self-modification is restricted by the programs in this class.

Another interpretation is in terms of logical decision theory, at least for \(k = 1\). Such a decision theory consists of two stages. In the first stage the agent computes logical conditional expected utility (using an optimal predictor scheme) where the condition corresponds to selection of a given element in \(\pi \in \Sigma^{MT(k,1)}\). It then performs a sequence of logical coin flips to decide on \(\pi\) according to a probability distribution designed to yield a thermodynamic equilibrium. In the second stage the agent uses an optimal predictor scheme to guess the strategy of its opponent and acts according to \(\pi\), this time providing so much resources to the optical predictor scheme that the result of the first stage for the opposing agent is certain. Such a decision theory is not entirely natural since it explicitly uses the concept of “opponent actions”, but hopefully it can be used as a foundation for constructing a fully natural logical decision theory.

\(^1\) The historical origin of this idea is Hofstadter “superrationality”.

\(^2\) We don’t expect the limit of this self-modification process to be entirely reasonable since in some situations self-modification cannot save CDT, for example if self-modification only begins after the agent is inspected by Omega in Newcomb’s paradox. However, it seems plausible to specify assumptions under which self-modifying CDT will make the right choices e.g. Newcomb’s paradox where self-modification is allowed before inspection by Omega.

by Paul Christiano 817 days ago | Vadim Kosoy likes this | link

Do thermodynamic equilibria differ meaningfully from proper equilibria?


by Vadim Kosoy 817 days ago | link

Good question. One obvious difference is that thermodynamic equilibria are naturally defined for games with appropriate continuous strategy spaces, but it’s probably not important for intelligent agent theory.

It seems plausible that for zero temperature they don’t.

For positive temperature thermodynamic equilibria are more rigid than \(\epsilon\)-proper profiles but I’m not sure it’s important.


by Vadim Kosoy 790 days ago | link

Another case which is important is games with infinite strategy spaces. In particular see my recent post where I use the notion of thermodynamic equilibrium in a way which cannot be trivially mimicked (I think) by a proper equilibrium.


by Stuart Armstrong 822 days ago | link

Very interesting! I’m eager to see where this will lead.

If the PD has different values (thus is non-symmetric), but still has the same ordinal relations on outcomes (player 1 prefers (D,C) to (C,C) to (D,D) to (C,D), etc…), is (C,C) still the unique level 1 metathreat equilibrium?

And what inspired this “thermodynamic” approach?


by Vadim Kosoy 822 days ago | link

The metathreat equilibrium definitely depends on the cardinal utilities, for example if \(u_{CC} < \frac{u_{CD}+u_{DC}}{2}\) then the “nicer bots” are not an equilibrium since both agents want to lower the probability of cooperation.

The thermodynamic approach is needed since we need a way to rule out some of the metathreat Nash equilibria. A pair of defect bots is a Nash equilibrium in the level 1 metathreat game but this equilibrium is “unstable” since there are bots that are equally good against the defect bot but do better against other bots. So we need some notion of equilibrium stability, but the usual trembling hand is not good enough since for trembling hand equilibrium stability wrt an arbitrary totally mixed perturbation is sufficient (and it is possible to construct perturbations that stabilize “bad” equilibria). The thermodynamic equilibrium works by considering perturbations in which lower utility alternatives get higher order infinitesimal probabilities. Together with the ease to achieve this equilibrium in a natural decision theory, it becomes an attractive choice.






[Delegative Reinforcement
by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 1 like

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes


Privacy & Terms