Intelligent Agent Foundations Forumsign up / log in
Superrationality in arbitrary games
discussion post by Vadim Kosoy 747 days ago | Jessica Taylor, Nate Soares, Patrick LaVictoire, Scott Garrabrant and Stuart Armstrong like this | 5 comments

This post is an informal summary of ideas that Scott and I discussed on my recent visit to Los Angeles, meant to serve as a temporary “savepoint” and reference. The approach outlined here, albeit very promising in my opinion, requires a more work to become mature.

Abstract

We introduce a new type of game theoretic equilibrium, which is a candidate game theoretic desideratum for decision theory. This can be regarded as a generalization of Hofstadter’s superrationality to arbitrary games. This equilibrium can be interpreted as the outcome of games between transparent agents running CDT with restricted self-modification but also as the outcome of games between transparent agents employing a certain type of logical counterfactuals. We suggest it should be possible to use the formalism of optimal predictors to rigorously construct a logical decision theory which implements the latter interpretation.

Introduction

It is commonly assumed that two agents running the correct decision theory would cooperate in a transparent Prisoner’s dilemma, that is, in a game of prisoner’s dilemma in which each player knows the source code of the other player\(^1\). It is possible to argue this behavior will indeed arise in UDT, at least for sufficiently similar players, but the argument is unrigorous since an agreed upon formalisation of UDT is lacking. It is also possible to formally prove this behavior for certain set-ups of proof-based decision theory. However, it is not presently clear what is the correct desideratum for decision theory in arbitrary games, especially games that are asymmetric.

The correct decision theory is likely to use logical counterfactuals but current understanding of logical counterfactuals is poor. On the other hand, it is much easier to model causal counterfactuals. Although CDT often leads to bad outcomes (and in particular defects in prisoner’s dilemma) it is reflectively unstable and given the chance is expected to self-modify into some kind of better decision theory\(^2\). We will now see that a naive formalisation of this idea fails to achieve satisfactory game-theoretic properties. This failure is later amended by restricting the “self-modification” to only produce programs of a certain form, arriving at what we hope is the correct desideratum.

Self-modifying CDT

Fix a universal Turing machine to be used for representing Turing machines as words (programs).

Consider a game \(G\) in normal form between \(n\) players. Denote \(\Sigma_i\) the set of pure strategies for the \(i\)-th player and \(u_i : \prod_{j=1}^n \Sigma_j \rightarrow {\mathbb{R}}\) the utility function of the \(i\)-th player.

Given \(t \in {\mathbb{N}}\), we define the associated self-modification game \(G^{SM(t)}\) as follows. The pure strategies of the \(i\)-th player are \(\Sigma_i^{SM(t)}:={{\{ 0, 1 \}^{t}}}\). These strategy represent programs that receive the strategies of the other players as input, run for \(t\) time steps and produce a strategy in the original game, possibly using random coin flips. The utility function of the \(i\)-th player is given by

\[u_i^{SM(t)}(\pi_1, \pi_2 \ldots \pi_n) := E_{U^{nt}}[u_i(ev^t(\pi_1; \pi_2, \pi_3 \ldots \pi_n, x_1), ev^t(\pi_2; \pi_1, \pi_3 \ldots \pi_n, x_2) \ldots ev^t(\pi_n; \pi_1, \pi_2 \ldots \pi_{n-1}, x_n))]\]

For example, denoting the prisoner’s dilemma \(G_{PD}\) and taking \(t >> 0\), there are Nash equilibria in \(G_{PD}^{SM(t)}\) which corresponds to the outcome \((C,C)\) in \(G_{PD}\). For example take \(q\) to be “cliquebot” program, constructed using quining:

\[q(\pi):=\begin{cases}C \text{ if } q \text{ is a prefix of } \pi \\ D \text { otherwise}\end{cases}\]

Then obviously \((q_1,q_2)\) is a Nash equilibrium of \(G_{PD}^{SM(t)}\) for any \(q_1,q_2 \in {{\{ 0, 1 \}^{t}}}\) s.t. \(q\) is a prefix of \(q_1\) and \(q_2\). However, the “defectbot” \(d(\pi):=D\) also gives a Nash equilibrium. More generally, for any \(p \in [0,1]\) a dyadic rational we can define the \(p\)-cliquebot to be

\[q_p(\pi):=\begin{cases}C \text{ with probability }p \text{ if } q_p \text{ is a prefix of } \pi \\ D \text { otherwise}\end{cases}\]

The defectbot equilibrium turns out to be unstable in the sense that it’s not a thermodynamic equilibrium at zero temperature (see below). However \(p\)-cliquebots for \(p > 0\) yield equilibria which are stable in this sense. This is a deficiency of the formalism which is avoided by the metathreat hierarchy (see below).

Thermodynamic Game Equilibria

Consider a game \(G\) in normal form as before. Denote \(\Delta_i\) the set of mixed strategies of the \(i\)-th player i.e.

\[\Delta_i := \{\mu: \Sigma_i \rightarrow [0,1] \mid \sum_{a \in \Sigma_i} \mu(a) = 1\}\]

Consider \(T \in ({\mathbb{R}}^{>0})^n\). A thermodynamic equilibrium of \(G\) at temperature \(T\) is \(\mu:=(\mu_1, \mu_2 \ldots \mu_n) \in \prod_{i=1}^n \Delta_i\) s.t.

\[\mu_i(a)=Z_i^{-1} \operatorname{exp}(\frac{E_{\mu}[u_i(a_1, a_2 \ldots a \ldots a_n)]}{T_i})\]

where \(Z_i \in {\mathbb{R}}\) are normalization constants. Alternatively, \(\mu\) is a thermodynamic equilibrium iff

\[\mu_i = \arg\max_{\mu \in \Delta_i} (E_{\mu}[u_i(a_1, a_2 \ldots a_n)] + T_i S[\mu_i])\]

where \(S[\mu_i]\) is the information-theoretic entropy of \(\mu_i\).

At least one thermodynamic equilibrium exists for any game and temperature, as a consequence of Brouwer’s fixed-point theorem.

\(\mu \in \prod_{i=1}^n \Delta_i\) is called a thermodynamic equilibrium of \(G\) at zero temperature when there is a sequence \(\{(\mu^k \in \prod_{i=1}^n \Delta_i, \, T^k \in ({\mathbb{R}}^{>0})^n)\}_{k \in {\mathbb{N}}}\) s.t. \({\lim_{k \rightarrow \infty}} \mu^k = \mu\), \({\lim_{k \rightarrow \infty}} T^k = 0\) and \(\mu^k\) is a thermodynamic equilibrium at temperature \(T^k\).

At least one thermodynamic equilibrium at zero temperature exists for any game, as a consequence of the Bolzano-Weierstrass theorem. Any thermodynamic equilibrium at zero temperature is a trembling hand equilibrium but the converse is false in general. We believe that for finite extensive-form games, a thermodynamic equilibrium at zero temperature is a subgame perfect equilibrium.

We believe \((d,d)\) is a trembling hand equilibrium in \(G_{PD}^{SM(t)}\) but not a thermodynamic equilibrium at zero temperature. On the other hand \((q_p, q_p)\) is a thermodynamic equilibrium at zero temperature for \(p > 0\) (when random bits are appended to \(q_p\) until it becomes a string of length \(t\)).

Thermodynamic equilibria are interesting because it is possible to implement them using a natural decision theory (that is, a decision theory that treats everything as a 1-player game by modeling the other players). Games between transparent CDT agents using reflective oracles lead to Nash equilibria, and it should be straightforward to “thermalize” the decision theory and get thermodynamic equilibria at finite temperature. A similar result should be possible using optimal predictor schemes instead of reflective oracles.

The Metathreat Hierarchy

The \(p\)-cliquebot can be informally regarded as an infinite tower of threats of increasing level of “meta”. The 1st level says any strategy other than cooperation with probability \(p\) will be answered by defection. The \(k+1\)-st level says that failure to produce the \(k\)-th level threat will be answered by defection. This way the players blackmail each other into complying with this suboptimal strategy. However, if we had a way to truncate the tower at any finite level, it would apart: there would be no justification for each player to produce the highest level threat and it would unwind downwards recursively. The metathreat hierarchy is a formal method to implement this truncation.

Consider a game \(G\) in normal form as before. Fix \(m \in {\mathbb{N}}\). The associated level \(0\) metathreat game with \(m\) coin-flips is just \(G\) itself: \(G^{MT(0,m)}:=G\). We define the associated level \(k+1\) metathreat game with \(m\) coin-flips \(G^{MT(k+1,m)}\) recursively. The pure strategies of the \(i\)-th player are

\[\Sigma_i^{MT(k+1,m)}:=\{\pi: \prod_{j \ne i} \Sigma_j^{MT(k,m)} \times {{\{ 0, 1 \}^{m}}} \rightarrow \Sigma_i^{MT(k,m)} \mid \exists a \in \Sigma_i^{MT(k,m)} \forall b \in \prod_{j \ne i} \Sigma_j^{MT(k,m)} \exists x \in {{\{ 0, 1 \}^{m}}}: \pi(b,x)=a \}\]

Consider \(\pi \in \prod_{i=1}^n \Sigma_i^{MT(k+1,m)}\). Then there is \(\mu^\pi \in \prod_{i=1}^n \Delta_i^{MT(k,m)}\) unique s.t.

\[\mu_i^\pi(a) = Pr_{\mu^\pi \times U^m}[\pi_i(b,x)=a]\]

It exists because of Brouwer’s fixed-point theorem and is unique because of standard results about Markov chains (\(\pi\) defines a Markov chain on \(\prod_{i=1}^n \Sigma_i^{MT(k,m)}\), \(\mu^\pi\) is a stationary distribution of this chain and it has to be unique because of the properties of the chain that follow from the condition on \(\pi_i\)). This allows us to define the utility functions

\[u_i^{MT(k+1,m)}(\pi) = E_{\mu^\pi}[u_i^{MT(k,m)}(a)]\]

Denote \(\Gamma^{MT(k,m)}\) the space of correlated mixed outcomes of \(G^{MT(k,m)}\):

\[\Gamma^{MT(k,m)}:=\{\nu: \prod_{i=1}^n \Sigma_i^{MT(k,m)} \rightarrow [0,1] \mid \sum_{\pi \in \prod_{i=1}^n \Sigma_i^{MT(k,m)}} \nu(\pi) = 1\}\]

There is a natural sense in which \(\prod_{i=1}^n \Delta_i^{MT(k,m)} \subseteq \Gamma^{MT(k,m)}\).

For \(k > j\) we define the unwinding operator \(\omega^{MT(k \rightarrow j,m)}: \Gamma^{MT(k,m)} \rightarrow \Gamma^{MT(j,m)}\) recursively by

\[\omega^{MT(k+1 \rightarrow k,m)}(\nu):=\sum_{\pi \in \prod_{i=1}^n \Sigma_i^{MT(k+1,m)}} \nu(\pi) \mu^\pi\]

\[\omega^{MT(k \rightarrow j,m)}(\nu):=\omega^{MT(j+1 \rightarrow j,m)}(\omega^{MT(k \rightarrow j+1,m)}(\nu))\]

Consider \(\nu \in \Gamma\), a correlated mixed outcome of \(G\). \(\nu\) is called a level \(k\) metathreat equilibrium when there is a sequence \(\{(m^i \in {\mathbb{N}}, \mu^i \in \prod_{i=1}^n \Delta_i^{MT(k,m^i)})\}_{i \in {\mathbb{N}}}\) s.t. \({\lim_{i \rightarrow \infty}} m_i = \infty\), \({\lim_{i \rightarrow \infty}} \omega^{MT(k \rightarrow 0,m_i)}(\mu^i) = \nu\) and \(\mu^i\) is a thermodynamic equilibrium of \(G^{MT(k,m^i)}\) at zero temperature. The Bolzano-Weierstrass theorem ensures such \(\nu\) exists for any game and value of \(k\).

We believe that for \(m >> 0\), \(G_{PD}^{MT(1,m)}\) has a unique thermodynamic equilibrium at zero temperature which is a mixture of \(\pi\) s.t.

\[Pr_{U^m}[\pi(C,x)=C] = 1\]

\[Pr_{U^m}[\pi(D,x)=C] = 2^{-m}\]

Therefore \((C,C)\) is the unique level \(1\) metathreat equilibrium of \(G_{PD}\).

Some questions for further research:

  • Does the set of level \(k\) metathreat equilibria stabilize for \(k >> 0\)?
  • Are level \(k\) metathreat equilibria guaranteed to be Pareto efficient for \(k >> 0\)? Or even already for \(k > 0\)?
  • How do level \(k\) metathreat equilibria look like for \(G_{PD}\) in \(\Sigma^{MT(k,m)}\)-space for \(k > 1\)?
  • Does anything change for \(k >> 0\) if we explicitly introduce shared random bits? It seems plausible that shared random bits can always be replaced by going to higher values of \(k\).

One decision theoretic interpretation of metathreat equilibria is in terms of self-modifying CDT. The elements of \(\Sigma^{MT(k,m)}\) can be interpreted as programs that use reflective oracles or optimal predictor schemes to make predictions about the opposing programs and decide on a strategy accordingly. Thus our self-modification is restricted by the programs in this class.

Another interpretation is in terms of logical decision theory, at least for \(k = 1\). Such a decision theory consists of two stages. In the first stage the agent computes logical conditional expected utility (using an optimal predictor scheme) where the condition corresponds to selection of a given element in \(\pi \in \Sigma^{MT(k,1)}\). It then performs a sequence of logical coin flips to decide on \(\pi\) according to a probability distribution designed to yield a thermodynamic equilibrium. In the second stage the agent uses an optimal predictor scheme to guess the strategy of its opponent and acts according to \(\pi\), this time providing so much resources to the optical predictor scheme that the result of the first stage for the opposing agent is certain. Such a decision theory is not entirely natural since it explicitly uses the concept of “opponent actions”, but hopefully it can be used as a foundation for constructing a fully natural logical decision theory.

\(^1\) The historical origin of this idea is Hofstadter “superrationality”.

\(^2\) We don’t expect the limit of this self-modification process to be entirely reasonable since in some situations self-modification cannot save CDT, for example if self-modification only begins after the agent is inspected by Omega in Newcomb’s paradox. However, it seems plausible to specify assumptions under which self-modifying CDT will make the right choices e.g. Newcomb’s paradox where self-modification is allowed before inspection by Omega.



by Paul Christiano 728 days ago | Vadim Kosoy likes this | link

Do thermodynamic equilibria differ meaningfully from proper equilibria?

reply

by Vadim Kosoy 727 days ago | link

Good question. One obvious difference is that thermodynamic equilibria are naturally defined for games with appropriate continuous strategy spaces, but it’s probably not important for intelligent agent theory.

It seems plausible that for zero temperature they don’t.

For positive temperature thermodynamic equilibria are more rigid than \(\epsilon\)-proper profiles but I’m not sure it’s important.

reply

by Vadim Kosoy 701 days ago | link

Another case which is important is games with infinite strategy spaces. In particular see my recent post where I use the notion of thermodynamic equilibrium in a way which cannot be trivially mimicked (I think) by a proper equilibrium.

reply

by Stuart Armstrong 733 days ago | link

Very interesting! I’m eager to see where this will lead.

If the PD has different values (thus is non-symmetric), but still has the same ordinal relations on outcomes (player 1 prefers (D,C) to (C,C) to (D,D) to (C,D), etc…), is (C,C) still the unique level 1 metathreat equilibrium?

And what inspired this “thermodynamic” approach?

reply

by Vadim Kosoy 733 days ago | link

The metathreat equilibrium definitely depends on the cardinal utilities, for example if \(u_{CC} < \frac{u_{CD}+u_{DC}}{2}\) then the “nicer bots” are not an equilibrium since both agents want to lower the probability of cooperation.

The thermodynamic approach is needed since we need a way to rule out some of the metathreat Nash equilibria. A pair of defect bots is a Nash equilibrium in the level 1 metathreat game but this equilibrium is “unstable” since there are bots that are equally good against the defect bot but do better against other bots. So we need some notion of equilibrium stability, but the usual trembling hand is not good enough since for trembling hand equilibrium stability wrt an arbitrary totally mixed perturbation is sufficient (and it is possible to construct perturbations that stabilize “bad” equilibria). The thermodynamic equilibrium works by considering perturbations in which lower utility alternatives get higher order infinitesimal probabilities. Together with the ease to achieve this equilibrium in a natural decision theory, it becomes an attractive choice.

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

I think the point I was
by Abram Demski on Predictable Exploration | 0 likes

(also x-posted from
by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

(x-posted from Arbital ==>
by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

>If the other players can see
by Stuart Armstrong on Predictable Exploration | 0 likes

Thinking about this more, I
by Abram Demski on Predictable Exploration | 0 likes

> So I wound up with
by Abram Demski on Predictable Exploration | 0 likes

Hm, I got the same result
by Alex Appel on Predictable Exploration | 1 like

Paul - how widely do you want
by David Krueger on Funding opportunity for AI alignment research | 0 likes

I agree, my intuition is that
by Abram Demski on Smoking Lesion Steelman III: Revenge of the Tickle... | 0 likes

RSS

Privacy & Terms