Reflexive Oracles and superrationality: prisoner's dilemma post by Stuart Armstrong 976 days ago | Jessica Taylor, Nate Soares and Patrick LaVictoire like this | 3 comments This grew out of an exchange with Jessica Taylor during MIRI’s recent visit to the FHI. Still getting my feel for the fixed point approach; let me know of any errors. The question is, how can we make use of a reflective oracle to reach outcomes that are not Nash equilibriums? To recap, a reflective oracle is a machine O such that: P(A()=1)>p implies O(A,p)=1 P(A()=0)>1-p implies O(A,p)=0 This works even if A() includes a call to the oracle within its code. Now, all the algorithms used here will be clearly terminating, so we’ll have the other two implications as well (eg (P(A()=0)>p implies O(A,p)=0). And given any δ, we can, with order log(1/δ) questions, establish the probability of A() to within δ. Thus we will write O(A()==a)=p to mean that O(A()==a,(n-1)δ/2)=1 and O(A()==a,(n+1)δ/2)=0, where (n-1)δ/2 < p < (n+1)δ/2. Note also that O can be used to output a probabilistic output (to within δ), so outputting specific mixed strategies is possible. If p1 and p2 are two probability distributions/strategies over possible agent outputs, define them to be “δ-Pareto” if they are within δ of Pareto strategies. We can differentiate (p1,p2) for small changes in strategy, by infinitesimally increasing the weight of some pure strategies o1 and o2 (note that for each strategy, we actually have one less independent degree of freedom than the number of pure strategies, since probabilities must sum to 1). We’ll say D(p1,p2)(o1,o2) is Pareto if we are sure that it is an improvement for both players for all possible (p1’,p2’) within the respective δ ranges of (p1,p2). If p1 and p2 are two probability distributions over possible agent outputs, we can consider We’ll make the following two assumptions: The players do not make use of each other’s internal structures, just of the possible outputs and the calls to O(). The players do not have access to a joint source of randomness. So now fix 1 >> ε >> δ > 0, assume the problem is the Prisoner’s dilemma as given in the previous figure, let Q() be the other player, and consider the following algorithm: define LP(): For all outputs o1 of LP(), compute O(LP()==o1). Call the probability distribution generated p1. For all outputs o2 of Q(), compute O(Q()==o2). Call the probability distribution generated p2. If exists o1 and o2 such that D(p1,p2)(o1,o2) is δ-Pareto: With probability 1-ε output p1. With probability ε output o1. Else output p1. Now, what does this algorithm do? To answer that, we need to consider its fixed points. Since ε >> δ, the only possible fixed points are those where the “With probability 1-ε…” output does not happen (even the degenerate case p1=o1 is not possible, as then D(p1,-)(o1,-)=0 since the probability of o1 cannot be further increased). Thus (p1,p2) must be strategies such that there do not exist o1, o2 making D(p1,p2)(o1,o2) δ-Pareto. In the prisonner’s dilemma, there is always a possibility of Pareto improvement by increasing mutual cooperation (o1,o2)=(C,C), so p1 and p2 must themselves be δ-Pareto. Thus LP() will always reach δ-Pareto outcomes with Q(). The name LP stands for “locally Pareto” (we’ll revisit the “locally” later). Though LP() achieves Pareto outcomes, this is not always ideal. If Q()==LP(), they will achieve some Pareto outcome, but it could be any. If Q() is the cooperation rock, then p1 could be any mix of defect and cooperate, as all those outcomes are Pareto. More worryingly, if Q() is the defection rock, LP() must cooperate (to within δ), as that is the only Pareto outcome. To deal with this, consider neLP() (non-exploitable LP()). Define O(p1,p2) as the module computing the two probability distributions. The value of (p1,p2) is the expected utility of these according to the agent. If we say the value is δ-surely less than some other number, that means that value of (p1’,p2’) is strictly less than that number for all possible p1’ and p2’ within δ of p1 and p2, respectively. define neLP(): O(p1,p2) Let min be the minmax value, from strategy p1'. If the value of (p1,p2) is δ-surely less than min: Output p1'. If exists o1 and o2 such that D(p1,p2)(o1,o2) is δ-Pareto: With probability 1-ε output p1. With probability ε output o1. Else output p1. If multiple o1 exist, it chooses randomly among them. For the prisoner’s dilemma, p1’=D and min=1. When playing against defection rock, p2 must be within δ of pure defection. The “If the value of…” clause prevents neLP() from cooperating with a probability larger than order δ. Therefore, neLP() will compute a (p1,p2) that, most of the time, will cause it to defect (“Output p1’”), and around δ/ε of the time, to go through the “If exists” loop, and cooperate with probability ε, resulting in cooperation of order δ. What happens when neLP() plays itself? The two players must have either the same values for the probabilities in p1 and p2, or values that are δ/2 apart. The two “probability zones” computed by the two players must thus touch, at a corner is nothing else. The first player will think the outcomes are δ-surely less than min if its “zone” has value strictly less than 1; conversely for the second player. Thus the touching point must have coordinates (a,b) with a<1 and b<1 - but such a point does not exist in the prisoner’s dilemma outcome space. So at least one player must reach the “If exists…” clause. But, as before, for the prisoner’s dilemma, strategies that trigger that clause are not stable. So the players must reach a δ-Pareto outcome. By the earlier conditions, this must be one that is not δ-less than (D,D) for either player. Consequently, it must be on the red boundary of the following figure: The red boundary is what neLP() can achieve against copies of itself. The combined red and blue boundary is what neLP() can achieve against LP() and cooperation rock. Can we do better? We might be tempted to increase “min”. If we increased min to 2, say, then surely the result would be Pareto in the “greater than 2” region? Have we successfully moved from “non-exploitable” to “exploits”? However, though this works against LP() and cooperation rock, it runs into trouble when playing against itself, as “you both defect” becomes a possible oracle answer. A better solution is to define the “allowable region” as the green one here: Thus the “If the value of…” line is replaced with “If the value of (p1,p2) is δ-surely not in the green zone” and the argument goes through as before. If such an agent faces a copy of itself, the combined allowable region is the kite delimited by the darker green lines, and then the outcome will be along the Pareto light green lines. The “green” agent will even be able to cooperate with the “yellow” agent, the one whose allowable region is the yellow triangle. Since their allowable regions overlap, the outcome will be the Pareto segment at the top of the overlap. However, two “yellow” agents will not be able to cooperate, and will mutually defect. By becoming too greedy, and insisting on a higher share of the prize, they’ve made mutual cooperation impossible. This seems to be a general trend: to make yourself better against some agents, you make yourself worse against others. In the next post, I’ll have another look at what we mean by “Pareto”.

 by Jessica Taylor 976 days ago | Patrick LaVictoire and Stuart Armstrong like this | link I think it would help to explain how this works for prisoner’s dilemma specifically. This is NicerBot: def NicerBot(opponent): with probability epsilon: cooperate otherwise: cooperate with the same probability that the opponent does (by querying the oracle) It’s easy to see that NicerBot(NicerBot) will always cooperate, while NicerBot(DefectBot) will defect with $$1 - \epsilon$$ probability. This seems like the correct analogue of FairBot in reflective oracle land. reply
 by Vadim Kosoy 976 days ago | Stuart Armstrong likes this | link Note that the NicerBot is exactly what you get from the threat game formalism. Specifically, see the two equations near the end. reply
 by Stuart Armstrong 975 days ago | link Yep! I try and not make use of symmetry when I can avoid it, to make the result more general. Note that NicerBot is neLP() with the allowable area being half of the output area (the half delimited by CC, DD and DC). reply

### NEW DISCUSSION POSTS

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

> For another thing, consider
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes