Intelligent Agent Foundations Forumsign up / log in
Prediction Based Robust Cooperation
post by Scott Garrabrant 297 days ago | Patrick LaVictoire likes this | 1 comment

In this post, We present a new approach to robust cooperation, as an alternative to the “modal combat” framework. This post is very hand-waivey. If someone would like to work on making it better, let me know.


Over the last year or so, MIRI’s agent foundations research has moved from the proof based world of modal combat and proof based decision theory to the probabilistic prediction based world of logical inductors and reflective oracles.

In this transition, we have partially overcome the Lobian obstacle to self trust, since the new framework is not effected by Lob’s theorem. Unfortunately, we also lost the silver lining of Lob’s theorem: the Lobian handshakes which powered modal combat.

NicerBot (which cooperates with probability epsilon greater than the probability it expects its opponent cooperates) is an alternative to FairBot in the prediction based world of reflective oracles, but it has problems. It is not a best response to itself, cannot obviously be generalized into things like PrudentBot, and does not look like something you could get out of a general decision theory.

Here, we give a first attempt at bringing the full power of modal combat into a prediction based setting. This framework will likely change, and it is not yet clear the extent to which this framework can is philosophically realistic. We will focus on the prisoners’ dilemma, although this framework can be extended to other games.


Two players are in a prisoner’s dilemma. Each player has access to a prediction of what they expect the other player to do. This prediction takes the form a single sample which is “C” with probability equal to the probability that the other player cooperates, and is “D” with probability equal to the probability that the other player defects. The prediction can also output “null,” and the player needs to choose an output in the case of a “null” prediction, but the the prediction will be non-nil with probability 1.

A non-null prediction, in addition to telling you whether or not the opponent cooperates, also tells you the output of your opponents prediction of you. When you observe the other player’s prediction of you, this again comes along with their prediction of your prediction of them, and so on.

We can thus view the output of your prediction as a finite or infinite string of C’s and D’s. The first letter represents your prediction of what the opponent will do. The second letter represents your prediction of what the opponent will predict that you do, and so on. This string can stop at any point representing a null prediction.

These predictions will be accurate. For example, the probability that the first player gets a string starting with CC will be exactly the probability that the second player gets a string starting with C, and cooperates.

We are giving each player a single call to this prediction oracle. You could also imagine multiple independent calls, leading to an output that looks like a finite or infinite tree of C’s and D’s, but we will not go into that in this post.

Each player will have a strategy that a function from this finite or infinite string to probability with which to cooperate. We say that such a strategy is a depth \(k\) strategy if it is guaranteed not to look past that the first \(k\) characters of the string. We say that a strategy is finite if it is depth \(k\) for some \(k\).

Give two players with finite strategies, we now explain how to evaluate the game and end up with a probability of cooperation for each player.

We generate an infinite string randomly as follows:

First, we uniformly randomly choose which player goes first. This player is give a null prediction, and outputs C with some probability, and D otherwise. We then give the other player a prediction (of an opponent that has a null prediction and outputs that value), and repeat the process.

The result is an infinite string, where the \(n\)th character is generated by running player \(p\)s strategy on the on the string which is the reverse of the first \(n-1\) characters of the string, where player \(p\) is the player chosen to go first if \(n\) is odd, and the other player if \(n\) is even.

Claim: In this string, with probability 1, for every finite substring, the limit of \(\frac{1}{n}\) times the number of instances of that substring starting at an even index less than \(n\) converges.

Proof: Probably follows from some simple theorem about Markov Chains.

This gives for each prefix of the prediction, a density of that prediction in our randomly generated string. We can take the expected value of this density over all randomly generated infinite strings to get a probability of that prediction prefix, and use this to get a distribution over infinite prediction strings for each player, and thus probability of cooperation for each player.

Note that there may be more than one distribution on predictions that is accurate for a given pair of strategies restricted to nowhere null predictions. The above process uses the behavior on null predictions to specify a single such fixed point.


Now we show that analogues of FairBot and PrudentBot exist in this framework, and have similar properties to the modal combat versions.

ProbabilisticFairBot:
    if the prediction string is empty
        output C
    else
        output the first character of the prediction string

ProbabilisticFairBot has depth 1. Observe that ProbabilisticFairBot will cooperate with itself with probability 1, and is unexploitable. It always cooperates with at most the probability that the other player cooperates.

The following ProbabilisticPrudentBot can probably be made simpler. I have not tried.

ProbabilisticPrudentBot:
    if the prediction string has at most 1 character
        output D
    else if the prediction string has at most 3 character
        output C
    else if the prediction string starts with CDC
        output D
    else if the prediction string is exactly CCCDC or CCCD
        output D
    else
        output the first character of the prediction string

Observe that if ProbabilisticPrudentBot plays against itself the infinite string generated in finding out how to evaluate the game will be DDCCCCCCC…

If It plays against ProbabilisticFairbot and goes fist, the string will be DDCCCCCCC…

If it plays against ProbabilisticFairbot and goes second, the string will be CDDCCCCCC…

If it plays against CooperateBot and goes first, the string will be DCCCDCDCDC…

If it plays against CooperateBot and goes second, the string will be CDCCCDCDCDC…

Also note that ProbabilisticPrudentBot is unexploitable, since it cannot cooperate with probability higher than its opponent.

Thus, ProbabilisticPrudentBot is unexploitable, cooperates with itself and ProbabilisticFairBot, and exploits CooperateBot, just like in modal combat.


Here is a list of things that someone might want to work on related to this. (email me and let me know if you work on any of these.)

  1. Formalize all of the above.

  2. Generalize to arbitrary games.

  3. Observe that if we take two players, and restrict them to choosing depth \(k\) strategies, The set of strategies they have to choose between is compact. However, the outcome of the game is not a continuous function of the strategies, so it is not clear that Nash equilibria exist. (They do exist in the Prisoners’ Dilemma), but do they exist when you generalize this analysis to arbitrary games?

  4. Also, we could generalize to not requiring finite strategies from the players, allowing for any strategies that are a continuous function of the prediction. I believe that if you do this right, you will still be able to get unique evaluations of the strategies, but this may interfere with the existence of Nash equilibria.

  5. Figure out what happens when you allow multiple independent calls to the prediction oracle. I think things should still work out well.

  6. Think about how philosophically reasonable this framework. Maybe come up with something better, if it is not very good. I suspect the ontology of Jessica’s UDT=CDT+SIA post will help with thinking about this.

  7. While this is inspired by Logical Induction, It is not directly related. Can you make it connected, and come up with philosophically justified LI agents that cooperate with each other through the LI predictions.

  8. Probably easier than 7 is to connect it up with the reflective oracle formalism.

  9. Build a decision theory and connect it up to this system, similar to how we put proof-based decision theory in modal combat.



by Alex Mennen 286 days ago | link

The behavior of players with finite strategies is not computable. To see this, let \(n\)-HaltingBot be the strategy that cooperates iff the \(n\)th Turing machine halts (on null input) in fewer steps than the length of the string of predictions. If the \(n\)th Turing machine halts, then the depth of this strategy is the number of steps it takes for it to halt. If it doesn’t halt, then the depth of this strategy is \(0\). Either way, it’s a finite strategy. But you’re paying attention to the limiting behavior as the length of the string of predictions approaches infinity, and in that sense, the \(n\)-HaltingBot cooperates with the other player iff the \(n\)th program halts.

However, I’m pretty sure the behavior of players with provable finite bounds on the depths of their strategies are computable.

In terms of philosophical reasonableness, I’m kind of skeptical of this. In an actual implementation of this with real agents, presumably there isn’t some prediction oracle that will be assisting the agents, so the agents will have to generate these predictions themselves, and then they’ll actually have to compute the limits of their and the other player’s behavior in order to decide what to actually do. You also want to avoid getting exploited by players who act like ProbabilisticFairBot when the string of predictions is finite but defect when the string is infinite. So I think they’ll need to look for proofs that the other player is an agent that behaves according to this framework, with some bound on the depth of their strategy. Two players doing this then end up doing a Lobian handshake that they’ll both act according to this framework, but if they need to do a Lobian handshake, they may as well use it to cooperate, instead of using it to adopt a different framework that will lead them to cooperate.

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Without reading closely, this
by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

RSS

Privacy & Terms