Intelligent Agent Foundations Forumsign up / log in
Cooperative Oracles: Introduction
post by Scott Garrabrant 192 days ago | Abram Demski, Jessica Taylor and Patrick LaVictoire like this | 1 comment

This is the first in a series of posts introducing a new tool called a Cooperative Oracle. All of these posts are joint work Sam Eisenstat, Tsvi Benson-Tilsen, and Nisan Stiennon.

Here is my plan for posts in this sequence. I will update this as I go.

  1. Introduction
  2. Nonexploited Bargaining
  3. Stratified Pareto Optima and Almost Stratified Pareto Optima
  4. Definition and Existence Proof
  5. Alternate Notions of Dependency

In this post, I will give a sketchy advertisement of what is to come.

Cooperative oracles are a refinement on reflective oracles. We consider the set of Turing machines with access to a reflective oracle, and we think of each Turing machine as having a utility function (perhaps written in a comment in the source code). This utility function is itself computable using the reflective oracle.

Since the definition of a reflective oracle use a fixed point procedure that may have had multiple fixed points, there are actually many different reflective oracles. However, we will take something like a Pareto optimum in the class of reflective oracles.

For example if two players in a prisoner’s dilemma use a reflective oracles to cooperate with exactly the probability the other player cooperates, there is a continuum of fixed points in which the player cooperate with the same probability. We want to take the fixed point in which the both cooperate.

Unfortunately, we do not want to just take a standard Pareto Optimum, since we are working with the class of all Turing Machines, and for every machine with a utility function, there is another with the negation of that utility function, so all points will be Optima.

For this, we will use a stratified version of Pareto optima. If a third party wants the two above players to defect against each other, we will still choose mutual cooperation. Since the players only reference each other, and not the third party, the third party is not considered in the notion of optimality for the output of the Oracle on the prisoners’ dilemma game.

We will be able to build a cooperative oracle, which is a reflective oracle that (roughly) ensures that all of the outputs of the computations are Pareto optimal for all of the utility function of the machines involved it that computation. This Pareto optimality is only used to choose between multiple fixed points. We will still be restricted to have the computation follow its source code, and have the reflective oracle give honest data about the output of the computations.

Using cooperative oracles, we will be able to set up a system similar to modal combat, but with reflective oracles.

We will also be able to study bargaining. We will formalize the phenomenon described in the Eliezer Yudkowsky post Cooperating with agents with different ideas of fairness, while resisting exploitation, and implement this style of bargaining with reflective oracles.

This is interesting because it reduces the problem of finding a fair way to split gains from trade from a group endeavor (like in standard cooperative game theory) to an individual endeavor, where each player specifies what they consider to be a fair outcome for them, and the reflective oracle takes care of the rest. The problem of figuring out how to define what is fair to you is still wide open, however.

The notion of cooperative oracles presented will be use a clunky syntactic notion of when one computation depends on another. If an algorithm calls the reflective oracle on the output of another, we say the first algorithm depends on the second, and we take the transitive closure of this property. There are many alternatives to this, some involving looking at the actual function the algorithms compute and checking how these functions depend on other values. Specifying better notions of computations depending on each other is a hard problem, and better solutions to this may yield better definitions of cooperative oracles.

by Scott Garrabrant 96 days ago | link

I have stopped working on this sequence, because a coauthor is trying to write it up as a more formal paper instead.






Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

I think the point I was
by Abram Demski on Predictable Exploration | 0 likes

(also x-posted from
by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

(x-posted from Arbital ==>
by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

>If the other players can see
by Stuart Armstrong on Predictable Exploration | 0 likes

Thinking about this more, I
by Abram Demski on Predictable Exploration | 0 likes

> So I wound up with
by Abram Demski on Predictable Exploration | 0 likes

Hm, I got the same result
by Alex Appel on Predictable Exploration | 1 like

Paul - how widely do you want
by David Krueger on Funding opportunity for AI alignment research | 0 likes

I agree, my intuition is that
by Abram Demski on Smoking Lesion Steelman III: Revenge of the Tickle... | 0 likes


Privacy & Terms