Intelligent Agent Foundations Forumsign up / log in

Thanks Anna Salamon for the idea of making an AI which cares about what happens in a counterfactual ideal world, rather than the real world world with the transistors in it, as a corrigibility strategy. I haven’t yet been able to find a way to make that idea work for an agent/utility maximizer, but it inspired the idea of doing the same thing in an oracle.

reply

by Stuart Armstrong 1344 days ago | link

You could have an agent that cares about what an idealised counterfactual human would think about its decisions (if the idealised human had a huge amount of time to think them over). Compare with Paul Christiano’s ideas.

Now, this isn’t safe, but it’s at least something you might be able to play with.

reply


The procrastination paradox is isomorphic to well-founded recursion. In the reasoning, the fourth step, “whether or not I press the button, the next agent or an agent after that will press the button” is an invalid proof-step; it’s shown that there is an inductive steps ending at the conclusion, but not that that chain has a base case.

This can only happen when the relation between an agent and its successor is not well-founded. If there is any well-founded relation between agents and their successors - either because they’re in a finite universe, or because the first agent picked a well-founded relation and build that in - then the button will eventually get pushed.

reply

by Jim Babcock 1464 days ago | Nate Soares and Patrick LaVictoire like this | link | parent | on: Modal Bargaining Agents

Point (1) seems to be a combination of an issue of working around the absence of a mathematically-elegant communication channel in the formalism, and an incentive to choose some orderings over others because of (2). If (2) is solved and they can communicate, then they can agree on an ordering without any trouble because they’re both indifferent to which one is chosen.

If you don’t have communication but you have solved (2), I think you can solve the problem by splitting agents into two stages. In the first stage, agents try to coordinate on an ordering over the points. To do this, the two agents X and Y each generate a bag of orderings Ox and Oy that they think might be Schelling points. Agent X first draws an ordering from Ox and tries to prove coordination on it in PA+1, then draws another with replacement and tries to prove coordination on it in PA+2, then draws another with replacement and tries to prove coordination on it in PA+3, etc. Agent Y does the same thing, with a different bag of proposed orderings. If there is overlap between their respective sets, then the odds that they will fail to find a coordination point fall off exponentially, albeit slowly.

Then in the second stage, after proving in PA+n for some n that they will both go through the same ordering, each will try to prove coordination on point 1 in PA+n+1, on point 2 in PA+n+2, etc, for the points they find acceptable.

reply

by Patrick LaVictoire 1464 days ago | link

I think your proposal is more complicated than, say, mutually randomly choosing an ordering in one step. Does it have any superior properties to just doing that?

reply

by Jim Babcock 1464 days ago | Patrick LaVictoire likes this | link

I don’t think the mechanics of the problem, as specified, let them mutually specify random things without something like an externally-provided probability distribution. This is aimed at eliminating that requirement. But it may be that this issue isn’t very illuminating and would be better addressed by adjusting the problem formulation to provide that.

reply

by Patrick LaVictoire 1463 days ago | link

We already assumed a source of mutual randomness in order to guarantee that the feasible set is convex (caption to Figure 1).

reply

by Jim Babcock 1463 days ago | link

To clarify, what I meant was not that they need a source of shared randomness, but that they need a shared probability distribution; ie, having dice isn’t enough, they also need to coordinate on a way of interpreting the dice, which is similar to the original problem of coordinating on an ordering over points.

reply


Regarding (2), the main problem is that this creates an incentive for agents to choose orderings that favor themselves when there is overlap between the acceptable regions, and this creates a high chance that they won’t be able to agree on an ordering at all. Jessica Taylor’s solution solves the problem of not being able to find an ordering, but at the cost of all the surplus utility that was in the region of overlap. For example, if Janos and I are deciding how to divide a dollar, I offer that Janos keeps it, and Janos offers that I keep it, that solution would have us set it on fire instead.

Instead, perhaps we could redefine the algorithm so that “cooperation at point N” means entering another round of negotiation, where only points that each agent finds at least as good as N are considered, and negotiation continues until it reaches a fixed point.

How to actually convert this into an algorithm? I haven’t figured out all the technical details, but I think the key is having agents prove things of the form “we’ll coordinate on a point I find at least as good as point N”.

reply

by Patrick LaVictoire 1464 days ago | Jessica Taylor likes this | link

I thought about that at some point, in the case where they’re biased in their own directions, but of course there it just reintroduces the incentive for playing hardball. In the case where they’re each overly generous, they already have the incentive to bias slightly more in their own direction.

However, there’s not an obvious way to translate the second round of negotiation into the modal framework…

reply


This relates to what in Boston we’ve been calling the Ensemble Stability problem: given multiple utility functions, some of which may be incorrect, how do you keep the AI from sacrificing the other values for the incorrect one(s). Maximin is a step in the right direction, but I don’t think it fully solves the problem.

I see two main issues. First, suppose one of the utility functions in the set is erroneous, and the AI predicts that in the future, we’ll realize this and create a different AI that optimizes without it. Then the AI will be incentivized to prevent the creation of that AI, or to modify it into including the erroneous value. The second issue is that, if one of the utility functions is offset so it outputs a score well below the others, the other utility functions will be crowded out in the AI’s attention and resource allocation.

One approach to the latter problem might be to make a utility function aggregation that approaches maximin behavior in the limit as the AI’s resources go to infinity, but starts out more linear.

reply

by Jessica Taylor 1463 days ago | Patrick LaVictoire likes this | link

First, suppose one of the utility functions in the set is erroneous, and the AI predicts that in the future, we’ll realize this and create a different AI that optimizes without it. Then the AI will be incentivized to prevent the creation of that AI, or to modify it into including the erroneous value.

The utility functions are normalized so that they all assign 0 to the status quo. The status quo includes humans designing an AI to optimize something. So the minimax agent won’t do anything worse for the values of the later AI than what would happen normally, unless the future AI’s utility function is not in minimax’s ensemble.

The second issue is that, if one of the utility functions is offset so it outputs a score well below the others, the other utility functions will be crowded out in the AI’s attention and resource allocation.

Since they’re normalized to return 0 on the status quo, this won’t quite happen, but it could be that one is a lot harder to increase above 0 than others, and so more resources will go to increasing that one above 0 than the others.

reply


This generalizes nicely. The asteroid problem provides a nice partitioning into two pieces, such that either piece alone has no effect, but the two pieces together have an effect. But most problems won’t have such a partition built in.

If we want the answer to a yes/no question, the first instinct would be that no such partitioning is possible: if two AIs each provide less than 1 bit of information, then combining them won’t produce a reliable answer. But we can make it work by combining the yes/no question with some other problem, as follows.

Suppose you want the answer to a question Q, which is a yes-or-no question. Then pick a hard problem H, which is an inconsequential yes-or-no question that AIs can solve reliably, but which humans can’t, and for which P(H)=0.5. Take two AIs X and Y. The first AI outputs X=xor(Q,H), and believes that the second AI will output a coin flip. The second AI outputs Y=H, and believes that the first AI will output a coin flip. Then the answer can be obtained by combining the two outputs, xor(X,Y).

reply

by Stuart Armstrong 1463 days ago | link

Interesting generalisation.

The next step is to allow more interaction between AI and world, while still minimising impact safely…

reply

NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

RSS

Privacy & Terms