Intelligent Agent Foundations Forumsign up / log in
Where's the first benign agent?
link by Jacob Kopczynski 43 days ago | Patrick LaVictoire and Paul Christiano like this | 15 comments

by Daniel Dewey 34 days ago | Ryan Carey likes this | link

My comment, for the record:

I’m glad to see people critiquing Paul’s work – it seems very promising to me relative to other alignment approaches, so I put high value on finding out about problems with it. By your definition of “benign”, I don’t think humans are benign, so I’m not going to argue with that. Instead, I’ll say what I think about building aligned AIs out of simulated human judgement.

I agree with you that listing and solving problems with such systems until we can’t think of more problems is unsatisfying, and that we should have positive arguments for confidence that we won’t hit unforeseen problems; maybe at some point we need to give up on getting those arguments and do the best we can without them, but it doesn’t feel like we’re at that point yet. I’m guessing the main difference here is that I’m hopeful about producing those arguments and you think it’s not likely to work.

Here’s an of example of how an argument might go. It’s sloppy, but I think it shows the flavor that makes me hopeful. Meta-execution preserving a “non-corrupting” invariant:

  1. define a naturally occurring set of queries nQ.

  2. have some reason to think that nq in nQ are very unlikely to cause significant value drift in Som in 1 hour (nq are “non-corrupting”).

  3. let Q be the closure of nQ under “Som spends an hour splitting q into sub-queries”.

  4. have some reason to think that Som’s processing never purposefully converts non-corrupting queries into corrupting ones.

  5. have some defense against random noise producing corrupting nq or q.

  6. conclude that all q in Q are non-corrupting, and so the system won’t involve any value-drifted Soms.

This kind of system would run sort of like your (2) or Paul’s meta-execution (

There are some domains where this argument seems clearly true and Som isn’t just being used as a microprocessor, e.g. Go problems or conjectures to be proven. In these cases it seems like (ii), (iii), and (iv) are true by virtue of the domain – no Go problems are corrupting – and Som’s processing doesn’t contribute to the truth of (iii).

For some other sets Q, it seems like (ii) will be true because of the nature of the domain (e.g. almost no naturally occurring single pages of text are value-corrupting in an hour), (iv) will be true because it would take significant work on Som’s part to convert a non-scary q into a scary q’ and that Som wouldn’t want to do this unless they were already corrupted, and (v) can be made true by using a lot of different “noise seeds” and some kind of voting system to wash out noise-produced corruption.

Obviously this argument is frustratingly informal, and maybe I could become convinced that it can’t be strengthened, but I think I’d mostly be convinced by trying and failing, and it seems reasonably likely to me that we could succeed.

Paul seems to have another kind of argument for another kind of system in mind here (, with a sketch of an argument at “I have a rough angle of attack in mind”. Obviously this isn’t an argument yet, but it seems worth looking into.

FWIW, Paul is thinking and writing about about the kinds of problems you point out, e.g. in this post (, this post (, or this post (, search “virus” on that page). Not sure if his thoughts are helpful to you.

If you’re planning to follow up this post, I’d be most interested in whether you think it’s not likely to be possible to design a process that can we can be confident will avoid Sim drift. I’d also be interested to know if there are other approaches to alignment that seem more promising to you.


by Wei Dai 34 days ago | link

>(e.g. almost no naturally occurring single pages of text are value-corrupting in an hour)

I don’t see what “naturally occurring” here could mean (even informally) that would both make this statement true and make it useful to try to design a system that could safely process “naturally occurring single pages of text”. And how would a system like this know whether a given input is “naturally occurring” and hence safe to process? Please explain?


by Daniel Dewey 33 days ago | link

“naturally occurring” means “could be inputs to this AI system from the rest of the world”; naturally occurring inputs don’t need to be recognized, they’re here as a base case for the induction. Does that make sense?

If there are other really powerful reasoners in the world, then they could produce value-corrupting single pages of text (and I would then worry about Soms becoming corrupted). If there aren’t, I’d guess that possible input single pages of text aren’t value-corrupting in an hour. (I would certainly want a much better answer than “I guess it’s fine” if we were really running something like this.)

To clarify my intent here, I wanted to show a possible structure of an argument that could make us confident that value drift wasn’t going to kill us. If you think it’s really unlikely that any argument of this inductive form could be run, I’d be interested in that (or if Paul or someone else thought I’m on the wrong track / making some kind of fundamental mistake.)


by Wei Dai 33 days ago | Patrick LaVictoire and Vladimir Nesov like this | link

Yes, that clarifies what you were intending to say. Paul typically assumes a need to compete with other AIs with comparable resources, so I wasn’t expecting “naturally occurring” to mean coming from an environment with no other powerful reasoners.

I think if we’re in an actual non-competitive scenario (and hence can tolerate large inefficiencies in our AI design), then some sort of argument like this can possibly be made to work, but it will probably be much trickier than you seem to suggest. Here are some problems I can see aside from the ones you’ve already acknowledged.

  1. We probably don’t need “powerful reasoners” to produce value-corrupting single pages of text, just a model of Som and relatively crude optimization techniques. In other words, for this system to be safe you probably need more of a tech lead/monopoly than you had in mind. Exactly how much is needed seems hard to know, so how do you achieve high confidence of safety?

  2. Presumably you’re building this system to be at least as capable as a human but at less risk of value drift. In order to do work comparable to a human contemplating the input query over a few days or years, the system needs some way of transmitting information between the short-lived Soms. Given that, what is preventing the overall system from undergoing value drift because, for example, one of the Soms has a (incorrect and value corrupting) philosophical epiphany and transmits it to other Soms though this communications channel? Nothing in your argument seems to adequately deal with this. If this happens by pure chance then you can try to wash it out by averaging over different “noise seeds” but what if Som has a tendency towards certain incorrect lines of argument, either naturally or because you placed him in a weird environment like this one?

  3. The output of this system isn’t “naturally occurring” so subsequent inputs to it won’t be either. If we’re to use this system a second time in a way that preserves your “non-corrupting invariant”, we have to either prevent all information flow from the first output to the second input, or have another argument for why this system along with whatever part of human civilization the information flows through preserves “non-corrupting” as a whole. Otherwise, someone could for example submit a bunch of queries to the system that would help them eventually craft a value-corrupting single page of text, and disguise these queries as innocent queries.


by Daniel Dewey 33 days ago | link

These objections are all reasonable, and 3 is especially interesting to me – it seems like the biggest objection to the structure of the argument I gave. Thanks.

I’m afraid that the point I was trying to make didn’t come across, or that I’m not understanding how your response bears on it. Basically, I thought the post was prematurely assuming that schemes like Paul’s are not amenable to any kind of argument for confidence, and we will only ever be able to say “well, I ran out of ideas for how to break it”, so I wanted to sketch an argument structure to explain why I thought we might be able to make positive arguments for safety.

Do you think it’s unlikely that we’ll be able to make positive arguments for the safety of schemes like Paul’s? If so, I’d be really interested in why – apologies if you’ve already tried to explain this and I just haven’t figured that out.


by Wei Dai 33 days ago | link

I’m afraid that the point I was trying to make didn’t come across, or that I’m not understanding how your response bears on it.

I guess my point was that any argument for confidence will likely be subject to the kinds of problems I listed, and I don’t see a realistic plan on Paul’s (or anyone else’s) part to deal with them.

Do you think it’s unlikely that we’ll be able to make positive arguments for the safety of schemes like Paul’s?

It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level. Without that, I don’t see how we can reason about a system that can encounter philosophical arguments, and make strong conclusions about whether it’s able to process them correctly. This seems intuitively obvious to me, but I don’t totally rule out that there is some sort of counterintuitive approach that could somehow work out.


by Daniel Dewey 32 days ago | link

Ah, gotcha. I’ll think about those points – I don’t have a good response. (Actually adding “think about”+(link to this discussion) to my todo list.)

It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level.

Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift?


by Wei Dai 32 days ago | Daniel Dewey likes this | link

Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift?

Interesting question. I guess it depends on the form that the metaphilosophical knowledge arrives in, but it’s currently hard to see what that could be. I can only think of two possibilities, but neither seem highly plausible.

  1. It comes as a set of instructions that humans (or emulations/models of humans) can use to safely and correctly process philosophical arguments, along with justifications for why those instructions are safe/correct. Kind of like a detailed design for meta-execution along with theory/evidence for why it works. But natural language is fuzzy and imprecise, humans are full of unknown security holes, so it’s hard to see how such instructions could possibly make us safe/correct, or what kind of information could possibly make us confident of that.

  2. It comes as a set of algorithms for reasoning about philosophical problems in a formal language, along with instructions/algorithms for how to translate natural language philosophical problems/arguments into this formal language, and justifications for why these are all safe/correct. But this kind of result seems very far from any of our current knowledge bases, nor does it seem compatible with any of the current trends in AI design (including things like deep learning, decision theory based ideas and Paul’s kinds of designs).

So I’m not very optimistic that a metaphilosophical approach will succeed either. If it ultimately does, it seems like maybe there will have to be some future insights whose form I can’t foresee. (Edit: Either that or a lot of time free from arms-race pressure to develop the necessary knowledge base and compatible AI design for 2.)


by Jacob Kopczynski 33 days ago | link

Your point 2 is an excellent summary of my reasons for being skeptical of relying on human reasoning. (I also expect more outlandish means of transmissible value-corruption would show up, on the principles that edge cases are hard to predict and that we don’t really understand our minds.)


by Paul Christiano 36 days ago | link

(I replied last weekend, but the comment is awaiting moderation.)


by Jacob Kopczynski 35 days ago | Daniel Dewey likes this | link

Apologies, I stopped getting moderation emails at some point and haven’t fixed it properly.


by Daniel Dewey 35 days ago | link

I also commented there last week and am awaiting moderation. Maybe we should post our replies here soon?


by Wei Dai 34 days ago | link

(I’m replying to your comment here since I don’t trust personal blogs to stay alive and I don’t want my comments to disappear with them.)

Your point about not giving up too easily seems a good one. There could well be some ideas that are counterintuitive (to most people) but ultimately workable after a lot of effort, like public-key crypto in another area that I’m familiar with. I also think you’re overly optimistic, but that’s not necessarily a bad thing if it helps you explore some areas that others wouldn’t. But I’m worried that unlike typical CS fields, where it’s relatively easy to define technical concepts (and then prove theorems about them) and run algorithms to test/debug them, the analogous things in AI alignment will be many times harder, so we can’t achieve high confidence that something works even if it actually does, or narrow down the precise right idea from the neighborhood that it sits in. Even in crypto, it took decades to refine the idea of “security” into things like “indistinguishability under adaptive chosen ciphertext attack” and then find actually secure algorithms. All of the earliest public-key crypto algorithms deployed were in fact broken, even though they formed the basis for later algorithms. If ideas about AI alignment evolve in a similar way (but on an even longer timescale due to concepts being even harder to define and experiments being harder to run), it’s hard to see how things will turn out well. If the best we can achieve in the relevant time-frame are plausible AI alignment ideas or algorithms that are “in the right neighborhood”, that could even make things worse (than not having them at all) by causing people to feel safer to pursue/deploy AI capability or not invest as much in other ways of preventing AI risk.


by Wei Dai 34 days ago | link

217/PDV (I assume you’re the same person?), I agree with much of what you wrote, but do you have your own ideas for how to achieve Friendly AI? It seems like most of the objections against Paul’s ideas also apply to other people’s (such as MIRI’s). The fact that humans aren’t benign (or can’t be determined to be benign) under a sufficiently large set of environments/inputs, suffer from value drift, have unknown/unpatchable security holes all pose similar problems for CEV, for instance, which nobody has proposed a plausible way to solve, AFAIK.

In a way, I guess Paul has actually done more to explicitly acknowledge these problems than just about anyone else, even if I think (as you do) that he is too optimistic about the prospect of solving them using the ideas he has sketched out.


by Jacob Kopczynski 33 days ago | Daniel Dewey likes this | link

(Yes, same person.)

I agree that no one else has solved the problem or made much progress. I object to Paul’s approach here because it’s coupling the value problem more closely to other problems in architecture and value stability. I would much prefer holding off on attacking it for the moment, rather than this approach, which - to my reading - takes for granted that the problem is not hard and rests further work on top of it. Holding off at least gets room for other pieces nearby to be carved out and provide a better idea of what properties a solution would have; this approach seems to be based on the solution looking vastly simpler than I think is true.

I also have a general intuitive prior that reinforcement learning approaches are untrustworthy and are “building on sand”, but that’s neither precise nor persuasive so I’m not writing it up except on questions like this where it’s more solid. I’ve put much less work into this field than Paul or others, so I don’t want to challenge things except where I’m confident.






The "benign induction
by David Krueger on Why I am not currently working on the AAMLS agenda | 0 likes

This comment is to explain
by Alex Mennen on Formal Open Problem in Decision Theory | 0 likes

Thanks for writing this -- I
by Daniel Dewey on AI safety: three human problems and one AI issue | 1 like

I think it does do the double
by Stuart Armstrong on Acausal trade: double decrease | 0 likes

>but the agent incorrectly
by Stuart Armstrong on CIRL Wireheading | 0 likes

I think the double decrease
by Owen Cotton-Barratt on Acausal trade: double decrease | 0 likes

The problem is that our
by Scott Garrabrant on Cooperative Oracles: Nonexploited Bargaining | 1 like

Yeah. The original generator
by Scott Garrabrant on Cooperative Oracles: Nonexploited Bargaining | 0 likes

I don't see how it would. The
by Scott Garrabrant on Cooperative Oracles: Nonexploited Bargaining | 1 like

Does this generalise to
by Stuart Armstrong on Cooperative Oracles: Nonexploited Bargaining | 0 likes

>Every point in this set is a
by Stuart Armstrong on Cooperative Oracles: Nonexploited Bargaining | 0 likes

This seems a proper version
by Stuart Armstrong on Cooperative Oracles: Nonexploited Bargaining | 0 likes

This doesn't seem to me to
by Stuart Armstrong on Change utility, reduce extortion | 0 likes

[_Regret Theory with General
by Abram Demski on Generalizing Foundations of Decision Theory II | 0 likes

It's not clear whether we
by Paul Christiano on Infinite ethics comparisons | 1 like


Privacy & Terms