Intelligent Agent Foundations Forumsign up / log in

Thanks for writing this – I think it’s a helpful kind of reflection for people to do!


by Daniel Dewey 816 days ago | Ryan Carey likes this | link | parent | on: Where's the first benign agent?

My comment, for the record:

I’m glad to see people critiquing Paul’s work – it seems very promising to me relative to other alignment approaches, so I put high value on finding out about problems with it. By your definition of “benign”, I don’t think humans are benign, so I’m not going to argue with that. Instead, I’ll say what I think about building aligned AIs out of simulated human judgement.

I agree with you that listing and solving problems with such systems until we can’t think of more problems is unsatisfying, and that we should have positive arguments for confidence that we won’t hit unforeseen problems; maybe at some point we need to give up on getting those arguments and do the best we can without them, but it doesn’t feel like we’re at that point yet. I’m guessing the main difference here is that I’m hopeful about producing those arguments and you think it’s not likely to work.

Here’s an of example of how an argument might go. It’s sloppy, but I think it shows the flavor that makes me hopeful. Meta-execution preserving a “non-corrupting” invariant:

  1. define a naturally occurring set of queries nQ.

  2. have some reason to think that nq in nQ are very unlikely to cause significant value drift in Som in 1 hour (nq are “non-corrupting”).

  3. let Q be the closure of nQ under “Som spends an hour splitting q into sub-queries”.

  4. have some reason to think that Som’s processing never purposefully converts non-corrupting queries into corrupting ones.

  5. have some defense against random noise producing corrupting nq or q.

  6. conclude that all q in Q are non-corrupting, and so the system won’t involve any value-drifted Soms.

This kind of system would run sort of like your (2) or Paul’s meta-execution (

There are some domains where this argument seems clearly true and Som isn’t just being used as a microprocessor, e.g. Go problems or conjectures to be proven. In these cases it seems like (ii), (iii), and (iv) are true by virtue of the domain – no Go problems are corrupting – and Som’s processing doesn’t contribute to the truth of (iii).

For some other sets Q, it seems like (ii) will be true because of the nature of the domain (e.g. almost no naturally occurring single pages of text are value-corrupting in an hour), (iv) will be true because it would take significant work on Som’s part to convert a non-scary q into a scary q’ and that Som wouldn’t want to do this unless they were already corrupted, and (v) can be made true by using a lot of different “noise seeds” and some kind of voting system to wash out noise-produced corruption.

Obviously this argument is frustratingly informal, and maybe I could become convinced that it can’t be strengthened, but I think I’d mostly be convinced by trying and failing, and it seems reasonably likely to me that we could succeed.

Paul seems to have another kind of argument for another kind of system in mind here (, with a sketch of an argument at “I have a rough angle of attack in mind”. Obviously this isn’t an argument yet, but it seems worth looking into.

FWIW, Paul is thinking and writing about about the kinds of problems you point out, e.g. in this post (, this post (, or this post (, search “virus” on that page). Not sure if his thoughts are helpful to you.

If you’re planning to follow up this post, I’d be most interested in whether you think it’s not likely to be possible to design a process that can we can be confident will avoid Sim drift. I’d also be interested to know if there are other approaches to alignment that seem more promising to you.


by Wei Dai 816 days ago | link

>(e.g. almost no naturally occurring single pages of text are value-corrupting in an hour)

I don’t see what “naturally occurring” here could mean (even informally) that would both make this statement true and make it useful to try to design a system that could safely process “naturally occurring single pages of text”. And how would a system like this know whether a given input is “naturally occurring” and hence safe to process? Please explain?


by Daniel Dewey 816 days ago | link

“naturally occurring” means “could be inputs to this AI system from the rest of the world”; naturally occurring inputs don’t need to be recognized, they’re here as a base case for the induction. Does that make sense?

If there are other really powerful reasoners in the world, then they could produce value-corrupting single pages of text (and I would then worry about Soms becoming corrupted). If there aren’t, I’d guess that possible input single pages of text aren’t value-corrupting in an hour. (I would certainly want a much better answer than “I guess it’s fine” if we were really running something like this.)

To clarify my intent here, I wanted to show a possible structure of an argument that could make us confident that value drift wasn’t going to kill us. If you think it’s really unlikely that any argument of this inductive form could be run, I’d be interested in that (or if Paul or someone else thought I’m on the wrong track / making some kind of fundamental mistake.)


by Wei Dai 816 days ago | Patrick LaVictoire and Vladimir Nesov like this | link

Yes, that clarifies what you were intending to say. Paul typically assumes a need to compete with other AIs with comparable resources, so I wasn’t expecting “naturally occurring” to mean coming from an environment with no other powerful reasoners.

I think if we’re in an actual non-competitive scenario (and hence can tolerate large inefficiencies in our AI design), then some sort of argument like this can possibly be made to work, but it will probably be much trickier than you seem to suggest. Here are some problems I can see aside from the ones you’ve already acknowledged.

  1. We probably don’t need “powerful reasoners” to produce value-corrupting single pages of text, just a model of Som and relatively crude optimization techniques. In other words, for this system to be safe you probably need more of a tech lead/monopoly than you had in mind. Exactly how much is needed seems hard to know, so how do you achieve high confidence of safety?

  2. Presumably you’re building this system to be at least as capable as a human but at less risk of value drift. In order to do work comparable to a human contemplating the input query over a few days or years, the system needs some way of transmitting information between the short-lived Soms. Given that, what is preventing the overall system from undergoing value drift because, for example, one of the Soms has a (incorrect and value corrupting) philosophical epiphany and transmits it to other Soms though this communications channel? Nothing in your argument seems to adequately deal with this. If this happens by pure chance then you can try to wash it out by averaging over different “noise seeds” but what if Som has a tendency towards certain incorrect lines of argument, either naturally or because you placed him in a weird environment like this one?

  3. The output of this system isn’t “naturally occurring” so subsequent inputs to it won’t be either. If we’re to use this system a second time in a way that preserves your “non-corrupting invariant”, we have to either prevent all information flow from the first output to the second input, or have another argument for why this system along with whatever part of human civilization the information flows through preserves “non-corrupting” as a whole. Otherwise, someone could for example submit a bunch of queries to the system that would help them eventually craft a value-corrupting single page of text, and disguise these queries as innocent queries.


by Daniel Dewey 816 days ago | link

These objections are all reasonable, and 3 is especially interesting to me – it seems like the biggest objection to the structure of the argument I gave. Thanks.

I’m afraid that the point I was trying to make didn’t come across, or that I’m not understanding how your response bears on it. Basically, I thought the post was prematurely assuming that schemes like Paul’s are not amenable to any kind of argument for confidence, and we will only ever be able to say “well, I ran out of ideas for how to break it”, so I wanted to sketch an argument structure to explain why I thought we might be able to make positive arguments for safety.

Do you think it’s unlikely that we’ll be able to make positive arguments for the safety of schemes like Paul’s? If so, I’d be really interested in why – apologies if you’ve already tried to explain this and I just haven’t figured that out.


by Wei Dai 815 days ago | link

I’m afraid that the point I was trying to make didn’t come across, or that I’m not understanding how your response bears on it.

I guess my point was that any argument for confidence will likely be subject to the kinds of problems I listed, and I don’t see a realistic plan on Paul’s (or anyone else’s) part to deal with them.

Do you think it’s unlikely that we’ll be able to make positive arguments for the safety of schemes like Paul’s?

It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level. Without that, I don’t see how we can reason about a system that can encounter philosophical arguments, and make strong conclusions about whether it’s able to process them correctly. This seems intuitively obvious to me, but I don’t totally rule out that there is some sort of counterintuitive approach that could somehow work out.


by Daniel Dewey 815 days ago | link

Ah, gotcha. I’ll think about those points – I don’t have a good response. (Actually adding “think about”+(link to this discussion) to my todo list.)

It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level.

Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift?


by Wei Dai 814 days ago | Daniel Dewey likes this | link

Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift?

Interesting question. I guess it depends on the form that the metaphilosophical knowledge arrives in, but it’s currently hard to see what that could be. I can only think of two possibilities, but neither seem highly plausible.

  1. It comes as a set of instructions that humans (or emulations/models of humans) can use to safely and correctly process philosophical arguments, along with justifications for why those instructions are safe/correct. Kind of like a detailed design for meta-execution along with theory/evidence for why it works. But natural language is fuzzy and imprecise, humans are full of unknown security holes, so it’s hard to see how such instructions could possibly make us safe/correct, or what kind of information could possibly make us confident of that.

  2. It comes as a set of algorithms for reasoning about philosophical problems in a formal language, along with instructions/algorithms for how to translate natural language philosophical problems/arguments into this formal language, and justifications for why these are all safe/correct. But this kind of result seems very far from any of our current knowledge bases, nor does it seem compatible with any of the current trends in AI design (including things like deep learning, decision theory based ideas and Paul’s kinds of designs).

So I’m not very optimistic that a metaphilosophical approach will succeed either. If it ultimately does, it seems like maybe there will have to be some future insights whose form I can’t foresee. (Edit: Either that or a lot of time free from arms-race pressure to develop the necessary knowledge base and compatible AI design for 2.)


by Jacob Kopczynski 815 days ago | link

Your point 2 is an excellent summary of my reasons for being skeptical of relying on human reasoning. (I also expect more outlandish means of transmissible value-corruption would show up, on the principles that edge cases are hard to predict and that we don’t really understand our minds.)


I also commented there last week and am awaiting moderation. Maybe we should post our replies here soon?


FWIW, this also reminded me of some discussion in Paul’s post on capability amplification, where Paul asks whether we can even define good behavior in some parts of capability-space, e.g.:

The next step would be to ask: can we sensibly define “good behavior” for policies in the inaccessible part H? I suspect this will help focus our attention on the most philosophically fraught aspects of value alignment.

I’m not sure if that’s relevant to your point, but it seemed like you might be interested.


I’m not sure you’ve gotten quite ALBA right here, and I think that causes a problem for your objection. Relevant writeups: most recent and original ALBA.

As I understand it, ALBA proposes the following process:

  1. H trains A to choose actions that would get the best immediate feedback from H. A is benign (assuming that H could give not-catastrophic immediate feedback for all actions and that the learning process is robust). H defines the feedback, and so A doesn’t make decisions that are more effective at anything than H is; A is just faster.
  2. A (and possibly H) is (are) used to define a slow process A+ that makes “better” decisions than A or H would. (Better is in quotes because we don’t have a definition of better; the best anyone knows how to do right now is look at the amplification process and say “yep, that should do better.”) Maybe H uses A as an assistant, maybe a copy of A breaks down a decision problem into parts and hands them off to other copies of A, maybe A makes decisions that guide a much larger cognitive process.
  3. The whole loop starts over with A+ used as H.

The claim is that step 2 produces a system that is able to give “better” feedback than the human could – feedback that considers more consequences more accurately in more complex decision situations, that has spent more effort introspecting, etc. This should make it able to handle circumstances further and further outside human-ordinary, eventually scaling up to extraordinary circumstances. So, while you say that the best case to hope for is \(r_i\rightarrow r\), it seems like ALBA is claiming to do more.

A second objection is that while you call each \(r_i\) a “reward function”, each system is only trained to take actions that maximize the very next reward they get (not sum of future rewards). This means that each system is only effective at anything insofar as the feedback function it’s maximizing at each step considers the long-term consequences of each action. So, if \(r_i\rightarrow r\), we don’t have reason to think that the system will be competent at anything outside of the “normal circumstances + a few exceptions” you describe – all of its planning power comes from \(r_i\), so we should expect it to be basically incompetent where \(r_i\) is incompetent.


by Stuart Armstrong 827 days ago | link

that is able to give “better” feedback than the human could – feedback that considers more consequences more accurately in more complex decision situations, that has spent more effort introspecting

This is roughly how I would run ALBA in practice, and why I said it was better in practice than in theory. I’d be working with considerations I mentioned in this post and try and formalise how to extend utilities/rewards to new settings.


by Daniel Dewey 826 days ago | link

If I read Paul’s post correctly, ALBA is supposed to do this in theory – I don’t understand the theory/practice distinction you’re making.


by Stuart Armstrong 824 days ago | link

I disagree. I’m arguing that the concept of “aligned at a certain capacity” makes little sense, and this is key to ALBA in theory.


Discussed briefly in Concrete Problems, FYI:


by Daniel Dewey 895 days ago | link | parent | on: Learning Impact in RL

This is a neat idea! I’d be interested to hear why you don’t think it’s satisfying from a safety point of view, if you have thoughts on that.


by Owen Cotton-Barratt 895 days ago | Jessica Taylor and Patrick LaVictoire like this | link

Seems to me like there are a bunch of challenges. For example you need extra structure on your space to add things or tell what’s small; and you really want to keep track of long-term impact not just at the next time-step. Particularly the long-term one seems thorny (for low-impact in general, not just for this).

Nevertheless I think this idea looks promising enough to explore further, would also like to hear David’s reasons.


by David Krueger 891 days ago | Daniel Dewey likes this | link

I was mostly a gut-feeling when I posted, but let me try and articulate a few:

  1. It relies on having a good representation. Small problems with the representation might make it unworkable. Learning a good enough representation and verifying that you’ve done so doesn’t seem very feasible. Impact may be missed if the representation doesn’t properly capture unobserved things and long-term dependencies. Things like the creation of sub-agents seem likely to crop up in subtle, hard to learn, ways.

  2. I haven’t looked into it, but ATM I have no theory about when this scheme could be expected to recover the “correct” model (I don’t even know how that would be defined… I’m trying to “learn” my way around the problem :P)

To put #1 another way, I’m not sure that I’ve gained anything compared with proposals to penalize impact in the input space, or some learned representation space (with the learning not directed towards discovering impact).

On the other hand, I was inspired to consider this idea when thinking about Yoshua’s proposal about causal disentangling mentioned at the end of his Asilomar talk here: This (and maybe some other similar work, e.g. on empowerment) seem to provide a way to direct an agent’s learning towards maximizing its influence, which might help… although having an agent learn based on maximizing its influence seems like a bad idea… but I guess you might be able to then add a conflicting objective (like a regularizer) to actually limit the impact…

So then you’d end up with some sort of adversarial-ish set-up, where the agent is trying to both: 1. maximize potential impact (i.e. by understanding its ability to influence the world) 2. minimize actual impact (i.e. by refraining from taking actions which turn out (eventually) to have a large impact).

Having just finished typing this, I feel more optimistic about this last proposal than the original idea :D We want an agent to learn about how to maximize its impact in order to avoid doing so.

(How) can an agent confidently predict its potential impact without trying potentially impactful actions?
I think it certainly can, because humans can. We use a powerful predictive model of the world to do this. … and that’s all I have to say ATM


by David Krueger 891 days ago | link

Yes, as Owen points out, there are general problems with reduced impact that apply to this idea, i.e. measuring long-term impacts.


Thanks for writing this, Jessica – I expect to find it helpful when I read it more carefully!


Thanks Jessica. This was helpful, and I think I see more what the problem is.

Re point 1: I see what you mean. The intuition behind my post is that it seems like it should be possible to make a bounded system that can eventually come to hold any computable hypothesis given enough evidence, including a hypothesis including a model of itself of arbitrary precision (which is different from Solomonoff, which can clearly never think about systems like itself). It’s clearly not possible for the system to hold and update infinitely many hypotheses the way Solomonoff does, and a system would need some kind of logical uncertainty or other magic to evaluate complex or self-referential hypotheses, but it seems like these hypotheses should be “in its class”. Does this make sense, or do you think there is a mistake there?

Re point 2: I’m not confident that’s an accurate summary; I’m precisely proposing that the agent learn a model of the world containing a model of the agent (approximate or precise). I agree that evaluating this kind of model will require logical uncertainty or similar magic, since it will be expensive and possibly self-referential.

Re point 3: I see what you mean, though for self-modeling the agent being predicted should only be as smart as the agent doing the prediction. It seems like approximation and logical uncertainty are the main ingredients needed here. Are there particular parts of the unbounded problem that are not solved by reflective oracles?


by Jessica Taylor 1232 days ago | Patrick LaVictoire likes this | link

Re point 1: Suppose the agent considers all hypotheses of length up to \(l\) bits that run in up to \(t\) time. Then the agent takes \(2^l t\) time to run. For an individual hypothesis to reason about the agent, it must use \(t\) computation time to reason about a computation of size \(2^l t\). A theoretical understanding of how this works would solve a large part of the logical uncertainty / naturalized induction / Vingean reflection problem.

Maybe it’s possible for this to work without having a theoretical understanding of why it works, but the theoretical understanding is useful too (it seems like you agree with this). I think there are some indications that naive solutions won’t automatically work; see e.g. this post.

Re point 2: It seems like this is learning a model from the state and action to state, and a model from state to state that ignores the agent. But it isn’t learning a model that e.g. reasons about the agent’s source code to predict the next state. An integrated model should be able to do reasoning like this.

Re point 3: I think you still have a Vingean reflection problem if a hypothesis that runs in \(t\) time predicts a computation of size \(2^l t\). Reflective Solomonoff induction solves a problem with an unrealistic computation model, and doesn’t translate to a solution with a finite (but large) amount of computing resources. The main part not solved is the general issue of predicting aspects of large computations using a small amount of computing power.


by Daniel Dewey 1231 days ago | link

Thanks. I agree that these are problems. It seems to me that the root of these problems is logical uncertainty / vingean reflection (which seem like two sides of the same coin); I find myself less confused when I think about self-modeling as being basically an application of “figuring out how to think about big / self-like hypotheses”. Is that how you think of it, or are there aspects of the problem that you think are missed by this framing?


by Jessica Taylor 1231 days ago | Daniel Dewey likes this | link

Yes, this is also how I think about it. I don’t know anything specific that doesn’t fit into this framing.


“Additionally, the fact that the predictor uses consequentialist reasoning indicates that you probably need to understand consequentialist reasoning to build the predictor in the first place.”

I’ve had this conversation with Nate before, and I don’t understand why I should think it’s true. Presumably we think we will eventually be able to make predictors that predict a wide variety of systems without us understanding every interesting subset ahead of time, right? Why are consequentialists different?


by Paul Christiano 1233 days ago | Daniel Dewey and Tsvi Benson-Tilsen like this | link

Here is my understanding of the argument:

(Warning, very long post, partly thinking out loud. But I endorse the summary. I would be most interested in Eliezer’s response.)

  • Something vaguely “consequentialist” is an important part of how humans reason about hard cognitive problems of all kinds (e.g. we must decide what cognitive strategy to use, what to focus our attention on, what topics to explore and which to ignore).
  • It’s not clear what prediction problems require this kind of consequentialism and what kinds of prediction problems can be solved directly by a brute force search for predictors. (I think Ilya has suggested that the cutoff is something like “anything a human can do in 100ms, you can train directly.”)
  • However, the behavior of an intelligent agent is in some sense a “universal” example of a hard-to-predict-without-consequentialism phenomenon.
  • If someone claims to have a solution that “just” requires a predictor, then they haven’t necessarily reduced the complexity of the problem, given that a good predictor depend on something consequentialist. If the predictor only needs to apply in some domain, then maybe the domain is easy and you can attack it more directly. But if that domain includes predicting intelligent agents, then it’s obviously not easy.
  • Actually building an agent that solves these hard prediction problems will probably require building some kind of consequentialism. So it offers just as much opportunity to kill yourself as the general AI problem.
  • And if you don’t explicitly build in consequentialism, then you’ve just made the situation even worse. There is still probably consequentialism somewhere inside your model, you just don’t even understand how it works because it was produced by a brute force search.

I think that this argument is mostly right. I also think that many thoughtful ML researchers would agree with the substantive claims, though they might disagree about language. We aren’t going to be able to directly train a simple model to solve all of the cognitive problems a human can solve, but there is a real hope that we could train a simple model to control computational machinery in a way that solves hard cognitive problems. And those policies will be “consequentialist” in the sense that their behavior is optimized to achieve a desired consequence. (An NTM is a simple mostly theoretical example of this; there are more practical instantiations as well, and moreover I think it is clear that you can’t actually use full differentiability forever and at least some of the system is going to have to be trained by RL.)

I get off the boat once we start drawing inferences about what AI control research should look like—at this point I think Eliezer’s argument becomes quite weak.

If Eliezer or Nate were to lay out a precise argument I think it would be easy to find the precise point where I object. Unfortunately no one is really in the position to be making precise arguments, so everything is going to be a bit blurrier. But here are some of the observations that seem to lead me to a very different conclusion:


Many decision theories, priors, etc. are reflectively consistent. Eliezer imagines an agent which uses the “right” settings in the long run because it started out with the right settings (or some as-yet-unknown framework for handling its uncertainty) and so stuck with them. I imagine an agent which uses the “right” settings in the long run because it defers to humans, and which may in the short term use incorrect decision theory/priors/etc. This is a central advantage of the act-based approach, and in my view no one has really offered a strong response.

The most natural response would be that using a wrong decision theory/prior/etc. in the short term would lead to bad outcomes, even if one had appropriate deference to humans. The strongest version of this argument goes something like “we’ve encountered some surprises in the past, like the simulation argument, blackmail by future superintelligences, some weird stuff with aliens etc., Pascal’s mugging, and it’s hard to know that we won’t encounter more surprises unless we figure out many of these philosophical issues.”

I think this argument has some merit, but these issues seem to be completely orthogonal to the development of AI (humans might mess these things up about as well as an act-based AI), and so they should be evaluated separately. I think they look a lot less urgent than AI control—I think the only way you end up with MIRI’s level of interest is if you see our decisions about AI as involving a long-term commitment.


I think that Eliezer at least does not yet understand, or has not yet thought deeply about, the situation where we use RL to train agents how to think. He repeatedly makes remarks about how AI control research targeted at deep learning will not generalize to extremely powerful AI systems, while consistently avoiding engagement with the most plausible scenario where deep learning is a central ingredient of powerful AI.


There appears to be a serious methodological disagreement about how AI control research should work.

For existing RL systems, the alignment problem is open. Without solving this problem, it is hard to see how we could build an aligned system which used existing techniques in any substantive way.

Future AI systems may involve new AI techniques that present new difficulties.

I think that we should first resolve, or try our best to resolve, the difficulties posed by existing techniques—whether or not we believe that new techniques will emerge. Once we resolve that problem, we can think about how new techniques will complicate the alignment problem, and try to produce new solutions that will scale to accommodate a wider range of future developments.

Part of my view is that it is much easier to work on problems for which we have a concrete model. Another part is that our work on the alignment problem matters radically more if AI is developed soon. There are a bunch of other issues at play, I discuss a subset here.

I think that Eliezer’s view is something like: we know that future techniques will introduce some qualitatively new difficulties, and those are most likely to be the real big ones. If we understand how to handle those difficulties, then we will be in a radically better position with respect to value alignment. And if we don’t, we are screwed. So we should focus on those difficulties.

Eliezer also believes that the alignment problem is most likely to be very difficult or impossible for systems of the kind that we currently build, such that some new AI techniques are necessary before anyone can build an aligned AI, and such that it is particularly futile to try to solve the alignment problem for existing techniques.


by Daniel Dewey 1232 days ago | link

Thanks, Paul – I missed this response earlier, and I think you’ve pointed out some of the major disagreements here.

I agree that there’s something somewhat consequentialist going on during all kinds of complex computation. I’m skeptical that we need better decision theory to do this reliably – are there reasons or intuition-pumps you know of that have a bearing on this?


by Paul Christiano 1232 days ago | Daniel Dewey likes this | link

I mentioned two (which I don’t find persuasive):

  1. Different decision theories / priors / etc. are reflectively consistent, so you may want to make sure to choose the right ones the first time. (I think that the act-based approach basically avoids this.)
  2. We have encountered some surprising possible failure modes, like blackmail by distant superintelligences, and might be concerned that we will run into new surprises if we don’t understand consequentialism well.

I guess there is one more:

  1. If we want to understand what our agents are doing, we need to have a pretty good understanding of how effective decision-making ought to work. Otherwise algorithms whose consequentialism we understand will tend to be beaten out by algorithms whose consequentialism we don’t understand. This may make alignment way harder.


by Jessica Taylor 1235 days ago | link

Here’s the argument as I understand it (paraphrasing Nate). If we have a system predict a human making plans, then we need some story for why it can do this effectively. One story is that, like Solomonoff induction, it’s learning a physical model of the human and simulating the human this way. However, in practice, this is unlikely to be the reason an actual prediction engine predicts humans well (it’s too computationally difficult).

So we need some other argument for why the predictor might work. Here’s one argument: perhaps it’s looking at a human making plans, figuring out what humans are planning towards, and using its own planning capabilities (towards the same goal) to predict what plans the human will make. But it seems like, to be confident that this will work, you need to have some understanding of how the predictor’s planning capabilities work. In particular, humans trying to study correct planning run into some theoretical problems (including e.g. decision theory and logical uncertainty), and it seems like a system would need to answer some of these same questions in order to predict humans well.

I’m not sure what to think of this argument. Paul’s current proposal contains reinforcement learning agents who plan towards an objective defined by a more powerful agent, so it is leaning on the reinforcement learner’s ability to plan towards desirable goals. Rather than understand how the reinforcement learner works internally, Paul proposes giving the reinforcement learner a good enough objective (defined by the powerful agent) such that optimizing this objective is equivalent to optimizing what the humans want. This raises some problems, so probably some additional ingredient is necessary. I suspect I’ll have better opinions on this after thinking about the informed oversight problem some more.

I also asked Nate about the analogy between computer vision and learning to predict a human making plans. It seems like computer vision is an easier problem for a few reasons: it doesn’t require serial thought (so it can be done by e.g. a neural network with a fixed number of layers), humans solve the problem using something similar to neural networks anyway, and planning towards the wrong goal is much more dangerous than recognizing objects incorrectly.


by Daniel Dewey 1235 days ago | link

Thanks, Jessica. This argument still doesn’t seem right to me – let me try to explain why.

It seems to me like something more tractable than Solomonoff induction, like an approximate cognitive-level model of a human or the other kinds of models that are being produced now (or will be produced in the future) in machine learning (neural nets, NTMs, other etc.), could be used to approximately predict the actions of humans making plans. This is how I expect most kinds of modeling and inference to work, about humans and about other systems of interest in the world, and it seems like most of my behaviors are approximately predictable using a model of me that falls far short of modeling my full brain. This makes me think that an AI won’t need to have hand-made planning faculties to learn to predict planners (human or otherwise), any more than it’ll need weather faculties to predict weather or physics faculties to predict physical systems. Does that make sense?

(I think the analogy to computer vision point toward the learnability of planning; humans use neural nets to plan, after all!)


by Jessica Taylor 1234 days ago | Patrick LaVictoire likes this | link

It seems like an important part of how humans make plans is that we use some computations to decide what other computations are worth performing. Roughly, we use shallow pattern recognition on a question to determine what strategy to use to think further thoughts, and after thinking those thoughts use shallow pattern recognition to figure out what thought to have after that, eventually leading to answering the question. (I expect the brain’s actual algorithm to be much more complicated than this simple model, but to share some aspects of it).

A system predicting what a human would do would presumably also have to figure out which further thoughts are worth thinking, upon being asked to predict how a human answers a question. For example, if I’m answering a complex math question that I have to break into parts to solve it, then for the system to predict my (presumably correct) answer, it might also break the problem into pieces and solve each piece. If it’s bad at determining which thoughts are worth thinking to predict the human’s answer (e.g. it chooses to break the problem into unhelpful pieces), then it will think thoughts that are not very useful for predicting the answer, so it will not be very effective without a huge amount of hardware. I think this is clear when the human is thinking for a long time (e.g. 2 weeks) and less clear for much shorter time periods (e.g. 1 minute, which you might be able to do with shallow pattern recognition in some cases?).

At the point where the system is able to figure out what thoughts to think in order to predict the human well, its planning to determine which thoughts to think looks at least as competent a human’s planning to answer the question, without necessarily using similar intermediate steps in the plan.

It seems like ordinary neural nets can’t decide what to think about (they can only recognize shallow patterns), and perhaps NTMs can. But if a NTM could predict how I answer some questions well (because it’s able to plan out what thoughts to think), I would be scared to ask it to predict my answer to future questions. It seems to be a competent planner, and not one that internally looks like my own thinking or anything I could easily understand. I see the internal approval-direction approach as trying to make systems whose internal planning looks more like planning understood by humans (by supervising the intermediate steps of planning); without internal supervision, we would be running a system capable of making complex plans in a way humans do not understand, which seems dangerous. As an example of a potential problem (not necessarily the most likely one), perhaps the system is very good at planning towards objectives but mediocre at figuring out what objective humans are planning towards, so it predicts plans well during training but occasionally outputs plans optimized for the wrong objective during testing.

It seems likely that very competent physics or weather predictions would also require at least some primitive form of planning what thoughts to think (e.g. maybe the system decides to actually simulate the clouds in an important region). But it seems like you can get a decent performance on these tasks with only primitive planning, whereas I don’t expect this for e.g. predicting a human doing novel research over >1-hour timescales.

Did this help explain the argument better? (I still haven’t thought about this argument enough to be that confident that it goes through, but I don’t see any obvious problems at the moment).


by Daniel Dewey 1234 days ago | link

I agree with paragraphs 1, 2, and 3. To recap, the question we’re discussing is “do you need to understand consequentialist reasoning to build a predictor that can predict consequentialist reasoners?”

A couple of notes on paragraph 4: - I’m not claiming that neural nets or NTMs are sufficient, just that they represent the kind of thing I expect to increasingly succeed at modeling human decisions (and many other things of interest): model classes that are efficiently learnable, and that don’t include built-in planning faculties. - You are bringing up understandability of an NTM-based human-decision-predictor. I think that’s a fine thing to talk about, but it’s different from the question we were talking about. - You’re also bringing up the danger of consequentialist hypotheses hijacking the overall system. This is fine to talk about as well, but it is also different from the question we were talking about.

In paragraph 5, you seem to be proposing that to make any competent predictor, we’ll need to understand planning. This is a broader assertion, and the argument in favor of it is different from the original argument (“predicting planners requires planning faculties so that you can emulate the planner” vs “predicting anything requires some amount of prioritization and decision-making”). In these cases, I’m more skeptical that a deep theoretical understanding of decision-making is important, but I’m open to talking about it – it just seems different from the original question.

Overall, I feel like this response is out-of-scope for the current question – does that make sense, or do I seem off-base?


by Jessica Taylor 1234 days ago | link

Regarding paragraph 4:

I see more now what you’re saying about NTMs. In some sense NTMs don’t have “built-in” planning capabilities; to the extent that they plan well, it’s because they learned that transition functions that make plans work better to predict some things. I think it’s likely that you can get planning capabilities in this manner, without actually understanding how the planning works internally. So it seems like there isn’t actually disagreement on this point (sorry for misinterpreting the question). The more controversial point is that you need to understand planning to train safe predictors of humans making plans.

I don’t think I was bringing up consequentialist hypotheses hijacking the system in this paragraph. I was noting the danger of having a system (which is in some sense just trying to predict humans well) output a plan it thinks a human would produce after thinking a very long time, given that it is good at predicting plans toward an objective but bad at predicting the humans’ objective.

Regarding paragraph 5: I was trying to say that you probably only need primitive planning abilities for a lot of prediction tasks, in some cases ones we already understand today. For example, you might use a deep neural net for deciding which weather simulations are worth running, and reinforce the deep neural net on the extent to which running the weather simulation changed the system’s accuracy. This is probably sufficient for a lot of applications.


by Daniel Dewey 1233 days ago | link

Thanks Jessica – sorry I misunderstood about hijacking. A couple of questions:

  • Is there a difference between “safe” and “accurate” predictors? I’m now thinking that you’re worried about NTMs basically making inaccurate predictions, and that accurate predictors of planning will require us to understand planning.

  • My feeling is that today’s current understanding of planning – if I run this computation, I will get the result, and if I run it again, I’ll get the same one – are sufficient for harder prediction tasks. Are there particular aspects of planning that we don’t yet understand well that you expect to be important for planning computation during prediction?


by Jessica Taylor 1233 days ago | Daniel Dewey likes this | link

EDIT: rewrote this comment

So, I wrote a comment, got confused, read over the conversation some more, and have a little bit of a model for thinking about this now. This is probably discontinuous with the previous stuff in the conversation (which I don’t have a steelman for anymore). Here is my current model:

Suppose we create 1000 NTM instances. We see which make good predictions of a human, throwing out the ones that make bad predictions (by asking an actual human). At the end we have some NTMs that make good predictions of a human making plans. Why would one of these NTMs be doing this?

  1. Perhaps it’s learning a detailed model of the human and simulating this. But, this seems like it requires a lot of computational resources.

  2. Perhaps it’s looking at the human’s behavior and then making plans optimized for similar goals to the human, using reasonable decision theory / logical uncertainty / etc.

  3. Perhaps it’s looking at the human’s behavior and then making plans optimized for similar goals, but it gets something subtly wrong (e.g. it uses the wrong decision theory).

  4. Perhaps it’s a UFAI that mostly acts like 2 (or something else that makes good predictions) because it doesn’t want to get thrown out for making bad predictions.

  5. Perhaps it’s something different.

If one of the NTM instances ends up using algorithm 2 and one uses algorithm 3, then the one using algorithm 3 gets thrown out because it makes bad predictions. If one of the NTM instances ends up using 2 and one ends up using 4, then 4 has to make the same predictions as 2 or it gets thrown out, so hijacking is less of a concern. So things seem fine if one of the NTMs ends up learning how to do algorithm 2.

I think the concern is that we don’t know how to write algorithm 2 (since we haven’t solved decision theory, logical uncertainty, etc). So we’re not sure if the system will actually have 2 in its hypothesis class. If none of the NTMs is 2, then maybe none is 4 either, so we just get a lot of versions of 3 that make different predictions due to using different incorrect decision theories etc. Eventually we throw out some versions of 3 that make bad predictions, until we’re left with just a few. These may also make bad predictions, but we don’t detect this, since we threw out all the other NTMs. Then we end up with something making plans using a bad decision theory.

But now that I think about it more, this seems like a resolvable problem. If the NTMs get to look at the predictions of the other NTMs when making their own predictions (there’s probably a fixed-point way to do this), then maybe there’s one out there that copies one of the versions of 3 but makes adjustments for 3’s bad decision theory. Or maybe it just tags the other NTM’s prediction as potentially flawed in some other way. I don’t know if things like this work (they probably also face the informed oversight problem), but they seem worth investigating.

To think about this more, I want to think about extending the selective sampling / active learning framework some more and see if this fixes the problem. I’m not actually sure that decision theoretic etc. errors are the most important here; in general we want the predictor to have no detectable bias in its predictions (including bias from using a bad decision theory to predict the human).

Not sure how helpful this was. I don’t currently have a good steelman for the position of “we should study decision theory and logical uncertainty because these are necessary to make good human-predictors”, other than the generic reason that AI based on logical uncertainty/tiling is generally more principled than NTMs. I wouldn’t study the problem of creating good human-predictors by studying decision theory / logical uncertainty, I would do it by thinking about extensions to active learning and selective sampling.


by Paul Christiano 1233 days ago | link

(The discussion seems to apply without modification to any predictor.)

It seems like “gets the wrong decision theory” is a really mild failure mode. If you can’t cope with that, there is no way you are going to cope with actually malignant failures.

Maybe the designer wasn’t counting on dealing with malignant failures at all, and this is an extra reminder that there can be subtle errors that don’t manifest most of the time. But I don’t think it’s much of an argument for understanding philosophy in particular.


by Jessica Taylor 1233 days ago | link

I agree with this.


by Paul Christiano 1233 days ago | link

If the NTMs get to look at the predictions of the other NTMs when making their own predictions (there’s probably a fixed-point way to do this), then maybe there’s one out there that copies one of the versions of 3 but makes adjustments for 3’s bad decision theory.

Why not say “If X is a model using a bad decision theory, there is a closely related model X’ that uses a better decision theory and makes better predictions. So once we have some examples that distinguish the two cases, we will use X’ rather than X.”

Sometimes this kind of argument doesn’t work and you can get tighter guarantees by considering the space of modifications (by coincidence this exact situation arises here), but I don’t see why this case in particular would bring up that issue.


by Jessica Taylor 1233 days ago | link

Suppose there are N binary dimensions that predictors can vary on. Then we’d need \(2^N\) predictors to cover every possibility. On the other hand, we would only need to consider N possible modifications to a predictor. Of course, if the dimensions factor that nicely, then you can probably make enough assumptions about the hypothesis class that you can learn from the \(2^N\) experts efficiently.

Overall it seems nicer to have a guarantee of the form “if there is a predictable bias in the predictions, then the system will correct this bias” rather than “if there is a strictly better predictor than a bad predictor, then the system will listen to the good predictor”, since it allows capabilities to be distributed among predictors instead of needing to be concentrated in a single predictor. But maybe things work anyway for the reason you gave.


by Daniel Dewey 1232 days ago | link

Thanks Jessica, I think we’re on similar pages – I’m also interested in how to ensure that predictions of humans are accurate and non-adversarial, and I think there are probably a lot of interesting problems there.







[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes


Privacy & Terms