Intelligent Agent Foundations Forumsign up / log in
On motivations for MIRI's highly reliable agent design research
post by Jessica Taylor 58 days ago | Ryan Carey, Daniel Dewey, Nate Soares, Patrick LaVictoire, Paul Christiano, Tsvi Benson-Tilsen and Vladimir Nesov like this | 10 comments

(this post came out of a conversation between me and Owen Cotton-Barratt, plus a follow-up conversation with Nate)

I want to clarify my understanding of some of the motivations of MIRI’s highly reliable agent design (HRAD) research (e.g. logical uncertainty, decision theory, multi level models).

Top-level vs. subsystem reasoning

I’ll distinguish between an AI system’s top-level reasoning and subsystem reasoning. Top-level reasoning is the reasoning the system is doing in a way its designers understand (e.g. using well-understood algorithms); subsystem reasoning is reasoning produced by the top-level reasoning that its designers (by default) don’t understand at an algorithmic level.

Here are a few examples:


Top-level reasoning: MCTS, self-play, gradient descent, …

Subsystem reasoning: whatever reasoning the policy network is doing, which might involve some sort of “prediction of consequences of moves”

Deep Q learning

Top-level reasoning: the Q-learning algorithm, gradient descent, random exploration, …

Subsystem reasoning: whatever reasoning the Q network is doing, which might involve some sort of “prediction of future score”

Solomonoff induction

Top-level reasoning: selecting (Cartesian) hypotheses by seeing which make the best predictions

Subsystem reasoning: the reasoning of the consequentialist reasoners who come to dominate Solomonoff induction, who will use something like naturalized induction and updateless decision theory

Genetic selection

It is possible to imagine a system that learns to play video games by finding (encodings of) policies that get high scores on training games, and combining encodings of policies that do well to produce new policies.

Top-level reasoning: genetic selection

Subsystem reasoning: whatever reasoning the policies are doing (which is something like “predicting the consequences of different actions”)

In the Solomonoff induction case and this case, if the algorithm is run with enough computation, the subsystem reasoning is likely to overwhelm the top-level reasoning (i.e. the system running Solomonoff induction or genetic selection will eventually come to be dominated by opaque consequentialist reasoners).

Good consequentialist reasoning

Humans are capable of good consequentialist reasoning (at least in comparison to current AI systems). Humans can:

  • make medium-term predictions about complex systems containing other humans
  • make plans that take months or years to execute
  • learn and optimize proxies for long-term success (e.g. learning skills, gaining money)
  • reason about how to write a computer program

and so on. Current AI systems are not capable of good consequentialist reasoning. Superintelligent AGI systems would be capable of good consequentialist reasoning (though superintelligent narrow AI systems might not in full generality).

The concern

Using these concepts, MIRI’s main concern motivating HRAD research can be stated as something like:

  1. The first AI systems capable of pivotal acts will use good consequentialist reasoning.

  2. The default AI development path will not produce good consequentialist reasoning at the top level.

  3. Therefore, on the default AI development path, the first AI systems capable of pivotal acts will have good consequentialist subsystem reasoning but not good consequentialist top-level reasoning.

  4. Consequentialist subsystem reasoning will likely come “packaged with a random goal” in some sense, and this goal will not be aligned with human interests.

  5. Therefore, the default AI development path will produce, as the first AI systems capable of pivotal acts, AI systems with goals not aligned with human interests, causing catastrophe.

Note that, even if the AI system is doing good consequentialist reasoning at the top level rather than in subsystems, this top-level reasoning must still be directed towards the correct objective for the system to be aligned. So HRAD research does not address the entire AI alignment problem.

Possible paths

Given this concern, a number of possible paths to aligned AI emerge:

Limited/tool AI

One might reject premise 1 and attempt to accomplish pivotal acts using AI systems that do not use good consequentialist reasoning. Roughly, the proposal is to have humans do the good consequentialist reasoning, and to use AI systems as tools.

The main concern with this proposal is that a system of humans and limited AIs might be much less effective (for a given level of computing resources) than an AI system capable of good consequentialist reasoning. In particular, (a) a limited AI might require a lot of human labor to do the good consequentialist reasoning, and (b) human consequentialist reasoning is likely to be less effective than superintelligent AI consequentialist reasoning.

The main hope, despite this concern, is that either “general consequentialist reasoning” is not particularly important for the kinds of tasks people will want to use AI systems for (including pivotal acts), or that some sort of global coordination will make the efficiency disadvantage less relevant.

Example research topics:

Hope that top-level reasoning stays dominant on the default AI development path

Currently, it seems like most AI systems’ consequentialist reasoning is explainable in terms of top-level algorithms. For example, AlphaGo’s performance is mostly explained by MCTS and the way it’s trained through self-play. The subsystem reasoning is subsumed by the top-level reasoning and does not overwhelm it.

One could hope that algorithms likely to be developed in the future by default (e.g. model-based reinforcement learning) continue to be powerful enough that the top-level consequentialist reasoning is more powerful than subsystem consequentialist reasoning.

The biggest indication that this might not happen by default is that we currently don’t have an in-principle theory for good reasoning (e.g. we’re currently confused about logical uncertainty and multi-level models), and it doesn’t look like these theories will be developed without a concerted effort. Usually, theory lags behind common practice.

Despite this, a possible reason for hope is that perhaps it’s possible to AI researchers to develop enough tacit understanding of these theories for practical purposes. Currently, algorithms such as MCTS are implicitly handling some subproblem of “logical uncertainty” without having a full formal theory, and this does not seem problematic yet. It’s conceivable that future algorithms will be similar to MCTS and implicitly handle larger parts of these theories in a way as well-understood as MCTS, such that good consequentialist reasoning in subsystems does not overwhelm the top-level consequentialist reasoning.

“MIRI” has a strong intuition that this won’t be the case, and personally I’m somewhat confused about the details; see Nate’s comments below for details.

On this path, the most important research topics are those that relate to directing top-level consequentialist reasoning (implemented using algorithms on the default AI development path) towards useful objectives. (Note that these research problems are also important on other paths; goals have to be specified at some point in all cases).

Example research topics:

  • Avoiding negative side effects
  • Avoiding reward hacking
  • Robustness to distribution shift
  • Safe exploration
  • Inverse reinforcement learning
  • Neural network transparency

(research topics like these are discussed in Concrete Problems in AI Safety and Alignment for Advanced Machine Learning Systems)

Figure out some core of good consequentialist reasoning and ensure that AI is developed through this paradigm

This is the main purpose of MIRI’s research in HRAD. The main hope is that there is some simple core of good reasoning that can be discovered through theoretical research.

On this pathway, it isn’t currently cleanly argued that the right way to research good consequentialist reasoning is to study the particular MIRI research topics such as decision theory. One could imagine other approaches to studying good consequentialist reasoning (e.g. thinking about how to train model-based reinforcement learners). I think the focus on problems like decision theory is mostly based on intuitions that are (currently) hard to explicitly argue for.

Example research topics:

  • Logical uncertainty
  • Decision theory
  • Multi level models
  • Vingean reflection

(see the agent foundations technical agenda paper) for details)

Figure out how to align a “messy” AI whose good consequentialist reasoning is in a subsystem

This is the main part of Paul Christiano’s research program. Disagreements about the viability of this approach are quite technical; I have previously written about some aspects of this disagreement here.

Example research topics:

Interaction with task AGI

Given this concern, it isn’t immediately clear how task AGI fits into the picture. I think the main motivation for task AGI is that it alleviates some aspects of this concern but not others; ideally it requires knowing fewer aspects of good consequentialist reasoning (e.g. perhaps some decision-theoretic problems can be dodged), and has subsystems “small” enough that they will not develop good consequentialist reasoning independently.


I hope I have clarified what the main argument motivating HRAD research is, and what positions are possible to take on this argument. There seem to be significant opportunities for further clarification of arguments and disagreements, especially the MIRI intuition that there is a small core of good consequentialist reasoning that is important for AI capabilities and that can be discovered through theoretical research.

by Nate Soares 58 days ago | Ryan Carey, Jessica Taylor and Patrick LaVictoire like this | link

As I noted when we chatted about this in person, my intuition is less “there is some small core of good consequentialist reasoning (it has “low Kolmogorov complexity” in some sense), and this small core will be quite important for AI capabilities” and more “good consequentialist reasoning is low-K and those who understand it will be better equipped to design AGI systems where the relevant consequentialist reasoning happens in transparent boxes rather than black boxes.”

Indeed, if I thought one had to understand good consequentialist reasoning in order to design a highly capable AI system, I’d be less worried by a decent margin.


by Jessica Taylor 57 days ago | link

The way I wrote it, I didn’t mean to imply “the designers need to understand the low-K thing for the system to be highly capable”, merely “the low-K thing must appear in the system somewhere for it to be highly capable”. Does the second statement seem right to you?

(perhaps a weaker statement, like “for the system to be highly capable, the low-K thing must be the correct high-level understanding of the system, and so the designers must understand the low-K thing to understand the behavior of the system at a high level”, would be better?)


by Nate Soares 53 days ago | link

The second statement seems pretty plausible (when we consider human-accessible AGI designs, at least), but I’m not super confident of it, and I’m not resting my argument on it.

The weaker statement you provide doesn’t seem like it’s addressing my concern. I expect there are ways to get highly capable reasoning (sufficient for, e.g., gaining decisive strategic advantage) without understanding low-K “good reasoning”; the concern is that said systems are much more difficult to align.


by Owen Cotton-Barratt 56 days ago | Jessica Taylor and Nate Soares like this | link

Thanks for the write-up, this is helpful for me (Owen).

My initial takes on the five steps of the argument as presented, in approximately decreasing order of how much I am on board:

  • Number 3 is a logical entailment, no quarrel here
  • Number 5 is framed as “therefore”, but adds the assumption that this will lead to catastrophe. I think this is quite likely if the systems in question are extremely powerful, but less likely if they are of modest power.
  • Number 4 splits my intuitions. I begin with some intuition that selection pressure would significantly constrain the goal (towards something reasonable in many cases), but the example of Solomonoff Induction was surprising to me and makes me more unsure. I feel inclined to defer intuitions on this to others who have considered it more.
  • Number 2 I don’t have a strong opinion on. I can tell myself stories which point in either direction, and neither feels compelling.
  • Number 1 is the step I feel most sceptical about. It seems to me likely that the first AIs which can perform pivotal acts will not perform fully general consequentialist reasoning. I expect that they will perform consequentialist reasoning within certain domains (e.g. AlphaGo in some sense reasons about consequences of moves, but has no conception of consequences in the physical world). This isn’t enough to alleviate concern: some such domains might be general enough that something misbehaving in them would cause large problems. But it is enough for me to think that paying attention to scope of domains is a promising angle.


by Jessica Taylor 55 days ago | link

  • For #5, it seems like “capable of pivotal acts” is doing the work of implying that the systems are extremely powerful.
  • For #4, I think that selection pressure does not constrain the goal much, since different terminal goals produce similar convergent instrumental goals. I’m still uncertain about this, though; it seems at least plausible (though not likely) that an agent’s goals are going to be aligned with a given task if e.g. their reproductive success is directly tied to performance on the task.
  • Agree on #2; I can kind of see it both ways too.
  • I’m also somewhat skeptical of #1. I usually think of it in terms of “how much of a competitive edge does general consequentialist reasoning give an AI project” and “how much of a competitive edge will safe AI projects have over unsafe ones, e.g. due to having more resources”.


by Owen Cotton-Barratt 54 days ago | Jessica Taylor and Nate Soares like this | link

For #5, OK, there’s something to this. But:

  • It’s somewhat plausible that stabilising pivotal acts will be available before world-destroying ones;
  • Actually there’s been a supposition smuggled in already with “the first AI systems capable of performing pivotal acts”. Perhaps there will at no point be a system capable of a pivotal act. I’m not quite sure whether it’s appropriate to talk about the collection of systems that exist being together capable of pivotal acts if they will not act in concert. Perhaps we’ll have a collection of systems which if aligned would produce a win, or if acting together towards an unaligned goal would produce catastrophe. It’s unclear if they each have different unaligned goals that we necessarily get catastrophe (though it’s certainly not a comfortable scenario).

I like your framing for #1.


by Jessica Taylor 54 days ago | link

I agree that things get messier when there is a collection of AI systems rather than a single one. “Pivotal acts” mostly make sense in the context of local takeoff. In nonlocal takeoff, one of the main concerns is that goal-directed agents not aligned with human values are going to find a way to cooperate with each other.


by Stuart Armstrong 56 days ago | link

I kinda get your point here, but it would work better with specific examples, including some non-trivial toy examples of failures for superintelligent agents.


by Jessica Taylor 56 days ago | link

It seems like the Solomonoff induction example is illustrative; what do you think it doesn’t cover?


by David Krueger 42 days ago | link

This post helped me understand HRAD a lot better. I’m quite confident that subsystems (SS) will be smarter than top-level systems (TS) (because meta-learning will work). So on that it seems we agree. Although, I’m not sure we have the same thing in mind by “smarter” (e.g., I don’t mean that SSs will use some kind of reasoning which is different from model-based RL, just that we won’t be able to easily/tractably identify what algo is being run at the SS level, because: 1. interpretability will be hard and 2. it will be a pile of hacks that won’t have a simple, complete description).

I think this is the main disagreement: I don’t believe that SS will work better because it will stumble upon some low-K reasoning core; I just think SS will be much better at rapid iterative improvements to AI algos. Actually, it seems possible that, lacking some good TS reasoning, SS will eventually hack itself to death :P.

I’m still a bit put-off by talking about TS vs. SS being “dominant”, and I think there is possibly some difference of views lurking behind this language.






I don't know which open
by Jessica Taylor on Some problems with making induction benign, and ap... | 0 likes

KWIK learning is definitely
by Vadim Kosoy on Some problems with making induction benign, and ap... | 0 likes

I should have said "reliably
by Patrick LaVictoire on HCH as a measure of manipulation | 0 likes

I think that one can argue
by Vadim Kosoy on Generalizing Foundations of Decision Theory | 0 likes

"Having a well-calibrated
by Jessica Taylor on HCH as a measure of manipulation | 0 likes

Re #2, I think this is an
by Patrick LaVictoire on HCH as a measure of manipulation | 0 likes

Re #1, an obvious set of
by Patrick LaVictoire on HCH as a measure of manipulation | 0 likes

There's the additional
by Patrick LaVictoire on HCH as a measure of manipulation | 0 likes

I agree it's not a complete
by David Krueger on An idea for creating safe AI | 0 likes

I spoke with Huw about this
by David Krueger on An idea for creating safe AI | 0 likes

Both of your conjectures are
by Alex Mennen on Generalizing Foundations of Decision Theory | 0 likes

I can think of two problems:
by Ryan Carey on HCH as a measure of manipulation | 0 likes

Question that I haven't seen
by Patrick LaVictoire on All the indifference designs | 0 likes

Agree that IRL doesn't solve
by Jessica Taylor on Some problems with making induction benign, and ap... | 0 likes

Designing an agent which is
by Vadim Kosoy on An idea for creating safe AI | 0 likes


Privacy & Terms