Intelligent Agent Foundations Forumsign up / log in
by Owen Cotton-Barratt 414 days ago | Jessica Taylor and Nate Soares like this | link | parent

Thanks for the write-up, this is helpful for me (Owen).

My initial takes on the five steps of the argument as presented, in approximately decreasing order of how much I am on board:

  • Number 3 is a logical entailment, no quarrel here
  • Number 5 is framed as “therefore”, but adds the assumption that this will lead to catastrophe. I think this is quite likely if the systems in question are extremely powerful, but less likely if they are of modest power.
  • Number 4 splits my intuitions. I begin with some intuition that selection pressure would significantly constrain the goal (towards something reasonable in many cases), but the example of Solomonoff Induction was surprising to me and makes me more unsure. I feel inclined to defer intuitions on this to others who have considered it more.
  • Number 2 I don’t have a strong opinion on. I can tell myself stories which point in either direction, and neither feels compelling.
  • Number 1 is the step I feel most sceptical about. It seems to me likely that the first AIs which can perform pivotal acts will not perform fully general consequentialist reasoning. I expect that they will perform consequentialist reasoning within certain domains (e.g. AlphaGo in some sense reasons about consequences of moves, but has no conception of consequences in the physical world). This isn’t enough to alleviate concern: some such domains might be general enough that something misbehaving in them would cause large problems. But it is enough for me to think that paying attention to scope of domains is a promising angle.

by Jessica Taylor 413 days ago | link

  • For #5, it seems like “capable of pivotal acts” is doing the work of implying that the systems are extremely powerful.
  • For #4, I think that selection pressure does not constrain the goal much, since different terminal goals produce similar convergent instrumental goals. I’m still uncertain about this, though; it seems at least plausible (though not likely) that an agent’s goals are going to be aligned with a given task if e.g. their reproductive success is directly tied to performance on the task.
  • Agree on #2; I can kind of see it both ways too.
  • I’m also somewhat skeptical of #1. I usually think of it in terms of “how much of a competitive edge does general consequentialist reasoning give an AI project” and “how much of a competitive edge will safe AI projects have over unsafe ones, e.g. due to having more resources”.


by Owen Cotton-Barratt 413 days ago | Jessica Taylor and Nate Soares like this | link

For #5, OK, there’s something to this. But:

  • It’s somewhat plausible that stabilising pivotal acts will be available before world-destroying ones;
  • Actually there’s been a supposition smuggled in already with “the first AI systems capable of performing pivotal acts”. Perhaps there will at no point be a system capable of a pivotal act. I’m not quite sure whether it’s appropriate to talk about the collection of systems that exist being together capable of pivotal acts if they will not act in concert. Perhaps we’ll have a collection of systems which if aligned would produce a win, or if acting together towards an unaligned goal would produce catastrophe. It’s unclear if they each have different unaligned goals that we necessarily get catastrophe (though it’s certainly not a comfortable scenario).

I like your framing for #1.


by Jessica Taylor 413 days ago | link

I agree that things get messier when there is a collection of AI systems rather than a single one. “Pivotal acts” mostly make sense in the context of local takeoff. In nonlocal takeoff, one of the main concerns is that goal-directed agents not aligned with human values are going to find a way to cooperate with each other.






Caveat: The version of EDT
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

[Delegative Reinforcement
by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 1 like


Privacy & Terms