Intelligent Agent Foundations Forumsign up / log in
by Owen Cotton-Barratt 322 days ago | Jessica Taylor and Nate Soares like this | link | parent

Thanks for the write-up, this is helpful for me (Owen).

My initial takes on the five steps of the argument as presented, in approximately decreasing order of how much I am on board:

  • Number 3 is a logical entailment, no quarrel here
  • Number 5 is framed as “therefore”, but adds the assumption that this will lead to catastrophe. I think this is quite likely if the systems in question are extremely powerful, but less likely if they are of modest power.
  • Number 4 splits my intuitions. I begin with some intuition that selection pressure would significantly constrain the goal (towards something reasonable in many cases), but the example of Solomonoff Induction was surprising to me and makes me more unsure. I feel inclined to defer intuitions on this to others who have considered it more.
  • Number 2 I don’t have a strong opinion on. I can tell myself stories which point in either direction, and neither feels compelling.
  • Number 1 is the step I feel most sceptical about. It seems to me likely that the first AIs which can perform pivotal acts will not perform fully general consequentialist reasoning. I expect that they will perform consequentialist reasoning within certain domains (e.g. AlphaGo in some sense reasons about consequences of moves, but has no conception of consequences in the physical world). This isn’t enough to alleviate concern: some such domains might be general enough that something misbehaving in them would cause large problems. But it is enough for me to think that paying attention to scope of domains is a promising angle.

by Jessica Taylor 321 days ago | link

  • For #5, it seems like “capable of pivotal acts” is doing the work of implying that the systems are extremely powerful.
  • For #4, I think that selection pressure does not constrain the goal much, since different terminal goals produce similar convergent instrumental goals. I’m still uncertain about this, though; it seems at least plausible (though not likely) that an agent’s goals are going to be aligned with a given task if e.g. their reproductive success is directly tied to performance on the task.
  • Agree on #2; I can kind of see it both ways too.
  • I’m also somewhat skeptical of #1. I usually think of it in terms of “how much of a competitive edge does general consequentialist reasoning give an AI project” and “how much of a competitive edge will safe AI projects have over unsafe ones, e.g. due to having more resources”.


by Owen Cotton-Barratt 321 days ago | Jessica Taylor and Nate Soares like this | link

For #5, OK, there’s something to this. But:

  • It’s somewhat plausible that stabilising pivotal acts will be available before world-destroying ones;
  • Actually there’s been a supposition smuggled in already with “the first AI systems capable of performing pivotal acts”. Perhaps there will at no point be a system capable of a pivotal act. I’m not quite sure whether it’s appropriate to talk about the collection of systems that exist being together capable of pivotal acts if they will not act in concert. Perhaps we’ll have a collection of systems which if aligned would produce a win, or if acting together towards an unaligned goal would produce catastrophe. It’s unclear if they each have different unaligned goals that we necessarily get catastrophe (though it’s certainly not a comfortable scenario).

I like your framing for #1.


by Jessica Taylor 321 days ago | link

I agree that things get messier when there is a collection of AI systems rather than a single one. “Pivotal acts” mostly make sense in the context of local takeoff. In nonlocal takeoff, one of the main concerns is that goal-directed agents not aligned with human values are going to find a way to cooperate with each other.






This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Without reading closely, this
by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes


Privacy & Terms