Intelligent Agent Foundations Forumsign up / log in
by Paul Christiano 442 days ago | link | parent

I’ve been trying to convince people that there will be strong trade-offs between safety and performance

What do you see as the best arguments for this claim? I haven’t seen much public argument for it and am definitely interested in seeing more. I definitely grant that it’s prima facie plausible (as is the alternative).

Some caveats:

It’s obvious there are trade-offs between safety and performance in the usual sense of “safety.” But we are interested in a special kind of failure, where a failed system ends up controlling a significant share of the entire universe’s resources (rather than e.g. causing an explosion), and it’s less obvious that preventing such failures necessarily requires a significant cost.

Its also obvious that there is an additional cost to be paid in order to solve control, e.g. consider the fact that we are currently spending time on it. But the question is how much additional work needs to be done. Does building aligned systems require 1000% more work? 10%? 0.1%? I don’t see why it should obvious that this number is on the order of 100% rather than 1%.

Similarly for performance costs. I’m willing to grant that an aligned system will be more expensive to run. But is that cost an extra 1000% or an extra 0.1%? Both seem quite plausible. From a theoretical perspective the question is whether the required overhead is linear or sublinear?

I haven’t seen strong arguments for the “linear overhead” side, and my current guess is that the answer is sublinear. But again, both positions seem quite plausible.

(There are currently a few major obstructions to my approach that could plausibly give a tight theoretical argument for linear overhead, such as the translation example in the discussion with Wei Dai. In the past such obstructions have ended up seeming surmountable, but I think that it is totally plausible that eventually one won’t. And at that point I hope to be able to make clean statements about exactly what kind of thing we can’t hope to do efficiently+safely / exactly what kinds of additional assumptions we would have to make / what the key obstructions are).

Personally, I tend to think that we ought to address the coordination problem head-on

I think this is a good idea and a good project, which I would really like to see more people working on. In the past I may have seemed more dismissive and if so I apologize for being misguided. I’ve spent a little bit of time thinking about it recently and my feeling is that there is a lot of productive and promising work to do.

My current guess is that AI control is the more valuable thing for me personally to do though I could imagine being convinced out of this.

I feel that AI control is valuable given that (a) it has a reasonable chance of succeeding even if we can’t solve these coordination problems, and (b) convincing evidence that the problem is hard would be a useful input into getting the AI community to coordinate.

If you managed to get AI researchers to effectively coordinate around conditionally restricting access to AI (if it proved to be dangerous), then that would seriously undermine argument (b). I believe that a sufficiently persuasive/charismatic/accomplished person could probably do this today.

If I ended up becoming convinced that AI control was impossible this would undermine argument (a) (though hopefully that impossibility argument could itself be used to satisfy desiderata (b)).





If you drop the
by Alex Appel on Distributed Cooperation | 1 like

Cool! I'm happy to see this
by Abram Demski on Distributed Cooperation | 0 likes

Caveat: The version of EDT
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

[Delegative Reinforcement
by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes


Privacy & Terms