Intelligent Agent Foundations Forumsign up / log in
by Paul Christiano 347 days ago | link | parent

I’ve been trying to convince people that there will be strong trade-offs between safety and performance

What do you see as the best arguments for this claim? I haven’t seen much public argument for it and am definitely interested in seeing more. I definitely grant that it’s prima facie plausible (as is the alternative).

Some caveats:

It’s obvious there are trade-offs between safety and performance in the usual sense of “safety.” But we are interested in a special kind of failure, where a failed system ends up controlling a significant share of the entire universe’s resources (rather than e.g. causing an explosion), and it’s less obvious that preventing such failures necessarily requires a significant cost.

Its also obvious that there is an additional cost to be paid in order to solve control, e.g. consider the fact that we are currently spending time on it. But the question is how much additional work needs to be done. Does building aligned systems require 1000% more work? 10%? 0.1%? I don’t see why it should obvious that this number is on the order of 100% rather than 1%.

Similarly for performance costs. I’m willing to grant that an aligned system will be more expensive to run. But is that cost an extra 1000% or an extra 0.1%? Both seem quite plausible. From a theoretical perspective the question is whether the required overhead is linear or sublinear?

I haven’t seen strong arguments for the “linear overhead” side, and my current guess is that the answer is sublinear. But again, both positions seem quite plausible.

(There are currently a few major obstructions to my approach that could plausibly give a tight theoretical argument for linear overhead, such as the translation example in the discussion with Wei Dai. In the past such obstructions have ended up seeming surmountable, but I think that it is totally plausible that eventually one won’t. And at that point I hope to be able to make clean statements about exactly what kind of thing we can’t hope to do efficiently+safely / exactly what kinds of additional assumptions we would have to make / what the key obstructions are).

Personally, I tend to think that we ought to address the coordination problem head-on

I think this is a good idea and a good project, which I would really like to see more people working on. In the past I may have seemed more dismissive and if so I apologize for being misguided. I’ve spent a little bit of time thinking about it recently and my feeling is that there is a lot of productive and promising work to do.

My current guess is that AI control is the more valuable thing for me personally to do though I could imagine being convinced out of this.

I feel that AI control is valuable given that (a) it has a reasonable chance of succeeding even if we can’t solve these coordination problems, and (b) convincing evidence that the problem is hard would be a useful input into getting the AI community to coordinate.

If you managed to get AI researchers to effectively coordinate around conditionally restricting access to AI (if it proved to be dangerous), then that would seriously undermine argument (b). I believe that a sufficiently persuasive/charismatic/accomplished person could probably do this today.

If I ended up becoming convinced that AI control was impossible this would undermine argument (a) (though hopefully that impossibility argument could itself be used to satisfy desiderata (b)).





This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Without reading closely, this
by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes


Privacy & Terms