Intelligent Agent Foundations Forumsign up / log in
by Paul Christiano 350 days ago | link | parent

Can you be more explicit and formal about what you’re looking for?

Consider some particular research program that might yield powerful AI systems, e.g. (search for better model classes for deep learning, search for improved optimization algorithms, deploy these algorithms on increasingly large hardware). For each such research program I would like to have some general recipe that takes as input the intermediate products of that program (i.e. the hardware and infrastructure, the model class, the optimization algorithms) and uses them to produce an benign AI which is competitive with the output of the research program. The additional effort and cost required for this recipe would ideally be sublinear in the effort invested in the underlying research project (O(1) was too strong).

I suspect this is possible for some research programs and not others. I expect there are some programs for which this goal is demonstrably hopeless. I think that those research programs need to be treated with care. Moreover, if I had a demonstration that a research programs was dangerous in this way, I expect that I could convince people that it needs to be treated with care.

The people who need convincing will just think there’s almost certainly other ways to build a competitive aligned AI that doesn’t involve transforming the hard-to-align design.

Yes, at best someone might agree that a particular research program is dangerous/problematic. That seems like enough though—hopefully they could either be convinced to pursue other research programs that aren’t problematic, or would continue with the problematic research program and could then agree that other measures are needed to avert the risk.



by Wei Dai 341 days ago | link

If an AI causes its human controller to converge to false philosophical conclusions (especially ones relevant to their values), either directly through its own actions or indirectly by allowing adversarial messages to reach the human from other AIs, it clearly can’t be considered benign. But given our current lack of metaphilosophical understanding, how do you hope to show that any particular AI (e.g., the output of a proposed transformation/recipe) won’t cause that? Or is the plan to accept a lower burden of proof, namely assume that the AI is benign as long as no one can show that it does cause its human controller to converge to false philosophical conclusions?

The additional effort and cost required for this recipe would ideally be sublinear in the effort invested in the underlying research project (O(1) was too strong).

If such a recipe existed for a project, that doesn’t mean it’s not problematic. For example, suppose there are projects A and B, each of which had such a recipe. It’s still possible that using intermediate products of both A and B, one could build a more efficient unaligned AI but not a competitive aligned AI (at sublinear additional cost). Similarly, if recipes didn’t exist for projects A and B individually, it might still exist for A+B. It seems like to make a meaningful statement you have to treat the entire world as one big research program. Do you agree?

reply

by Paul Christiano 340 days ago | link

If such a recipe existed for a project, that doesn’t mean it’s not problematic. For example, suppose there are projects A and B, each of which had such a recipe. It’s still possible that using intermediate products of both A and B, one could build a more efficient unaligned AI but not a competitive aligned AI

My hope is to get technique A working, then get technique B working, and then get A+B working, and so on, prioritizing based on empirical guesses about what combinations will end up being deployed in practice (and hoping to develop general understanding that can be applied across many programs and combinations). I expect that in many cases, if you can handle A and B you can handle A+B, though some interactions will certainly introduce new problems. This program doesn’t have much chance of success without new abstractions.

indirectly by allowing adversarial messages to reach the human from other AIs, it clearly can’t be considered benign

This is definitely benign on my accounting. There is a further question of how well you do in conflict. A window is benign but won’t protect you from inputs that will drive you crazy. The hope is that if you have an AI that is benign + powerful than you may be OK.

directly through its own actions

If the agent is trying to implement deliberation in accordance with the user’s preferences about deliberation, then I want to call that benign. There is a further question of whether we mess up deliberation, which could happen with or without AI. We would like to set things up in such a way that we aren’t forced to deliberate earlier than we would otherwise want to. (And this included in the user’s preferences about deliberation, i.e. a benign AI will be trying to secure for the user the option of deliberating later, if the user believes that deliberating later is better than deliberating in concert with the AI now.)

Malign just means “actively optimizing for something bad,” the hope is to avoid that, but this doesn’t rule out other kinds of problems (e.g. causing deliberation to go badly due to insufficient competence, blowing up the world due to insufficient competence, etc.)

Overall, my current best guess is that this disagreement is better to pursue after my research program is further along, we know things like whether “benign” makes sense as an abstraction, I have considered some cases where benign agents necessarily seem to be less efficient, and so on.

I am still interested in arguments that might (a) convince me to not work on this program, e.g. because I should be working on alternative social solutions, or (b) convince others to work on this program, e.g. because they currently don’t see how it could succeed but might work on it if they did, or (c) which clarify the key obstructions for this research program.

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Without reading closely, this
by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

RSS

Privacy & Terms