Intelligent Agent Foundations Forumsign up / log in
by Wei Dai 458 days ago | Jessica Taylor and Stuart Armstrong like this | link | parent

Essentially, the AI has to be able to do moral philosophy exactly as a human would, and to do it well. Without us being able to define what “exactly as a human would” means. And it has to continue this, as both it and humans change and we’re confronted by a world completely transformed, and situations we can’t currently imagine.

Despite AI safety becoming a more mainstream topic, I still haven’t seen a lot of people outside of FHI/MIRI/LessWrong acknowledge or discuss this part of the problem. (An alternative to AI being able to do moral philosophy correctly is developing an AI/human ecosystem that somehow preserves our collective ability to eventually discover our values and optimize for them, while not having a clear specification of what our values are or how to do moral philosophy in the meantime. But that doesn’t seem any easier and I haven’t seen people outside of FHI/MIRI/LessWrong talk about that either.)

I’m curious, since you probably have a much better idea of this than I do, do people who for example proposed (C)IRL without acknowledging the difficulties you described in this post actually understand these difficulties and just want to write papers that show some sort of forward progress, or are they not aware of them?




[…] do people who for example proposed (C)IRL without acknowledging the difficulties you described in this post actually understand these difficulties and just want to write papers that show some sort of forward progress, or are they not aware of them?

As someone who has worked on IRL a little bit, my impression is that such algorithms are not intended to capture human value to its full extent, but rather to learn shorter-term instrumental preferences. Paul gives some arguments for such “narrow value learning” here. This scenario, where human abilities are augmented using AI assistants, falls under your AI/human ecosystem category. I don’t think many people view “moral philosophy” as a separate type of activity that differentially benefits less from augmentation. Rather, AI assistants are seen as helping with essentially all tasks, including analyzing the consequences of decisions that have potentially far-reaching impacts, deciding when to keep our options open, and engineering the next-generation AI/human system in a way that maintains alignment. I don’t think this sort of bootstrapping process is understood very well, though.

reply

by Wei Dai 453 days ago | Patrick LaVictoire likes this | link

(I saw your comment several days ago but couldn’t reply until now. Apparently it was in some sort of moderation state.)

I don’t think many people view “moral philosophy” as a separate type of activity that differentially benefits less from augmentation.

This is what worries me though. It seems obvious to me that AI will augment some activities more than others, or earlier than others, and looking at the past, it seems that the activities that benefit most or earliest from AI augmentation are the ones we understand best in a computational or mathematical sense. For example, scientific computations, finding mathematical proofs, chess. I’d expect moral philosophy to be one of the last activities to benefit significantly from AI augmentation, since it seems really hard to understand what it is we’re doing when we’re trying to figure out our values, or even how to recognize a correct solution to this problem.

So in this approach we have to somehow build an efficient/competitive aligned system out of a core who doesn’t know what their values are, and doesn’t explicitly know how to find out what their values are, or worse, think they do know but is just plain wrong. (The latter perhaps applies for the great majority of the world’s population.) I’d feel a lot better if people recognized this as a core difficulty, instead of brushing it away by assuming that moral philosophy won’t differentially benefit less from augmentation (if that is indeed what they’re doing). BTW, I think Paul does recognize this, but I’m talking about people outside of FHI/MIRI/LessWrong.

reply

by Paul Christiano 453 days ago | link

Do you think that we can consider this as its own problem, of technology outpacing philosophy, which we can evaluate separately of other aspects of AI risk? Or are these problems tied together in a critical way?

In the past people have argued that we needed to resole a wide range of philosophical questions prior to constructing AI because we would need to lock in answers to those questions at that point. I would like to push back against that view, while acknowledging that there may be object-level issues where we pay a costs because we lack philosophical understanding (e.g. how to trade off haste vs. extinction risk, how to deal with the possibility of strange physics, how to bargain effectively…). And I would further acknowledge that AI may have a differential effect on progress in physical technology vs. philosophy.

My current tentative view is that the total object-level cost from philosophical error is modest over the next subjective century. I also believe that you overestimate the differential effects of AI, but that’s also not very firm. If my view changed on these points it might make me more enthusiastic about philosophy or metaphilosophy as research projects.

I have a much stronger belief that we should treat metaphilosophy and AI control as separate problems, and in particular that these concerns about metaphilosophy should not significantly dampen my enthusiasm for my current approach to resolving control problems.

reply

by Vladimir Nesov 453 days ago | Patrick LaVictoire likes this | link

I agree with the sentiment that there are philosophical difficulties that AI needs to take into account, but that very likely take far too long to formulate. Simpler kinds of indirect normativity that involve prediction of uploads allow delaying that work to after AI.

So this issue doesn’t block all actionable work, as its straightforward form would suggest. There might be no need for the activities to be in this order in physical time. Instead it motivates work on the simpler kinds of indirect normativity that would allow such philosophical investigations to take place inside AI’s values. In particular, it motivates figuring out what kind of thing AI’s values are, in sufficient generality so that it would be able to represent the results of unexpected future philosophical progress.

reply

by Wei Dai 451 days ago | link

If we could model humans as having well-defined values but irrational in predictable ways (e.g., due to computational constraints or having a limited repertoire of heuristics), then some variant of CIRL might be sufficient (along with solving certain other technical problems such as corrigibility and preventing bugs) for creating aligned AIs. I was (and still am) worried that some researchers think this is actually true, or by not mentioning further difficulties, give the wrong impression to policymakers and other researchers.

If you are already aware of the philosophical/metaphilosophical problems mentioned here, and have an approach that you think can work despite them, then it’s not my intention to dampen your enthusiasm. We may differ on how much expected value we think your approach can deliver, but I don’t really know another approach that you can more productively spend your time on.

reply

by Patrick LaVictoire 457 days ago | Jessica Taylor likes this | link

The authors of the CIRL paper are in fact aware of them, and are pondering them for future work. I’ve had fruitful conversations with Dylan Hadfield-Menell (one of the authors), talking about how a naive implementation goes wrong for irrational humans, and about what a tractable non-naive implementation might look like (trying to model probabilities of a human’s action under joint hypotheses about the correct reward function and about the human’s psychology); he’s planning future work relevant to that question.

Also note Dylan’s talk on CIRL, value of information, and the shutdown problem, which doesn’t solve the problem entirely but which significantly improved my opinion of the usefulness of approaches like CIRL. (The writeup of this result is forthcoming.)

reply

by Paul Christiano 455 days ago | Jessica Taylor likes this | link

Stuart Russell’s view seems to be similar to the one described by 180 in another comment: humans have preferences about how to do moral deliberation, and an IRL agent ought to learn to deliberate in a way that humans endorse, and then actually execute that deliberation, rather than directly learning arbitrarily complex values about e.g. population ethics.

(At least, I discussed this issue with him once and this was the impression I got, but I may have misunderstood.)

This view looks very reasonable to me. You and I have gone back and forth on this point a little bit but I don’t understand your position as well as I would like.

reply

by Stuart Armstrong 457 days ago | link

An alternative to AI being able to do moral philosophy correctly is developing an AI/human ecosystem that somehow preserves our collective ability to eventually discover our values and optimize for them, while not having a clear specification of what our values are or how to do moral philosophy in the meantime.

That’s what I hope the various low-impact ideas will do.

[…] actually understand these difficulties

I think they do, partially. CIRL is actually a decent step forwards, but I think they thought it was more of step forward than it was.

Or maybe they thought that a little bit of extra work (a bit of meta-preferences, for instance) would be enough to make CIRL work.

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Without reading closely, this
by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

RSS

Privacy & Terms