Intelligent Agent Foundations Forumsign up / log in
How to judge moral learning failure
post by Stuart Armstrong 373 days ago | 2 comments

A putative new idea for AI control; index here.

I’m finding many different results, that show problems and biases with the reward learning process.

But there is a meta problem, which is answering the question: “If the AI gets it wrong, how bad is it?”. It’s clear that there are some outcomes which might be slightly suboptimal – like the complete extinction of all intelligent life across the entire reachable universe. But it’s not so clear what to do if the error is smaller.


For instance, suppose that there are two moral theories, M1 and M2. An AI following M1 would lead to outcome O1, a trillion trillion people leading superb lives. An AI following M2 would lead to outcome O2, a trillion people leading superbly superb lives.

If one of M1 or M2 was the better moral theory from the human perspective, how would we assess the consequences of the AI choosing the wrong one? One natural way of doing this is to use these moral theories to assess each other. How bad, from M1’s perspective, is O2 compared with O1? And vice versa for M2. Because of how value accumulates on different theories, it’s plausible that M1(O1) could be a million times better than M1(O2) (as O1 is finely adapted to M1’s constraints). And similarly, M2(O1) could be a million times worse than M2(O2).

So from that perspective, the cost of choosing the wrong moral theory is disastrous, by a factor of a million. But from my current perspective (and that of most humans), the costs do not seem so large; both O1 and O2 seem pretty neat. I certainly wouldn’t be willing to run a 99.9999% chance of human extinction, in exchange for the AI choosing the better moral theory.

Convincing humans what to value

At least part of this stems, I think, from the fact that humans can be convinced of many things. A superintelligent AI could convince us to value practically anything. Even if we restrict ourselves to “non-coercive” or “non-manipulative” convincing (and defining those terms is a large part of the challenge), there’s still a very wide space of possible future values. Even if we restrict ourselves massively – future values we might have, with no AI convincing at all, just my continuing to live our lives according to vagaries of fortune – that’s still a wide span.

So our values are underdetermined in important ways (personal example: I didn’t expect I’d become an effective altruist or an expected utility maximiser or ever come to respect (some) bureaucracies). So saying “you will come to value M1, and M1 ranks O1 way above O2” doesn’t mean that you should value O1 way above O2. As it’s possible, given different future interactions, that you would come to value M2, or something similar.

We should also give some thought to our future values in impossible universes. It’s perfectly plausible that if we existed in a specific universe (eg a fantasy Tolkein universe), we might come to value, non-coercively and non-manipulatively, a moral theory M3. Even though it’s almost certain we would never come to value M3 in our physical or social universe. We are still proto-M3 believers.

I’m thinking of modelling this as classical moral uncertainty over plausive value/reward functions in a set R={Ri}, but assuming that the probability of a given Ri is never assumed to go below a certain probability. There’s an irreducible core of multiple moralities that never goes away. The purpose of this core is not to inform our future decisions, or train the AI, but purely to assess the goodness of different moralities in the future.



by Ryan Carey 372 days ago | link

I’m thinking of modelling this as classical moral uncertainty over plausive value/reward functions in a set R={Ri}, but assuming that the probability of a given Ri is never assumed to go below a certain probability.

It’s surprising to me that you would want your probabilities of each reward function to not approach zero, even asymptotically. In regular bandit problems, if your selection of some action never asymptotes toward zero, then you will necessarily keep making some kinds of mistakes forever, incurring linear regret. The same should be true for some suitable definition of regret if you stubbornly continue to behave according to some “wrong” moral theory.

reply

by Stuart Armstrong 368 days ago | link

But I’m arguing that using these moral theories to assess regret, is the wrong thing to do.

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Without reading closely, this
by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

RSS

Privacy & Terms