Intelligent Agent Foundations Forumsign up / log in
How to judge moral learning failure
post by Stuart Armstrong 683 days ago | 2 comments

A putative new idea for AI control; index here.

I’m finding many different results, that show problems and biases with the reward learning process.

But there is a meta problem, which is answering the question: “If the AI gets it wrong, how bad is it?”. It’s clear that there are some outcomes which might be slightly suboptimal – like the complete extinction of all intelligent life across the entire reachable universe. But it’s not so clear what to do if the error is smaller.


For instance, suppose that there are two moral theories, M1 and M2. An AI following M1 would lead to outcome O1, a trillion trillion people leading superb lives. An AI following M2 would lead to outcome O2, a trillion people leading superbly superb lives.

If one of M1 or M2 was the better moral theory from the human perspective, how would we assess the consequences of the AI choosing the wrong one? One natural way of doing this is to use these moral theories to assess each other. How bad, from M1’s perspective, is O2 compared with O1? And vice versa for M2. Because of how value accumulates on different theories, it’s plausible that M1(O1) could be a million times better than M1(O2) (as O1 is finely adapted to M1’s constraints). And similarly, M2(O1) could be a million times worse than M2(O2).

So from that perspective, the cost of choosing the wrong moral theory is disastrous, by a factor of a million. But from my current perspective (and that of most humans), the costs do not seem so large; both O1 and O2 seem pretty neat. I certainly wouldn’t be willing to run a 99.9999% chance of human extinction, in exchange for the AI choosing the better moral theory.

Convincing humans what to value

At least part of this stems, I think, from the fact that humans can be convinced of many things. A superintelligent AI could convince us to value practically anything. Even if we restrict ourselves to “non-coercive” or “non-manipulative” convincing (and defining those terms is a large part of the challenge), there’s still a very wide space of possible future values. Even if we restrict ourselves massively – future values we might have, with no AI convincing at all, just my continuing to live our lives according to vagaries of fortune – that’s still a wide span.

So our values are underdetermined in important ways (personal example: I didn’t expect I’d become an effective altruist or an expected utility maximiser or ever come to respect (some) bureaucracies). So saying “you will come to value M1, and M1 ranks O1 way above O2” doesn’t mean that you should value O1 way above O2. As it’s possible, given different future interactions, that you would come to value M2, or something similar.

We should also give some thought to our future values in impossible universes. It’s perfectly plausible that if we existed in a specific universe (eg a fantasy Tolkein universe), we might come to value, non-coercively and non-manipulatively, a moral theory M3. Even though it’s almost certain we would never come to value M3 in our physical or social universe. We are still proto-M3 believers.

I’m thinking of modelling this as classical moral uncertainty over plausive value/reward functions in a set R={Ri}, but assuming that the probability of a given Ri is never assumed to go below a certain probability. There’s an irreducible core of multiple moralities that never goes away. The purpose of this core is not to inform our future decisions, or train the AI, but purely to assess the goodness of different moralities in the future.



by Ryan Carey 682 days ago | link

I’m thinking of modelling this as classical moral uncertainty over plausive value/reward functions in a set R={Ri}, but assuming that the probability of a given Ri is never assumed to go below a certain probability.

It’s surprising to me that you would want your probabilities of each reward function to not approach zero, even asymptotically. In regular bandit problems, if your selection of some action never asymptotes toward zero, then you will necessarily keep making some kinds of mistakes forever, incurring linear regret. The same should be true for some suitable definition of regret if you stubbornly continue to behave according to some “wrong” moral theory.

reply

by Stuart Armstrong 678 days ago | link

But I’m arguing that using these moral theories to assess regret, is the wrong thing to do.

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

RSS

Privacy & Terms