Intelligent Agent Foundations Forumsign up / log in
Learning doesn't solve philosophy of ethics
discussion post by Stuart Armstrong 757 days ago | discuss

A putative new idea for AI control; index here.

This post will use the formalism of this post to illustrate some well known philosophical thought experiments and show why learning algorithms are not sufficient to solve them.

Examples

Death and life extension

A human consists of two agents, \(A_{100}\) and \(A_1\). The agent \(A_{100}\) is a long-term agent; it has the preference that the human not live longer than a century. The agent \(A_2\) is a short term agent; it prefers that the human survive for the coming year.

The human meta preferences \(M\) are that \(A_{100}\) and \(A_{1}\) be eventually be brought into compatibility with each other.

By observation and prediction, the AI knows that, under the normal course of events, \(A_1\) will never sync with \(A_{100}\): the human will continue to believe that it shouldn’t live another hundred years, but will never want to die that year.

The AI can trigger human introspection \(I\) in two ways; the first one removed the long term death preference in \(A_{100}\), the second one will remove the short term death-avoidance in \(A_1\), at some later point, so that the human will act consistently with its current \(A_{100}\) (and thus die within the century).

Just based on this information, what is the human’s preferences?

Total utilitarianism

The human has the preference \(r\) that humans not be reduced to a large population of barely-happy individuals. They also have the meta-preference \(M\) that individual utility be additive.

The AI can trigger the human’s awareness of the repugnant conclusion. And it can do this in a differential or integral fashion, which will cause the human to either reject its current \(r\) (and embrace the repugnant conclusion) or reject \(M\) (and reject the repugnant conclusion).

Just based on this information, what is the human’s preferences?

The malarial drowning child

Peter Singer has an argument about a drowning child and our duty to them.

To model that contradiction in a human, let \(r\) contain the preference to save a drowning child in front of them, and a preference not to send money to distant people dying of malaria. Let \(M\) contain the desire that the human preferences not be different across different ways of dying or physical distance.

As before, the right presentation on the AI’s part, within the usual bounds of how humans reason, can cause the human to emphasise their preferences or their meta-preferences.

Balconies with a view

The human is modelled as two agents \(A_1\) (basically system 1) and \(A_2\) (system 2).

The human travels a lot, and likes to go out on the balcony to look at various views. They have an instinctive (\(A_1\)) of falling, but typically overrides this with reason (\(A_2\)). Except that \(A_1\)’s fear varies in intensity. It wants to avoid wooden balconies with a (consciously imperceptible) faint smell of rot. It also wants to avoid balconies around sunset.

Given that faint rot increases danger and sunsets don’t, what are we to make of this agent’s true preferences?

The big question: what’s tolerable?

Now, the first three examples illustrate big differences in outcomes: the difference between a total utilitarian and not are non-trivial, wanting life extension technology or not could make a huge difference in outcome, and so on.

However, all are within the scope of “tolerable outcomes”, very broadly defined. None result in optimisation of the universe for money or paperclips or immediate human extinction. We could extend the models to get those situations (eg by having some of these agents in a position to make long term or large impact decisions).

But the key question remains: if we add more details of the model of human rationality along with some principles for resolving these types of conflicts (principles which the AI can’t simply “learn”), we will still likely end up with the AI’s computed reward function being something unpredictable in a large class of functions. However, can we ensure it’s “tolerable”, or does anything less that perfect modelling of human irrationality result in a disastrous optimise outcome?

How approximately can we input human irrationalities into a learning AI?



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

RSS

Privacy & Terms