Learning doesn't solve philosophy of ethics
discussion post by Stuart Armstrong 663 days ago | discuss

A putative new idea for AI control; index here.

This post will use the formalism of this post to illustrate some well known philosophical thought experiments and show why learning algorithms are not sufficient to solve them.

# Death and life extension

A human consists of two agents, $$A_{100}$$ and $$A_1$$. The agent $$A_{100}$$ is a long-term agent; it has the preference that the human not live longer than a century. The agent $$A_2$$ is a short term agent; it prefers that the human survive for the coming year.

The human meta preferences $$M$$ are that $$A_{100}$$ and $$A_{1}$$ be eventually be brought into compatibility with each other.

By observation and prediction, the AI knows that, under the normal course of events, $$A_1$$ will never sync with $$A_{100}$$: the human will continue to believe that it shouldn’t live another hundred years, but will never want to die that year.

The AI can trigger human introspection $$I$$ in two ways; the first one removed the long term death preference in $$A_{100}$$, the second one will remove the short term death-avoidance in $$A_1$$, at some later point, so that the human will act consistently with its current $$A_{100}$$ (and thus die within the century).

Just based on this information, what is the human’s preferences?

# Total utilitarianism

The human has the preference $$r$$ that humans not be reduced to a large population of barely-happy individuals. They also have the meta-preference $$M$$ that individual utility be additive.

The AI can trigger the human’s awareness of the repugnant conclusion. And it can do this in a differential or integral fashion, which will cause the human to either reject its current $$r$$ (and embrace the repugnant conclusion) or reject $$M$$ (and reject the repugnant conclusion).

Just based on this information, what is the human’s preferences?

# The malarial drowning child

Peter Singer has an argument about a drowning child and our duty to them.

To model that contradiction in a human, let $$r$$ contain the preference to save a drowning child in front of them, and a preference not to send money to distant people dying of malaria. Let $$M$$ contain the desire that the human preferences not be different across different ways of dying or physical distance.

As before, the right presentation on the AI’s part, within the usual bounds of how humans reason, can cause the human to emphasise their preferences or their meta-preferences.

# Balconies with a view

The human is modelled as two agents $$A_1$$ (basically system 1) and $$A_2$$ (system 2).

The human travels a lot, and likes to go out on the balcony to look at various views. They have an instinctive ($$A_1$$) of falling, but typically overrides this with reason ($$A_2$$). Except that $$A_1$$’s fear varies in intensity. It wants to avoid wooden balconies with a (consciously imperceptible) faint smell of rot. It also wants to avoid balconies around sunset.

Given that faint rot increases danger and sunsets don’t, what are we to make of this agent’s true preferences?

## The big question: what’s tolerable?

Now, the first three examples illustrate big differences in outcomes: the difference between a total utilitarian and not are non-trivial, wanting life extension technology or not could make a huge difference in outcome, and so on.

However, all are within the scope of “tolerable outcomes”, very broadly defined. None result in optimisation of the universe for money or paperclips or immediate human extinction. We could extend the models to get those situations (eg by having some of these agents in a position to make long term or large impact decisions).

But the key question remains: if we add more details of the model of human rationality along with some principles for resolving these types of conflicts (principles which the AI can’t simply “learn”), we will still likely end up with the AI’s computed reward function being something unpredictable in a large class of functions. However, can we ensure it’s “tolerable”, or does anything less that perfect modelling of human irrationality result in a disastrous optimise outcome?

How approximately can we input human irrationalities into a learning AI?

### NEW DISCUSSION POSTS

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

> For another thing, consider
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes