Intelligent Agent Foundations Forumsign up / log in
Learning doesn't solve philosophy of ethics
discussion post by Stuart Armstrong 447 days ago | discuss

A putative new idea for AI control; index here.

This post will use the formalism of this post to illustrate some well known philosophical thought experiments and show why learning algorithms are not sufficient to solve them.


Death and life extension

A human consists of two agents, \(A_{100}\) and \(A_1\). The agent \(A_{100}\) is a long-term agent; it has the preference that the human not live longer than a century. The agent \(A_2\) is a short term agent; it prefers that the human survive for the coming year.

The human meta preferences \(M\) are that \(A_{100}\) and \(A_{1}\) be eventually be brought into compatibility with each other.

By observation and prediction, the AI knows that, under the normal course of events, \(A_1\) will never sync with \(A_{100}\): the human will continue to believe that it shouldn’t live another hundred years, but will never want to die that year.

The AI can trigger human introspection \(I\) in two ways; the first one removed the long term death preference in \(A_{100}\), the second one will remove the short term death-avoidance in \(A_1\), at some later point, so that the human will act consistently with its current \(A_{100}\) (and thus die within the century).

Just based on this information, what is the human’s preferences?

Total utilitarianism

The human has the preference \(r\) that humans not be reduced to a large population of barely-happy individuals. They also have the meta-preference \(M\) that individual utility be additive.

The AI can trigger the human’s awareness of the repugnant conclusion. And it can do this in a differential or integral fashion, which will cause the human to either reject its current \(r\) (and embrace the repugnant conclusion) or reject \(M\) (and reject the repugnant conclusion).

Just based on this information, what is the human’s preferences?

The malarial drowning child

Peter Singer has an argument about a drowning child and our duty to them.

To model that contradiction in a human, let \(r\) contain the preference to save a drowning child in front of them, and a preference not to send money to distant people dying of malaria. Let \(M\) contain the desire that the human preferences not be different across different ways of dying or physical distance.

As before, the right presentation on the AI’s part, within the usual bounds of how humans reason, can cause the human to emphasise their preferences or their meta-preferences.

Balconies with a view

The human is modelled as two agents \(A_1\) (basically system 1) and \(A_2\) (system 2).

The human travels a lot, and likes to go out on the balcony to look at various views. They have an instinctive (\(A_1\)) of falling, but typically overrides this with reason (\(A_2\)). Except that \(A_1\)’s fear varies in intensity. It wants to avoid wooden balconies with a (consciously imperceptible) faint smell of rot. It also wants to avoid balconies around sunset.

Given that faint rot increases danger and sunsets don’t, what are we to make of this agent’s true preferences?

The big question: what’s tolerable?

Now, the first three examples illustrate big differences in outcomes: the difference between a total utilitarian and not are non-trivial, wanting life extension technology or not could make a huge difference in outcome, and so on.

However, all are within the scope of “tolerable outcomes”, very broadly defined. None result in optimisation of the universe for money or paperclips or immediate human extinction. We could extend the models to get those situations (eg by having some of these agents in a position to make long term or large impact decisions).

But the key question remains: if we add more details of the model of human rationality along with some principles for resolving these types of conflicts (principles which the AI can’t simply “learn”), we will still likely end up with the AI’s computed reward function being something unpredictable in a large class of functions. However, can we ensure it’s “tolerable”, or does anything less that perfect modelling of human irrationality result in a disastrous optimise outcome?

How approximately can we input human irrationalities into a learning AI?





This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Without reading closely, this
by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes


Privacy & Terms