Intelligent Agent Foundations Forumsign up / log in
Learning doesn't solve philosophy of ethics
discussion post by Stuart Armstrong 544 days ago | discuss

A putative new idea for AI control; index here.

This post will use the formalism of this post to illustrate some well known philosophical thought experiments and show why learning algorithms are not sufficient to solve them.


Death and life extension

A human consists of two agents, \(A_{100}\) and \(A_1\). The agent \(A_{100}\) is a long-term agent; it has the preference that the human not live longer than a century. The agent \(A_2\) is a short term agent; it prefers that the human survive for the coming year.

The human meta preferences \(M\) are that \(A_{100}\) and \(A_{1}\) be eventually be brought into compatibility with each other.

By observation and prediction, the AI knows that, under the normal course of events, \(A_1\) will never sync with \(A_{100}\): the human will continue to believe that it shouldn’t live another hundred years, but will never want to die that year.

The AI can trigger human introspection \(I\) in two ways; the first one removed the long term death preference in \(A_{100}\), the second one will remove the short term death-avoidance in \(A_1\), at some later point, so that the human will act consistently with its current \(A_{100}\) (and thus die within the century).

Just based on this information, what is the human’s preferences?

Total utilitarianism

The human has the preference \(r\) that humans not be reduced to a large population of barely-happy individuals. They also have the meta-preference \(M\) that individual utility be additive.

The AI can trigger the human’s awareness of the repugnant conclusion. And it can do this in a differential or integral fashion, which will cause the human to either reject its current \(r\) (and embrace the repugnant conclusion) or reject \(M\) (and reject the repugnant conclusion).

Just based on this information, what is the human’s preferences?

The malarial drowning child

Peter Singer has an argument about a drowning child and our duty to them.

To model that contradiction in a human, let \(r\) contain the preference to save a drowning child in front of them, and a preference not to send money to distant people dying of malaria. Let \(M\) contain the desire that the human preferences not be different across different ways of dying or physical distance.

As before, the right presentation on the AI’s part, within the usual bounds of how humans reason, can cause the human to emphasise their preferences or their meta-preferences.

Balconies with a view

The human is modelled as two agents \(A_1\) (basically system 1) and \(A_2\) (system 2).

The human travels a lot, and likes to go out on the balcony to look at various views. They have an instinctive (\(A_1\)) of falling, but typically overrides this with reason (\(A_2\)). Except that \(A_1\)’s fear varies in intensity. It wants to avoid wooden balconies with a (consciously imperceptible) faint smell of rot. It also wants to avoid balconies around sunset.

Given that faint rot increases danger and sunsets don’t, what are we to make of this agent’s true preferences?

The big question: what’s tolerable?

Now, the first three examples illustrate big differences in outcomes: the difference between a total utilitarian and not are non-trivial, wanting life extension technology or not could make a huge difference in outcome, and so on.

However, all are within the scope of “tolerable outcomes”, very broadly defined. None result in optimisation of the universe for money or paperclips or immediate human extinction. We could extend the models to get those situations (eg by having some of these agents in a position to make long term or large impact decisions).

But the key question remains: if we add more details of the model of human rationality along with some principles for resolving these types of conflicts (principles which the AI can’t simply “learn”), we will still likely end up with the AI’s computed reward function being something unpredictable in a large class of functions. However, can we ensure it’s “tolerable”, or does anything less that perfect modelling of human irrationality result in a disastrous optimise outcome?

How approximately can we input human irrationalities into a learning AI?





If you drop the
by Alex Appel on Distributed Cooperation | 1 like

Cool! I'm happy to see this
by Abram Demski on Distributed Cooperation | 0 likes

Caveat: The version of EDT
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

[Delegative Reinforcement
by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes


Privacy & Terms