Optimisation in manipulating humans: engineered fanatics vs yes-men
discussion post by Stuart Armstrong 636 days ago | discuss

A putative new idea for AI control; index here.

One of the ways in which humans aren’t agents is that we can be manipulated into having any (or almost any) set of values - through drugs, brain surgery, extreme propaganda, and other methods.

If an AI is tasked with “satisfy human preferences”, and if the AI can affect the definition of “human preferences” by changing humans, this is what we would expect it to do.

This could be combated by my making the definition of human preferences counterfactual. But it’s not fully clear how to define counterfactual preferences, and it would be interesting to see whether we can constrain the AI from manipulating humans in other ways.

One idea is to look at what we informally call optimisation power. If the AI would prefer a human to be a $$u$$-maximising agent, then it presumably has to work hard at transforming the human values into that. And small changes in the definition of the AI’s preference would mean that they would prefer transforming the human into a $$v$$-maximiser.

Thus, though optimisation power is hard to define, this would seem to fit the bill: an “honest” reward learning process is one where the final human values don’t depend sensitively on the AI’s initial values. And a manipulable one is where it does.

Let’s dig into this a bit more. Why would an AI prefer $$u$$ or $$v$$? Well, the generic failure mode for “satisfy human preferences” is to see it as the sum over all utilities $$u$$ in some set $$U$$ of “maximise $$u$$ if the human agrees to maximise $$u$$”. Then what the AI wants is for the human agree to maximise a $$v$$ where $$v$$ is the utility function the AI finds easiest to reach a high value on.

But then we can individually translate or scale the various utilities in $$U$$, to make different ones easier or harder to reach high values on. This would make the AI prefer the human to agree to a different maximisation.

So if $$R$$ is the reward function that encodes “satisfy human preferences”, there are many $$R$$ that are equivalent if the AI cannot influence the human’s values, but that are very different if the AI can. Looking for something like that could be a sign that something is wrong in the system.

# Yes-men are a problem

And this is true, if the only option of the AI was to turn the human into an engineered fanatic with certain values. But it might also seek to turn the human into a yes-man, agreeing to anything the AI suggests. And this is something that an AI would do for a wide variety of different $$R$$’s. So our optimisation idea flounders at this point.

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes