Intelligent Agent Foundations Forumsign up / log in
Corrigibility thoughts III: manipulating versus deceiving
discussion post by Stuart Armstrong 548 days ago | discuss

A putative new idea for AI control; index here.

This is the first of three articles about limitations and challenges in the concept of corrigibility (see articles 1 and 2).

The desiderata for corrigibility are:

  1. A corrigible agent tolerates, and preferably assists, its operators in their attempts to alter or shut down the agent.
  2. A corrigible agent does not attempt to manipulate or deceive its operators.
  3. A corrigible agent has incentives to repair safety measures (such as shutdown buttons, tripwires, or containment tools) if they break, or at least notify its operators in the event of a breakage.
  4. A corrigible agent preserves its corrigibility, even as it creates new sub-systems or sub-agents, even if it undergoes significant self-modification.

In this post, I’ll be looking more at some aspects of point 2. A summary of the result will be:

  • Defining manipulation simply may be possible, but defining deception is a whole other problem.

The warning in this post should always be born in mind, of course; it’s possible that we me might find a semi-formal version of deception that does the trick.

Manipulation versus deception

In the previous post, I mentioned that we may need to define clearly what an operator was, rather than relying on the pair: {simple description of a value correction event, physical setup around that event}. Can we define manipulation and deception without defining what an operator is?

For manipulation, it seems we can. Because manipulation is all about getting certain preferred outcomes. By specifying that the AI cannot aim to optimise certain outcomes, we can stop at least certain types of manipulations. Along with other more direct ways of achieving those outcomes.

For deception, the situation is much more complicated. It seems impossible to define how one agent can communicate to another agent (especially one as biased as a human), and increase the accuracy of the second agent, without defining the second agent properly. More confusingly, this doesn’t even stop deception; sometimes lying to a bounded agent can increase their accuracy about the world.

There may be some ways to define deception or truth behaviourally, such as using a human as a crucial node in an autoencoder between two AIs. But those definitions are dangerous, because the AI is incentivised to make the human behave in a certain way, rather than having them believe certain things. Manipulating the human or replacing them entirely is positively encourage.

In all, it seems that the problem of AI deception is vast and complicated, and should probably be separated from the issue of corrigibility.



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

> For another thing, consider
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

RSS

Privacy & Terms