Intelligent Agent Foundations Forumsign up / log in
Corrigibility thoughts III: manipulating versus deceiving
discussion post by Stuart Armstrong 331 days ago | discuss

A putative new idea for AI control; index here.

This is the first of three articles about limitations and challenges in the concept of corrigibility (see articles 1 and 2).

The desiderata for corrigibility are:

  1. A corrigible agent tolerates, and preferably assists, its operators in their attempts to alter or shut down the agent.
  2. A corrigible agent does not attempt to manipulate or deceive its operators.
  3. A corrigible agent has incentives to repair safety measures (such as shutdown buttons, tripwires, or containment tools) if they break, or at least notify its operators in the event of a breakage.
  4. A corrigible agent preserves its corrigibility, even as it creates new sub-systems or sub-agents, even if it undergoes significant self-modification.

In this post, I’ll be looking more at some aspects of point 2. A summary of the result will be:

  • Defining manipulation simply may be possible, but defining deception is a whole other problem.

The warning in this post should always be born in mind, of course; it’s possible that we me might find a semi-formal version of deception that does the trick.

Manipulation versus deception

In the previous post, I mentioned that we may need to define clearly what an operator was, rather than relying on the pair: {simple description of a value correction event, physical setup around that event}. Can we define manipulation and deception without defining what an operator is?

For manipulation, it seems we can. Because manipulation is all about getting certain preferred outcomes. By specifying that the AI cannot aim to optimise certain outcomes, we can stop at least certain types of manipulations. Along with other more direct ways of achieving those outcomes.

For deception, the situation is much more complicated. It seems impossible to define how one agent can communicate to another agent (especially one as biased as a human), and increase the accuracy of the second agent, without defining the second agent properly. More confusingly, this doesn’t even stop deception; sometimes lying to a bounded agent can increase their accuracy about the world.

There may be some ways to define deception or truth behaviourally, such as using a human as a crucial node in an autoencoder between two AIs. But those definitions are dangerous, because the AI is incentivised to make the human behave in a certain way, rather than having them believe certain things. Manipulating the human or replacing them entirely is positively encourage.

In all, it seems that the problem of AI deception is vast and complicated, and should probably be separated from the issue of corrigibility.



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Without reading closely, this
by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

RSS

Privacy & Terms