Intelligent Agent Foundations Forumsign up / log in
AI safety: three human problems and one AI issue
post by Stuart Armstrong 123 days ago | Ryan Carey and Daniel Dewey like this | 2 comments

A putative new idea for AI control; index here.

There have been various attempts to classify the problems in AI safety research. Our old Oracle paper that classified then-theoretical methods of control, to more recent classifications that grow out of modern more concrete problems.

These all serve their purpose, but I think a more enlightening classification of the AI safety problems is to look at what the issues we are trying to solve or avoid. And most of these issues are problems about humans.


Specifically, I feel AI safety issues can be classified as three human problems and one central AI issue. The human problems are:

  1. Humans don’t know their own values (sub-issue: humans know their values better in retrospect than in prediction).
  2. Humans are not (idealised) agents and don’t have stable values (sub-issue: humanity itself is even less of an agent).
  3. Humans have poor predictions of an AI’s behaviour.

And the central AI issue is:

  1. AIs could become extremely powerful.

Obviously if humans were agents and knew their own values and could predict whether a given AI would follow those values or not, there would be not problem. Conversely, if AIs were weak, then the human failings wouldn’t matter so much.

The points about human values is relatively straightforward, but what’s the problem with humans not being agents? Essentially, humans can be threatened, tricked, seduced, exhausted, drugged, modified, and so on, in order to act seemingly against our interests and values.

If humans were clearly defined agents, then what counts as a trick or a modification would be easy to define and exclude. But since this is not the case, we’re reduced to trying to figure out the extent to which something like a heroin injection is a valid way to influence human preferences. This makes both humans susceptible to manipulation, and human values hard to define.

Finally, the issue of humans having poor predictions of AI is more general than it seems. If you want to ensure that an AI has the same behaviour in the testing and training environment, then you’re essentially trying to guarantee that you can predict that the testing environment behaviour will be the same as the (presumably safe) training environment behaviour.

How to classify methods and problems

That’s well and good, but how to various traditional AI methods or problems fit into this framework? This should give us an idea as to whether the framework is useful.

It seems to me that:

  • Friendly AI is trying to solve the values problem directly.
  • IRL and Cooperative IRL are also trying to solve the values problem. The greatest weakness of these methods is the not agents problem.
  • Corrigibility/interruptibility are also addressing the issue of humans not knowing their own values, using the sub-issue that human values are clearer in retrospect. These methods also overlap with poor predictions.
  • AI transparency is aimed at getting round the poor predictions problem.
  • Laurent’s work on carefully defining the properties of agents is mainly also about solving the poor predictions problem.
  • Low impact and Oracles are aimed squarely at preventing AIs from becoming powerful. Methods that restrict the Oracle’s output implicitly accept that humans are not agents.
  • Robustness of the AI to changes between testing and training environment, degradation and corruption, etc… ensures that humans won’t be making poor predictions about the AI.
  • Robustness to adversaries is dealing with the sub-issue that humanity is not an agent.
  • The modular approach of Eric Drexler is aimed at preventing AIs from becoming too powerful, while reducing our poor predictions.
  • Logical uncertainty, if solved, would reduce the scope for certain types of poor predictions about AIs.
  • Wireheading, when the AI takes control of reward channel, is a problem that humans don’t know their values (and hence use an indirect reward) and that the humans make poor predictions about the AI’s actions.
  • Wireheading, when the AI takes control of the human, is as above but also a problem that humans are not agents.
  • Incomplete specifications are either a problem of not knowing our own values (and hence missing something important in the reward/utility) or making poor predictions (when we though that a situation was covered by our specification, but it turned out not to be).
  • AIs modelling human knowledge seem to be mostly about getting round the fact that humans are not agents.

Putting this all in a table:

Further refinements of the framework

It seems to me that the third category – poor predictions – is the most likely to be expandable. For the moment, it just incorporates all our lack of understanding about how AIs would behave, but this might more useful to subdivide.



by Daniel Dewey 123 days ago | Ryan Carey likes this | link

Thanks for writing this – I think it’s a helpful kind of reflection for people to do!

reply

by Vadim Kosoy 76 days ago | link

It seems to me that “friendly AI” is a name for the entire field rather than a particular approach, otherwise I don’t understand what you mean by “friendly AI”? More generally, it would be nice to provide a link for each of the approaches.

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

Note that the problem with
by Vadim Kosoy on Open Problems Regarding Counterfactuals: An Introd... | 0 likes

Typos on page 5: *
by Vadim Kosoy on Open Problems Regarding Counterfactuals: An Introd... | 0 likes

Ah, you're right. So gain
by Abram Demski on Smoking Lesion Steelman | 0 likes

> Do you have ideas for how
by Jessica Taylor on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

I think I understand what
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

>You don’t have to solve
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

Your confusion is because you
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

My confusion is the
by Tom Everitt on Delegative Inverse Reinforcement Learning | 0 likes

> First of all, it seems to
by Abram Demski on Smoking Lesion Steelman | 0 likes

> figure out what my values
by Vladimir Slepnev on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

I agree that selection bias
by Jessica Taylor on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

>It seems quite plausible
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

> defending against this type
by Paul Christiano on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

2. I think that we can avoid
by Paul Christiano on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

I hope you stay engaged with
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

RSS

Privacy & Terms