Intelligent Agent Foundations Forumsign up / log in
New(ish) AI control ideas
post by Stuart Armstrong 638 days ago | 1 comment

The list of posts is getting unwieldy, so I’ll post the up-to-date stuff at the beginning:

Humans inconsistencies:

Reward function learning:

Understanding humans:

Framework:

Acausal trade:

Oracle designs:

Extracting human values:

Corrigibility:

Indifference:

AIs in virtual worlds:

True answers from AI:

Miscellanea:


Migrating my old post over from Less Wrong.

I recently went on a two day intense solitary “AI control retreat”, with the aim of generating new ideas for making safe AI. The “retreat” format wasn’t really a success (“focused uninterrupted thought” was the main gain, not “two days of solitude” - it would have been more effective in three hour sessions), but I did manage to generate a lot of new ideas. These ideas will now go before the baying bloodthirsty audience (that’s you, folks) to test them for viability.

A central thread running through could be: if you want something, you have to define it, then code it, rather than assuming you can get if for free through some other approach.

To provide inspiration and direction to my thought process, I first listed all the easy responses that we generally give to most proposals for AI control. If someone comes up with a new/old brilliant idea for AI control, it can normally be dismissed by appealing to one of these responses:

  1. The AI is much smarter than us.
  2. It’s not well defined.
  3. The setup can be hacked.
  • By the agent.
  • By outsiders, including other AI.
  • Adding restrictions encourages the AI to hack them, not obey them.
  1. The agent will resist changes.
  2. Humans can be manipulated, hacked, or seduced.
  3. The design is not stable.
  • Under self-modification.
  • Under subagent creation.
  • Unrestricted search is dangerous.
  1. The agent has, or will develop, dangerous goals.

Important background ideas:

I decided to try and attack as many of these ideas as I could, head on, and see if there was any way of turning these objections. A key concept is that we should never just expect a system to behave “nicely” by default (see eg here). If we wanted that, we should define what “nicely” is, and put that in by hand.

I came up with sixteen main ideas, of varying usefulness and quality, which I will be posting in the coming weekdays in comments (the following links will go live after each post). The ones I feel most important (or most developed) are:

While the less important or developed ideas are:

Please let me know your impressions on any of these! The ideas are roughly related to each other as follows (where the arrow Y→X can mean “X depends on Y”, “Y is useful for X”, “X complements Y on this problem” or even “Y inspires X”):

EDIT: I’ve decided to use this post as a sort of central repository of my new ideas on AI control. So adding the following links:

Short tricks:

High-impact from low impact:

High impact from low impact, best advice:

Overall meta-thoughts:

Pareto-improvements to corrigible agents:

AIs in virtual worlds:

Low importance AIs:

Wireheading:

AI honesty and testing:

Goal completion:



by David Krueger 629 days ago | link

Thanks! I love having central repos.

A quick question / comment, RE: “I decided to try and attack as many of these ideas as I could, head on, and see if there was any way of turning these objections.”

Q: What do you mean (or have in mind) in terms of “turning […] objections”? I’m not very familiar with the phrase.

Comment: One trend I see is that technical safety proposals are often dismissed by appealing to one of the 7 responses you’ve given. Recently I’ve been thinking that we should be a bit less focused on finding airtight solutions, and more focused on thinking about which proposed techniques could be applied in various scenarios to significantly reduce risk. For example, boxing an agent (e.g. by limiting it’s sensors/actuators) might significantly increase how long it takes to escape.

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

What does the Law of Logical
by Alex Appel on Smoking Lesion Steelman III: Revenge of the Tickle... | 0 likes

To quote the straw vulcan:
by Stuart Armstrong on Hyperreal Brouwer | 0 likes

I intend to cross-post often.
by Scott Garrabrant on Should I post technical ideas here or on LessWrong... | 1 like

I think technical research
by Vadim Kosoy on Should I post technical ideas here or on LessWrong... | 2 likes

I am much more likely to miss
by Abram Demski on Should I post technical ideas here or on LessWrong... | 1 like

Note that the problem with
by Vadim Kosoy on Open Problems Regarding Counterfactuals: An Introd... | 0 likes

Typos on page 5: *
by Vadim Kosoy on Open Problems Regarding Counterfactuals: An Introd... | 0 likes

Ah, you're right. So gain
by Abram Demski on Smoking Lesion Steelman | 0 likes

> Do you have ideas for how
by Jessica Taylor on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

I think I understand what
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

>You don’t have to solve
by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

Your confusion is because you
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

My confusion is the
by Tom Everitt on Delegative Inverse Reinforcement Learning | 0 likes

> First of all, it seems to
by Abram Demski on Smoking Lesion Steelman | 0 likes

> figure out what my values
by Vladimir Slepnev on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

RSS

Privacy & Terms