Intelligent Agent Foundations Forumsign up / log in
Three Oracle designs
post by Stuart Armstrong 492 days ago | Patrick LaVictoire likes this | discuss

A putative new idea for AI control; index here.

An initial draft looking at three ways of getting information out of Oracles, information that’s useful and safe - in theory.

One thing I may need to do, is find slightly better names for them ^_^

Good and safe uses of AI Oracles

Abstract:


An Oracle is a design for potentially high power artificial intelligences (AIs), where the AI is made safe by restricting it to only answer questions. Unfortunately most designs cause the Oracle to be motivated to manipulate humans with the contents of their answers. The second challenge is to get the AI to provide accurate and useful answers. This paper presents three Oracle designs that get around the manipulation and accuracy problems in different ways: the Counterfactually Unread Agent, the Verified Selective Agent, and the Virtual-world Time-bounded Agent. It demonstrates how each design is safe (given that humans stick with the protocols), and allows different types of questions and answers. Finally, it investigates what happens when the implementation is slightly imperfect, concluding the first two agent designs are robust to this, but not the third.

Images of the three designs:

Counterfactually Unread Agent:

Verified Selective Agent:

Virtual-world Time-bounded Agent:



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

Indeed there is some kind of
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

I think the point I was
by Abram Demski on Predictable Exploration | 0 likes

(also x-posted from
by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

(x-posted from Arbital ==>
by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

>If the other players can see
by Stuart Armstrong on Predictable Exploration | 0 likes

Thinking about this more, I
by Abram Demski on Predictable Exploration | 0 likes

> So I wound up with
by Abram Demski on Predictable Exploration | 0 likes

RSS

Privacy & Terms