A putative new idea for AI control; index here.
An initial draft looking at three ways of getting information out of Oracles, information that’s useful and safe - in theory.
One thing I may need to do, is find slightly better names for them ^_^
Good and safe uses of AI Oracles
An Oracle is a design for potentially high power artificial intelligences (AIs), where the AI is made safe by restricting it to only answer questions. Unfortunately most designs cause the Oracle to be motivated to manipulate humans with the contents of their answers. The second challenge is to get the AI to provide accurate and useful answers. This paper presents three Oracle designs that get around the manipulation and accuracy problems in different ways: the Counterfactually Unread Agent, the Verified Selective Agent, and the Virtual-world Time-bounded Agent. It demonstrates how each design is safe (given that humans stick with the protocols), and allows different types of questions and answers. Finally, it investigates what happens when the implementation is slightly imperfect, concluding the first two agent designs are robust to this, but not the third.
Images of the three designs:
Counterfactually Unread Agent:
Verified Selective Agent:
Virtual-world Time-bounded Agent: