Intelligent Agent Foundations Forumsign up / log in
Humans as a truth channel
post by Stuart Armstrong 292 days ago | discuss

A putative new idea for AI control; index here.

Defining truth and accuracy is tricky, so when I’ve proposed designs for things like Oracles, I’ve either used a very specific and formal question, or and indirect criteria for truth.

Here I’ll try and get a more direct system so that an AI will tell the human the truth about a question, so that the human understands.


The basic idea is simple. The first AI wishes to communicate certain facts to a second AI, but has to use the human as an intermediary. The first AI talks to the human, and then the human talks with the second AI. If the facts are to be accurate, the human has to understand them.

Of course, this will immediately fail for a variety of reasons. Even if neither AI hacks the human directly, the human doesn’t have to understand anything to serve as a conduit. So the rest of this post is dedicated to making that basic idea much more secure. If you just want to have the final idea and avoid the justifications, then you can skip directly to the end.

The knowledge to pass on

The human would have a good understanding of a subject, if they are able to answer a large quantity of questions about it. Having a long conversation with the second AI is tiring, among other things. A human simply won’t be able to answer ten million questions about what they understand, at least not well.

So the idea could be to have the second AI ask a single question, randomly chosen from a large collection of questions, and have the human give a single yes or no answer. Both AIs are then graded on the accuracy of the answer.

The answer…

What does the second AI want to do? It basically wants to dissect the human (figuratively or literally), suck out all their knowledge, and infer what the first AI was trying to communicate. Note that this doesn’t require the human to understand anything, just that the second AI can infer what the first one wanted by taking the human apart.

So instead of allowing the second AI unrestricted access to the human, it instead formulates its (boolean) question without interference, and then the human choose yes or no, and that choice is counted as an answer.

…and the question

We could have a list of actual questions, but that makes the second AI redundant, and means that the first AI will optimise for certain answers from a set of questions, not for genuine understanding.

Instead, the second AI then gets a technical boolean question about some distribution. Since the human will be deciding yes or no, and the AI will not be able to interact with them, it’s in the second AI’s interest to formulate a question that allows the human to understand the technical boolean question they’re being asked. And then, using the human’s private information, to answer it correctly.

The information from the first AI

What distribution will the second AI be asking a boolean question about? This can be anything, but I’m thinking of some conditional probability distribution along the lines of “what will the world look like if X happens?”. This needs to be formally defined, but the AI’s communication with the human is informal, and up to the AI. As long as the AI can’t hack the human in the third sense, it needs to make them understand the distribution as best it can.

Then the second AI is given a randomly chosen boolean technical question about this distribution. Since it doesn’t know what question will be asked, it’s in the first AI’s interest to ensure the human has a good general understanding of the distribution.

Making it all counterfactual

It would be very inefficient to have the human shuffle back and forth between AIs every time the first AI gives an answer. Instead, we could use the actual setup one time in a thousand or so, and let the first AI be motivated to get the correct answer in those one-in-a-thousand cases.

In summary

The whole setup can be summarised by this graphic:

There are some subtle questions about what distribution the first AI should be using versus the questions the second one can ask so as to guarantee genuine human understanding of issues the human cares about. There’s also some issues about how the setup fits into siren worlds, and other vulnerabilities. But I’ll defer that analysis to another time.



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

I think the point I was
by Abram Demski on Predictable Exploration | 0 likes

(also x-posted from
by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

(x-posted from Arbital ==>
by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

>If the other players can see
by Stuart Armstrong on Predictable Exploration | 0 likes

Thinking about this more, I
by Abram Demski on Predictable Exploration | 0 likes

> So I wound up with
by Abram Demski on Predictable Exploration | 0 likes

Hm, I got the same result
by Alex Appel on Predictable Exploration | 1 like

Paul - how widely do you want
by David Krueger on Funding opportunity for AI alignment research | 0 likes

I agree, my intuition is that
by Abram Demski on Smoking Lesion Steelman III: Revenge of the Tickle... | 0 likes

RSS

Privacy & Terms