Intelligent Agent Foundations Forumsign up / log in
Humans as a truth channel
post by Stuart Armstrong 116 days ago | discuss

Defining truth and accuracy is tricky, so when I’ve proposed designs for things like Oracles, I’ve either used a very specific and formal question, or and indirect criteria for truth.

Here I’ll try and get a more direct system so that an AI will tell the human the truth about a question, so that the human understands.

The basic idea is simple. The first AI wishes to communicate certain facts to a second AI, but has to use the human as an intermediary. The first AI talks to the human, and then the human talks with the second AI. If the facts are to be accurate, the human has to understand them.

Of course, this will immediately fail for a variety of reasons. Even if neither AI hacks the human directly, the human doesn’t have to understand anything to serve as a conduit. So the rest of this post is dedicated to making that basic idea much more secure. If you just want to have the final idea and avoid the justifications, then you can skip directly to the end.

The knowledge to pass on

The human would have a good understanding of a subject, if they are able to answer a large quantity of questions about it. Having a long conversation with the second AI is tiring, among other things. A human simply won’t be able to answer ten million questions about what they understand, at least not well.

So the idea could be to have the second AI ask a single question, randomly chosen from a large collection of questions, and have the human give a single yes or no answer. Both AIs are then graded on the accuracy of the answer.

The answer…

What does the second AI want to do? It basically wants to dissect the human (figuratively or literally), suck out all their knowledge, and infer what the first AI was trying to communicate. Note that this doesn’t require the human to understand anything, just that the second AI can infer what the first one wanted by taking the human apart.

So instead of allowing the second AI unrestricted access to the human, it instead formulates its (boolean) question without interference, and then the human choose yes or no, and that choice is counted as an answer.

…and the question

We could have a list of actual questions, but that makes the second AI redundant, and means that the first AI will optimise for certain answers from a set of questions, not for genuine understanding.

Instead, the second AI then gets a technical boolean question about some distribution. Since the human will be deciding yes or no, and the AI will not be able to interact with them, it’s in the second AI’s interest to formulate a question that allows the human to understand the technical boolean question they’re being asked. And then, using the human’s private information, to answer it correctly.

The information from the first AI

What distribution will the second AI be asking a boolean question about? This can be anything, but I’m thinking of some conditional probability distribution along the lines of “what will the world look like if X happens?”. This needs to be formally defined, but the AI’s communication with the human is informal, and up to the AI. As long as the AI can’t hack the human in the third sense, it needs to make them understand the distribution as best it can.

Then the second AI is given a randomly chosen boolean technical question about this distribution. Since it doesn’t know what question will be asked, it’s in the first AI’s interest to ensure the human has a good general understanding of the distribution.

Making it all counterfactual

It would be very inefficient to have the human shuffle back and forth between AIs every time the first AI gives an answer. Instead, we could use the actual setup one time in a thousand or so, and let the first AI be motivated to get the correct answer in those one-in-a-thousand cases.

In summary

The whole setup can be summarised by this graphic:

There are some subtle questions about what distribution the first AI should be using versus the questions the second one can ask so as to guarantee genuine human understanding of issues the human cares about. There’s also some issues about how the setup fits into siren worlds, and other vulnerabilities. But I’ll defer that analysis to another time.





The "benign induction
by David Krueger on Why I am not currently working on the AAMLS agenda | 0 likes

This comment is to explain
by Alex Mennen on Formal Open Problem in Decision Theory | 0 likes

Thanks for writing this -- I
by Daniel Dewey on AI safety: three human problems and one AI issue | 1 like

I think it does do the double
by Stuart Armstrong on Acausal trade: double decrease | 0 likes

>but the agent incorrectly
by Stuart Armstrong on CIRL Wireheading | 0 likes

I think the double decrease
by Owen Cotton-Barratt on Acausal trade: double decrease | 0 likes

The problem is that our
by Scott Garrabrant on Cooperative Oracles: Nonexploited Bargaining | 1 like

Yeah. The original generator
by Scott Garrabrant on Cooperative Oracles: Nonexploited Bargaining | 0 likes

I don't see how it would. The
by Scott Garrabrant on Cooperative Oracles: Nonexploited Bargaining | 1 like

Does this generalise to
by Stuart Armstrong on Cooperative Oracles: Nonexploited Bargaining | 0 likes

>Every point in this set is a
by Stuart Armstrong on Cooperative Oracles: Nonexploited Bargaining | 0 likes

This seems a proper version
by Stuart Armstrong on Cooperative Oracles: Nonexploited Bargaining | 0 likes

This doesn't seem to me to
by Stuart Armstrong on Change utility, reduce extortion | 0 likes

[_Regret Theory with General
by Abram Demski on Generalizing Foundations of Decision Theory II | 0 likes

It's not clear whether we
by Paul Christiano on Infinite ethics comparisons | 1 like


Privacy & Terms