Intelligent Agent Foundations Forumsign up / log in
Humans as a truth channel
post by Stuart Armstrong 381 days ago | discuss

A putative new idea for AI control; index here.

Defining truth and accuracy is tricky, so when I’ve proposed designs for things like Oracles, I’ve either used a very specific and formal question, or and indirect criteria for truth.

Here I’ll try and get a more direct system so that an AI will tell the human the truth about a question, so that the human understands.


The basic idea is simple. The first AI wishes to communicate certain facts to a second AI, but has to use the human as an intermediary. The first AI talks to the human, and then the human talks with the second AI. If the facts are to be accurate, the human has to understand them.

Of course, this will immediately fail for a variety of reasons. Even if neither AI hacks the human directly, the human doesn’t have to understand anything to serve as a conduit. So the rest of this post is dedicated to making that basic idea much more secure. If you just want to have the final idea and avoid the justifications, then you can skip directly to the end.

The knowledge to pass on

The human would have a good understanding of a subject, if they are able to answer a large quantity of questions about it. Having a long conversation with the second AI is tiring, among other things. A human simply won’t be able to answer ten million questions about what they understand, at least not well.

So the idea could be to have the second AI ask a single question, randomly chosen from a large collection of questions, and have the human give a single yes or no answer. Both AIs are then graded on the accuracy of the answer.

The answer…

What does the second AI want to do? It basically wants to dissect the human (figuratively or literally), suck out all their knowledge, and infer what the first AI was trying to communicate. Note that this doesn’t require the human to understand anything, just that the second AI can infer what the first one wanted by taking the human apart.

So instead of allowing the second AI unrestricted access to the human, it instead formulates its (boolean) question without interference, and then the human choose yes or no, and that choice is counted as an answer.

…and the question

We could have a list of actual questions, but that makes the second AI redundant, and means that the first AI will optimise for certain answers from a set of questions, not for genuine understanding.

Instead, the second AI then gets a technical boolean question about some distribution. Since the human will be deciding yes or no, and the AI will not be able to interact with them, it’s in the second AI’s interest to formulate a question that allows the human to understand the technical boolean question they’re being asked. And then, using the human’s private information, to answer it correctly.

The information from the first AI

What distribution will the second AI be asking a boolean question about? This can be anything, but I’m thinking of some conditional probability distribution along the lines of “what will the world look like if X happens?”. This needs to be formally defined, but the AI’s communication with the human is informal, and up to the AI. As long as the AI can’t hack the human in the third sense, it needs to make them understand the distribution as best it can.

Then the second AI is given a randomly chosen boolean technical question about this distribution. Since it doesn’t know what question will be asked, it’s in the first AI’s interest to ensure the human has a good general understanding of the distribution.

Making it all counterfactual

It would be very inefficient to have the human shuffle back and forth between AIs every time the first AI gives an answer. Instead, we could use the actual setup one time in a thousand or so, and let the first AI be motivated to get the correct answer in those one-in-a-thousand cases.

In summary

The whole setup can be summarised by this graphic:

There are some subtle questions about what distribution the first AI should be using versus the questions the second one can ask so as to guarantee genuine human understanding of issues the human cares about. There’s also some issues about how the setup fits into siren worlds, and other vulnerabilities. But I’ll defer that analysis to another time.



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

[Delegative Reinforcement
by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 1 like

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

RSS

Privacy & Terms