A putative new idea for AI control; index here.
I’ll try to clarify what I was doing with the AI truth setup in a previous post. First I’ll explain the nature of the challenge, and then how the setup tries to solve it.
The nature of the challenge is to have an AI give genuine understanding to a human. Getting the truth out of an AI or Oracle is not that hard, conceptually: you get the AI to report some formal property of its model. The problem is that that truth can be completely misleading, or, more likely, incomprehensible.
For misleading you have old staples like “If you implement this strategy, the incidence of cancer will go down” (neglecting to mention that this will be because everyone will be dead). For incomprehensible, there are statements like “ratios of FtsZ to Pikachurin plus C3-convertase to Von Willebrand Factor will rise in proportion to the average electronegativity in the Earth’s orbit”. That might be true, but is incomprehensible to a layman and not much better to an expert.
In fact, the AI’s answer might be a confusing mix of misleading and incomprehensible. If we’ve managed to find a good formal definition of “humans have cancer” but not of “humans are still alive”, then we’ll confront that mix: we’ll know this will solve cancer, but we can’t figure out whether humanity will survive or not.
Then there’s the omnipresent risks from human biases. Confirmation bias, backfire effects, optimism, or pessimism bias… Should the AI take these in account, and present us with inaccurate information that nonetheless ends up with us having a more accurate impression at the end? And how are we going to measure that “accurate impression” anyway?
Measuring true understanding with an exam
Ignore the second AI in the setup for the moment, and focus simply on the first. One mark of true understanding in a human, is if they are capable of accurately answering questions about what they’ve just learnt. This suggests an exam as a true measure of understanding. And thus the goal for the first AI: it must prep its human to successfully pass the exam.
Now, many have argued that exams are not the be all and end all of understanding. But it does seem that passing exams is a necessary component of true understanding. If you understand something better but can’t answer anything new about it, I’d be very curious as to what your understanding consists of.
Now, if there are only a few questions in the exam, the first AI can teach to the test, and simply have the human memorise a few meaningless answers. Technically, it could also do this for billions and billions of questions, but the frailty of human memory partially protects us here: it’s much more plausible we have a process for figuring out the answers to that many questions (ie a sort of understanding) than a long list to follow.
However, it’s not plausible for normal humans to sit down and answer a billion questions with any degree of competence or speed. Fortunately, that will not be necessary: instead of answering N questions, the human must simply be capable of answering one question chosen randomly from among N. This means that N can be very large indeed, combinatorically large, far too large for even a machine to memorise. Some sort of compression or understanding or simplification will be needed.
So, formally, the first AI is rewarded from the human choosing the right answer from a randomly chosen question. Obviously the AI has to not know the question in advance, or actively advise the human on the answer once the question is given, or the whole setup falls apart.
The roles of the second AI and the surprise-test counterfactual
So, what’s the point of having the second AI? Well the setup will work above if we have a list of N questions that are a) formally defined for the first AI, and b) understandable for the human.
But that’s rarely the case. And if it were the case, we might have alternatives, like simply having the AI directly answer a whole bunch of the understandable questions.
So typically, the questions are going to be like that “ratios of FtsZ…” statement above, and completely beyond human comprehension. Enter the second AI. It’s only role is to take these formally defined questions, and translate them into terms the human is likely to understand – and hence, answer correctly. Then both AIs are given the motivation to have the human choose the correct answer.
So the second AI is like an exam invigilator clarifying the questions on the exam sheet. Now, if the invigilator knows the answer to the question and really wants the student to pass, it can simply tell the human the answer. So the second AI must be ignorant of the answers itself: only the first AI knows. Similarly, if the second AI has a long conversation with the human, it can extract the info the first AI gave the human, and then use that to figure out the answer, and tell the human what to answer.
Finally, if the questions are biased (for instance the answer is “no” for \(99\%\) of the questions), then the first AI could give up on giving genuine understanding and simply tell the human to always answer “no”.
All these points can be addressed by question selection and information control. Both AIs start with a prior \(P\) over whatever is being analysed. Then N boolean questions are selected, where N is a very large number. These questions are selected so that they are almost 50-50, given \(P\). That means that they might have a definite answer, given \(P\), but the slightest bit of extra evidence is likely to push them towards “yes” or “no” with (almost) equal ease. Furthermore, for every question in N, it’s negation is also in N, so there will always be half of the questions with a “yes” answer and half with a “no”, whatever happens.
Then the first AI gets extra evidence, and computes a posterior \(P'\). It then talks with the human, trying to get them to understand \(P'\). The human goes over to the second AI, and a question is drawn randomly from N. The second AI only knows \(P\), so it knows that the question its asking is roughly equally likely to be answered by “yes” or “no”. It doesn’t get to see or hear or interact with the human in any way; it simply creates an explanation for the question, and leaves it up to the human to act on. Thus, it cannot update \(P\) in any way, so can’t choose to pass on extra information about \(P'\) – because it doesn’t have any.
And then the human, armed with the knowledge of \(P'\) from the first AI and the explanation of the question from the second, will hopefully be able to answer that question with better than random accuracy.
Finally, it is ponderous if we run this exam every single time the first AI communicates anything. So instead this exam is done as surprise test – with probability \(\epsilon\) it will happen, so the first AI better be on its toes all the time, while freeing up the human to just receive the understanding and act freely on it, most of the time.