As background, I suggest reading the imitation learning section of this post.
Suppose we have a generative adversarial model for answering questions. This consists of two agents, an actor and a critic. The actor generates a human-like answer given a question; the critic sees the question and answer, and tries to guess whether the answer was produced by the actor or the human.
The actor could differ from human behavior in important but hard-to-detect ways. For example, it could plagiarize or insert steganography into its answers, as in the informed oversight problem.
To resolve this problem, perhaps we want to make the actor transparent to the critic, so the critic is able to detect plagiarism. But, this makes the critic’s job far too easy: if the critic sees a trace of the actor, then it is very easy to check whether the answer to the question was produced by the actor or not (just see if the actor’s trace is consistent with the answer to the question).
In some sense, we want the critic to have logical information about the actor’s computation, but not physical information about the random bits the actor has access to. Here’s a way of doing this.
Introduce two more agents, an adviser and a misinformer. Both of these agents know whether the answer was produced by a human or the actor, and they both have access to the actor’s internal state. The critic will be able to interact with both of these agents, but will not know which the adviser and which is the misinformer. The adviser’s objective is for the critic to guess correctly, while the misinformer’s objective is for the critic to guess incorrectly.
Suppose the actor plagiarizes (in a case where the human would not plagiarize). Then the adviser may be able to look at the actor’s internal state and compile a report showing where the actor has plagiarized. The adviser may give this report to the critic, which will help the critic know that the answer was produced by the actor and not the human. Furthermore, since the critic can check that the plagiarism report is accurate, the presence of the misinformer does not hinder this process. In some sense, by having access to an advisor (who can be confused with a misinformer), the critic can gain logical information about the actor’s computations but not physical information.
In some cases, this setup will make the critic’s job too easy. Suppose the adviser always gives the critic a trace of the actor’s computation, whenever the answer was produced by the actor. The critic may check this trace for plausibility. To cause the critic to make errors, the misinformer must, when seeing that the answer was produced by a human, synthesize a plausible trace in which the actor outputs the same answer. This is only possible if the actor’s computation is “reversible” in some sense: it must be possible to (probabilistically) guess the actor’s random decisions given their answer. So whenever the actor uses “non-reversible” computations, it will be easy for the critic to tell whether the answer was produced by the actor or a human: it just sees whether it has available a plausible trace in which the actor produces the given answer (provided by the advisor or the misinformer, although only the advisor has an incentive to provide this trace).
I’m not sure if this proposal is substantially different from variational autoencoders. Both proposals only allow the actor to use reversible computations, although perhaps the notion of reversibility for variational autoencoders is different from the notion of reversibility for the proposal in this post.