Some proposed ways to train a system to imitate a human involve having one system imitate a human, while another system tries to tell the first system apart from an actual human. If the first system can get classified as human by the second system, then (one might think) it is imitating a human well, as long as the second system is sufficiently smart.
I described a possible problem with these approaches in a LessWrong thread (this originally came up in a discussion with Paul):
Here’s a scenario that doesn’t seem completely implausible. Suppose Bob is someone whose public key is easily available on the internet. The first AI will read things on the internet and output a message. Some of the message will get put on the public internet. Bob suspects that the AI might have secretly sent him a message (e.g. giving him advice on which stocks to buy). So he tries using his private key to decrypt some of the AI’s output (e.g. the lowestorder bits in some images the AI has output).
Knowing that Bob (or someone else like Bob) will likely do this, the AI (which has access to Bob’s public key, due to reading the internet) encrypts a message with Bob’s public key, and puts the ciphertext in the lowestorder bits of some image. The ciphertext looks like random noise to the second AI. Maybe the plaintext contains both actual stock tips and some dangerous messages that trick Bob into running an unrestrained copy of the first AI (I’m assuming the AI is good at psychological manipulation).
As Paul says later in the thread, the underlying problem is that it is easy for the first AI to change its output in an important way without the second AI noticing. There isn’t selection pressure to do this, but there isn’t selection pressure against it either.
Here’s a proposed way of implementing humanimitation while avoiding this problem. The humanimitator is a single AI that is given a prefix of a string produced by a human and predicts the next bit. It is rewarded for predicting the next bit accurately (using some proper scoring rule). We can use this system to imitate a human by sampling from its implied probability distribution over strings, bit by bit. If the system is very good at predicting the next bit in a humanproduced string, then this implied probability distribution will be an accurate prediction of the human string.
Unfortunately, predicting the string bitbybit might be computationally harder than producing a string that is hard to tell apart from what a human would produce. Here’s an extension that makes the problem slightly easier.
We want some way to represent a distribution over strings. One way of representing a distribution over strings S is to introduce some auxiliary data A and function \(f : A \rightarrow S\), and represent a distribution \(p\) over A. For example, if the model for generating strings is a PCFG, then the auxiliary data A is the actual parse tree. We could assume that \(p\) is represented by a bitbybit predictor (which takes a prefix of the data A and returns a probability for the next bit). In the PCFG example, this is easy; just order the data A in a “generative” order, and these bitbybit predictions will be easy.
Given the actual string S, it is possible to show a lower bound on this string’s marginal probability under \(p\) by importance sampling possible parse trees consistent with a given string. That is, the system will represent some distribution \(q\) over A (after seeing the string S), which only outputs A values for which \(f(A) = S\). This sampler can be used to estimate the total probability under \(p\) of \(A\) values for which \(f(A) = S\).
Note that this is only one method of estimating the probability. Another method is modelcounting using hash functions. Probably, there are useful methods other than these two.
So in order to predict a humanproduced string, the AI:
 outputs a distribution \(p\) over A, as a bitbybit generative model
 sees the humanproduced string S
 outputs the distribution \(q\) over A, as a bitbybit generative model
 gets a score according to the estimated logprobability of the string S according to \(p\), using \(q\) for importance sampling
We could represent a bitbybit generative model as e.g. a split copy of the AI that outputs a probability for the next bit given previous bits. Note that steps 24 only happen during training; during testing, you only need to run step 1 to get \(p\), take a sample \(a\) from \(p\), and then return \(f(a)\).
This kind of method seems computationally easier than directly predicting a humanproduced string bitbybit, but it might still be harder than the original adversarial proposal.
