Intelligent Agent Foundations Forumsign up / log in
A possible training procedure for human-imitators
discussion post by Jessica Taylor 1246 days ago | Patrick LaVictoire and Paul Christiano like this | 4 comments

Some proposed ways to train a system to imitate a human involve having one system imitate a human, while another system tries to tell the first system apart from an actual human. If the first system can get classified as human by the second system, then (one might think) it is imitating a human well, as long as the second system is sufficiently smart.

I described a possible problem with these approaches in a LessWrong thread (this originally came up in a discussion with Paul):

Here’s a scenario that doesn’t seem completely implausible. Suppose Bob is someone whose public key is easily available on the internet. The first AI will read things on the internet and output a message. Some of the message will get put on the public internet. Bob suspects that the AI might have secretly sent him a message (e.g. giving him advice on which stocks to buy). So he tries using his private key to decrypt some of the AI’s output (e.g. the lowest-order bits in some images the AI has output).

Knowing that Bob (or someone else like Bob) will likely do this, the AI (which has access to Bob’s public key, due to reading the internet) encrypts a message with Bob’s public key, and puts the ciphertext in the lowest-order bits of some image. The ciphertext looks like random noise to the second AI. Maybe the plaintext contains both actual stock tips and some dangerous messages that trick Bob into running an unrestrained copy of the first AI (I’m assuming the AI is good at psychological manipulation).

As Paul says later in the thread, the underlying problem is that it is easy for the first AI to change its output in an important way without the second AI noticing. There isn’t selection pressure to do this, but there isn’t selection pressure against it either.

Here’s a proposed way of implementing human-imitation while avoiding this problem. The human-imitator is a single AI that is given a prefix of a string produced by a human and predicts the next bit. It is rewarded for predicting the next bit accurately (using some proper scoring rule). We can use this system to imitate a human by sampling from its implied probability distribution over strings, bit by bit. If the system is very good at predicting the next bit in a human-produced string, then this implied probability distribution will be an accurate prediction of the human string.

Unfortunately, predicting the string bit-by-bit might be computationally harder than producing a string that is hard to tell apart from what a human would produce. Here’s an extension that makes the problem slightly easier.

We want some way to represent a distribution over strings. One way of representing a distribution over strings S is to introduce some auxiliary data A and function \(f : A \rightarrow S\), and represent a distribution \(p\) over A. For example, if the model for generating strings is a PCFG, then the auxiliary data A is the actual parse tree. We could assume that \(p\) is represented by a bit-by-bit predictor (which takes a prefix of the data A and returns a probability for the next bit). In the PCFG example, this is easy; just order the data A in a “generative” order, and these bit-by-bit predictions will be easy.

Given the actual string S, it is possible to show a lower bound on this string’s marginal probability under \(p\) by importance sampling possible parse trees consistent with a given string. That is, the system will represent some distribution \(q\) over A (after seeing the string S), which only outputs A values for which \(f(A) = S\). This sampler can be used to estimate the total probability under \(p\) of \(A\) values for which \(f(A) = S\).

Note that this is only one method of estimating the probability. Another method is model-counting using hash functions. Probably, there are useful methods other than these two.

So in order to predict a human-produced string, the AI:

  1. outputs a distribution \(p\) over A, as a bit-by-bit generative model
  2. sees the human-produced string S
  3. outputs the distribution \(q\) over A, as a bit-by-bit generative model
  4. gets a score according to the estimated log-probability of the string S according to \(p\), using \(q\) for importance sampling

We could represent a bit-by-bit generative model as e.g. a split copy of the AI that outputs a probability for the next bit given previous bits. Note that steps 2-4 only happen during training; during testing, you only need to run step 1 to get \(p\), take a sample \(a\) from \(p\), and then return \(f(a)\).

This kind of method seems computationally easier than directly predicting a human-produced string bit-by-bit, but it might still be harder than the original adversarial proposal.



by Paul Christiano 1236 days ago | Jessica Taylor likes this | link

This looks a lot like a variational autoencoder, unless I’m missing some distinction (Note that once you have a bit-by-bit predictor, you can assume WLOG that the distribution on A is uniform). Thinking about how variational autoencoders work in the superintelligent case seems worthwhile if we want to scale up something like imitation. I also wouldn’t be that surprised if it produced some practically useful insight.

(Incidentally, I think that variational autoencoders and generative adversarial nets are the leading approaches to generative modeling right now.)

I agree that the steganography problem looks kind of bad for adversarial methods in the limit. For coping with the analogous problem with approval-maximization, I think the best bet is to try to make the generative model state transparent to the discriminative model. But this is obviously not going to work for generative adversarial models, since access to the generator state would make distinguishing trivial.

reply

by Paul Christiano 1216 days ago | link

Actually I’m not sure exactly what you mean by importance sampling here.

The variational lower bound would be to draw samples from \(q\) and compute \(log(p/q)\). The log probability of the output under \(p\) is bounded by the expectation of this quantity (with equality iff \(q\) is the correct conditional distribution over \(A\)).

I’m just going to work with this in my other comments, I assume it amounts to the same thing.

reply

by Jessica Taylor 1214 days ago | link

What I mean is: compute \(\hat{\mathbb{E}}_q[[f(A) = x]p(A)/q(A)]\), which is a probabilistic lower bound on \(P_p(f(A) = x)\).

The variational score gives you a somewhat worse lower bound if \(q\) is different from \(p(A | f(A) = x)\). Due to Jensen’s inequality, \[\mathbb{E}_q[ \log ( [f(A) = x] p(A) / q(A) ) ] \leq \log \mathbb{E}_q [ [f(A) = x] p(A) / q(A) ] \leq \log P_p(f(A) = x)\]

It probably doesn’t make a huge difference either way.

reply

by Paul Christiano 1216 days ago | link

It would be great to see an analysis of this from a complexity-theoretic / cryptographic perspective. Are there distributions that can’t be imitated correctly in this way, even when they should be within the power of your model? Are there distributions where you get potentially problematic behavior like in the steganography case?

(That’s also surely of interest to the mainstream ML community given the recent prominence of variational autoencoders, so it seems quite likely someone has done it.)

After that there is a more subtle question about learnability, but it would be good to start with the easy part.

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

RSS

Privacy & Terms