Summary: Given that both imitation and maximization have flaws, it might be reasonable to interpolate between these two extremes. It’s possible to use Rényi divergence (a family of measures of distance between distributions) to define a family of these interpolations.
Rényi divergence is a measure of distance between two probability distributions. It is defined as
\[D_\alpha(P  Q) = \frac{1}{\alpha  1} \log \mathbb{E}_{x \sim P}\left[\left( \frac{P(X)}{Q(X)} \right)^{\alpha1}\right]\]
with \(\alpha \geq 0\). Special values \(D_0, D_1, D_{\infty}\) can be filled in by their limits.
Particularly interesting values are \(D_1(P  Q) = D_{KL}(P  Q)\) and \(D_{\infty}(P  Q) = \log \max_x \frac{P(x)}{Q(x)}\).
Consider some agent choosing a distribution \(P\) over actions to simultaneously maximize a score function \(s\) and minimize Rényi divergence from some base distribution \(Q\). That is, score \(P\) according to
\[v_{\alpha}(P) = \mathbb{E}_{X \sim P}[s(X)]  \gamma D_{\alpha}(P  Q)\]
where \(\gamma > 0\) controls how much the secondary objective is emphasized. Define \(P_\alpha^* = \arg\max_P v_\alpha(P)\). We have \(P_1^*(x) \propto Q(x) e^{s(x) / \gamma}\), and \(P_\infty^*\) is a quantilizer with score function \(s\) and base distribution \(Q\) (with the amount of quantilization being some function of \(\gamma\), \(Q\), and \(s\)). For \(1 < \alpha < \infty\), \(P_\alpha^*\) will be some interpolation between \(P_1^*\) and \(P_{\infty}^*\).
It’s not necessarily possible to compute \(v_{\alpha}(P)\). To approximate this quantity, take samples \(x_1, ..., x_n \sim P\) and compute
\[\frac{1}{n} \sum_{i=1}^n s(x_i)  \frac{1}{\alpha  1} \log \left( \frac{1}{n} \sum_{i=1}^n \left( \frac{P(x_i)}{Q(x_i)} \right)^{\alpha  1} \right)\]
Of course, this requires \(P\) and \(Q\) to be specified in a form that allows efficiently estimating probabilities of particular values. For example, \(P\) and \(Q\) could both be variational autoencoders.
As \(\alpha\) approaches 1, this limits to
\[\frac{1}{n} \sum_{i=1}^n s(x_i)  \sum_{i=1}^n \log \frac{P(x_i)}{Q(x_i)} = \frac{1}{n} \sum_{i=1}^n (s(x_i)  \log P(x_i) + \log Q(x_i))\]
As \(\alpha\) approaches \(\infty\), this limits to
\[\frac{1}{n} \sum_{i=1}^n s(x_i)  \log \max_i \frac{P(x_i)}{Q(x_i)}\]
Like a true quantilizer, a distribution \(P\) trained to maximize this value (an approximate quantilizer) will avoid assigning much higher probability to any action than \(Q\) does.
These approximations yield training objectives for agents which will interpolate between imitating \(Q\) and maximizing \(s\). What do we use these for? Patrick suggested that \(Q\) could be an estimation of the distribution of actions a human would take (trained using something like this training procedure). Then, the distribution \(P\) maximizing the combined objective \(v_{\alpha}\) will try to maximize score in a somewhat humanlike way; it will interpolate between imitation and scoremaximization.
There are problems, though. Suppose a human and an AI can both solve Sudoku, but the AI can’t solve it the way a human would. Suppose the AI trains a distribution \(Q\) over ways of filling out the puzzle to imitate the human. \(Q\) will usually not solve the puzzle, since the AI can’t solve the puzzle the way a human would. Suppose the AI is choosing a distribution \(P\) over ways of filling out the puzzle to maximize a combined objective based on solving the puzzle and having low Renyi divergence from \(Q\). If \(\alpha = \infty\), then \(P\) will be an approximate quantilizer with base distribution \(Q\), so it is unlikely to solve the puzzle unless \(\gamma\) is very low (since \(Q\) very rarely solves the puzzle). With \(\alpha < \infty\), there is not much of a guarantee that the AI is solving the puzzle the way a human would; unlike a quantilizer, a distribution trained with \(\alpha < \infty\) may assign much higher probability to some ways of filling out the puzzle than \(Q\) does. Something like meeting halfway might be necessary to ensure that the AI solves the problem in a humanlike way.
