Rényi divergence as a secondary objective discussion post by Jessica Taylor 984 days ago | Vadim Kosoy and Patrick LaVictoire like this | 1 comment Summary: Given that both imitation and maximization have flaws, it might be reasonable to interpolate between these two extremes. It’s possible to use Rényi divergence (a family of measures of distance between distributions) to define a family of these interpolations. Rényi divergence is a measure of distance between two probability distributions. It is defined as $D_\alpha(P || Q) = \frac{1}{\alpha - 1} \log \mathbb{E}_{x \sim P}\left[\left( \frac{P(X)}{Q(X)} \right)^{\alpha-1}\right]$ with $$\alpha \geq 0$$. Special values $$D_0, D_1, D_{\infty}$$ can be filled in by their limits. Particularly interesting values are $$D_1(P || Q) = D_{KL}(P || Q)$$ and $$D_{\infty}(P || Q) = \log \max_x \frac{P(x)}{Q(x)}$$. Consider some agent choosing a distribution $$P$$ over actions to simultaneously maximize a score function $$s$$ and minimize Rényi divergence from some base distribution $$Q$$. That is, score $$P$$ according to $v_{\alpha}(P) = \mathbb{E}_{X \sim P}[s(X)] - \gamma D_{\alpha}(P || Q)$ where $$\gamma > 0$$ controls how much the secondary objective is emphasized. Define $$P_\alpha^* = \arg\max_P v_\alpha(P)$$. We have $$P_1^*(x) \propto Q(x) e^{s(x) / \gamma}$$, and $$P_\infty^*$$ is a quantilizer with score function $$s$$ and base distribution $$Q$$ (with the amount of quantilization being some function of $$\gamma$$, $$Q$$, and $$s$$). For $$1 < \alpha < \infty$$, $$P_\alpha^*$$ will be some interpolation between $$P_1^*$$ and $$P_{\infty}^*$$. It’s not necessarily possible to compute $$v_{\alpha}(P)$$. To approximate this quantity, take samples $$x_1, ..., x_n \sim P$$ and compute $\frac{1}{n} \sum_{i=1}^n s(x_i) - \frac{1}{\alpha - 1} \log \left( \frac{1}{n} \sum_{i=1}^n \left( \frac{P(x_i)}{Q(x_i)} \right)^{\alpha - 1} \right)$ Of course, this requires $$P$$ and $$Q$$ to be specified in a form that allows efficiently estimating probabilities of particular values. For example, $$P$$ and $$Q$$ could both be variational autoencoders. As $$\alpha$$ approaches 1, this limits to $\frac{1}{n} \sum_{i=1}^n s(x_i) - \sum_{i=1}^n \log \frac{P(x_i)}{Q(x_i)} = \frac{1}{n} \sum_{i=1}^n (s(x_i) - \log P(x_i) + \log Q(x_i))$ As $$\alpha$$ approaches $$\infty$$, this limits to $\frac{1}{n} \sum_{i=1}^n s(x_i) - \log \max_i \frac{P(x_i)}{Q(x_i)}$ Like a true quantilizer, a distribution $$P$$ trained to maximize this value (an approximate quantilizer) will avoid assigning much higher probability to any action than $$Q$$ does. These approximations yield training objectives for agents which will interpolate between imitating $$Q$$ and maximizing $$s$$. What do we use these for? Patrick suggested that $$Q$$ could be an estimation of the distribution of actions a human would take (trained using something like this training procedure). Then, the distribution $$P$$ maximizing the combined objective $$v_{\alpha}$$ will try to maximize score in a somewhat human-like way; it will interpolate between imitation and score-maximization. There are problems, though. Suppose a human and an AI can both solve Sudoku, but the AI can’t solve it the way a human would. Suppose the AI trains a distribution $$Q$$ over ways of filling out the puzzle to imitate the human. $$Q$$ will usually not solve the puzzle, since the AI can’t solve the puzzle the way a human would. Suppose the AI is choosing a distribution $$P$$ over ways of filling out the puzzle to maximize a combined objective based on solving the puzzle and having low Renyi divergence from $$Q$$. If $$\alpha = \infty$$, then $$P$$ will be an approximate quantilizer with base distribution $$Q$$, so it is unlikely to solve the puzzle unless $$\gamma$$ is very low (since $$Q$$ very rarely solves the puzzle). With $$\alpha < \infty$$, there is not much of a guarantee that the AI is solving the puzzle the way a human would; unlike a quantilizer, a distribution trained with $$\alpha < \infty$$ may assign much higher probability to some ways of filling out the puzzle than $$Q$$ does. Something like meeting halfway might be necessary to ensure that the AI solves the problem in a humanlike way.

 by Patrick LaVictoire 983 days ago | link Note also that $$v_0(P)$$ is maximized when $$P$$ has full support on the distribution of $$Q$$ and when $$s$$ has a high average on $$P$$. That is, it’s at most $$\epsilon$$ from maximized when $$P$$ is $$(1-\epsilon)$$ times a delta function on an $$s$$-maximizing point, plus $$\epsilon$$ times the distribution of $$Q$$. So $$v_0$$ essentially corresponds to a raw maximizer, and $$v_\alpha$$ for $$0<\alpha<1$$ interpolates between maximizing and softmax. reply

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes