Informed oversight through an entropy-maximization objective discussion post by Jessica Taylor 1085 days ago | 9 comments The informed oversight problem is a serious challenge for approval-directed agents (I recommend reading the post if you haven’t already). Here is one approach to the problem that works by adding an entropy-maximization objective. Let agent B be overseeing agent A. It seems that some of the problem is that A has many different possible strategies that B evaluates as good. Thus, A may choose among these good-looking strategies arbitrarily. If some of the good-looking strategies are actually bad, then A may choose one of these bad strategies. This is not a problem if B’s evaluation function $$B(x, \cdot)$$ has a single global maximum, and solutions significantly different from this one are necessarily rated as worse. It would be nice to have a general way of turning a problem with multiple global maxima into one with a unique global maximum. Here’s one attempt at going this. Given the original evaluation function $$b$$ mapping strings to reals, construct a new evaluation function $$v$$ mapping distributions over strings to reals. Specifically, for some other distribution of strings $$f$$ and a constant $$\gamma > 0$$, define $v(d) = -\gamma D_{KL}(f || d ) + \mathbb{E}_{y \sim d} [ b(y) ] = \gamma H(d) + \mathbb{E}_{y \sim d}[ \gamma \log f(y) + b(y) ]$ where the equality holds because $$D_{KL}(f || d) = H(d, f) - H(d) = -\mathbb{E}_{y \sim d}[\log f(y)] - H(d)$$. Observe that $$v$$ is strongly concave, so it has a single global maximum and no other local maxima. This global maximum is $d(y) \propto f(y) e^{b(y) / \gamma}$ So the optimal solution to this problem is to choose a distribution that is somewhat similar to $$f$$ but overweights $$y$$ values with a high $$b(y)$$ value (with the rationality parameter $$\gamma$$ determining how much to overweight). The higher $$\gamma$$ is, the more strongly concave the problem is and the more $$d$$ will imitate $$f$$; the lower $$\gamma$$ is, the more this problem looks like the original $$b$$-maximization problem. This interpolation is similar to quantilization, but is somewhat different mathematically. Intuitively, optimizing $$v$$ seems harder than optimizing $$b$$: the distribution $$d$$ must be able to provide all possible good solutions to $$b$$, rather than just one. But I think standard reinforcement learning algorithms can be adapted to optimize $$v$$. Really, you just need to optimize $$H(d) + \mathbb{E}_{y \sim d}[b(y)]$$, for some $$b$$, since you can wrap the $$\gamma \log f(y)$$ and $$b(y)$$ terms together into a single function. So the agent must be able to maximize the sum of some original objective $$b$$ and the entropy of its own actions. Consider Q-learning. An agent using Q-learning, in a given state $$s$$, will take the action $$a$$ that maximizes $$Q(s, a)$$, which is the expected total reward resulting from taking this action in this state (including utility from all future actions). Instead of choosing an action $$a$$ to maximize $$Q(s, a)$$, suppose the agent chooses a distribution over actions $$d$$ to maximize $$H(d) + \mathbb{E}_{a \sim d}[Q(s, a)]$$. Then the agent takes a random action $$a \sim d$$ and receives the normal reward plus an extra reward equal to $$H(d)$$ (so that the learned $$Q$$ takes into account the entropy objective). As far as I can tell, this algorithm works for maximizing the original reward plus the entropy of the agent’s sequence of actions. I’m not sure how well this works as a general solution to the informed oversight problem. It replaces the original objective with a slightly different one, and I don’t have intuitions about whether this new objective is harder to optimize than the original one. Still, it doesn’t seem a lot harder to optimize. I’m also not sure whether it’s always possible to set $$\gamma$$ low enough to incentivize good performance on the original objective $$b$$ and high enough for $$v$$ to be strongly concave enough to isolate a unique solution. It’s also not clear whether the strong concavity will be sufficient: even if the global maximum of $$v$$ is desirable, other strategies A might use could approximately optimize $$v$$ while being bad. So while there’s some selection pressure against bad strategies, it might not be enough.

 by Paul Christiano 1084 days ago | link I like the idea of adding in additional constraints. I think it is complicated a lot by the boundedness of the agent—finding the actual optimum will naturally be intractable, so the question is what it looks like to imperfectly optimize this objective. I’m not sure it does what we want (in the toy model from my post, I think that you may just end up with $$O(zxx)$$ for a random $$z$$). But I think the most fundamental difficulty is approximating the entropy during training. One approach is to train a second model to estimate the density of $$d$$, and then use the density estimates to compute the payoff $$-log(d(y))$$. But this actually won’t work here, for precisely the same reason that providing oversight is so hard. Our model for estimating the density of $$d$$ also can’t tell the difference between the good actions and the bad actions, even if $$d$$ is exclusively outputting bad actions, and so can’t actually penalize $$d$$ for outputting a bad action (unless it also penalizes $$d$$ for outputting a good action). As far as I can tell, most approaches to training $$d$$ run into this problem in one form or another. reply
 by Jessica Taylor 1084 days ago | link I propose training $$d$$ using something like this: Suppose $$d$$ is outputting a binary string of length $$n$$. Then it starts by outputting a probability. The first bit gets sampled using this probability. The agent is told this bit, and then it outputs the probability for the next bit. Then the second bit gets sampled using this probability. And so on, until $$n$$ bits are output. Now it’s easy to estimate the entropy of $$d$$. Perhaps outputting the distribution over strings bit-by-bit like this is difficult. But there are other ways of representing distributions that allows sampling from them and estimating their entropy. Consider the following algebraic data type, meant to represent distributions: data Distr a where Unit :: Distr () Flip :: Double -> Distr Bool InvFmap :: (a <-> b) -> Distr a -> Distr b Samp :: (a -> Distr b) -> Distr a -> Distr (a, b) Unit is a delta distribution on (); Flip flips a coin with a certain probability; InvFmap composes a distribution with an invertible function (which could be represented as a pair of functions $$f$$ and $$g$$; to run the invertible function on an input $$x$$, take $$f(x)$$ and make sure that $$g(f(x)) = x$$ or else throw an error); Samp is similar to the monadic bind >>= but keeps the original sample of $$a$$ around; a sample from Samp f a is (x, y) where $$x \sim a$$ and $$y \sim f(x)$$. The intention is to have everything be fully reversible, so there is no way to throw away information. (note that it’s also possible to have Unsamp :: (a -> Distr b) -> Distr (a, b) -> Distr a, intended as an inverse of Samp; I can give more details if you’re interested). reply
 by Jessica Taylor 1084 days ago | link Additional note: I think you can implement the approach I sketched (which I’m calling “reversible probabilistic programming”) using autoencoders. Represent each layer of the autoencoder by a single-layer neural net f : a -> Distr b and approximate inverse g : b -> Distr a. Given a distribution for the previous layer x : Distr a, get the distribution for the next layer by taking Unsamp g (InvFmap ($$x, y) -> (y, x)) (Samp f x)) :: Distr b. Compose a lot of these together to get a multi-layer generative model. This seems simple enough that there’s probably a simple more direct way to estimate the entropy of a generative model represented by an autoencoder. (actually, I think this might not work, because the inverses won’t be very accurate, but maybe something like this works?) reply  by Paul Christiano 1070 days ago | link How do you estimate the entropy of this distribution? If you want to force the computation to end up with no garbage, then this significantly restricts the range of functions you compute, since the sampling process was invertible. In particular, they need to be functions such that you can compute the agent’s entire computation history only given the output, which feels like a restriction that may be dodging the hard cases of the informed oversight problem. If you allow the agent to end up with garbage, then you can calculate the probability of that particular garbage+outputs, but the entropy of that distribution will be unboundedly larger than the entropy of the distribution over outputs. reply  by Jessica Taylor 1069 days ago | link The approach I have in mind is (roughly) to let the agent output some number of bits of garbage, but penalize for the number of bits of garbage (so generating additional uniformly random garbage doesn’t make a difference to the score). I think this can be done using autoencoders (use layer \(n+1$$ to compress layer $$n$$ into a small number of bits of garbage). It’s not clear whether this approach is practical for complex agents, though. reply
 by Paul Christiano 1069 days ago | link In the OWF example, the garbage is necessarily low-entropy though (at least $$k$$ bits short on entropy, where $$k$$ is the size of advice needed to invert the OWF). Right? reply
 by Jessica Taylor 1069 days ago | link Yes, that seems right. So this won’t work for that example. reply
 by Paul Christiano 1070 days ago | link finding the actual optimum will naturally be intractable, so the question is what it looks like to imperfectly optimize this objective Actually even getting the actual optimum doesn’t seem to really address the problem. You are then OK if the unintended solutions are rare. But it’s easy to imagine cases where they are dense (e.g. where good plans are highly constrained, but plans with a subtle bug in one step have lots of degrees of freedom in the other steps). So at a minimum it seems like this will only address some cases, or would require some additional analysis. reply
 by Jessica Taylor 1069 days ago | link This seems correct. While an approach like this could isolate one particular distribution over solutions to the problem as the optimal one, you also have to engineer the problem such that this distribution over solutions is desirable. reply

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes