Quantilizers maximize expected utility subject to a conservative cost constraint
post by Jessica Taylor 790 days ago | Abram Demski, Patrick LaVictoire and Stuart Armstrong like this | 3 comments

Summary: this post reframes ideas from “Learning a concept using only positive examples” and “The AI, the best human advisor” in terms of a strong constraint for conservative behavior, and shows that quantilizers are optimal under this constraint.

## Defining a quantilizer

A $$q$$-quantilizer is an agent that, when faced with a decision problem, gives a random action in the top $$q$$ proportion by expected utility. More formally, let $$A$$ be the action set and $$p_A$$ be some distribution over actions. For any $$0 < q \leq 1$$, a $$q$$-quantilizer returns an action $$a$$ with probability

$P(a) = 0 \text{ if } \sum_{a' \in A} p_A(a') [ E[U(a', W)] > E[U(a, W)] ] > q$ $P(a) = p_A(a) / q \text{ if } \sum_{a' \in A} p_A(a') [ E[U(a', W)] \geq E[U(a, W)] ] < q$ $P(a) = \frac{1}{q}\frac{p_A(a)}{\sum_{a' \in A, a' \text{ also falls into this case}} p_A(a')} \left( q - \sum_{a' \in A} p_A(a') [ E[U(a', W)] > E[U(a, W)] ]\right) \text{ otherwise}$

where $$U$$ is the quantilizer’s utility function. The weird logic in the third case is necessary for handling ties; the details are not especially important. Naturally, a 0-quantilizer is just a maximizer. A 0.5-quantilizer (which we could also call a 50-percentilizer) samples a random above-average action, a 0.01-quantilizer samples a random action in the top 1%, and a 1-quantilizer samples a random action from $$p_A$$.

## The conservative cost constraint

Why might we use a quantilizer? Suppose we believe there is some unknown cost function $$c : A \rightarrow \mathbb{R}^+$$, which quantifies the negative contribution of the action to our utility function (e.g. we could define $$c(a) = \max\{0, \mathbb{E}[V(\text{do nothing}, W)] - \mathbb{E}[V(a, W)]\}$$, where $$V$$ is our true utility function). For example, an action that turns the universe into paperclips would be very costly. As explained previously in “Learning a concept using only positive examples”, the fact that an agent’s distribution over actions fits under $$p_A$$ indicates that it does not accrue many times more expected cost than would be accrued by sampling from $$p_A$$. It turns out that for $$0 < q \leq 1$$, a $$q$$-quantilizer maximizes expected utility subject to the conservative cost constraint

$\forall (c : A \rightarrow \mathbb{R}^+) \frac{\sum_{a \in A} P(a) c(a)}{\sum_{a \in A} p_A(a) c(a)} \leq 1/q$

The cost constraint states that, for any cost function, the expected cost of an action drawn from $$P$$ is no more than $$1/q$$ times the expected cost of an action drawn from $$p_A$$.

## Optimality of quantilizers under the cost constraint

From the cost constraint we can prove $\forall (a \in A) P(a) \leq p_A(a) / q$

Suppose this were not true. Then for some $$a^*$$ we have $$P(a^*) > p_A(a^*) / q$$. Now define $$c(a) = [a = a^*] / p_A(a^*)$$. We have $\sum_{a \in A} p_A(a) c(a) = p_A(a^*) c(a^*) = 1$ Also, $\sum_{a \in A} P(a) c(a) = P(a^*) c(a^*) > p_A(a^*) c(a) / q = 1 / q$ Jointly, these statements contradict our cost constraint.

Since we cannot assign more that $$p_A(a) / q$$ probability to any action $$a$$, the best we can do is assign $$p_A(a) / q$$ probability to the best actions and 0 to the rest. This gets us back the definition of a quantilizer.

## Repeated quantilization of games is not quantilization of repeated games

Suppose we have 2 independent games. On game $$i$$, we choose an action $$a$$ and receive utility $$U(a, W_i)$$, where each $$W_i$$ independently comes from some distribution over worlds. We could $$q$$-quantilize each game independently, so we choose from the best $$q$$ propertion of actions in each game independently. Alternatively, we could $$q$$-quantilize the pair of the games. Here we define $$W' = (W_1, W_2)$$ and $$U'((a_1, a_2), (w_1, w_2)) = U(a_1, w_1) + U(a_2, w_2)$$. The expected utility of an action pair is $$\mathbb{E}[U'((a_1, a_2), W')] = \mathbb{E}[U(a_1, W)] + \mathbb{E}[U(a_2, W)]$$

A $$q$$-quantilizer will sample a random pair of actions $$(a_1, a_2)$$ from the best $$q$$ proportion of the action pairs. For example, if each game consists of saying a number in the unit interval and then gaining that much utility, and $$p_A$$ is uniform, then a 12.5-percentilizer will sample uniformly from the set $$\{(a_1, a_2) \in [0, 1] \times [0, 1] | x + y \geq 1.5\}$$. On the other hand, if we 12.5-percentilize each game independently, we will sample our action pair uniformly from the set $$[0.875, 1]^2$$.

What could explain this difference? If all of our cost comes from action pairs in the set $$[0.875, 1]^2$$ (so $$c(a_1, a_2) = 0$$ outside this set), then uniformly sampling from $$[0.875, 1]^2$$ will accrue a cost 64 times that of sampling from $$p_A$$ twice independently. On the other hand, uniformly sampling from $$\{(a_1, a_2) \in [0, 1] \times [0, 1] | x + y \geq 1.5\}$$ will accrue a cost only 8 times that of sampling from $$p_A$$ twice independently.

Suppose we added the assumption that $$c$$ is additive over the two games, so $$c(a_1, a_2) = c'(a_1) + c'(a_2)$$ for some $$c' : A \rightarrow \mathbb{R}^+$$. Then we get the additive cost constraint $\forall (c' : A \rightarrow \mathbb{R}^+) \frac{\sum_{a_1, a_2 \in A} P(a_1, a_2) (c'(a_1) + c'(a_2))}{\sum_{a_1, a_2 \in A} p_A(a_1)p_A(a_2) (c'(a_1) + c'(a_2))} \leq 1/q$

Then a repeated $$q$$-quantilizer would get total expected cost $\mathbb{E}_P[c(A_1, A_2)] = 2\mathbb{E}_P[c'(A_1)] \leq \frac{2}{q} \mathbb{E}_{p_A}[c'(A_1)] = \frac{1}{q} \mathbb{E}_{p_A}[c(A_1, A_2)]$ so repeated quantilization would be considered acceptable under the additive cost constraint.

However, adding this assumption may be dangerous. It is possible to create scenarios in which almost all of the cost comes from correlation between actions in different trials.

For example, say that we’re creating a stock-trading AI. Every minute, the AI either decides to buy or sell a single share of Google. We can set $$p_A$$ to be a uniform distribution over $$\{ \text{buy}, \text{sell} \}$$. For some reason, there is another computer somewhere that detects whether the AI bought or sold a share each minute appends a bit (0 for buy or 1 for sell) to a file accordingly. It runs this file (say, as a Python program) after collecting a million bits.

Now suppose the AI sells the stock only on minutes corresponding to 1 bits in a UFAI source code (which is less than a million bits long). This would cause the UFAI to be run. Since the probability that $$p_A$$ would make any one of these decisions (to buy or sell a share in a particular minute) is 50%, if we assumed that the cost was additive over iterations, then we would conclude that this AI is only twice as costly as independent samples from $$p_A$$ would be. While independent quantilization would not lead to this outcome, this hypothetical scenario proves that assuming that the cost function is additive can be very bad.

# Are there reasonable less-restrictive constraints on cost that allow something like independent quantilization?

Intuitively, it seems like independent quantilization is usually not dangerous. Perhaps there is some cost constraint that allows independent quantilization (or something like it) while forbidding the UFAI encoder. We want something less restrictive than the original cost constraint, but more restrictive than the additive cost constraint.

# Can we use this framework to define conservative classifiers (and other learners)?

We would like to create classifiers that learn from training data and, when deciding what to do with a new example, either classify it (if it looks like the training data) or refuse to classify it (if it looks quite different from the training data). For example, if we train an AI to classify smiling humans as positive and frowning humans as negative, perhaps we want it to refuse to classify molecular smiley faces as either positive or negative. These classifiers should usually classify points when they come from the same distribution as the training data, and they should obey some kind of conservative cost constraint. I am working on writing up a solution to this for a future forum post.

# When do “ordinary” $$p_A$$ distributions have high cost?

It could be that most ordinary actions that people take are either very good or very bad (and a superintelligence could predict which one it is), due to butterfly effects. In this case, the expected cost of sampling from fairly ordinary $$p_A$$ distributions is actually quite high, but it is cancelled out by an equally high expected benefit. However, a quantilizer may preserve most of the expected cost of $$p_A$$ without preserving the expected benefit. How much of a problem is this in practice, and how can it be reduced (e.g. is it possible to use false thermodynamic miracles to interfere with butterfly effects)?

# Can we use something like a quantilizer to “win”?

How could an organization with access to superintelligent AI use a conservative agent (such as a quantilizer or a conservative classifier) to drastically reduce existential risk (e.g. by shutting down rogue AI projects)? It would be useful to know what kind of capabilities would be necessary to solve the strategic situation, so that we can see if a conservative agent can deliver these capabilities.

 by Paul Christiano 747 days ago | Jessica Taylor likes this | link In general it seems like there is no tractable algorithm that samples from the given space, or performs well on the given conservative cost maximization game. On top of that, it’s not clear how you would train a conservative-maximizer of this kind; e.g. even if you have a simple model class that contains a good quantilizer, you don’t have access to the objective and so can’t train your model in any normal way. The most obvious solution to both problems (and a few others) is to maximize over efficient and learnable cost functions c rather than all cost functions. This almost immediately gives you a practical algorithm which we could begin to test (you can train both c and the quantilizer in parallel e.g. by gradient descent). It also looks a lot closer to things that people do in practice—this is a sensible-looking adjustment to normal apprenticeship learning. reply
 by Jessica Taylor 737 days ago | link If $$q$$ is not too low, then you can do this by taking a bunch of samples and evaluating them by expected utility. Of course, it might be expensive to evaluate this many samples. I think that you can also do this with an adversarial game, as in your post on mimicry. You can have one system that takes some action, and another system that bets at some odds that the action was produced by the AI rather than the base distribution. This seems to work without learning the cost function. reply
 by Paul Christiano 732 days ago | link I was imagining the case where $$O(q^{-1})$$ is too slow, i.e. where we want the AI to actually perform a search. The second paragraph is what I had in mind. Note that in this case you are maximizing over learnable cost functions rather than all cost functions. reply

### NEW DISCUSSION POSTS

Unfortunately, it's not just
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
 by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
 by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
 by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

I think the point I was
 by Abram Demski on Predictable Exploration | 0 likes

(also x-posted from
 by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

(x-posted from Arbital ==>
 by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

>If the other players can see
 by Stuart Armstrong on Predictable Exploration | 0 likes

 by Abram Demski on Predictable Exploration | 0 likes

> So I wound up with
 by Abram Demski on Predictable Exploration | 0 likes

Hm, I got the same result
 by Alex Appel on Predictable Exploration | 1 like

Paul - how widely do you want
 by David Krueger on Funding opportunity for AI alignment research | 0 likes

I agree, my intuition is that
 by Abram Demski on Smoking Lesion Steelman III: Revenge of the Tickle... | 0 likes