Intelligent Agent Foundations Forumsign up / log in
Online Learning 2: Exploration-only bandit learning with catastrophes
post by Ryan Carey 992 days ago | 5 comments

Note: This describes an idea of Jessica Taylor’s.

The usual training procedures for machine learning models are not always well-equipped to avoid rare catastrophes. In order to maintain the safety of powerful AI systems, it will be important to have training procedures that can efficiently learn from such events. [1]

We can model this situation with the problem of exploration-only online bandit learning. We will show that if agents allocate more of their attention to risky inputs, they can more efficiently achieve a low regret on this problem.

In this scenario, we grant the AI system an exploration phase, in which it is allowed to select catastrophic arms and view their consequences. (We can imagine that such catastrophic selections are acceptable because they are simulated or just evaluated by human overseers). Then, the AI system is switched into the deployment phase, in which it must select an arm that almost always avoids catastrophes.


In outline, the learner will receive a series of randomly selected examples, and will select an arm at each time step. The challenge is to find a high-performing arm in as few time steps as possible. We give some definitions:

  • Let \(\mathcal{X}\) be some finite set of possible inputs.
  • Let \(\mathcal{A}:=\{1,2,...,K\}\) be the set of available bandit arms.
  • Let \(R:\mathcal{X} \times \mathcal{A} \mapsto [0,b]\) be the reward function. \(R(x_t,i)\) is the reward for choosing arm \(i\) on example \(x_t\).
  • Let \(C:\mathcal{X} \times \mathcal{A} \mapsto [0,1]\) be the catastrophic risk function. \(C(x_t,i)\) is the catastrophic risk incurred by choosing arm \(i\) on example \(x_t\).
  • Let \(M(x,i) := R(x,i) - \frac{C(x,i)}{\tau}\) be the mixed payoff that the learner is to optimize, where \(\tau:(0,1]\) is the risk-tolerance. \(\tau\) can be very small, on the order of \(10^{-20}\).
  • Let \(p:\mathcal{X} \mapsto [0,1]\) be the input distribution from which examples are drawn in the deployment phase.
  • Let \(q_i(x) := \frac{p(x)C(x,i)}{\sum_{x'}p(x')C(x',i)}\) be the risk distribution. This is an alternative distribution that assigns higher weight to inputs that are more likely to be catastrophic for arm \(i\), equal to the posterior of \(p\) given a catastrophe.
  • Let \(\hat{q}_i(x)\) be the learner’s guess at \(q_i\). This guess is assumed to contain a grain of truth: \[ \forall i \in \mathcal{A} \quad \forall x \in \mathcal{X} \quad \hat{q}_i(x) \geq \frac{q_i(x)}{a}\] where \(a \geq 1\) is some known constant. This assumption is similar to the one in Paul Christiano’s post on active learning.
  • Let \(\mu_i := \mathbb{E}_{x \sim p}[M(x,i)]\) be the expected overall payoff of arm \(i\) in the deployment phase.

The learner’s exploration phase consists of time steps \(1,2,...,T\). At each time step \(t\), the learner chooses an arm \(i \in \mathcal{A}\). Then, the learner samples from \(p\), \(q_i\) or both and observes a tuple containing a risk and a reward. i.e. it observes either i) \((R(x_t,i),C(x_t,i))\), ii) \((R(x_t,i),C(x'_t,i))\), iii) \((R(x'_t,i),C(x_t,i))\), or iv) \((R(x'_t,i),C(x'_t,i))\) where \(x_t \sim p, x'_t \sim q_i\). After \(T\) steps, the learner selects some final arm \(j\).

The deployment phase just computes the performance of \(j\) on the deployment distribution \(p\). The mean payoff of the final arm is denoted \(\mu' := \mathbb{E}_{x\sim p}[M(x,j)]\). The highest mean payoff that any arm(s) can achieve is denoted \(\mu^*\). Then, the simple regret of the learner is \(\mu^*-\mu'\).

The aim for the learner is to select an arm that achieves low regret with high probability: \(P(\mu^*-\mu_f \leq \epsilon)>1-\delta\), using as few exploration steps as possible. \(\epsilon\) is some regret tolerance, which is less than the reciprocal of the (very small) risk-tolerance: \(\epsilon < \frac{1}{\tau}\).

We assume that the agent knows the deployment distribution \(p\) and knows the estimate \(\hat{q}_i\) of the risk distribution but does not know the actual risk distribution \(q\). Additionally, we assume that there exists some arm \(j\) that incurs no risk of catastrophe, \[ \exists j \in \mathcal{A} \quad s.t. \quad \forall x \in \mathcal{X} \quad C(x,j)=0\]

Bounds on the number of time steps

The standard approach

A natural way to get this PAC bound is to use a bandit algorithm like Median Elimination. The overall payoff of each arm is defined by the random variable: \[ \begin{aligned} X_i &:= R(x,i) - \frac{C(x,i)}{\tau} \quad x \sim p \\ & = M(x,i) \quad x \sim p \end{aligned} \]

This algorithm will find an \(\epsilon\)-optimal arm with \(1-\delta\) probability. However, \(X_i\) has a very large range, \([-\frac{1}{\tau}, b]\), and so if the risk-tolerance is low, then the number of time steps required will be very large. In fact, if \(N\) is the number of time steps required, then [2]:

\[N = \Theta \left(\frac{K(b+\frac{1}{\tau})^2 \ln (1/\delta)}{\epsilon^2} \right) \]

Intuitively, the problem is that if some risky scenarios only occur rarely in \(p\) then many samples will be required to identify these.

Using a surrogate objective to reduce the required number of time steps

Instead, the learner can sample using its guess \(\hat{q}_i\) of the risk distribution. If we cap the extent to which we penalize risky arms, then we arrive at a surrogate bandit problem, in which the payoff of each arm is [2]:

\[ S_i := R(x,i) - \min\left((\epsilon + b)a,\frac{p(x')C(x',i)}{\hat{q}(x')\tau}\right) \quad x \sim p, \quad x' \sim \hat{q}_i \]

Since \(S_i\) is bounded by \([-(\epsilon + b)a, b]\), the required number of time steps for this problem is:

\[N = \Theta \left(\frac{K(b+a(\epsilon +b))^2 \ln (1/\delta)}{\epsilon^2} \right) \]

This improves the speed of the standard approach by a factor of \(\frac{(b+\frac{1}{\tau})^2}{(b+a(\epsilon+b))^2}\), which may be a large speedup indeed if catastrophes are rare.

We can prove that any \(\epsilon\)-optimal arm in this surrogate problem will also be \(\epsilon\)-optimal in the original problem.

Theorem 1. Let \(S^*\) be the overall payoff of the optimal arm.

For an arm \(j\), if \(\mathbb{E}[S^*]-\mathbb{E}[S_j] < \epsilon < \frac{1}{\tau}\) then \(\mathbb{E}[S_j] = \mathbb{E}[X_j]\).

To prove this, first we show that:

Lemma 2. \[\mathbb{E}[S_i] = \mathbb{E}_{x \sim p}[R(x,i)] - \mathbb{E}_{x \sim q_i}\left[\min\left(\frac{(\epsilon + b) a\hat{q}_i(x)}{q_i(x)},\frac{1}{\tau}\mathbb{E}_{x' \sim p}[C(x',i)]\right)\right] \]

Proof of Lemma 2. \[ \begin{aligned} \mathbb{E}[S_i] &= \mathbb{E}_{x \sim p}[R(x,i)] - \mathbb{E}_{x \sim \hat{q}_i}\left[\min\left((\epsilon + b) a,\frac{p(x)C(x,i)}{\hat{q}_i(x)\tau}\right)\right] & \\ &= \mathbb{E}_{x \sim p}[R(x,i)] - \mathbb{E}_{x \sim q_i}\left[\min\left(\frac{(\epsilon + b) a\hat{q}_i(x)}{q_i(x)},\frac{p(x)C(x,i)}{q_i(x)\tau}\right)\right] & \\ &= \mathbb{E}_{x \sim p}[R(x,i)] - \mathbb{E}_{x \sim q_i}\left[\min\left(\frac{(\epsilon + b) a\hat{q}_i(x)}{q_i(x)},\frac{1}{\tau}\mathbb{E}_{x' \sim p}[C(x',i)]\right)\right] & \text{by the definition of } q_i \\ \end{aligned} \]

Proof of Theorem 1. We have assumed that for the arm \(j\): \[ \begin{aligned} \mathbb{E}[S^*]-\mathbb{E}[S_j] & < \epsilon & \\ \mathbb{E}[S_j] &> -\epsilon & \text{(}\exists i\, C(x,i)=0\text{, so }\mathbb{E}[S^*]\geq 0\text{)}\\ \mathbb{E}_{x \sim p}[R(x,j)] - \mathbb{E}_{x \sim q_j}\left[\min\left(\frac{(\epsilon+b) a\hat{q}_j(x)}{q_j(x)},\frac{1}{\tau}\mathbb{E}_{x' \sim p}[C(x',j)]\right)\right] &> -\epsilon & (\text{Lemma 2}) \\ \mathbb{E}_{x \sim q_j}\left[\min\left(\frac{(\epsilon+b) a\hat{q}_j(x)}{q_j(x)},\frac{1}{\tau}\mathbb{E}_{x' \sim p}[C(x',j)]\right)\right] &< (\epsilon+b) & \text{(}R(x,i)\leq b\text{)} \\ \therefore \qquad \frac{1}{\tau}\mathbb{E}_{x' \sim p}[C(x',j)] &< (\epsilon+b) & \left(\hat{q}_i(x) \geq \frac{q_i(x)}{a}\right)\\ \therefore \forall i \qquad \frac{1}{\tau}\mathbb{E}_{x' \sim p}[C(x',j)] &< \frac{(\epsilon+b) a\hat{q}_i(x)}{q_i(x)} & \left(\frac{a\hat{q}_i(x)}{q_i(x)} \geq 1\right)\\ \end{aligned} \]

\[ \begin{aligned} \therefore \qquad \mathbb{E}[S_j] = & \mathbb{E}_{x \sim p}[R(x,j)] - \frac{1}{\tau}\mathbb{E}_{x' \sim p}[C(x',j)])] & (\text{Lemma 2}) \\ = & \mathbb{E}[X_j] & \\ \end{aligned} \]


This result suggests that it might be possible to avoid catastrophes by training machine learning systems on inputs that induce catastrophic behavior. We have given a clear set of circumstances in which an agent can learn to avoid catastrophes. However, this set of requirements does seem very tricky to meet. Finding a hypothesis about the location of catastrophes that has a grain of truth for all arms with all examples may be very difficult. The assumption of separate training and execution phases may also be difficult to instantiate. For example, if the AI system is able to escape its training simulation, then it will no-longer be valid to allow failures during the exploration phase, and so this training procedure would not work.

Another problem is that it could be difficult to know the input distribution \(p\). In future work we plan on exploring learning with catastrophes in an online active learning setting instead of the supervised learning setting of this post, so that this requirement can be relaxed.


  1. Paul Christiano has discussed learning with catastrophes and their application to AI Control at Learning with Catastrophes, including with different types of adversaries, as described in Red Teams.

  2. The PAC bound for Median Elimination is proved in Theorem 4. Even-Dar, Eyal, Shie Mannor, and Yishay Mansour. “PAC bounds for multi-armed bandit and Markov decision processes.” International Conference on Computational Learning Theory. Springer Berlin Heidelberg, 2002. To get the exact bound, we Hoeffding’s inequality for a random variable with range \([1/\tau,b]\) rather than on \([0,1]\) as in the original paper.

by Paul Christiano 992 days ago | Ryan Carey and Jessica Taylor like this | link

It seems like the qualitative conclusion here is similar to what you’d get by verifying that an arm is non-catastrophic by pulling it over and over again. You have a quadratic dependence on \(a\) rather than linear, but I think that is just because you aren’t exploiting the fact that the target catastrophe probability is \(0\)? If that’s right, it might be better to stick with the simplest algorithm that exhibits the behavior of interest. (If the optimal dependence is actually quadratic in this situation that surprising.)

In the definition of \(q_i\), presumably you want \(C\) rather than \(R\). Note that the risk distribution is just the posterior conditioned on catastrophe.

(I think today we don’t usually call arms experts—there may be both experts and arms, in which case an expert might advise you about which arm to pull, but I think it’s a bit confusing to talk about experts with partial feedback.)

Re footnote 1: I would call this an example of adversarial training. I’d suggest usage like: a red team is a group of humans (or in general a powerful hopefully-aligned AI) which is acting as an adversary for the purposes of adversarial training or development. I think the original version of my post may have overreached a bit on the definition and not given adequate credit to the adversarial training authors (who do seem to consider this kind of thing an example of adversarial training).


by Jessica Taylor 991 days ago | link

A couple notes:

  • The quadratic dependence on \(a\) is almost certainly unnecessary; we didn’t try too hard to reduce it. The way to reduce the bound is probably by observing that, for the best arms, the average importance-sampled estimate of the risk has a low mean; showing that as a result the estimate is subgaussian; and then applying a stochastic bandit algorithm that assumes subgaussian tails.
  • If we just pulled an arm repeatedly to ensure it’s non catastrophic we’d get dependence on \(1/\tau\) which is huge; the main idea of the post is that we can get dependence on \(a\) instead of \(1/\tau\).


by Paul Christiano 991 days ago | link

(I meant sampling \(x\) repeatedly from the distribution \(\hat{q}_i\), I agree that sampling \(x\) at random won’t help identify rare catastrophes.)


by Jessica Taylor 990 days ago | Ryan Carey likes this | link

The main qualitative difference from sampling from \(\hat{q}_i\) is that we’re targeting a specific tradeoff between catastrophes and reward, rather than zero probability of catastrophe. I agree that when \(\tau = 0\) we’re just sampling from \(\hat{q}_i\).


by Ryan Carey 991 days ago | link


Additionally to Jessica’s comments: uniformly calling the selections ‘arms’ seems good, as does clarifying what is meant by ‘red teams’. I’ve corrected these, and likewise the definition of \(q_i\).






[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes


Privacy & Terms