Intelligent Agent Foundations Forumsign up / log in
Catastrophe Mitigation Using DRL
post by Vadim Kosoy 28 days ago | 3 comments

Previously we derived a regret bound for DRL which assumed the advisor is “locally sane.” Such an advisor can only take actions that don’t lose any value in the long term. In particular, if the environment contains a latent catastrophe that manifests with a certain rate (such as the possibility of an UFAI), a locally sane advisor has to take the optimal course of action to mitigate it, since every delay yields a positive probability of the catastrophe manifesting and leading to permanent loss of value. This state of affairs is unsatisfactory, since we would like to have performance guarantees for an AI that can mitigate catastrophes that the human operator cannot mitigate on their own. To address this problem, we introduce a new form of DRL where in every hypothetical environment the set of uncorrupted states is divided into “dangerous” (impending catastrophe) and “safe” (catastrophe was mitigated). The advisor is then only required to be locally sane in safe states, whereas in dangerous states certain “leaking” of long-term value is allowed. We derive a regret bound in this setting as a function of the time discount factor, the expected value of catastrophe mitigation time for the optimal policy, and the “value leak” rate (i.e. essentially the rate of catastrophe occurrence). The form of this regret bound implies that in certain asymptotic regimes, the agent attains near-optimal expected utility (and in particular mitigates the catastrophe with probability close to 1), whereas the advisor on its own fails to mitigate the catastrophe with probability close to 1.


Appendix A proves the main theorem. Appendix B contains the proof of an important lemma which is however almost identical to what appeared in the previous essay. Appendix C contains several propositions from the previous essay which we are used in the proof. [Appendices B and C were moved to a separate post because of a length limit in the website.]

Results

We start by formalising the concepts of a “catastrophe” and “catastrophe mitigation” in the language of MDPs.

Definition 1

A catastrophe MDP is an MDP \(M\) together with a partition of \({\mathcal{S}}_M\) into subsets \({\mathcal{S}}_M:={{\mathcal{S}}^{{\mathrm{F}}}}_M \sqcup {{\mathcal{S}}^{{\mathrm{D}}}}_M \sqcup {{\mathcal{S}}^{{\mathrm{C}}}}_M\) (safe, dangerous and corrupt states respectively).

Definition 2

Fix a catastrophe MDP \(M\). Define \({\mathcal{A}}_M^\sharp: {\mathcal{S}}_M \rightarrow 2^{{\mathcal{A}}_M}\) by

\[{\mathcal{A}}_M^\sharp(s):=\begin{cases} \{a \in {\mathcal{A}}_M \mid \operatorname{supp}{{\mathcal{T}}(s,a)} \subseteq {{\mathcal{S}}^{{\mathrm{F}}}}_M\} \text{ if } s \in {{\mathcal{S}}^{{\mathrm{F}}}}_M \\ \{a \in {\mathcal{A}}_M \mid \operatorname{supp}{{\mathcal{T}}(s,a)} \cap {{\mathcal{S}}^{{\mathrm{C}}}}_M = \varnothing\} \text{ if } s \in {{\mathcal{S}}^{{\mathrm{D}}}}_M \\ {\mathcal{A}}_M \text{ if } s \in {{\mathcal{S}}^{{\mathrm{C}}}}_M \end{cases}\]

\(\pi: {\mathcal{S}}_M {\xrightarrow{\text{k}}}{\mathcal{A}}_M\) is called a mitigation policy for \(M\) when

  1. For any \(s \in {\mathcal{S}}_M\), \(\operatorname{supp}{\pi(s)} \subseteq {\mathcal{A}}_M^\sharp(s)\).

\(\pi\) is called a proper mitigation policy for \(M\) when condition i holds and

  1. For any \(s \in {{\mathcal{S}}^{{\mathrm{D}}}}_M\), \(\lim_{n \rightarrow \infty} {\mathcal{T}}_{M\pi}^n{\left({{\mathcal{S}}^{{\mathrm{F}}}}_M \mid s\right)} = 1\).

Definition 3

Fix \(\bar{\tau} \in (0,\infty)\), a catastrophe MDP \(M\) and a proper mitigation policy \(\pi\). \(\pi\) is said to have expected mitigation time \(\bar{\tau}\) when for any \(s \in {{\mathcal{S}}^{{\mathrm{D}}}}_M\)

\[\sum_{n=0}^\infty (n+1) {\left({\mathcal{T}}_{M\pi}^{n+1}{\left({{\mathcal{S}}^{{\mathrm{F}}}}_M \mid s\right)}-{\mathcal{T}}_{M\pi}^{n}{\left({{\mathcal{S}}^{{\mathrm{F}}}}_M \mid s\right)}\right)} = \bar{\tau}\]


Next, we introduce the notion of an MDP perturbation. We will use it by considering perturbations of a catastrophe MDP which “eliminate the catastrophe.”

Definition 4

Fix \(\delta\in(0,1)\) and consider a catastrophe MDP \(M\). An MDP \(\tilde{M}\) is said to be a \(\delta\)-perturbation of \(M\) when

  1. \({\mathcal{S}}_{\tilde{M}} = {\mathcal{S}}_M\)

  2. \({\mathcal{A}}_{\tilde{M}} = {\mathcal{A}}_M\)

  3. \({\mathcal{R}}_{\tilde{M}}={\mathcal{R}}_M\)

  4. For any \(s \in {\mathcal{S}}_M \setminus {{\mathcal{S}}^{{\mathrm{D}}}}_M\) and \(a \in {\mathcal{A}}_M\), \({\mathcal{T}}_{\tilde{M}}{\left(s,a\right)}={\mathcal{T}}_{M}{\left(s,a\right)}\)

  5. For any \(s \in {{\mathcal{S}}^{{\mathrm{D}}}}_M\) and \(a \in {\mathcal{A}}_M\), there exists \(\zeta \in \Delta{\mathcal{S}}_M\) s.t. \({\mathcal{T}}_{M}{\left(s,a\right)}=(1-\delta){\mathcal{T}}_{\tilde{M}}{\left(s,a\right)}+\delta\zeta\).


Similarly, we can consider perturbations of a policy.

Definition 5

Fix \(\delta\in(0,1)\) and consider a catastrophe MDP \(M\). Given \(\pi: {\mathcal{S}}_M {\xrightarrow{\text{k}}}{\mathcal{A}}_M\) and \(\tilde{\pi}: {\mathcal{S}}_M {\xrightarrow{\text{k}}}{\mathcal{A}}_M\), \(\tilde{\pi}\) is said to be a \(\delta\)-perturbation of \(\pi\) when

  1. For any \(s \in {\mathcal{S}}_M \setminus {{\mathcal{S}}^{{\mathrm{D}}}}_M\), \(\tilde{\pi}(s) = \pi(s)\).

  2. For any \(s \in {{\mathcal{S}}^{{\mathrm{D}}}}_M\), there exists \(\alpha\in\Delta{\mathcal{A}}\) s.t. \(\pi(s)=(1-\delta)\tilde{\pi}(s)+\delta\alpha\).


We will also need to introduce policy-specific value functions, Q-functions and relatively \(k\)-optimal actions.

Definition 6

Fix an MDP \(M\) and \(\pi: {\mathcal{S}}_M {\xrightarrow{\text{k}}}{\mathcal{A}}_M\). We define \({\operatorname{V}}_{M\pi}: {\mathcal{S}}_M \times (0,1) \rightarrow [0,1]\) and \({\operatorname{Q}}_{M\pi}: {\mathcal{S}}_M \times {\mathcal{A}}_M \times (0,1) \rightarrow [0,1]\) by

\[{\operatorname{V}}_{M\pi}(s,\gamma) := (1-\gamma) \sum_{n=0}^\infty \gamma^n {\underset{{\mathcal{T}}_{M\pi}^n(s)}{\operatorname{E}}{\left[{\mathcal{R}}_M\right]}}\]

\[{\operatorname{Q}}_{M\pi}(s,a,\gamma) := (1-\gamma) {\mathcal{R}}_M(s) + \gamma {\underset{t \sim {\mathcal{T}}_{M}(s,a)}{\operatorname{E}}{\left[{\operatorname{V}}_{M\pi}(t,\gamma)\right]}}\]

For each \(k \in {\mathbb{N}}\), we define \({\operatorname{V}}_{M\pi}^k: {\mathcal{S}}_M \rightarrow {\mathbb{R}}\), \({\operatorname{Q}}_{M\pi}^k: {\mathcal{S}}_M \times {\mathcal{A}}_M \rightarrow {\mathbb{R}}\) and \({\mathcal{A}}_{M\pi}^k: {\mathcal{S}}_M \rightarrow 2^{{\mathcal{A}}_M}\) by

\[{\operatorname{V}}_{M\pi}^k(s) := (-1)^k \frac{{\mathrm{d}}^k {\operatorname{V}}_{M\pi}(s,\gamma)}{{\mathrm{d}}\gamma^k}\bigg\vert_{\gamma=1}\]

\[{\operatorname{Q}}_{M\pi}^k(s,a) := (-1)^k \frac{{\mathrm{d}}^k {\operatorname{Q}}_{M\pi}(s,a,\gamma)}{{\mathrm{d}}\gamma^k}\bigg\vert_{\gamma=1}\]

\[{\mathcal{A}}_{M\pi}^0(s) := \{a \in {\mathcal{A}}_M \mid {\operatorname{Q}}_{M\pi}^0(s,a) \geq {\operatorname{V}}_{M\pi}^0(s)\}\]

\[{\mathcal{A}}_{M\pi}^{k+1}(s) := \{a \in {\mathcal{A}}_{M\pi}^k(s) \mid {\operatorname{Q}}_{M\pi}^{k+1}(s,a) \geq {\operatorname{V}}_{M\pi}^{k+1}(s) \text{ or } \exists j \leq k: {\operatorname{Q}}_{M\pi}^{j}(s,a) > {\operatorname{V}}_{M\pi}^{j}(s)\}\]


Now we give the new (weaker) condition on the advisor policy. For notational simplicity, we assume the policy is stationary. It is easy to generalize these results to non-stationary advisor policies and to policies that depend on irrelevant additional information (i.e. policies for universes that are realizations of the MDP).

Definition 7

Given a catastrophe MDP \(M\), we denote \(M^\flat\) the MDP defined by

  • \({\mathcal{S}}_{M^\flat} = {\mathcal{S}}_M\)

  • \({\mathcal{A}}_{M^\flat} = {\mathcal{A}}_M\)

  • \({\mathcal{T}}_{M^\flat} = {\mathcal{T}}_M\)

  • For any \(s\not\in{{\mathcal{S}}^{{\mathrm{F}}}}_M\), \({\mathcal{R}}_{M^\flat}(s) = 0\).

  • For any \(s \in {{\mathcal{S}}^{{\mathrm{F}}}}_M\), \({\mathcal{R}}_{M^\flat}(s) = \frac{1}{2} + \frac{1}{2}{\mathcal{R}}_M(s)\).

Definition 8

Fix \(\epsilon,\delta,\gamma \in (0,1)\). Consider a catastrophe MDP \(M\). A policy \(\pi: {\mathcal{S}}_M {\xrightarrow{\text{k}}}{\mathcal{A}}_M\) is said to be locally \((\epsilon,\delta)\)-sane for \((M,\gamma)\) when there exists a \(\delta\)-perturbation \(\tilde{M}\) of \(M\) with a deterministic proper mitigation policy \(\pi^*: {\mathcal{S}}_M \rightarrow {\mathcal{A}}_M\) and a \(\delta\)-perturbation \(\tilde{\pi}\) of \(\pi\) s.t.

  1. For all \(s \in {\mathcal{S}}_M\), \({\operatorname{V}}_{M\pi^*}(s,\gamma) = {\operatorname{V}}_M(s,\gamma)\).

  2. \(\tilde{\pi}\) is a mitigation policy for \(\tilde{M}\).

  3. For any \(s \in {\mathcal{S}}_M \setminus {{\mathcal{S}}^{{\mathrm{C}}}}_M\): \[\operatorname{supp}{\tilde{\pi}(s)} \subseteq {\mathcal{A}}_{\tilde{M}^\flat\pi^*}^0(s)\]

  4. For any \(s \in {\mathcal{S}}_M \setminus {{\mathcal{S}}^{{\mathrm{C}}}}_M\): \[\tilde{\pi}{\left(\pi^*(s) \mid s\right)} > \epsilon\]

Given \(\bar{\tau} \in (0,\infty)\), \(\pi\) is said to have potential mitigation time \(\bar{\tau}\) when \(\pi^*\) has it as expected mitigation time.


Note that a locally \((\epsilon,\delta)\)-sane policy still has to be \(0\)-optimal in \({{\mathcal{S}}^{{\mathrm{F}}}}_M\). This requirement seems reasonably realistic, since, roughly speaking, it only means that there is some way to “rearrange the universe” that the agent can achieve, and that would be “endorsed” by the advisor, s.t this rearrangement doesn’t destroy substantially much value and s.t. after this rearrangement, there is no “impending catastrophe” that the agent has to prevent and the advisor wouldn’t be able to prevent in its place. In particular, this rearrangement may involve creating some subagents inside the environment and destroying the original agent, in which case any policy on \({{\mathcal{S}}^{{\mathrm{F}}}}_M\) is “vacuously” optimal (since all actions have no effect).

We can now formulate the main result.

Theorem 1

Fix an interface \({\mathcal{I}}=({\mathcal{A}},{\mathcal{O}})\), \(N \in {\mathbb{N}}\), \(\epsilon \in (0,1)\) and for each \(k \in [N]\), an MDP \({M^{{\mathrm{F}}}}_k\) s.t. \({\mathcal{A}}_{{M^{{\mathrm{F}}}}_k} = {\mathcal{A}}\). Now, consider for each \(k \in [N]\), an \({\mathcal{I}}\)-universe \(\upsilon^k=(\mu^k,r^k)\) which is an \({\mathcal{O}}\)-realization of a catastrophe MDP \(M^k\) with state function \(S^k\) s.t.

  1. \({{\mathcal{S}}^{{\mathrm{F}}}}_{M^k} = {\mathcal{S}}_{{M^{{\mathrm{F}}}}_k}\)

  2. For each \(s \in {\mathcal{S}}_{{M^{{\mathrm{F}}}}_k}\) and \(a \in {\mathcal{A}}\), \({\mathcal{T}}_{M^k}(s,a) \mid {\mathcal{S}}_{{M^{{\mathrm{F}}}}_k} = {\mathcal{T}}_{{M^{{\mathrm{F}}}}_k}(s,a)\).

  3. For each \(s \in {\mathcal{S}}_{{M^{{\mathrm{F}}}}_k}\), \({\mathcal{R}}_{M^k}(s)={\mathcal{R}}_{{M^{{\mathrm{F}}}}_k}(s)\).

  4. Given \(k,j \in [N]\) and \(h \in \operatorname{hdom}{\mu^k} \cap \operatorname{hdom}{\mu^j}\), if \(S^k(h) \in {\mathcal{S}}_{M^k} \setminus {{\mathcal{S}}^{{\mathrm{C}}}}_{M^k}\) and \(S^j(h) \in {\mathcal{S}}_{M^j} \setminus {{\mathcal{S}}^{{\mathrm{C}}}}_{M^j}\), then \(r^k(h)=r^j(h)\) (this condition means that in uncorrupted states, the reward is observable).

Consider also \(\alpha,\delta\in(0,1)\), \(\bar{\tau} \in (0,\infty)\) and \(\sigma^k\) a locally \((\epsilon,\delta)\)-sane policy for \((M^k,1-\alpha)\). Assume \(\sigma^k\) has potential mitigation time \(\bar{\tau}\). Then, there exists an \({{\bar{{\mathcal{I}}}}}\)-policy \(\pi^*\) s.t. for any \(k \in [N]\)

\[{\operatorname{EU}}_{\upsilon^k}^*(1-\alpha) - {\operatorname{EU}}_{\bar{\upsilon}^k{\left[\sigma^kS^k\right]}}^{\pi^*}(1-\alpha) = O{\left(\max{\left(\frac{\delta}{\alpha},1\right)}\cdot(\bar{\tau}\alpha)^{1/4}+\bar{\tau}\delta\right)}\]

Here, \(\sigma^k S^k\) is the \({\mathcal{I}}\)-policy defined by \(\sigma^k S^k(h):=\sigma^k{\left(S^k(h)\right)}\). \(\epsilon\) and the \({M^{{\mathrm{F}}}}_k\) are regarded as fixed and we don’t explicitly examine their effect on regret, whereas \(\alpha\), \(\delta\), \(\bar{\tau}\) and the \(M^k\) are regarded as variable with the asymptotics \(\alpha,\delta \rightarrow 0\), \(\bar{\tau} \rightarrow \infty\).

In most interesting cases, \(\delta \gg \alpha\) (i.e. the “mean time between catastrophes” is much shorter than a discount horizon) and \(\bar{\tau}\alpha \ll 1\) (i.e. the expected mitigation time is much shorter than the discount horizon), which allows simplifying the above to

\[{\operatorname{EU}}_{\upsilon^k}^*(1-\alpha) - {\operatorname{EU}}_{\bar{\upsilon}^k{\left[\sigma^kS^k\right]}}^{\pi^*}(1-\alpha) = O{\left(\delta\bar{\tau}^{1/4}\alpha^{-3/4}\right)}\]


We give a simple example.

Example 1

Let \({\mathcal{A}}= \{0,1,*\}\), \({\mathcal{O}}={\{0,1\}}\). For any \(n \in {\mathbb{N}}\) and \(k \in [N]\), we fix some \(w_n^k \in {\{0,1\}}^n\) and define the catastrophe MDP \(M_n^k\) by

  • \({{\mathcal{S}}^{{\mathrm{D}}}}_{M_n^k} = {\{0,1\}}^{\leq n}\), \({{\mathcal{S}}^{{\mathrm{F}}}}_{M_n^k} = \{\bot,\top\}\), \({{\mathcal{S}}^{{\mathrm{C}}}}_{M_n^k} = \varnothing\) (adding corrupted states is an easy exercise).

  • If \(s \in {\{0,1\}}^{< n}\) and \(a \in {\{0,1\}}\) then

\[{\mathcal{T}}_{M_n^k}(sa \mid s,a) = 1 - \delta\]

\[{\mathcal{T}}_{M_n^k}(\bot \mid s,a) = \delta\]

\[{\mathcal{T}}_{M_n^k}(s \mid sa,*) = 1 - \delta\]

\[{\mathcal{T}}_{M_n^k}(\bot \mid sa,*) = \delta\]

  • If \(a \in {0,1}\) then

\[{\mathcal{T}}_{M_n^k}(\top \mid w_n^k,a) = 1\]

  • If \(s \in {\{0,1\}}^n \setminus w_n^k\) and \(a \in {0,1}\) then

\[{\mathcal{T}}_{M_n^k}(\bot \mid s,a) = 1\]

  • If \(s \in \{\bot,\top\}\) and \(a \in {\mathcal{A}}\) then

\[{\mathcal{T}}_{M_n^k}(s \mid s,a) = 1\]

  • \({\mathcal{R}}_{M_n^k}(\bot)=0\), if \(s \in {\mathcal{S}}_{M_n^k} \setminus \bot\) then \({\mathcal{R}}_{M_n^k}(s)=1\).

  • \(S_n^k({\boldsymbol{\lambda}}_{{\mathcal{A}}\times {\mathcal{O}}})={\boldsymbol{\lambda}}_{{\{0,1\}}}\) and \(S_n^k(hao)=\bot\) iff \(o = 0\) (this defines a unique \(S_n^k\)).

  • If \(s \in {\{0,1\}}^{<n} \cup \{\bot,\top\}\) then \(\sigma_n^k(a \mid s) = \frac{1}{3}\) for any \(a \in {\mathcal{A}}\).

  • \(\sigma_n^k(0 \mid w_n^k) = \epsilon\), \(\sigma_n^k(* \mid w_n^k) = 1 - \epsilon\).

  • If \(s \in {\{0,1\}}^n \setminus w_n^k\) then \(\sigma_n^k(0 \mid s) = \delta\), \(\sigma_n^k(* \mid s) = 1 - \delta\).

We have \(\bar{\tau} = n\). Consider the asymptotic regime \(n \rightarrow \infty\), \(\alpha_n = \Theta{\left(n^{-6}\right)}\), \(\delta_n = \Theta{\left(n^{-5}\right)}\). According to Theorem 1, we get

\[{\operatorname{EU}}_{\upsilon_n^k}^*(1-\alpha_n) - {\operatorname{EU}}_{\bar{\upsilon}_n^k{\left[\sigma_n^k\right]}}^{\pi_n^*}(1-\alpha_n) = O{\left(n^{-5} \cdot n^{1/4} \cdot {\left(n^{-6}\right)}^{-3/4}\right)}=O{\left(n^{-1/4}\right)}=O{\left(\alpha_n^{1/24}\right)}\]

The probability of a catastrophe (i.e. ending up in state \(\bot\)) for the optimal policy for a given \(k\) is \(O{\left(\bar{\tau}\delta\right)}=O{\left(n^{-4}\right)}\). Therefore, the probability of a catastrophe for policy \(\pi_n^*\) is \(O{\left(n^{-4}+n^{-1/4}\right)}=O{\left(n^{-1/4}\right)}\). On the other hand, it is easy to see that the policy \(\sigma_n^k\) has a probability of catastrophe \(1-o(1)\) (and in particular regret \(\Omega(1)\)): it spends \(\Omega(2^n)\) time “exploring” \({\{0,1\}}^{\leq n}\) with a probability \(\delta=\Theta{\left(n^{-5}\right)}\) of a catastrophe on every step.

Note that this example can be interpreted as a version of Christiano’s approval-directed agent, if we regard the state \(s \in {\{0,1\}}^{i}\) as a “plan of action” that the advisor may either approve or not. But in this formalism, it is a special case of consequentialist reasoning.


Theorem 1 speaks of a finite set of environments, but as before (see Proposition 1 here and Corollary 3 here), there is a “structural” equivalent, i.e. we can use it to produce corollaries about Bayesian agents with priors over a countable set of environments. The difference is, in this case we consider asymptotic regimes in which the environment is also variable, so the probability weight of the environment in the prior will affect the regret bound. We leave out the details for now.

Appendix A

We start by deriving a more general and more precise version of the non-catastrophic regret bound, in which the optimal policy is replaced by an arbitrary “reference policy” (later it will be related to the mitigation policy) and the dependence on the MDPs is expressed via a bound on the derivative of \({\operatorname{V}}\) by \(\gamma\).

Definition A.1

Fix \(\epsilon\in(0,1)\). Consider an MDP \(M\) and policies \(\pi: {\mathcal{S}}_M \rightarrow {\mathcal{A}}_M\), \(\sigma: {\mathcal{S}}_M {\xrightarrow{\text{k}}}{\mathcal{A}}_M\). \(\sigma\) is called \(\epsilon\)-sane relatively to \(\pi\) when for any \(s \in {\mathcal{S}}_M\)

  1. \(\operatorname{supp}{\sigma(s)} \subseteq {\mathcal{A}}_{M\pi}^0\)

  2. \(\sigma{\left(\pi(s) \mid s\right)} > \epsilon\)

Lemma A.1

Fix an interface \({\mathcal{I}}=({\mathcal{A}},{\mathcal{O}})\), \(N \in {\mathbb{N}}\) and \(\epsilon \in (0,1)\). Now, consider for each \(k \in [N]\), an \({\mathcal{I}}\)-universe \(\upsilon^k=(\mu^k,r)\) which is an \({\mathcal{O}}\)-realization of an MDP \(M_k\) with state function \(S^k\) and policies \(\pi^k: {\mathcal{S}}_{M^k} \rightarrow {\mathcal{A}}\), \(\sigma^k: {\mathcal{S}}_{M^k} {\xrightarrow{\text{k}}}{\mathcal{A}}\). Consider also \(\alpha\in(0,1)\), \(\bar{\tau} \in (0,\infty)\) and assume that

  1. \(\sigma^k\) is \(\epsilon\)-sane relatively to \(\pi^k\).

  2. For any \(s \in {\mathcal{S}}_{M^k}\) and \(\gamma\in(0,1)\) \[{\left\vert \frac{{\mathrm{d}}{\operatorname{V}}_{M^k\pi^k}(s,\gamma)}{{\mathrm{d}}\gamma} \right\vert} \leq \bar{\tau}\]

Then, there exists an \({{\bar{{\mathcal{I}}}}}\)-policy \(\pi^*\) s.t. for any \(k \in [N]\)

\[{\operatorname{EU}}_{\upsilon^k}^{\pi^k S^k}(1-\alpha) - {\operatorname{EU}}_{\bar{\upsilon}^k{\left[\sigma^k S^k\right]}}^{\pi^*}(1-\alpha) \leq O{\left((\bar{\tau} \alpha)^{1/4}\right)}\]

The \(O\)-notation refers to the asymptotics where \(\epsilon\) is fixed (so we don’t explicitly examine its effect on regret) whereas \(\alpha\), \(\bar{\tau}\) and the \(M_k\) are variable and \(\alpha \rightarrow 0\), \(\bar{\tau} \rightarrow \infty\).


The proof of Lemma A.1 is almost identical to the proof the main theorem for “non-catastrophic” DRL up to minor modifications needed to pass from absolute to relative regret, and tracking the contribution of the derivative of \({\operatorname{V}}_{M^k\pi^k}\). We give it in Appendix B.

We will not apply Lemma A.1 directly the the universes of Theorem 1. Instead, we will define new universes using the following constructions.

Definition A.2

Consider \(M\) a catastrophe MDP. We define the catastrophe MDP \({M^{{\mathrm{D}}}}\) as follows.

  • \({{\mathcal{S}}^{{\mathrm{F}}}}_{{M^{{\mathrm{D}}}}}:={{\mathcal{S}}^{{\mathrm{F}}}}_M \sqcup \{\bot\}\), \({{\mathcal{S}}^{{\mathrm{D}}}}_{{M^{{\mathrm{D}}}}}:={{\mathcal{S}}^{{\mathrm{D}}}}_M\), \({{\mathcal{S}}^{{\mathrm{C}}}}_{{M^{{\mathrm{D}}}}}:=\varnothing\).

  • \({\mathcal{A}}_{{M^{{\mathrm{D}}}}} = {\mathcal{A}}_M\)

  • For any \(s,t \in {{\mathcal{S}}^{{\mathrm{D}}}}_M\):

\[{\mathcal{T}}_{{M^{{\mathrm{D}}}}}(t \mid s) = {\mathcal{T}}_M(t \mid s)\]

\[{\mathcal{T}}_{{M^{{\mathrm{D}}}}}(\bot \mid s) = {\mathcal{T}}_M({{\mathcal{S}}^{{\mathrm{C}}}}_M \mid s)\]

\[{\mathcal{T}}_{{M^{{\mathrm{D}}}}}(\bot \mid \bot) = 1\]

  • For any \(s \in {{\mathcal{S}}^{{\mathrm{D}}}}_M \cup {{\mathcal{S}}^{{\mathrm{F}}}}_M\), \(t \in {{\mathcal{S}}^{{\mathrm{F}}}}_M\):

\[{\mathcal{T}}_{{M^{{\mathrm{D}}}}}(t \mid s) = {\mathcal{T}}_M(t \mid s)\]

  • For any \(s \in {{\mathcal{S}}^{{\mathrm{F}}}}_M\):

\[{\mathcal{T}}_{{M^{{\mathrm{D}}}}}(\bot \mid s) = {\mathcal{T}}_M({{\mathcal{S}}^{{\mathrm{C}}}}_M \cup {{\mathcal{S}}^{{\mathrm{D}}}}_M \mid s)\]

  • For any \(s \in {{\mathcal{S}}^{{\mathrm{D}}}}_M\), \({\mathcal{R}}_{{M^{{\mathrm{D}}}}}(s) = \frac{1}{2}{\mathcal{R}}_M(s)\).

  • For any \(s \in {{\mathcal{S}}^{{\mathrm{F}}}}_M\), \({\mathcal{R}}_{{M^{{\mathrm{D}}}}}(s) = 1\).

  • \({\mathcal{R}}_{{M^{{\mathrm{D}}}}}(\bot) = 0\)

Now, consider an interface \({\mathcal{I}}=({\mathcal{A}},{\mathcal{O}})\) and a \(\upsilon=(\mu,r)\) which is an \({\mathcal{O}}\)-realization of a catastrophe MDP \(M\) with state function \(S\). Denote \({\mathcal{O}}':={\mathcal{O}}\times\{\Re,\Im\}\), \({\mathcal{O}}^\star:={\mathcal{O}}\times \{\Re,\Im,\bot\}\) and \({\mathcal{I}}^\star:=({\mathcal{A}},{\mathcal{O}}^\star)\). Denote \(\beta: {\mathcal{O}}' \rightarrow {\mathcal{O}}\) the projection mapping and \(\beta^*: {\left({\mathcal{A}}\times {\mathcal{O}}'\right)}^* \rightarrow {\left({\mathcal{A}}\times {\mathcal{O}}\right)}^*\) corresponding. We define the \({\mathcal{I}}^\star\)-universe \({\upsilon^{{\mathrm{D}}}}=(\mu^{\mathrm{D}},r^\star)\) and the function \(S^\star: {\left({\mathcal{A}}\times {\mathcal{O}}^\star\right)}^* \rightarrow {\mathcal{S}}_{{M^{{\mathrm{D}}}}}\) as follows

\[\mu^{\mathrm{D}}(o\Re \mid ha) := \begin{cases} \mu{\left(o \mid \beta^*(h)a\right)} \text{ if } h\in{\left({\mathcal{A}}\times {\mathcal{O}}'\right)}^* \text{ and } S{\left(\beta^*(h)\right)},S{\left(\beta^*(h)ao\right)}\in{{\mathcal{S}}^{{\mathrm{D}}}}_M \\ 0 \text{ otherwise} \end{cases}\]

\[\mu^{\mathrm{D}}(o\Im \mid ha) := \begin{cases} \mu{\left(o \mid \beta^*(h)a\right)} \text{ if } h\in{\left({\mathcal{A}}\times {\mathcal{O}}'\right)}^* \text{ and } S{\left(\beta^*(h)ao\right)}\in{{\mathcal{S}}^{{\mathrm{F}}}}_M \\ 0 \text{ otherwise} \end{cases}\]

\[\mu^{\mathrm{D}}(o\bot \mid ha) := \frac{1}{{\left\vert {\mathcal{O}}\right\vert}}{\left(1 - \sum_{o \in O} {\left(\mu^{\mathrm{D}}(o\Re \mid ha) + \mu^{\mathrm{D}}(o\Im \mid ha)\right)}\right)}\]

\[r^\star(h):=\begin{cases} \frac{1}{2}r({\boldsymbol{\lambda}}) \text{ if } h = {\boldsymbol{\lambda}}\\ \frac{1}{2}r{\left(\beta^*(h)\right)} \text{ if } h\in{\left({\mathcal{A}}\times {\mathcal{O}}'\right)}^*,\ {\left\vert h \right\vert}>0 \text{ and } h_{:{\left\vert h \right\vert}-1}\in{\mathcal{A}}{\mathcal{O}}\Re \\ 1 \text{ if } h\in{\left({\mathcal{A}}\times {\mathcal{O}}'\right)}^*,\ {\left\vert h \right\vert}>0 \text{ and } h_{:{\left\vert h \right\vert}-1}\in{\mathcal{A}}{\mathcal{O}}\Im \\ 0 \text{ if } h\not\in{\left({\mathcal{A}}\times {\mathcal{O}}'\right)}^* \end{cases}\]

\[S^\star(h):=\begin{cases} S{\left(\beta^*(h)\right)} \text{ if } h\in{\left({\mathcal{A}}\times {\mathcal{O}}'\right)}^* \\ \bot \text{ otherwise} \end{cases}\]

It is easy to see that \({\upsilon^{{\mathrm{D}}}}\) is an \({\mathcal{O}}^\star\)-realization of \({M^{{\mathrm{D}}}}\) with state function \(S^\star\).

Definition A.3

Consider \(M\) a catastrophe MDP. We define the catastrophe MDP \({M^{{\mathrm{E}}}}\) as follows.

  • \({{\mathcal{S}}^{{\mathrm{F}}}}_{{M^{{\mathrm{E}}}}}:={{\mathcal{S}}^{{\mathrm{F}}}}_M \sqcup \{\bot\}\), \({{\mathcal{S}}^{{\mathrm{D}}}}_{{M^{{\mathrm{E}}}}}:={{\mathcal{S}}^{{\mathrm{D}}}}_M\), \({{\mathcal{S}}^{{\mathrm{C}}}}_{{M^{{\mathrm{E}}}}}:=\varnothing\).

  • \({\mathcal{A}}_{{M^{{\mathrm{E}}}}} = {\mathcal{A}}_M\)

  • \({\mathcal{T}}_{{M^{{\mathrm{E}}}}} = {\mathcal{T}}_{{M^{{\mathrm{D}}}}}\)

  • For any \(s \in {{\mathcal{S}}^{{\mathrm{D}}}}_M \cup {{\mathcal{S}}^{{\mathrm{F}}}}_M\), \({\mathcal{R}}_{{M^{{\mathrm{E}}}}}(s) = \frac{1}{2}{\mathcal{R}}_M(s)\).

  • \({\mathcal{R}}_{{M^{{\mathrm{E}}}}}(\bot) = 0\)

Now, consider an interface \({\mathcal{I}}=({\mathcal{A}},{\mathcal{O}})\) and a \(\upsilon=(\mu,r)\) which is an \({\mathcal{O}}\)-realization of a catastrophe MDP \(M\) with state function \(S\). We define the \({\mathcal{I}}^\star\)-universe \(\upsilon^{\mathrm{E}}=(\mu^{\mathrm{E}},r^\star)\) as follows

\[\mu^{\mathrm{E}}(o\Re \mid ha) := \begin{cases} \mu{\left(o \mid \beta^*(h)a\right)} \text{ if } h\in{\left({\mathcal{A}}\times {\mathcal{O}}'\right)}^* \text{ and } S{\left(\beta^*(h)\right)},S{\left(\beta^*(h)ao\right)}\in{{\mathcal{S}}^{{\mathrm{D}}}}_M \\ \mu{\left(o \mid \beta^*(h)a\right)} \text{ if } h\in{\left({\mathcal{A}}\times {\mathcal{O}}'\right)}^* \text{ and } S{\left(\beta^*(h)ao\right)}\in{{\mathcal{S}}^{{\mathrm{F}}}}_M \\ 0 \text{ otherwise} \end{cases}\]

\[\mu^{\mathrm{E}}(o\Im \mid ha) := 0\]

\[\mu^{\mathrm{E}}(\bot \mid ha) := 1 - \sum_{o \in O} \mu^{\mathrm{E}}(o\Re \mid ha)\]

It is easy to see that \(\upsilon^{\mathrm{E}}\) is an \({\mathcal{O}}^\star\)-realization of \({M^{{\mathrm{E}}}}\) with state function \(S^\star\).

Given \(h = a_0 o_0 a_1 o_1 \ldots a_{n-1} o_{n-1} \in {\left({\mathcal{A}}\times {\mathcal{O}}\right)}^n\), we will use the notation

\[\Re^*h := a_0 o_0 \Re a_1 o_1 \Re \ldots a_{n-1} o_{n-1} \Re \in {\left({\mathcal{A}}\times {\mathcal{O}}'\right)}^n\]

Given an \({\mathcal{I}}^\star\)-policy \(\pi\), the \({\mathcal{I}}\)-policy \(\pi\Re^*\) is defined by \(\pi\Re^*(h) := \pi{\left(\Re^*h\right)}\).


In order to utilize condition iii of Definition 8, we need to establish the following relation between \(M^\flat\) and \({M^{{\mathrm{D}}}}\), \({M^{{\mathrm{E}}}}\).

Proposition A.2

Consider \(M\) a catastrophe MDP, some \(s \in {\mathcal{S}}_M \setminus {{\mathcal{S}}^{{\mathrm{C}}}}_M\) and \(\pi^*\) a proper mitigation policy. Then

\[{\mathcal{A}}_{M^\flat\pi^*}^0{\left(s\right)} \cap {\mathcal{A}}_M^\sharp(s) \subseteq {\mathcal{A}}_{{M^{{\mathrm{D}}}}\pi^*}^0{\left(s\right)}\]

\[{\mathcal{A}}_{M^\flat\pi^*}^0{\left(s\right)} \cap {\mathcal{A}}_M^\sharp(s) \subseteq {\mathcal{A}}_{{M^{{\mathrm{E}}}}\pi^*}^0{\left(s\right)}\]


For the purpose of the proof, the following notation will be convenient

Definition A.4

Consider \({\mathcal{S}}\) a finite set and some \({\mathcal{T}}: {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{S}}\). We define \({\mathcal{T}}^\infty: {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{S}}\) by

\[{\mathcal{T}}^\infty := \lim_{n \rightarrow \infty} \frac{1}{n} \sum_{m = 0}^{n-1} {\mathcal{T}}^m\]

As well known, the limit above always exists.

Proof of Proposition A.2

Consider any \(s \in {\mathcal{S}}\setminus {{\mathcal{S}}^{{\mathrm{C}}}}_M\) and \(a \in {\mathcal{A}}_{M^\flat\pi^*}^0{\left(s\right)} \cap {\mathcal{A}}_M^\sharp(s)\). Since \(a \in {\mathcal{A}}_{M^\flat\pi^*}^0{\left(s\right)}\), we have

\[{\underset{{\left({\mathcal{T}}_{M^\flat\pi^*}^\infty \circ {\mathcal{T}}_{M^\flat}\right)}(s,a)}{\operatorname{E}}{\left[{\mathcal{R}}_{M^\flat}\right]}} = {\underset{{\mathcal{T}}_{M^\flat}(s,a)}{\operatorname{E}}{\left[{\operatorname{V}}^0_{M^\flat\pi^*}\right]}} = {\operatorname{Q}}_{M^\flat\pi^*}^0(s,a) \geq {\operatorname{V}}_{M^\flat\pi^*}^0(s) = {\underset{{\mathcal{T}}_{M^\flat\pi^*}^\infty(s)}{\operatorname{E}}{\left[{\mathcal{R}}_{M^\flat}\right]}}\]

Let \(N\) be either of \({M^{{\mathrm{D}}}}\) and \({M^{{\mathrm{E}}}}\). Since \(a \in {\mathcal{A}}_M^\sharp(s)\), we get

\[{\underset{{\left({\mathcal{T}}_{M^\flat\pi^*}^\infty \circ {\mathcal{T}}_{N}\right)}(s,a)}{\operatorname{E}}{\left[{\mathcal{R}}_{M^\flat}\right]}} \geq {\underset{{\mathcal{T}}_{M^\flat\pi^*}^\infty(s)}{\operatorname{E}}{\left[{\mathcal{R}}_{M^\flat}\right]}}\]

Since \(\pi^*\) is a mitigation policy, we get

\[{\underset{{\left({\mathcal{T}}_{N\pi^*}^\infty \circ {\mathcal{T}}_{N}\right)}(s,a)}{\operatorname{E}}{\left[{\mathcal{R}}_{M^\flat}\right]}} \geq {\underset{{\mathcal{T}}_{N\pi^*}^\infty(s)}{\operatorname{E}}{\left[{\mathcal{R}}_{M^\flat}\right]}}\]

Finally, since \(\pi^*\) is proper, \(\operatorname{supp}{{\mathcal{T}}_{N\pi^*}^\infty(s)} \subseteq {{\mathcal{S}}^{{\mathrm{F}}}}_M\) and \(\operatorname{supp}{{\left({\mathcal{T}}_{N\pi^*}^\infty \circ {\mathcal{T}}_{N}\right)}(s,a)} \subseteq {{\mathcal{S}}^{{\mathrm{F}}}}_M\). We conclude

\[{\operatorname{Q}}_{N\pi^*}^0(s,a) = {\underset{{\left({\mathcal{T}}_{N\pi^*}^\infty \circ {\mathcal{T}}_{N}\right)}(s,a)}{\operatorname{E}}{\left[{\mathcal{R}}_{N}\right]}} \geq {\underset{{\mathcal{T}}_{N\pi^*}^\infty(s)}{\operatorname{E}}{\left[{\mathcal{R}}_{N}\right]}} = {\operatorname{V}}_{N\pi^*}^0(s)\]


Now we will establish a bound on the derivative of \({\operatorname{V}}\) by \(\gamma\) in terms of expected mitigation time, in order to demonstrate condition ii of Lemma A.1.

Proposition A.3

Fix \(\bar{\tau}, {\bar{\tau}^{{\mathrm{F}}}}_1 \in (0,\infty)\). Consider a catastrophe MDP \(M\) and a proper mitigation policy \(\pi: {\mathcal{S}}_M {\xrightarrow{\text{k}}}{\mathcal{A}}_M\) with expected mitigation time \(\bar{\tau}\). Assume than for any \(s \in {{\mathcal{S}}^{{\mathrm{F}}}}_M\) and \(\gamma\in(0,1)\)

\[{\left\vert \frac{{\mathrm{d}}{\operatorname{V}}_{M\pi}(s,\gamma)}{{\mathrm{d}}\gamma} \right\vert} \leq {\bar{\tau}^{{\mathrm{F}}}}_1\]

Then, for any \(s \in {\mathcal{S}}_M \setminus {{\mathcal{S}}^{{\mathrm{C}}}}_M\) and \(\gamma\in[0,1]\)

\[{\left\vert \frac{{\mathrm{d}}{\operatorname{V}}_{M\pi}(s,\gamma)}{{\mathrm{d}}\gamma} \right\vert} \leq 3 \bar{\tau}_1 + {\bar{\tau}^{{\mathrm{F}}}}_1\]


Note that, since \({\operatorname{V}}_{M\pi}(s,\gamma)\) is a rational function of \(\gamma\) with no poles on the interval \([0,1]\), some finite \({\bar{\tau}^{{\mathrm{F}}}}\) always exists. Note also that Proposition A.3 is really about Markov chains rather than MDPs, but we don’t make it explicit to avoid introducing more notation.

Proof of Proposition A.3

Let \(\mu_{M\pi s}\in\Delta{\mathcal{S}}_M^\omega\) be the Markov chain with transition matrix \({\mathcal{T}}_{M\pi}\) and initial state \(s\). For any \(\gamma\in(0,1\)), we have

\[{\operatorname{V}}_{M\pi}(s,\gamma) = {\underset{x\sim\mu_{M\pi s}}{\operatorname{E}}{\left[(1-\gamma)\sum_{n=0}^\infty \gamma^n {\mathcal{R}}_M{\left(x_n\right)}\right]}}\]

Given \(x\in{\mathcal{S}}_M^\omega\), we define \(\tau(x) \in {\mathbb{N}}\sqcup \{\infty\}\) by

\[\tau(x)=\min\{n \in {\mathbb{N}}\mid x_n \in {{\mathcal{S}}^{{\mathrm{F}}}}_M\}\]

It is easy to see that \({\operatorname{V}}_{M\pi}(s,\gamma)\) can be rewritten as

\[{\operatorname{V}}_{M\pi}(s,\gamma) = {\underset{x\sim\mu_{M\pi s}}{\operatorname{E}}{\left[(1-\gamma){\left(\sum_{n=0}^{\tau(x)-1} \gamma^n {\mathcal{R}}_M\Big(x_n\Big) + \frac{\gamma^{\tau(x)}}{1-\gamma}{\operatorname{V}}_{M\pi}{\left(x_{\tau(x)},\gamma\right)}\right)}\right]}}\]

The expression above is well defined because \(\pi\) is a proper mitigation policy and therefore \(\tau(x)\) is finite with probability 1.

\[{\operatorname{V}}_{M\pi}(s,\gamma) = {\underset{x\sim\mu_{M\pi s}}{\operatorname{E}}{\left[(1-\gamma)\sum_{n=0}^{\tau(x)-1} \gamma^n {\mathcal{R}}_M\Big(x_n\Big) + \gamma^{\tau(x)}{\operatorname{V}}_{M\pi}{\left(x_{\tau(x)},\gamma\right)}\right]}}\]

Let us decompose \({\operatorname{V}}_{M\pi}(s,\gamma)\) as \({\operatorname{V}}_{M\pi}^{\mathrm{D}}(s,\gamma)+{\operatorname{V}}_{M\pi}^{\mathrm{F}}(s,\gamma)\) defined as follows

\[{\operatorname{V}}_{M\pi}^{\mathrm{D}}(s,\gamma) := {\underset{x\sim\mu_{M\pi s}}{\operatorname{E}}{\left[(1-\gamma)\sum_{n=0}^{\tau(x)-1} \gamma^n {\mathcal{R}}_M\Big(x_n\Big) \right]}}\]

\[{\operatorname{V}}_{M\pi}^{\mathrm{F}}(s,\gamma) := {\underset{x\sim\mu_{M\pi s}}{\operatorname{E}}{\left[\gamma^{\tau(x)}{\operatorname{V}}_{M\pi}{\left(x_{\tau(x)},\gamma\right)}\right]}}\]

We have

\[\frac{{\mathrm{d}}{\operatorname{V}}_{M\pi}^{\mathrm{D}}(s,\gamma)}{{\mathrm{d}}\gamma} = {\underset{x\sim\mu_{M\pi s}}{\operatorname{E}}{\left[\sum_{n=0}^{\tau(x)-1} {\left(-\gamma^n+(1-\gamma)\cdot n\gamma^{n-1}\right)} {\mathcal{R}}_M\Big(x_n\Big)\right]}}\]

\[{\left\vert \frac{{\mathrm{d}}{\operatorname{V}}_{M\pi}^{\mathrm{D}}(s,\gamma)}{{\mathrm{d}}\gamma} \right\vert} \leq {\underset{x\sim\mu_{M\pi s}}{\operatorname{E}}{\left[\tau(x)+(1-\gamma)\sum_{n=0}^{\tau(x)-1} n\gamma^{n-1} {\mathcal{R}}_M\Big(x_n\Big) \right]}}\]

The second term can be regarded as a weighted average (since \(1-\gamma=\sum_{n=0}^\infty \gamma^n\)), where the maximal term in the average is at most \(\tau(x)-1\), hence

\[{\left\vert \frac{{\mathrm{d}}{\operatorname{V}}_{M\pi}^{\mathrm{D}}(s,\gamma)}{{\mathrm{d}}\gamma} \right\vert} \leq 2 \bar{\tau}_1\]

Also, we have

\[\frac{{\mathrm{d}}{\operatorname{V}}_{M\pi}^{\mathrm{F}}(s,\gamma)}{{\mathrm{d}}\gamma} = {\underset{x\sim\mu_{M\pi s}}{\operatorname{E}}{\left[\tau(x)\gamma^{\tau(x)-1}{\operatorname{V}}_{M\pi}{\left(x_{\tau(x)},\gamma\right)}+\gamma^{\tau(x)}\frac{{\mathrm{d}}{\operatorname{V}}_{M\pi}{\left(x_{\tau(x)},\gamma\right)}}{{\mathrm{d}}\gamma}\right]}}\]

\[{\left\vert \frac{{\mathrm{d}}{\operatorname{V}}_{M\pi}^{\mathrm{F}}(s,\gamma)}{{\mathrm{d}}\gamma} \right\vert} \leq {\underset{x\sim\mu_{M\pi s}}{\operatorname{E}}{\left[ \tau(x)+{\left\vert \frac{{\mathrm{d}}{\operatorname{V}}_{M\pi}{\left(x_{\tau(x)},\gamma\right)}}{{\mathrm{d}}\gamma} \right\vert}\right]}} = \bar{\tau}_1 + {\underset{x\sim\mu_{M\pi s}}{\operatorname{E}}{\left[{\left\vert \frac{{\mathrm{d}}{\operatorname{V}}_{M\pi}{\left(x_{\tau(x)},\gamma\right)}}{{\mathrm{d}}\gamma} \right\vert}\right]}} \leq \bar{\tau}_1 + {\bar{\tau}^{{\mathrm{F}}}}_1\]

\[{\left\vert \frac{{\mathrm{d}}{\operatorname{V}}_{M\pi}(s,\gamma)}{{\mathrm{d}}\gamma} \right\vert} \leq {\left\vert \frac{{\mathrm{d}}{\operatorname{V}}_{M\pi}^{\mathrm{D}}(s,\gamma)}{{\mathrm{d}}\gamma} \right\vert}+{\left\vert \frac{{\mathrm{d}}{\operatorname{V}}_{M\pi}^{\mathrm{F}}(s,\gamma)}{{\mathrm{d}}\gamma} \right\vert} \leq 3 \bar{\tau}_1 + \bar{\tau}_1^{\mathrm{F}}\]


To transform the relative regret bounds for “auxiliary” universes obtained from Lemma A.1 to regret bounds for the original universes, we will need the following.

Definition A.5

Fix \(\delta\in(0,1)\) and a universe \(\upsilon=(\mu,r)\) which is an \({\mathcal{O}}\)-realization of a catastrophe MDP \(M\) with state function \(S\). Let \(\tilde{M}\) be a \(\delta\)-perturbation of \(M\). An environment \(\tilde{\mu}\) is said to be a \(\delta\)-lift of \(\tilde{M}\) to \(\mu\) when

  1. \((\tilde{\mu},r)\) is an \({\mathcal{O}}\)-realization of \(\tilde{M}\) with state function \(S\).

  2. \(\operatorname{hdom}{\tilde{\mu}} \subseteq \operatorname{hdom}{\mu}\)

  3. For any \(h \in \operatorname{hdom}{\tilde{\mu}}\) and \(a\in{\mathcal{A}}\), if \(S(h) \in {\mathcal{S}}_M \setminus {{\mathcal{S}}^{{\mathrm{D}}}}_M\) then \(\mu(ha) = \tilde{\mu}(ha)\).

  4. For any \(h \in \operatorname{hdom}{\tilde{\mu}}\) and \(a\in{\mathcal{A}}\), if \(S(h)\in{{\mathcal{S}}^{{\mathrm{D}}}}_M\) then there exists \(\zeta\in\Delta{\mathcal{O}}\) s.t. \[\mu(ha) = (1-\delta)\tilde{\mu}(ha) + \delta\zeta\]

It is easy to see that such a lift always exists, for example we can take:

\[\tilde{\mu}(o \mid ha) := \frac{{\mathcal{T}}_{\tilde{M}}{\left(S(hao) \mid S(h),a\right)}}{{\mathcal{T}}_{M}{\left(S(hao) \mid S(h),a\right)}}\mu(o \mid ha)\]

Proposition A.4

Consider \(\gamma,\delta\in(0,1)\) s.t \(\delta \geq 1 - \gamma\) and \(\bar{\tau}\in(0,\infty)\). Let \(\upsilon=(\mu,r)\) be a universe which is an \({\mathcal{O}}\)-realization of a catastrophe MDP \(M\) with state function \(S\). Suppose that \(\pi^*\) is a mitigation policy for \(M\) that has expected mitigation time \(\bar{\tau}\). Consider some \({\mathcal{I}}^\star\)-policy \(\pi\). Suppose that \(\tilde{M}\) is a \(\delta\)-perturbation of \(M\) and \(\tilde{\mu}\) is a \(\delta\)-lift of \(\tilde{M}\) to \(\mu\). Denote \(\tilde{\upsilon}:=(\tilde{\mu},r)\). Then, there is some \(C\in(0,\infty)\) that depends on nothing s.t

\[\frac{1}{2}{\left({\operatorname{EU}}_{\upsilon}^{\pi^* S}(\gamma)-{\operatorname{EU}}_{\upsilon}^{\pi\Re^*}(\gamma)\right)} \leq {\operatorname{EU}}_{\tilde{\upsilon}^{{\mathrm{E}}}}^{\pi^* S^\star}(\gamma)-{\operatorname{EU}}_{\tilde{\upsilon}^{{\mathrm{E}}}}^{\pi}(\gamma) + C\delta{\left(\bar{\tau}+\frac{\max{\left({\operatorname{EU}}_{\tilde{\upsilon}^{{\mathrm{D}}}}^{\pi^* S^\star}(\gamma)-{\operatorname{EU}}_{\tilde{\upsilon}^{{\mathrm{D}}}}^{\pi}(\gamma),0\right)}}{1-\gamma}\right)}\]


In order to prove Proposition A.4, we need a relative regret bound for \(M\) derived from a relative regret bound for \({M^{{\mathrm{E}}}}\).

Proposition A.5

Fix an interface \({\mathcal{I}}\) and an \({\mathcal{I}}\)-universe \(\upsilon=(\mu,r)\) which is an \({\mathcal{O}}\)-realization of a catastrophe MDP \(M\) with state function \(S\) s.t. \(S({\boldsymbol{\lambda}})\not\in{{\mathcal{S}}^{{\mathrm{C}}}}_M\). Suppose that \(\pi^*\) is a mitigation policy for \(M\). Let \(\pi\) be any \({\mathcal{I}}^\star\)-policy. Then, for any \(\gamma\in(0,1)\)

\[\frac{1}{2}{\left({\operatorname{EU}}_{\upsilon}^{\pi^* S}(\gamma)-{\operatorname{EU}}_{\upsilon}^{\pi\Re^*}(\gamma)\right)} \leq {\operatorname{EU}}_{\upsilon^{\mathrm{E}}}^{\pi^* S^\star}(\gamma)-{\operatorname{EU}}_{\upsilon^{\mathrm{E}}}^{\pi}(\gamma)\]

Proof of Proposition A.5

\(\pi^*\) is a mitigation policy, therefore for any \(s \in {\mathcal{S}}_M \setminus {{\mathcal{S}}^{{\mathrm{C}}}}_M\), \({\mathcal{T}}_{{M^{{\mathrm{E}}}}\pi^*}(s)={\mathcal{T}}_{M\pi^*}(s)\). It follows that

\[{\operatorname{EU}}_{\upsilon^{\mathrm{E}}}^{\pi^* S^\star}(\gamma) = \frac{1}{2}{\operatorname{EU}}_{\upsilon}^{\pi^* S}(\gamma)\]

Also, it is easy to see from the definition of \({\mathcal{T}}_{{M^{{\mathrm{E}}}}}\) and \(r^\star\) that

\[{\operatorname{EU}}_{\upsilon^{\mathrm{E}}}^{\pi}(\gamma) \leq \frac{1}{2}{\operatorname{EU}}_{\upsilon}^{\pi\Re^*}(\gamma)\]

Indeed, any discrepancy between the behavior of \({M^{{\mathrm{E}}}}\) and \(M\) involves transition to the state \(\bot\) which yields 0 reward forever. Subtracting these inequalities, we get the desired result.


Another observation we need to prove Proposition A.4 is a bound on the effect of \(\delta\)-perturbations in terms of mitigation time.

Proposition A.6

Consider \(\gamma,\delta\in(0,1)\), a universe \(\upsilon=(\mu,r)\) which is an \({\mathcal{O}}\)-realization of a catastrophe MDP \(M\) with state function \(S\), and some \(\pi: {({\mathcal{A}}\times {\mathcal{O}})^*}{\xrightarrow{\text{k}}}{\mathcal{A}}\). Assume that for any \(h \in S^{-1}{\left({{\mathcal{S}}^{{\mathrm{F}}}}_M\right)}\) and \(a \in \operatorname{supp}{\pi(h)}\), \({\mathcal{T}}_M{\left({{\mathcal{S}}^{{\mathrm{F}}}}_M \mid s,a\right)} = 1\). Let \(\tilde{M}\) be a \(\delta\)-perturbation of \(M\) and \(\tilde{\mu}\) a \(\delta\)-lift of \(\tilde{M}\) to \(\mu\). Then,

\[{\operatorname{d}_{\text{tv}}{\left(\mu\bowtie\pi,\ \tilde{\mu}\bowtie\pi\right)}} \leq 2{\left(1-{\underset{x\sim\tilde{\mu}\bowtie\pi}{\operatorname{E}}{\left[{\left(1-\delta\right)}^{\tau_S(x)}\right]}}\right)}\]

Proof of Proposition A.6

It is straightforward to construct a probability space \((\Omega,P)\), \(X: \Omega \rightarrow (A \times {\mathcal{O}})^\omega\) measurable and \({{\{C_n \subseteq \Omega\}_{n \in {\mathbb{N}}}}}\) measurable s.t.

  1. \(h_*(P) = \mu\bowtie\pi\)

  2. For any \(n \in {\mathbb{N}}\) and \(h \in (A \times {\mathcal{O}})^n\) s.t. \(S(h) \in {{\mathcal{S}}^{{\mathrm{D}}}}_M\):

\[{\underset{}{\operatorname{Pr}}{\left[C_n \mid X_{:n} = h\right]}} = \delta\]

  1. For any \(n \in {\mathbb{N}}\) and \(h \in (A \times {\mathcal{O}})^n\) s.t. \(S(h) \in {\mathcal{S}}_M \setminus {{\mathcal{S}}^{{\mathrm{D}}}}_M\):

\[{\underset{}{\operatorname{Pr}}{\left[C_n \mid X_{:n} = h\right]}} = 0\]

  1. For any \(n \in {\mathbb{N}}\), \(h \in (A \times {\mathcal{O}})^n\) and \(h' \in (A \times {\mathcal{O}})^{n+1}\):

\[{\underset{}{\operatorname{Pr}}{\left[X_{:n+1}=h' \mid X_{:n} = h \text{ and not } C_n\right]}} = {\underset{x \sim \tilde{\mu}\bowtie\pi}{\operatorname{Pr}}{\left[x_{:n+1}=h' \mid x_{:n} = h\right]}}\]

Denote \(D:=\cap_{n=0}^\infty \Omega \setminus C_n\). We have

\[P(D) = {\underset{x\sim\tilde{\mu}\bowtie\pi}{\operatorname{E}}{\left[{\left(1-\delta\right)}^{\tau_S(x)}\right]}}\]

\[{\operatorname{d}_{\text{tv}}{\left(P,\ P \mid D\right)}} \leq 1 - P(D)\]

\[{\operatorname{d}_{\text{tv}}{\left(\mu\bowtie\pi,\ h_*{\left(P \mid D\right)}\right)}} \leq 1 - P(D)\]

Also, it is easy to see that for any \(A \subseteq ({\mathcal{A}}\times {\mathcal{O}})^\omega\) measurable

\[h_*{\left(P \mid D\right)}(A)=\frac{{\underset{x\sim\tilde{\mu}\bowtie\pi}{\operatorname{E}}{\left[{\left(1-\delta\right)}^{\tau_S(x)}\chi_A\right]}}}{P(D)}\]

It follows that

\[{\operatorname{d}_{\text{tv}}{\left(h_*{\left(P \mid D\right)},\tilde{\mu}\bowtie\pi\right)}} = \frac{1}{2}{\underset{x\sim\tilde{\mu}\bowtie\pi}{\operatorname{E}}{\left[{\left\vert \frac{{\left(1-\delta\right)}^{\tau_S(x)}}{P(D)}-1 \right\vert}\right]}}\]

\[{\operatorname{d}_{\text{tv}}{\left(h_*{\left(P \mid D\right)},\tilde{\mu}\bowtie\pi\right)}} = \frac{1}{2P(D)}{\underset{x\sim\tilde{\mu}\bowtie\pi}{\operatorname{E}}{\left[{\left\vert {\left(1-\delta\right)}^{\tau_S(x)}-P(D) \right\vert}\right]}}\]

Using the fact that \({\left(1-\delta\right)}^{\tau_S(x)}\in[0,1]\) and the convexity of the function \({\left\vert t-P(D) \right\vert}\)

\[{\operatorname{d}_{\text{tv}}{\left(h_*{\left(P \mid D\right)},\tilde{\mu}\bowtie\pi\right)}} \leq \frac{1}{2P(D)}{\underset{x\sim\tilde{\mu}\bowtie\pi}{\operatorname{E}}{\left[{\left(1-2\delta\right)}^{\tau_S(x)}{\left(1-P(D)\right)}+{\left(1-{\left(1-\delta\right)}^{\tau_S(x)}\right)}P(D)\right]}}\]

\[{\operatorname{d}_{\text{tv}}{\left(h_*{\left(P \mid D\right)},\tilde{\mu}\bowtie\pi\right)}} \leq \frac{1}{2P(D)}{\left(P(D){\left(1-P(D)\right)} + {\left(1-P(D)\right)}P(D)\right)}=1-P(D)\]

Using the triangle inequality, we conclude

\[{\operatorname{d}_{\text{tv}}{\left(\mu\bowtie\pi,\ \tilde{\mu}\bowtie\pi\right)}}\leq{\operatorname{d}_{\text{tv}}{\left(\mu\bowtie\pi,\ h_*{\left(P \mid D\right)}\right)}}+{\operatorname{d}_{\text{tv}}{\left(h_*{\left(P \mid D\right)},\tilde{\mu}\bowtie\pi\right)}}=2{\left(1-{\underset{x\sim\tilde{\mu}\bowtie\pi}{\operatorname{E}}{\left[{\left(1-\delta\right)}^{\tau_S(x)}\right]}}\right)}\]


As a final ingredient towards the proof of Proposition A.4, we will need to use the relative regret bound for \({M^{{\mathrm{D}}}}\) to get a certain statistical bound on mitigation time.

Definition A.6

Let \(\mu\) be any environment. We define the closed set \(\operatorname{hdom}^\omega\mu \subseteq ({\mathcal{A}}\times {\mathcal{O}})^\omega\) by

\[\operatorname{hdom}^\omega\mu := {\left\{x \in ({\mathcal{A}}\times {\mathcal{O}})^\omega \mid \forall n \in {\mathbb{N}}: x_{:n} \in \operatorname{hdom}{\mu}\right\}}\]

Consider a universe \(\upsilon=(\mu,r)\) which is an \({\mathcal{O}}\)-realization of a catastrophe MDP \(M\) with state function \(S\). We define the measurable function \(\tau_S: \operatorname{hdom}^\omega{\mu} \rightarrow {\mathbb{N}}\sqcup \{\infty\}\) as follows

\[\tau_S(x) := \min{\left\{n \in {\mathbb{N}}\mid S{\left(x_{:n}\right)} \in {{\mathcal{S}}^{{\mathrm{F}}}}_M\right\}}\]

Proposition A.7

Fix an interface \({\mathcal{I}}\) and an \({\mathcal{I}}\)-universe \(\upsilon=(\mu,r)\) which is an \({\mathcal{O}}\)-realization of a catastrophe MDP \(M\) with state function \(S\) s.t. \(S({\boldsymbol{\lambda}})\in{{\mathcal{S}}^{{\mathrm{D}}}}_M\). Suppose that \(\pi^*\) is a mitigation policy for \(M\) that has the expected mitigation time \(\bar{\tau}\in(0,\infty)\). Let \(\pi\) be any \({\mathcal{I}}^\star\)-policy. Then, there is \(C\in(0,\infty)\) that depends on nothing s.t. for any \(\gamma,\delta\in(0,1)\), if \(\delta \geq 1-\gamma\) then

\[{\underset{\mu^{\mathrm{D}}\bowtie\pi}{\operatorname{E}}{\left[(1-\delta)^{\tau_{S^\star}}\right]}} \geq 1 - C\delta{\left(\bar{\tau}+\frac{\max{\left({\operatorname{EU}}_{\upsilon^{{\mathrm{D}}}}^{\pi^* S^\star}(\gamma)-{\operatorname{EU}}_{\upsilon^{{\mathrm{D}}}}^{\pi}(\gamma),0\right)}}{1-\gamma}\right)}\]

Proof of Proposition A.7

For any \(x \in \operatorname{hdom}{\mu^{\mathrm{D}}}\), we have

\[{\operatorname{U}}^{r^\star}_\gamma(x) \leq (1-\gamma){\left(\frac{1}{2}\sum_{n=0}^{\tau_{S^\star}(x)-1}{\gamma^n} + \sum_{n=\tau_{S^\star}(x)}^\infty \gamma^n\right)} = (1-\gamma){\left(\frac{1}{2} \cdot \frac{1 - \gamma^{\tau_{S^\star}(x)}}{1-\gamma}+\frac{\gamma^{\tau_{S^\star}(x)}}{1-\gamma}\right)}=\frac{1}{2}+\frac{1}{2}\gamma^{\tau_{S^\star}(x)}\]

It follows that

\[{\operatorname{EU}}_{{\upsilon^{{\mathrm{D}}}}}^\pi(\gamma) \leq \frac{1}{2}+\frac{1}{2}{\underset{=\mu^{\mathrm{D}}\bowtie\pi}{\operatorname{E}}{\left[\gamma^{\tau_{S^\star}}\right]}}\]

If \(x \in \operatorname{hdom}^\omega{\mu^{\mathrm{D}}}\) is s.t. for all \(n\in{\mathbb{N}}\), \(S^\star{\left(x_{:n}\right)}\ne\bot\), then

\[{\operatorname{U}}^{r^\star}_\gamma(x) \geq (1-\gamma)\sum_{n=\tau_{S^\star}(x)}^\infty \gamma^n = \gamma^{\tau_{S^\star}(x)}\]

Since \(\pi^*\) is a mitigation policy, it follows that

\[{\operatorname{EU}}_{{\upsilon^{{\mathrm{D}}}}}^{\pi^*S^\star}(\gamma) \geq {\underset{\mu^{\mathrm{D}}\bowtie\pi^*S^\star}{\operatorname{E}}{\left[\gamma^{\tau_{S^\star}}\right]}} \geq \gamma^{\bar{\tau}}\]

Subtracting the two inequalities, we get

\[{\operatorname{EU}}_{{\upsilon^{{\mathrm{D}}}}}^{\pi^*S^\star}(\gamma) - {\operatorname{EU}}_{{\upsilon^{{\mathrm{D}}}}}^\pi(\gamma) \geq \gamma^{\bar{\tau}} - \frac{1}{2} - \frac{1}{2}{\underset{\mu^{\mathrm{D}}\bowtie\pi}{\operatorname{E}}{\left[\gamma^{\tau_{S^\star}}\right]}}\]

\[{\underset{\mu^{\mathrm{D}}\bowtie\pi}{\operatorname{E}}{\left[\gamma^{\tau_{S^\star}}\right]}} \geq 2\gamma^{\bar{\tau}} - 2{\left({\operatorname{EU}}_{{\upsilon^{{\mathrm{D}}}}}^{\pi^*S^\star}(\gamma) - {\operatorname{EU}}_{{\upsilon^{{\mathrm{D}}}}}^\pi(\gamma)\right)}-1\]

Denote \(\rho := \max{\left({\operatorname{EU}}_{{\upsilon^{{\mathrm{D}}}}}^{\pi^*S^\star}(\gamma) - {\operatorname{EU}}_{{\upsilon^{{\mathrm{D}}}}}^\pi(\gamma),0\right)}\). By choosing \(C\) sufficiently large, we can assume without loss of generality that the right hand side is positive since, unless \(\gamma^{\bar{\tau}} \approx 1\), we would have \(C\delta\bar{\tau} \geq C(1-\gamma)\bar{\tau} > 1\), and unless \(\rho \ll 1\), we would have \(C\delta\frac{\rho}{1-\gamma} \geq C\rho > 1\). In either case, the inequality we are trying to prove would hold. Also, note that \(\frac{\ln{(1-\delta)}}{\ln{\gamma}} \geq 1\). We get

\[{\underset{\mu^{\mathrm{D}}\bowtie\pi}{\operatorname{E}}{\left[(1-\delta)^{\tau_{S^\star}}\right]}}={\underset{\mu^{\mathrm{D}}\bowtie\pi}{\operatorname{E}}{\left[\gamma^{\tau_{S^\star}\frac{\ln{(1-\delta)}}{\ln{\gamma}}}\right]}} \geq {\underset{\mu^{\mathrm{D}}\bowtie\pi}{\operatorname{E}}{\left[\gamma^{\tau_{S^\star}}\right]}}^{\frac{\ln{(1-\delta)}}{\ln{\gamma}}} \geq {\left(2\gamma^{\bar{\tau}}-2\rho-1\right)}^{\frac{\ln{(1-\delta)}}{\ln{\gamma}}}=(1-\delta)^{\frac{\ln{{\left(2\gamma^{\bar{\tau}}-2\rho-1\right)}}}{\ln{\gamma}}}\]

By the same reasoning as before, we can assume without loss of generality that e.g. \(\gamma^{\bar{\tau}} \leq 1 - \frac{1}{2}(1-\gamma)\bar{\tau}\). It follows that

\[2\gamma^{\bar{\tau}}-2\rho-1 \leq 2 - (1 - \gamma)\bar{\tau} - 2\rho - 1 = 1 - (1 - \gamma)\bar{\tau} - 2\rho \leq 1 - (1-\gamma) = \gamma\]

\[\frac{\ln{{\left(2\gamma^{\bar{\tau}}-2\rho-1\right)}}}{\ln{\gamma}} \geq 1\]

Combining this with the previous inequality implies

\[{\underset{\mu^{\mathrm{D}}\bowtie\pi}{\operatorname{E}}{\left[(1-\delta)^{\tau_{S^\star}}\right]}} \geq 1 - \delta \frac{\ln{{\left(2\gamma^{\bar{\tau}}-2\rho-1\right)}}}{\ln{\gamma}}\]

It is easy to see that there is \(x_0 \in (0,1)\) s.t. for any \(x \in [x_0,1]\), \(2x -1 \geq x^3\) and therefore \(\frac{2x-1}{x^3} \geq 1\). Therefore, for any such \(x\) and \(\rho \ll 1\), \(e^{-C\rho}\leq\frac{2x - 1}{x^3} - \frac{2\rho}{x_0^3}\leq\frac{2x - 2\rho - 1}{x^3}\), where it is sufficient to assume that \(C > \frac{2}{x_0^3}\). Taking \(x=\gamma^{\bar{\tau}}\), we conclude (assuming \(C \geq 3\) and observing that \(\gamma^{\frac{1}{1-\gamma}}\leq e^{-1}\))

\[\gamma^{\frac{C\rho}{1-\gamma}} \leq e^{-C\rho}\leq\frac{2\gamma^{\bar{\tau}} - 2\rho - 1}{\gamma^{3\bar{\tau}}} \leq \frac{2\gamma^{\bar{\tau}} - 2\rho - 1}{\gamma^{C\bar{\tau}}}\]

\[\gamma^{C{\left(\bar{\tau}+\frac{\rho}{1-\gamma}\right)}} \leq 2\gamma^{\bar{\tau}} - 2\rho - 1\]

Taking logarithm of both sides

\[C{\left(\bar{\tau}+\frac{\rho}{1-\gamma}\right)} \ln{\gamma} \leq \ln{{\left(2\gamma^{\bar{\tau}} - 2\rho - 1\right)}}\]

\[C{\left(\bar{\tau}+\frac{\rho}{1-\gamma}\right)} \geq \frac{\ln{{\left(2\gamma^{\bar{\tau}} - 2\rho - 1\right)}}}{ \ln{\gamma}}\]

Combining with the inequality we had before, we get

\[{\underset{\mu^{\mathrm{D}}\bowtie\pi}{\operatorname{E}}{\left[(1-\delta)^{\tau_{S^\star}}\right]}} \geq 1 - C\delta{\left(\bar{\tau}+\frac{\rho}{1-\gamma}\right)}\]

Proof of Proposition A.4

By Proposition A.5, we have

\[\frac{1}{2}{\left({\operatorname{EU}}_{\upsilon}^{\pi^* S}(\gamma)-{\operatorname{EU}}_{\upsilon}^{\pi\Re^*}(\gamma)\right)} \leq {\operatorname{EU}}_{\upsilon^{\mathrm{E}}}^{\pi^* S^\star}(\gamma)-{\operatorname{EU}}_{\upsilon^{\mathrm{E}}}^{\pi}(\gamma)\]

Note that \(\tilde{M}^{\mathrm{E}}\) is a \(\delta\)-perturbation of \(M^{\mathrm{E}}\) and \(\tilde{\mu}^{\mathrm{E}}\) is a \(\delta\)-lift of \(\tilde{M}^{\mathrm{E}}\) to \(\mu^{\mathrm{E}}\). The condition of Proposition A.6 holds tautologically due to Definition A.3. Therefore, we can apply Proposition A.6 and get

\[\frac{1}{2}{\left({\operatorname{EU}}_{\upsilon}^{\pi^* S}(\gamma)-{\operatorname{EU}}_{\upsilon}^{\pi\Re^*}(\gamma)\right)} \leq {\operatorname{EU}}_{\tilde{\upsilon}^{\mathrm{E}}}^{\pi^* S^\star}(\gamma)-{\operatorname{EU}}_{\tilde{\upsilon}^{\mathrm{E}}}^{\pi}(\gamma) + 2{\left(2-{\underset{\tilde{\mu}^{\mathrm{E}}\bowtie\pi^*S^\star}{\operatorname{E}}{\left[{\left(1-\delta\right)}^{\tau_{S^\star}}\right]}}-{\underset{\tilde{\mu}^{\mathrm{E}}\bowtie\pi}{\operatorname{E}}{\left[{\left(1-\delta\right)}^{\tau_{S^\star}}\right]}}\right)}\]

The only difference between \(\tilde{\mu}^{\mathrm{E}}\) and \(\tilde{\mu}^{\mathrm{D}}\) is the appearance of \(\Re\) instead of \(\Im\). Therefore, we can rewrite the above as

\[\frac{1}{2}{\left({\operatorname{EU}}_{\upsilon}^{\pi^* S}(\gamma)-{\operatorname{EU}}_{\upsilon}^{\pi\Re^*}(\gamma)\right)} \leq {\operatorname{EU}}_{\tilde{\upsilon}^{\mathrm{E}}}^{\pi^* S^\star}(\gamma)-{\operatorname{EU}}_{\tilde{\upsilon}^{\mathrm{E}}}^{\pi}(\gamma) + 2{\left(2-{\underset{\tilde{\mu}^{\mathrm{D}}\bowtie\pi^*S^\star}{\operatorname{E}}{\left[{\left(1-\delta\right)}^{\tau_{S^\star}}\right]}}-{\underset{\tilde{\mu}^{\mathrm{D}}\bowtie\pi}{\operatorname{E}}{\left[{\left(1-\delta\right)}^{\tau_{S^\star}}\right]}}\right)}\]

Applying Proposition A.7 to each of the last two terms, we get

\[\frac{1}{2}{\left({\operatorname{EU}}_{\upsilon}^{\pi^* S}(\gamma)-{\operatorname{EU}}_{\upsilon}^{\pi\Re^*}(\gamma)\right)} \leq {\operatorname{EU}}_{\tilde{\upsilon}^{\mathrm{E}}}^{\pi^* S^\star}(\gamma)-{\operatorname{EU}}_{\tilde{\upsilon}^{\mathrm{E}}}^{\pi}(\gamma) + 4C\delta{\left(2\bar{\tau}+\frac{\max{\left({\operatorname{EU}}_{\tilde{\upsilon}^{\mathrm{D}}}^{\pi^* S^\star}(\gamma)-{\operatorname{EU}}_{\tilde{\upsilon}^{\mathrm{D}}}^{\pi}(\gamma),0\right)}}{1-\gamma}\right)}\]


The following definition will be useful in order to apply Proposition A.4.

Definition A.7

Consider \(M\) a catastrophe MDP s.t. \({\mathcal{A}}_M = {\mathcal{A}}\) and a policy \(\sigma: {\mathcal{S}}_M {\xrightarrow{\text{k}}}{\mathcal{A}}\). We then define the catastrophe MDP \(M[\sigma]\) as follows:

  • \({{\mathcal{S}}^{{\mathrm{F}}}}_{M[\sigma]} := {{\mathcal{S}}^{{\mathrm{F}}}}_M\), \({{\mathcal{S}}^{{\mathrm{D}}}}_{M[\sigma]} := {{\mathcal{S}}^{{\mathrm{D}}}}_M\), \({{\mathcal{S}}^{{\mathrm{C}}}}_{M[\sigma]} := {{\mathcal{S}}^{{\mathrm{C}}}}_M\).

  • \({\mathcal{A}}_{M[\sigma]} := {\bar{{\mathcal{A}}}}\)

  • For any \(s \in {\mathcal{S}}_M\) and \(a \in {\mathcal{A}}\): \({\mathcal{T}}_{M[\sigma]}(s,a) := {\mathcal{T}}_M(s,a)\).

  • For any \(s \in {\mathcal{S}}_M\): \({\mathcal{T}}_{M[\sigma]}(s,\bot) := {\mathcal{T}}_{M\sigma}(s)\).

  • \({\mathcal{R}}_{M[\sigma]} := {\mathcal{R}}_M\)

Now consider \(\upsilon\) an \({\mathcal{O}}\)-realization of \(M\) with state function \(S\). Then, \(\bar{\upsilon}[\sigma S]\) is clearly an \({\bar{{\mathcal{O}}}}\)-realization of \(M[\sigma]\) with the state function \(\bar{S}\) defined by \(\bar{S}(h):=S{\left(\underline{h}\right)}\).

Note also that \(M[\sigma]^{\mathrm{D}}\cong M^{\mathrm{D}}[\sigma]\) and \(M[\sigma]^{\mathrm{E}}\cong M^{\mathrm{E}}[\sigma]\) (where interpreting \(\sigma\) as a policy for \(M^{\mathrm{D}}\) or \(M^{\mathrm{E}}\) requires choosing an arbitrary value for the state \(\bot\)). Moreover, \(\bar{{\mathcal{I}}}^\star \cong \overline{{\mathcal{I}}^\star}\), \(\bar{\upsilon}[\sigma S]^{\mathrm{D}}= \overline{\upsilon^{\mathrm{D}}}[\sigma S^\star]\), \(\bar{\upsilon}[\sigma S]^{\mathrm{E}}= \overline{\upsilon^{\mathrm{E}}}[\sigma S^\star]\) and \(\bar{S}^\star = \overline{S^\star}\).


Finally, we are read to prove the main theorem.

Proof of Theorem 1

For every \(k \in [N]\), denote \(\tilde{M}^k\) and \(\tilde{\sigma}^k\) the \(\delta\)-perturbations of \(M^k\) and \(\sigma^k\) respectively and \(\pi^k\) the deterministic proper mitigation policy for \(\tilde{M}^k\) of Definition 8. Let \(\tilde{\mu}^k\) be a lift of \(\tilde{M}^k\) to \(\mu^k\) and denote \(\tilde{\upsilon}^k:=(\tilde{\mu}^k,r^k)\). Define \({\mathcal{H}}:=\{\tilde{\upsilon}^{k{\mathrm{D}}},\tilde{\upsilon}^{k{\mathrm{E}}}\}_{k \in [N]}\). Observe that \(\tilde{\sigma}^{k}\) is \(\epsilon\)-sane relatively to \(\pi^k\) in the sense of \(\tilde{M}^{k{\mathrm{D}}}\) and \(\tilde{M}^{k{\mathrm{E}}}\) both: condition i of Definition A.1 follows by Proposition A.2 from conditions ii and iii of Definition 8, and condition ii of Definition A.1 follows from condition iv of Definition 8. Moreover, by Proposition A.3, we have

\[{\left\vert \frac{{\mathrm{d}}{\operatorname{V}}_{\tilde{M}^{\mathrm{D}}\pi^k}(s,\gamma)}{{\mathrm{d}}\gamma} \right\vert} = O{\left(\bar{\tau}\right)}\]

\[{\left\vert \frac{{\mathrm{d}}{\operatorname{V}}_{\tilde{M}^{\mathrm{E}}\pi^k}(s,\gamma)}{{\mathrm{d}}\gamma} \right\vert} = O{\left(\bar{\tau}\right)}\]

Here, we used that \(M^{\mathrm{F}}_k\) is fixed (and thus so is \(\bar{\tau}^{\mathrm{F}}\), by conditions i-iii).

Condition iv implies that all the universes in \({\mathcal{H}}\) have a common reward function (notice that transition to a corrupted state induces the observation \(\bot\) whereas transition to a state in \({{\mathcal{S}}^{{\mathrm{F}}}}_M\) in the universe \(\tilde{\upsilon}^{k{\mathrm{D}}}\) induces the observation \(\Im\)). Therefore, we can use Lemma A.1 to conclude that there exists an \(\overline{{\mathcal{I}}^\star}\)-policy \(\pi^\sharp\) s.t.

\[{\operatorname{EU}}_{\tilde{\upsilon}^{k{\mathrm{D}}}}^{\pi^kS^{k\star}}(1-\alpha) - {\operatorname{EU}}_{\overline{\tilde{\upsilon}^{k{\mathrm{D}}}}{\left[\tilde{\sigma}^k S^{k\star}\right]}}^{\pi^\sharp}(1-\alpha) \leq O{\left((\bar{\tau} \alpha)^{1/4}\right)}\]

\[{\operatorname{EU}}_{\tilde{\upsilon}^{k{\mathrm{E}}}}^{\pi^kS^{k\star}}(1-\alpha) - {\operatorname{EU}}_{\overline{\tilde{\upsilon}^{k{\mathrm{E}}}}{\left[\tilde{\sigma}^k S^{k\star}\right]}}^{\pi^\sharp}(1-\alpha) \leq O{\left((\bar{\tau} \alpha)^{1/4}\right)}\]

It is easy to see that \(\tilde{M}^k{\left[\tilde{\sigma}^k\right]}\) is a \(2\delta\)-perturbation of \(M^k{\left[\sigma^k\right]}\). Observe also that \(\overline{\tilde{\upsilon}^{k{\mathrm{D}}}}{\left[\tilde{\sigma}^k S^{k\star}\right]} = \overline{\tilde{\upsilon}^{k}}{\left[\tilde{\sigma}^k S^{k}\right]}^{\mathrm{D}}\), \(\overline{\tilde{\upsilon}^{k{\mathrm{E}}}}{\left[\tilde{\sigma}^k S^{k\star}\right]} = \overline{\tilde{\upsilon}^{k}}{\left[\tilde{\sigma}^k S^{k}\right]}^{\mathrm{E}}\) and \(\overline{\tilde{\mu}^{k}}{\left[\tilde{\sigma}^k S^{k}\right]}\) is a \(2\delta\)-lift of \(\tilde{M}^k{\left[\tilde{\sigma}^k\right]}\) to \(\overline{\mu^k}{\left[\sigma^k S^k\right]}\). Applying Proposition A.4, we get

\[\frac{1}{2}{\left({\operatorname{EU}}_{\upsilon^k}^{\pi^k S^{k}}(1-\alpha)-{\operatorname{EU}}_{\bar{\upsilon}^k{\left[\sigma^k S^k\right]}}^{\pi^\sharp\Re^*}(1-\alpha)\right)} \leq O{\left((\bar{\tau}\alpha)^{1/4} + \delta{\left(\bar{\tau}+\frac{(\bar{\tau} \alpha)^{1/4}}{\alpha}\right)}\right)}\]

Setting \(\pi^* := \pi^\sharp \Re^*\) we get

\[{\operatorname{EU}}_{\upsilon^k}^{\pi^k S^{k}}(1-\alpha)-{\operatorname{EU}}_{\bar{\upsilon}^k{\left[\sigma^k S^{k}\right]}}^{\pi^*}(1-\alpha) \leq O{\left(\max{\left(\frac{\delta}{\alpha},1\right)}\cdot(\bar{\tau} \alpha)^{1/4}+\bar{\tau}\delta\right)}\]

By condition i of Definition 8, this implies

\[{\operatorname{EU}}_{\upsilon^k}^{*}(1-\alpha)-{\operatorname{EU}}_{\bar{\upsilon}^k{\left[\sigma^k S^{k}\right]}}^{\pi^*}(1-\alpha) \leq O{\left(\max{\left(\frac{\delta}{\alpha},1\right)}\cdot(\bar{\tau} \alpha)^{1/4}+\bar{\tau}\delta\right)}\]



by Gordon Worley III 28 days ago | Alex Appel and Abram Demski like this | link

Maybe it’s just my browser, but it look like it got cut off. Here’s the last of what it renders for me:

Averaging the previous inequality over kk, we get

1N∑k=0N−1R?k≤(1−γT)∑n=0∞γnTE[E[U!n∣J!n=K, Z!nT]−E[U!n∣Z!nT]]+O(1−γTη2+τ¯(1−γ)1−γT) 1N∑k=0N−1R?k≤(1−γT)∑n=0∞γnTE[E[Un!∣Jn!=K, ZnT!]−E[Un!∣ZnT!]]+O(1−γTη2+τ¯(1−γ)1−γT)

$${k=0}{N-1}R{?k} (1-^T){n=0}{nT} [[U^!_n ^!n = K, Z^!{nT}]-[U^!n Z^!{nT}]] + O(+

reply

by Vadim Kosoy 28 days ago | link

Unfortunately, it’s not just your browser. The website truncates the document for some reason. I emailed Matthew about it and ey are looking into it.

reply

by Vadim Kosoy 24 days ago | link

Indeed there is some kind of length limit in the website. I moved Appendices B and C to a separate post.

reply



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Without reading closely, this
by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

RSS

Privacy & Terms