Catastrophe Mitigation Using DRL (Appendices)
discussion post by Vadim Kosoy 25 days ago | discuss

These are Appendices B and C for the essay Catastrophe Mitigation Using DRL. They appear in a separate post because of a length limit in the website.

## Appendix B

Given $$p=(a,o)\in{\mathcal{A}}\times{\mathcal{O}}$$, we denote $$p^{\mathcal{A}}:=a$$, $$p^{\mathcal{O}}:=o$$.

# Proposition B.1

Consider a universe $$\upsilon=(\mu,r)$$ which an $${\mathcal{O}}$$-realization of an MDP $$M$$ with state function $$S$$, a stationary policy $$\pi^*: {\mathcal{S}}_M {\xrightarrow{\text{k}}}{\mathcal{A}}$$, an arbitrary $${\mathcal{I}}$$-policy $$\pi^0$$ and some $$\gamma \in (0,1)$$. Then,

${\operatorname{EU}}_{\upsilon}^{\pi^* S}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi^0}(\gamma)=\sum_{n=0}^\infty {\gamma^n {\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}_{M\pi^*}{\left(S{\left(x_{:n}\right)},\gamma\right)}-{\operatorname{Q}}_{M\pi^*}{\left(S{\left(x_{:n}\right)},x_n^{\mathcal{A}},\gamma\right)}\right]}}}$

# Proof of Proposition B.1

For the sake of encumbering the notation less, we will omit the parameter $$\gamma$$ in functions that depend on it. We will use $$S$$ implicitly, i.e. given $$F$$ a function on $${\mathcal{S}}_M$$ and $$h \in \operatorname{hdom}{\mu}$$, $$F(h):=F{\left(S(h)\right)}$$. Finally, we will omit $$M\pi^*$$, using the shorthand notations $${\operatorname{V}}:={\operatorname{V}}_{M\pi^*}$$, $${\operatorname{Q}}:={\operatorname{Q}}_{M\pi^*}$$.

For any $$x \in \operatorname{hdom}^\omega \mu$$, it is easy to see that

${\operatorname{EU}}_{\upsilon}^{\pi^* S}={\operatorname{V}}{\left({\boldsymbol{\lambda}}\right)}=\sum_{n=0}^\infty \gamma^n {\left({\operatorname{V}}{\left(x_{:n}\right)}-\gamma{\operatorname{V}}{\left(x_{:n+1}\right)}\right)}$

${\operatorname{U}}^{r}(x)=(1-\gamma)\sum_{n=0}^\infty \gamma^n r{\left(x_{:n}\right)}$

${\operatorname{EU}}_{\upsilon}^{\pi^* S} - {\operatorname{U}}^{r}(x)=\sum_{n=0}^\infty \gamma^n {\left({\operatorname{V}}{\left(x_{:n}\right)}-(1-\gamma)r{\left(x_{:n}\right)}-\gamma{\operatorname{V}}{\left(x_{:n+1}\right)}\right)}$

${\operatorname{EU}}_{\upsilon}^{\pi^* S} - {\operatorname{U}}^{r}(x)=\sum_{n=0}^\infty \gamma^n {\left({\operatorname{V}}{\left(x_{:n}\right)}-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}}\right)}+{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}}\right)}-(1-\gamma)r{\left(x_{:n}\right)}-\gamma{\operatorname{V}}{\left(x_{:n+1}\right)}\right)}$

Taking expected value over $$x$$, we get

${\operatorname{EU}}_{\upsilon}^{\pi^* S} - {\operatorname{EU}}_{\upsilon}^{\pi^0}=\sum_{n=0}^\infty \gamma^n {\left({\underset{\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}{\left(x_{:n}\right)}-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}}\right)}\right]}}+{\underset{\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}}\right)}-(1-\gamma)r{\left(x_{:n}\right)}-\gamma{\operatorname{V}}{\left(x_{:n+1}\right)}\right]}}\right)}$

It is easy to see that the second term vanishes, yielding the desired result.

# Proposition B.2

Consider some $$\tau\in(0,\infty)$$, $$T\in{\mathbb{N}}^+$$, a universe $$\upsilon=(\mu,r)$$ that is an $${\mathcal{O}}$$-realization of $$M$$ with state function $$S$$, a stationary policy $$\pi^*: {\mathcal{S}}_M {\xrightarrow{\text{k}}}{\mathcal{A}}$$ and an arbitrary $${\mathcal{I}}$$-policy $$\pi^0: {({\mathcal{A}}\times {\mathcal{O}})^*}{\xrightarrow{\text{k}}}{\mathcal{A}}$$. For any $$n \in {\mathbb{N}}$$, let $$\pi^*_n$$ be an $${\mathcal{I}}$$-policy s.t. for any $$h \in \operatorname{hdom}{\mu}$$

$\pi^*_n(h):=\begin{cases} \pi^0(h) \text{ if } {\left\vert h \right\vert} < nT \\ \pi^*(h) \text{ otherwise} \end{cases}$

Assume that

1. For any $$h \in \operatorname{hdom}{\mu}$$ $\operatorname{supp}{\pi^0(h)} \subseteq {\mathcal{A}}_{M\pi^*}^0{\left(S(h)\right)}$

2. For any $$s \in {\mathcal{S}}_M$$ and $$\gamma\in(0,1)$$ ${\left\vert \frac{{\mathrm{d}}{\operatorname{V}}_{M\pi^*}{\left(s,\gamma\right)}}{{\mathrm{d}}\gamma} \right\vert} \leq \tau$

Then, for any $$\gamma\in(0,1)$$,

${\operatorname{EU}}^{\pi^*}_\upsilon(\gamma)-{\operatorname{EU}}^{\pi^0}_\upsilon(\gamma) \leq (1-\gamma)\sum_{n=0}^\infty \sum_{m=0}^{T-1} \gamma^{nT+m}\left({\underset{x\sim\mu\bowtie\pi^*_n}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]-{\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]\right) + \frac{2\tau\gamma^T(1-\gamma)}{1-\gamma^T}$

# Proof of Proposition B.2

For the sake of encumbering the notation less, we will use $$S$$ implicitly, i.e. given $$F$$ a function on $${\mathcal{S}}_M$$ and $$h \in \operatorname{hdom}{\mu}$$, $$F(h):=F{\left(S(h)\right)}$$. Also, we will omit $$M\pi^*$$, using the shorthand notations $${\operatorname{V}}:={\operatorname{V}}_{M\pi^*}$$, $${\operatorname{Q}}:={\operatorname{Q}}_{M\pi^*}$$.

By Proposition B.1, for any $$l \in {\mathbb{N}}$$

${\operatorname{EU}}_{\upsilon}^{\pi^*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi_l^*}(\gamma) = \sum_{n=0}^\infty{\gamma^n {\underset{x\sim\mu\bowtie\pi_l^*}{\operatorname{E}}{\left[{\operatorname{V}}\Big(x_{:n},\gamma\Big)-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}},\gamma\right)}\right]}}}$

$$\pi^*_l$$ coincides with $$\pi^*$$ after $$lT$$, therefore the corresponding expected values vanish.

${\operatorname{EU}}_{\upsilon}^{\pi^*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi_l^*}(\gamma) = \sum_{n=0}^{lT-1}{\gamma^n {\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}\Big(x_{:n},\gamma\Big)-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}},\gamma\right)}\right]}}}$

Subtracting the equalities for $$l+1$$ and $$l$$, we get

${\operatorname{EU}}_{\upsilon}^{\pi_{l}^*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi_{l+1}^*}(\gamma) = \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}\Big(x_{:n},\gamma\Big)-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}},\gamma\right)}\right]}}}$

$(1-\gamma)\sum_{n=0}^\infty {\gamma^n{\left({\underset{x \sim \mu\bowtie\pi^*_l}{\operatorname{E}}{\left[r{\left(x_{:n}\right)}\right]}}-{\underset{x \sim \mu\bowtie\pi^*_{l+1}}{\operatorname{E}}{\left[r{\left(x_{:n}\right)}\right]}}\right)}} = \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}\Big(x_{:n},\gamma\Big)-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}},\gamma\right)}\right]}}}$

$$\pi^*_l$$ and $$\pi^*_{l+1}$$ coincide until $$lT$$, therefore

$(1-\gamma)\sum_{n=lT}^\infty {\gamma^n{\left({\underset{x \sim \mu\bowtie\pi^*_l}{\operatorname{E}}{\left[r{\left(x_{:n}\right)}\right]}}-{\underset{x \sim \mu\bowtie\pi^*_{l+1}}{\operatorname{E}}{\left[r{\left(x_{:n}\right)}\right]}}\right)}} = \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}\Big(x_{:n},\gamma\Big)-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}},\gamma\right)}\right]}}}$

Denote $$\rho^*_l:=\mu\bowtie\pi^*_l$$, $$\rho^0:=\mu\bowtie\pi^0$$. We also use the shorthand notations $$r_n:=r{\left(x_{:n}\right)}$$, $${\operatorname{V}}_n(\gamma):={\operatorname{V}}{\left(x_{:n},\gamma\right)}$$, $${\operatorname{Q}}_n(\gamma):={\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}},\gamma\right)}$$. Both $$\pi^*_l$$ and $$\pi^*_{l+1}$$ coincide with $$\pi^*$$ after $$(l+1)T$$, therefore

$(1-\gamma)\sum_{n=lT}^{(l+1)T-1} {\gamma^n{\left({\underset{\rho^*_l}{\operatorname{E}}{\left[r_n\right]}}-{\underset{\rho^0}{\operatorname{E}}{\left[r_n\right]}}\right)}} + \gamma^{(l+1)T}{\left({\underset{\rho^*_l}{\operatorname{E}}{\left[{\operatorname{V}}_{(l+1)T}(\gamma)\right]}}-{\underset{\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}_{(l+1)T}(\gamma)\right]}}\right)}= \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}_n(\gamma)-{\operatorname{Q}}_n(\gamma)\right]}}}$

Denote $${\operatorname{V}}'(s,\gamma):=\frac{{\mathrm{d}}{\operatorname{V}}(s,\gamma)}{{\mathrm{d}}\gamma}$$. By the mean value theorem, for each $$s\in{\mathcal{S}}_M$$ there is $$\gamma^*\in(\gamma,1)$$ s.t.

${\operatorname{V}}(s,\gamma) = {\operatorname{V}}^0(s) - {\operatorname{V}}'(s,\gamma^*)\cdot(1-\gamma)$

${\operatorname{V}}^0(s) - \tau(1-\gamma) \leq {\operatorname{V}}(s,\gamma) \leq {\operatorname{V}}^0(s) + \tau(1-\gamma)$

It follows that

$(1-\gamma)\sum_{n=lT}^{(l+1)T-1} {\gamma^n{\underset{\rho^*_l - \rho^0}{\operatorname{E}}{\left[r_n\right]}}} + \gamma^{(l+1)T}{\left({\underset{\rho^*_l-\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}^0_{(l+1)T}\right]}}+ 2\tau(1-\gamma)\right)}\geq \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}_n(\gamma)-{\operatorname{Q}}_n(\gamma)\right]}}}$

Here, an expected value w.r.t. the difference of two probability measures is understood to mean the corresponding difference of expected values.

It is easy to see that assumption i implies that $${\operatorname{V}}_n^0$$ is a submartingale for $$\rho^0$$ (whereas it is a martingale for $$\mu\bowtie\pi^*$$) and therefore

${\underset{\rho^*_l-\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}^0_{(l+1)T}\right]}} \leq 0$

We get

$(1-\gamma)\sum_{n=lT}^{(l+1)T-1} {\gamma^n{\underset{\rho^*_l - \rho^0}{\operatorname{E}}{\left[r_n\right]}}} + 2\tau\gamma^{(l+1)T}(1-\gamma)\geq \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}_n(\gamma)-{\operatorname{Q}}_n(\gamma)\right]}}}$

Summing over $$l$$, we get

$(1-\gamma) \sum_{l=0}^\infty \sum_{n=lT}^{(l+1)T-1} {\gamma^n{\underset{\rho^*_l - \rho^0}{\operatorname{E}}{\left[r_n\right]}}} + \frac{2\tau\gamma^T(1-\gamma)}{1-\gamma^T}\geq \sum_{n=0}^{\infty}{\gamma^n {\underset{\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}_n(\gamma)-{\operatorname{Q}}_n(\gamma)\right]}}}$

Applying Proposition B.1 to the right hand side

$(1-\gamma) \sum_{l=0}^\infty \sum_{n=lT}^{(l+1)T-1} {\gamma^n{\underset{\rho^*_l - \rho^0}{\operatorname{E}}{\left[r_n\right]}}} + \frac{2\tau\gamma^T(1-\gamma)}{1-\gamma^T} \geq {\operatorname{EU}}_{\upsilon}^{\pi^*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi^0}(\gamma)$

# Proof of Lemma A.1

Fix $$\gamma \in (0,1)$$, $$\eta\in\left(0,N^{-1}\right)$$ and $$T \in {\mathbb{N}}^+$$. Denote $$\nu^k:=\bar{\mu}^k\left[\sigma^k S^k\right]$$. To avoid cumbersome notation, whenever $$M^k$$ should appear a subscript, we will replace it by $$k$$. Let $$(\Omega,P \in \Delta\Omega)$$ be a probability space. Let $$K: \Omega \rightarrow [N]$$ be a random variable and the following be stochastic processes

${Z}_n,\tilde{{Z}}_n: \Omega \rightarrow \Delta[N]$

${J}_n: \Omega \rightarrow [N]$

$\Psi_n: \Omega \rightarrow {\mathcal{A}}$

$A_n: \Omega \rightarrow {\bar{{\mathcal{A}}}}$

$\Theta_n: \Omega \rightarrow {\bar{{\mathcal{O}}}}$

We also define $$A\Theta_{:n}: \Omega \rightarrow {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}$$ by

$A\Theta_{:n}:= A_0\Theta_0A_1\Theta_1 \ldots A_{n-1}\Theta_{n-1}$

(The following conditions on $$A$$ and $$\Theta$$ imply that the range of the above is indeed in $${{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}$$.) Let $${\mathcal{D}}$$ and $${\mathcal{D}}^{!k}$$ be as in Proposition C.1 (we assume w.l.o.g. that $$\epsilon < \frac{1}{{\left\vert {\mathcal{A}}\right\vert}}$$). We construct $$\Omega$$, $$K$$, $${Z}$$, $$\tilde{{Z}}$$, $${J}$$, $$\Psi$$, $$A$$ and $$\Theta$$ s.t $$K$$ is uniformly distributed and for any $$k \in [N]$$, $$l \in {\mathbb{N}}$$, $$m \in [T]$$ and $$o \in {\mathcal{O}}$$, denoting $$n = lT+m$$

$\tilde{{Z}}_0(k)\equiv\frac{1}{N}$

${Z}_{n}(k) = \frac{\tilde{{Z}}_{n}(k)[[\tilde{{Z}}_{n}(k) \geq \eta]] }{\sum_{j = 0}^{N-1}\tilde{{Z}}_{n}(j)[[\tilde{{Z}}_{n}(j) \geq \eta]]}$

$\Pr\left[{J}_{l} = k \mid Z_{lT}\right] = {Z}_{lT}\left(k\right)$

$\Psi_{n} = \pi^{J_l}{\left(A\Theta_{:n}\right)}$

$\Pr\left[\Theta_{n} = o \mid A\Theta_{:n}\right] = \nu^K\left(o \mid A\Theta_{:n}\right)$

$A_n = {\mathcal{D}}\left(A\Theta_{:n}, \Psi_n\right)$

$\tilde{{Z}}_{n+1}(k)\sum_{j = 0}^{N-1} {Z}_n(j) [[A_n = {\mathcal{D}}^{!j}\left(A\Theta_{:n}, \Psi_n\right)]] \nu^j(\Theta_n \mid A\Theta_{:n}A_n)={Z}_{n}(k) [[A_n = {\mathcal{D}}^{!k}\left(A\Theta_{:n}, \Psi_n\right)]] \nu^k\left(\Theta_{n} \mid A\Theta_{:n}A_{n}\right)$

Note that the last equation has the form of a Bayesian update which is allowed to be arbitrary when update is on “impossible” information.

We now construct the $${{\bar{{\mathcal{I}}}}}$$-policy $$\pi^*$$ s.t. for any $$n \in {\mathbb{N}}$$, $$h \in {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}$$ s.t. $$\Pr\left[A\Theta_{:n}=h\right] > 0$$ and $$a \in {\bar{{\mathcal{A}}}}$$

$\pi^*(a \mid h):=\Pr\left[A_n = a \mid A\Theta_{:n} = h\right]$

That is, we perform Thompson sampling at time intervals of size $$T$$, moderated by the delegation routine $${\mathcal{D}}$$, and discard from our belief state hypotheses whose probability is below $$\eta$$ and hypotheses sampling which resulted in recommending “unsafe” actions i.e. actions that $${\mathcal{D}}$$ refused to perform.

In order to prove $$\pi^*$$ has the desired property, we will define the stochastic processes $${Z}^!$$, $$\tilde{{Z}}^!$$, $${J}^!$$, $$\Psi^!$$, $$A^!$$ and $$\Theta^!$$, each process of the same type as its shriekless counterpart (thus $$\Omega$$ is constructed to accommodate them). These processes are required to satisfy the following:

$\tilde{{Z}}^!_0(k)\equiv\frac{1}{N}$

${Z}_{n}^!(k) = \frac{\tilde{{Z}}^!_{n}(k)[[\tilde{{Z}}^!_{n}(k) \geq \eta]] }{\sum_{j = 0}^{N-1}\tilde{{Z}}^!_{n}(j)[[\tilde{{Z}}^!_{n}(j) \geq \eta]]}[[\tilde{{Z}}^!_{n}(K) \geq \eta]] + [[K = k]]\cdot [[\tilde{{Z}}^!_{n}(K) < \eta]]$

$\Pr\left[{J}^!_{l} = k \mid Z^!_{lT}\right] = {Z}^!_{lT}\left(k\right)$

$\Psi^!_{n} = \pi^{{J}^!_l}{\left(A\Theta^!_{:n}\right)}$

$\Pr\left[\Theta^!_{n} = o \mid A\Theta^!_{:n}\right] = \nu^K\left(o \mid A\Theta^!_{:n}\right)$

$A^!_n = {\mathcal{D}}^{!K}\left(A\Theta^!_{:n}, \Psi^!_n\right)$

$\tilde{{Z}}^!_{n+1}(k)=\frac{{Z}^!_{n}(k) [[A^!_n = {\mathcal{D}}^{!k}\left(A\Theta^!_{:n}, \Psi^!_n\right)]] \nu^k\left(\Theta^!_{n} \mid A\Theta^!_{:n}A^!_{n}\right)}{\sum_{j = 0}^{N-1} {Z}^!_n(j) [[A^!_n = {\mathcal{D}}^{!j}\left(A\Theta^!_{:n}, \Psi^!_n\right)]] \nu^j(\Theta^!_n \mid A\Theta^!_{:n}A^!_n)}$

For any $$k \in [N]$$, we construct the $${{\bar{{\mathcal{I}}}}}$$-policy $$\pi^{?k}$$ s.t. for any $$n \in {\mathbb{N}}$$, $$h \in {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}$$ s.t. $$\Pr\left[A\Theta^!_{:n}=h,\ K = k\right] > 0$$ and $$a \in {\bar{{\mathcal{A}}}}$$

$\pi^{?k}(a \mid h):=\Pr\left[A^!_n = a \mid A\Theta^!_{:n} = h,\ K = k\right]$

Given any $${{\bar{{\mathcal{I}}}}}$$-policy $$\pi$$ and $${\mathcal{I}}$$-policy $$\sigma$$ we define $$\alpha_{\sigma\pi}: {({\mathcal{A}}\times {\mathcal{O}})^*}{\xrightarrow{\text{k}}}{{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}$$ by

$\alpha_{\sigma\pi} (g \mid h) := [[h = \underline{g}]]C_h\prod_{n = 0}^{{\left\vert h \right\vert}-1} \sum_{a \in {\mathcal{A}}}\left([[g_n \in \bot a{\mathcal{O}}]] \pi\left(\bot \mid g_{:n}\right)\sigma\left(a \mid h_{:n}\right)+[[g_n \in a\bot{\mathcal{O}}]]\pi\left(a \mid g_{:n}\right)\right)$

Here, $$C_h \in {\mathbb{R}}$$ is a constant defined s.t. the probabilities sum to 1. We define the $${\mathcal{I}}$$-policy $$\left[\sigma\right]\underline{\pi}$$ by

$\left[\sigma\right]\underline{\pi}(a \mid h):=\Pr_{g \sim \alpha_{\sigma\pi}(h)}\left[\pi\left(g\right)=a \lor \left(\pi\left(g\right)=\bot \land \sigma(h)=a\right)\right]$

Condition iii of Proposition C.1 and condition i of Definition A.1 imply that for any $$h \in \operatorname{hdom}{\mu^k}$$

$\operatorname{supp}{\left[\sigma^k\right]\underline{\pi}^{?k}(h)} \subseteq {\mathcal{A}}^0_{M^k\pi^k}\left(S^k\left(h\right)\right)$

This means we can apply Proposition B.2 and get

${\operatorname{EU}}^{\pi^k S^k}_{\upsilon^k}(\gamma)-{\operatorname{EU}}^{\pi^{?k}}_{\bar{\upsilon}^k{\left[\sigma^k S^k\right]}}(\gamma) \leq (1-\gamma)\sum_{n=0}^\infty \sum_{m=0}^{T-1} \gamma^{nT+m}\left({\underset{x\sim\mu^k\bowtie\pi^{*k}_n}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]-{\underset{x\sim\nu^k\bowtie\pi^{?k}}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]\right) + \frac{2\bar{\tau}\gamma^T(1-\gamma)}{1-\gamma^T}$

Here, the $${\mathcal{I}}$$-policy $$\pi^{*k}_n$$ is defined as $$\pi^*_n$$ in Proposition B.2. We also define the $${{\bar{{\mathcal{I}}}}}$$-policies $$\pi^{!k}_n$$ and $$\pi^{!!k}_n$$ by

$\pi^{!k}_n(a \mid h):=\begin{cases} \pi^{?k}(a \mid h) \text{ if } {\left\vert h \right\vert} < nT \\ \Pr\left[A^!_{{\left\vert h \right\vert}} = a \mid A\Theta^!_{:{{\left\vert h \right\vert}}} = h,\ K = k,\ {J}^!_n = k\right] \text{ otherwise} \end{cases}$

$\pi^{!!k}_n(a \mid h):=\begin{cases} \pi^{?k}(a \mid h) \text{ if } {\left\vert h \right\vert} < nT \\ \pi^{!k}_n(a \mid h) + \pi^{!k}_n(\bot \mid h) \cdot \pi^{*k}_n\left(a \mid \underline{h}\right) \text{ if } {\left\vert h \right\vert} \geq nT \text{ and } a \ne \bot \\ 0 \text{ if } {\left\vert h \right\vert} \geq nT \text{ and } a = \bot \end{cases}$

Denote

$\rho^{*k}_n:=\mu^k\bowtie\pi^{*k}_n$

$\rho^{!!k}_n:=\nu^k\bowtie\pi^{!!k}_n$

$\rho^{!k}_n:=\nu^k\bowtie\pi^{!k}_n$

$\rho^{?k}:=\nu^k\bowtie\pi^{?k}$

$R^{?k}:={\operatorname{EU}}^{\pi^k S^k}_{\upsilon^k}(\gamma)-{\operatorname{EU}}^{\pi^{?k}}_{\bar{\upsilon}^k{\left[\sigma^k S^k\right]}}(\gamma)$

For each $$n \in {\mathbb{N}}$$, denote

${\operatorname{EU}}_n^{*k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}{\underset{x\sim\rho^{*k}_n}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]$

${\operatorname{EU}}_n^{!!k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}{\underset{x\sim\rho^{!!k}_n}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]$

${\operatorname{EU}}_n^{!k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}{\underset{x\sim\rho^{!k}_n}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]$

${\operatorname{EU}}_n^{?k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}{\underset{x\sim\rho^{?k}}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]$

We have

$R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left({\operatorname{EU}}^{*k}_n(\gamma)-{\operatorname{EU}}^{?k}_n(\gamma)\right) + \frac{2\bar{\tau}\gamma^T(1-\gamma)}{1-\gamma^T}$

$R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left({\operatorname{EU}}^{*k}_n(\gamma)-{\operatorname{EU}}^{!!k}_n(\gamma)+{\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma)+{\operatorname{EU}}^{!k}_n(\gamma)-{\operatorname{EU}}^{?k}_n(\gamma)\right) + \frac{2\bar{\tau}\gamma^T(1-\gamma)}{1-\gamma^T}$

Condition iv of Proposition C.1 and condition ii of Definition A.1 imply that, given $$h \in \operatorname{hdom}{\nu^k}$$ s.t. $${\left\vert h \right\vert} \geq nT$$

$\operatorname{supp}{\pi^{!k}_n(h)} \subseteq {\left\{\pi^k{\left(S^k(h)\right)},\bot\right\}}$

$\pi^{!!k}_n{\left(\pi^k{\left(S^k(h)\right)} \mid h\right)} = 1$

Therefore, $$\pi^{!!k}_n = \pi^{*k}_n$$, and we remain with

$R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left({\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma)+{\operatorname{EU}}^{!k}_n(\gamma)-{\operatorname{EU}}^{?k}_n(\gamma)\right) + \frac{2\bar{\tau}\gamma^T(1-\gamma)}{1-\gamma^T}$

We have

${\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma) \leq \Pr_{x\sim\rho^{!k}_n}\left[\exists m \in [T]: x_{nT+m} \in \bot{\bar{{\mathcal{O}}}}\right]$

Since $${Z}_{nT}^{!}(K) \geq \eta$$, it follows that

${\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma) \leq \frac{1}{\eta}\Pr_{x\sim\rho^{?k}}\left[\exists m \in [T]: x_{nT+m} \in \bot{\bar{{\mathcal{O}}}}\right] \leq \frac{1}{\eta}{\underset{x\sim\rho^{?k}}{\operatorname{E}}{\left[{\left\vert {\left\{m \in [T] \mid x_{nT+m} \in \bot{\bar{{\mathcal{O}}}}\right\}} \right\vert}\right]}}$

$\sum_{n=0}^\infty {\left({\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma)\right)} \leq \frac{1}{\eta}{\underset{x\sim\rho^{?k}}{\operatorname{E}}{\left[{\left\vert {\left\{n \in {\mathbb{N}}\mid x_n \in \bot{\bar{{\mathcal{O}}}}\right\}} \right\vert}\right]}}$

Using condition i of Proposition C.1, we conclude

$R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left({\operatorname{EU}}^{!k}_n(\gamma)-{\operatorname{EU}}^{?k}_n(\gamma)\right) + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)$

Define the random variables $${{\{U^!_n : \Omega \rightarrow [0,1]\}_{n \in {\mathbb{N}}}}}$$ by

$U^!_n:=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m} r\left(A\Theta^!_{:nT+m}\right)$

Averaging the previous inequality over $$k$$, we get

$\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} {\underset{}{\operatorname{E}}}\left[{\underset{}{\operatorname{E}}}\left[U^!_n \mid {J}^!_n = K,\ Z^!_{nT}\right]-{\underset{}{\operatorname{E}}}\left[U^!_n \mid Z^!_{nT}\right]\right] + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)$

$\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{(1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} {\underset{}{\operatorname{E}}}\left[\left({\underset{}{\operatorname{E}}}\left[U^!_n \mid {J}^!_n = K,\ Z^!_{nT}\right]-{\underset{}{\operatorname{E}}}\left[U^!_n \mid Z^!_{nT}\right]\right)^2\right]} + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)$

We apply Proposition C.2 to each term in the sum over $$n$$.

$\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{(1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} {\underset{}{\operatorname{E}}}\left[\frac{1}{2\eta}{\underset{}{\operatorname{I}}}\left[K;{J}^!_n,U^!_n \mid Z^!_{nT}\right]\right]} + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)$

$\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{\frac{1-\gamma^T}{2\eta}\sum_{n=0}^\infty \gamma^{nT} {\underset{}{\operatorname{E}}}\left[{\operatorname{H}}\Big(Z^!_{nT}\Big)-{\operatorname{H}}\left(Z^!_{(n+1)T}\right)\right]} + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)$

$\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{\frac{1-\gamma^T}{2\eta}\ln N} + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)$

$\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = O\left(\sqrt{\frac{1-\gamma^T}{\eta}} +\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)$

Condition ii of Proposition C.1 implies that

${\operatorname{d}_{\text{tv}}}\left(\frac{1}{N}\sum_{k=0}^{N-1}{\bar{\nu}^k\left[\sigma^k\right]\bowtie\pi^*},\ \frac{1}{N}\sum_{k=0}^{N-1}{\bar{\nu}^k\left[\sigma^k\right]\bowtie\pi^{?k}}\right) \leq 2(N-1)\eta$

Here, the factor of 2 comes from the difference between the equations for $$Z_n$$ and $$Z^!_n$$ (we can construct and intermediate policy between $$\pi^*$$ and $$\pi^{?k}$$ and use the triangle inequality for $${\operatorname{d}_{\text{tv}}}$$). We conclude

${\operatorname{EU}}^{\pi^k S^k}_{\upsilon^k}(\gamma)-{\operatorname{EU}}^{\pi^{*}}_{\bar{\upsilon}^k{\left[\sigma^k S^k\right]}}(\gamma) = O\left(\eta+\sqrt{\frac{1-\gamma^T}{\eta}} +\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)$

Now we set

$\eta:=\bar{\tau}^{1/4} (1-\gamma)^{1/4}$

$T:={\left\lceil \bar{\tau}^{3/4}(1-\gamma)^{-1/4} \right\rceil}$

Without loss of generality, we can assume that $$\bar{\tau}(1-\gamma) \ll 1$$ (because of the form of the bound we are proving), which implies that $$T(1-\gamma) \ll 1$$ and $$1-\gamma^T \approx T(1-\gamma) \approx \bar{\tau}^{3/4}(1-\gamma)^{3/4}$$. We get

${\operatorname{EU}}^{\pi^k S^k}_{\upsilon^k}(\gamma)-{\operatorname{EU}}^{\pi^{*}}_{\bar{\upsilon}^k{\left[\sigma^k S^k\right]}}(\gamma) = O\left(\bar{\tau}^{1/4}(1-\gamma)^{1/4}\right)$

## Appendix C

The following is a simple special case of what appeared as “Proposition A.2” in the previous essay, where we restrict $$\pi$$ to be single-valued (the more general case isn’t needed).

# Proposition C.1

Fix an interface $${\mathcal{I}}=({\mathcal{A}},{\mathcal{O}})$$, $$N \in {\mathbb{N}}$$, $$\epsilon \in (0,\frac{1}{{\left\vert {\mathcal{A}}\right\vert}})$$, $$\eta \in (0,\frac{1}{N})$$. Consider some $$\{\sigma^k: {({\mathcal{A}}\times {\mathcal{O}})^*}{\xrightarrow{\text{k}}}{\mathcal{A}}\}_{k \in [N]}$$. Then, there exist $${\mathcal{D}}: {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\times {\mathcal{A}}\rightarrow {\bar{{\mathcal{A}}}}$$ and $$\{{\mathcal{D}}^{!k}: {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\times {\mathcal{A}}\rightarrow {\bar{{\mathcal{A}}}}\}_{k \in [N]}$$ with the following properties. Given $$x \in \left(2^{\mathcal{A}}\times {\overline{{\mathcal{A}}\times {\mathcal{O}}}}\right)^*$$, we denote $$\underline{x}$$ its projection to $${{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}$$. Thus, $$\underline{\underline{x}}\in{({\mathcal{A}}\times {\mathcal{O}})^*}$$. Given $$\mu$$ an $${\mathcal{I}}$$-environment, $$\pi: \operatorname{hdom}{\mu} {\xrightarrow{\text{k}}}{\mathcal{A}}$$, $${\mathcal{D}}': {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\times {\mathcal{A}}\rightarrow {\bar{{\mathcal{A}}}}$$ and $$k \in [N]$$, we can define $$\Xi\left[\mu,\sigma^k,{\mathcal{D}}',\pi\right]\in \Delta\left({\mathcal{A}}\times {\overline{{\mathcal{A}}\times {\mathcal{O}}}}\right)^\omega$$ as follows

$\Xi\left[\mu,\sigma^k,{\mathcal{D}}',\pi\right]\left(b,a,o \mid x\right):=\pi\left(b \mid \underline{\underline{x}}\right){\mathcal{D}}'\left(a \mid \underline{x},b\right) \bar{\mu}[\sigma^k]\left(o \mid \underline{x}a\right)$

We require that for every $$\pi$$, $$\mu$$ and $$k$$ as above, the following conditions hold

1. ${\underset{x \sim\Xi\left[\mu,\sigma^k,{\mathcal{D}}^{!k},\pi\right]}{\operatorname{E}}}\left[{\left\vert \{n \in {\mathbb{N}}\mid x_n \in {\mathcal{A}}\times \bot \times \bar{{\mathcal{O}}}\} \right\vert}\right] \leq \frac{\ln N}{\eta \ln\left(1 + \epsilon(1-\epsilon)^{(1-\epsilon)/\epsilon}\right)}=O\left(\frac{\ln N}{\eta \epsilon}\right)$

2. $${\operatorname{d}_{\text{tv}}}\left(\frac{1}{N}\sum_{j=0}^{N-1}{\Xi\left[\mu,\sigma^j,{\mathcal{D}},\pi\right]},\frac{1}{N}\sum_{j=0}^{N-1}{\Xi\left[\mu,\sigma^j,{\mathcal{D}}^{!j},\pi\right]}\right) \leq (N-1)\eta$$

3. For all $$x \in \operatorname{hdom}{\bar{\mu}[\sigma^k]}$$, if $${\mathcal{D}}^{!k}\left(x,\pi\left(\underline{x}\right)\right) \ne \bot$$ then $$\sigma^k\left({\mathcal{D}}^{!k}\left(x,\pi\left(\underline{x}\right)\right) \mid \underline{x}\right) > 0$$

4. For all $$x \in \operatorname{hdom}{\bar{\mu}[\sigma^k]}$$, if $${\mathcal{D}}^{!k}\left(x,\pi\left(\underline{x}\right)\right) \not\in {\left\{\pi\left(\underline{x}\right), \bot\right\}}$$ then $$\sigma^k\left(\pi\left(\underline{x}\right) \mid \underline{x}\right) \leq \epsilon$$

The following appeared in the previous essay as “Proposition A.1”.

# Proposition C.2

Consider a probability space $$(\Omega, P \in \Delta\Omega)$$, $$N \in {\mathbb{N}}$$, $$R \subseteq [0,1]$$ a finite set and random variables $$U: \Omega \rightarrow R$$, $$K: \Omega \rightarrow [N]$$ and $${J}: \Omega \rightarrow [N]$$. Assume that $$K_*P = J_*P = \zeta \in \Delta[N]$$ and $${\underset{}{\operatorname{I}}}[K;J] = 0$$. Then

${\underset{}{\operatorname{I}}}\left[K;J,U\right] \geq 2 \left(\min_{i \in [N]} {\zeta(i)}\right) \left({\underset{}{\operatorname{E}}}\left[U \mid J = K\right]-{\underset{}{\operatorname{E}}}\left[U\right]\right)^2$

### RECENT COMMENTS

This is exactly the sort of
 by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
 by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
 by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Without reading closely, this
 by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
 by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
 by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
 by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
 by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
 by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
 by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Privacy & Terms