Intelligent Agent Foundations Forumsign up / log in
Catastrophe Mitigation Using DRL (Appendices)
discussion post by Vadim Kosoy 25 days ago | discuss

These are Appendices B and C for the essay Catastrophe Mitigation Using DRL. They appear in a separate post because of a length limit in the website.

Appendix B

Given \(p=(a,o)\in{\mathcal{A}}\times{\mathcal{O}}\), we denote \(p^{\mathcal{A}}:=a\), \(p^{\mathcal{O}}:=o\).

Proposition B.1

Consider a universe \(\upsilon=(\mu,r)\) which an \({\mathcal{O}}\)-realization of an MDP \(M\) with state function \(S\), a stationary policy \(\pi^*: {\mathcal{S}}_M {\xrightarrow{\text{k}}}{\mathcal{A}}\), an arbitrary \({\mathcal{I}}\)-policy \(\pi^0\) and some \(\gamma \in (0,1)\). Then,

\[{\operatorname{EU}}_{\upsilon}^{\pi^* S}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi^0}(\gamma)=\sum_{n=0}^\infty {\gamma^n {\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}_{M\pi^*}{\left(S{\left(x_{:n}\right)},\gamma\right)}-{\operatorname{Q}}_{M\pi^*}{\left(S{\left(x_{:n}\right)},x_n^{\mathcal{A}},\gamma\right)}\right]}}}\]

Proof of Proposition B.1

For the sake of encumbering the notation less, we will omit the parameter \(\gamma\) in functions that depend on it. We will use \(S\) implicitly, i.e. given \(F\) a function on \({\mathcal{S}}_M\) and \(h \in \operatorname{hdom}{\mu}\), \(F(h):=F{\left(S(h)\right)}\). Finally, we will omit \(M\pi^*\), using the shorthand notations \({\operatorname{V}}:={\operatorname{V}}_{M\pi^*}\), \({\operatorname{Q}}:={\operatorname{Q}}_{M\pi^*}\).

For any \(x \in \operatorname{hdom}^\omega \mu\), it is easy to see that

\[{\operatorname{EU}}_{\upsilon}^{\pi^* S}={\operatorname{V}}{\left({\boldsymbol{\lambda}}\right)}=\sum_{n=0}^\infty \gamma^n {\left({\operatorname{V}}{\left(x_{:n}\right)}-\gamma{\operatorname{V}}{\left(x_{:n+1}\right)}\right)}\]

\[{\operatorname{U}}^{r}(x)=(1-\gamma)\sum_{n=0}^\infty \gamma^n r{\left(x_{:n}\right)}\]

\[{\operatorname{EU}}_{\upsilon}^{\pi^* S} - {\operatorname{U}}^{r}(x)=\sum_{n=0}^\infty \gamma^n {\left({\operatorname{V}}{\left(x_{:n}\right)}-(1-\gamma)r{\left(x_{:n}\right)}-\gamma{\operatorname{V}}{\left(x_{:n+1}\right)}\right)}\]

\[{\operatorname{EU}}_{\upsilon}^{\pi^* S} - {\operatorname{U}}^{r}(x)=\sum_{n=0}^\infty \gamma^n {\left({\operatorname{V}}{\left(x_{:n}\right)}-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}}\right)}+{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}}\right)}-(1-\gamma)r{\left(x_{:n}\right)}-\gamma{\operatorname{V}}{\left(x_{:n+1}\right)}\right)}\]

Taking expected value over \(x\), we get

\[{\operatorname{EU}}_{\upsilon}^{\pi^* S} - {\operatorname{EU}}_{\upsilon}^{\pi^0}=\sum_{n=0}^\infty \gamma^n {\left({\underset{\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}{\left(x_{:n}\right)}-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}}\right)}\right]}}+{\underset{\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}}\right)}-(1-\gamma)r{\left(x_{:n}\right)}-\gamma{\operatorname{V}}{\left(x_{:n+1}\right)}\right]}}\right)}\]

It is easy to see that the second term vanishes, yielding the desired result.

Proposition B.2

Consider some \(\tau\in(0,\infty)\), \(T\in{\mathbb{N}}^+\), a universe \(\upsilon=(\mu,r)\) that is an \({\mathcal{O}}\)-realization of \(M\) with state function \(S\), a stationary policy \(\pi^*: {\mathcal{S}}_M {\xrightarrow{\text{k}}}{\mathcal{A}}\) and an arbitrary \({\mathcal{I}}\)-policy \(\pi^0: {({\mathcal{A}}\times {\mathcal{O}})^*}{\xrightarrow{\text{k}}}{\mathcal{A}}\). For any \(n \in {\mathbb{N}}\), let \(\pi^*_n\) be an \({\mathcal{I}}\)-policy s.t. for any \(h \in \operatorname{hdom}{\mu}\)

\[\pi^*_n(h):=\begin{cases} \pi^0(h) \text{ if } {\left\vert h \right\vert} < nT \\ \pi^*(h) \text{ otherwise} \end{cases}\]

Assume that

  1. For any \(h \in \operatorname{hdom}{\mu}\) \[\operatorname{supp}{\pi^0(h)} \subseteq {\mathcal{A}}_{M\pi^*}^0{\left(S(h)\right)}\]

  2. For any \(s \in {\mathcal{S}}_M\) and \(\gamma\in(0,1)\) \[{\left\vert \frac{{\mathrm{d}}{\operatorname{V}}_{M\pi^*}{\left(s,\gamma\right)}}{{\mathrm{d}}\gamma} \right\vert} \leq \tau\]

Then, for any \(\gamma\in(0,1)\),

\[{\operatorname{EU}}^{\pi^*}_\upsilon(\gamma)-{\operatorname{EU}}^{\pi^0}_\upsilon(\gamma) \leq (1-\gamma)\sum_{n=0}^\infty \sum_{m=0}^{T-1} \gamma^{nT+m}\left({\underset{x\sim\mu\bowtie\pi^*_n}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]-{\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]\right) + \frac{2\tau\gamma^T(1-\gamma)}{1-\gamma^T}\]

Proof of Proposition B.2

For the sake of encumbering the notation less, we will use \(S\) implicitly, i.e. given \(F\) a function on \({\mathcal{S}}_M\) and \(h \in \operatorname{hdom}{\mu}\), \(F(h):=F{\left(S(h)\right)}\). Also, we will omit \(M\pi^*\), using the shorthand notations \({\operatorname{V}}:={\operatorname{V}}_{M\pi^*}\), \({\operatorname{Q}}:={\operatorname{Q}}_{M\pi^*}\).

By Proposition B.1, for any \(l \in {\mathbb{N}}\)

\[{\operatorname{EU}}_{\upsilon}^{\pi^*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi_l^*}(\gamma) = \sum_{n=0}^\infty{\gamma^n {\underset{x\sim\mu\bowtie\pi_l^*}{\operatorname{E}}{\left[{\operatorname{V}}\Big(x_{:n},\gamma\Big)-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}},\gamma\right)}\right]}}}\]

\(\pi^*_l\) coincides with \(\pi^*\) after \(lT\), therefore the corresponding expected values vanish.

\[{\operatorname{EU}}_{\upsilon}^{\pi^*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi_l^*}(\gamma) = \sum_{n=0}^{lT-1}{\gamma^n {\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}\Big(x_{:n},\gamma\Big)-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}},\gamma\right)}\right]}}}\]

Subtracting the equalities for \(l+1\) and \(l\), we get

\[{\operatorname{EU}}_{\upsilon}^{\pi_{l}^*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi_{l+1}^*}(\gamma) = \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}\Big(x_{:n},\gamma\Big)-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}},\gamma\right)}\right]}}}\]

\[(1-\gamma)\sum_{n=0}^\infty {\gamma^n{\left({\underset{x \sim \mu\bowtie\pi^*_l}{\operatorname{E}}{\left[r{\left(x_{:n}\right)}\right]}}-{\underset{x \sim \mu\bowtie\pi^*_{l+1}}{\operatorname{E}}{\left[r{\left(x_{:n}\right)}\right]}}\right)}} = \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}\Big(x_{:n},\gamma\Big)-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}},\gamma\right)}\right]}}}\]

\(\pi^*_l\) and \(\pi^*_{l+1}\) coincide until \(lT\), therefore

\[(1-\gamma)\sum_{n=lT}^\infty {\gamma^n{\left({\underset{x \sim \mu\bowtie\pi^*_l}{\operatorname{E}}{\left[r{\left(x_{:n}\right)}\right]}}-{\underset{x \sim \mu\bowtie\pi^*_{l+1}}{\operatorname{E}}{\left[r{\left(x_{:n}\right)}\right]}}\right)}} = \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}\Big(x_{:n},\gamma\Big)-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}},\gamma\right)}\right]}}}\]

Denote \(\rho^*_l:=\mu\bowtie\pi^*_l\), \(\rho^0:=\mu\bowtie\pi^0\). We also use the shorthand notations \(r_n:=r{\left(x_{:n}\right)}\), \({\operatorname{V}}_n(\gamma):={\operatorname{V}}{\left(x_{:n},\gamma\right)}\), \({\operatorname{Q}}_n(\gamma):={\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}},\gamma\right)}\). Both \(\pi^*_l\) and \(\pi^*_{l+1}\) coincide with \(\pi^*\) after \((l+1)T\), therefore

\[(1-\gamma)\sum_{n=lT}^{(l+1)T-1} {\gamma^n{\left({\underset{\rho^*_l}{\operatorname{E}}{\left[r_n\right]}}-{\underset{\rho^0}{\operatorname{E}}{\left[r_n\right]}}\right)}} + \gamma^{(l+1)T}{\left({\underset{\rho^*_l}{\operatorname{E}}{\left[{\operatorname{V}}_{(l+1)T}(\gamma)\right]}}-{\underset{\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}_{(l+1)T}(\gamma)\right]}}\right)}= \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}_n(\gamma)-{\operatorname{Q}}_n(\gamma)\right]}}}\]

Denote \({\operatorname{V}}'(s,\gamma):=\frac{{\mathrm{d}}{\operatorname{V}}(s,\gamma)}{{\mathrm{d}}\gamma}\). By the mean value theorem, for each \(s\in{\mathcal{S}}_M\) there is \(\gamma^*\in(\gamma,1)\) s.t.

\[{\operatorname{V}}(s,\gamma) = {\operatorname{V}}^0(s) - {\operatorname{V}}'(s,\gamma^*)\cdot(1-\gamma)\]

\[{\operatorname{V}}^0(s) - \tau(1-\gamma) \leq {\operatorname{V}}(s,\gamma) \leq {\operatorname{V}}^0(s) + \tau(1-\gamma)\]

It follows that

\[(1-\gamma)\sum_{n=lT}^{(l+1)T-1} {\gamma^n{\underset{\rho^*_l - \rho^0}{\operatorname{E}}{\left[r_n\right]}}} + \gamma^{(l+1)T}{\left({\underset{\rho^*_l-\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}^0_{(l+1)T}\right]}}+ 2\tau(1-\gamma)\right)}\geq \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}_n(\gamma)-{\operatorname{Q}}_n(\gamma)\right]}}}\]

Here, an expected value w.r.t. the difference of two probability measures is understood to mean the corresponding difference of expected values.

It is easy to see that assumption i implies that \({\operatorname{V}}_n^0\) is a submartingale for \(\rho^0\) (whereas it is a martingale for \(\mu\bowtie\pi^*\)) and therefore

\[{\underset{\rho^*_l-\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}^0_{(l+1)T}\right]}} \leq 0\]

We get

\[(1-\gamma)\sum_{n=lT}^{(l+1)T-1} {\gamma^n{\underset{\rho^*_l - \rho^0}{\operatorname{E}}{\left[r_n\right]}}} + 2\tau\gamma^{(l+1)T}(1-\gamma)\geq \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}_n(\gamma)-{\operatorname{Q}}_n(\gamma)\right]}}}\]

Summing over \(l\), we get

\[(1-\gamma) \sum_{l=0}^\infty \sum_{n=lT}^{(l+1)T-1} {\gamma^n{\underset{\rho^*_l - \rho^0}{\operatorname{E}}{\left[r_n\right]}}} + \frac{2\tau\gamma^T(1-\gamma)}{1-\gamma^T}\geq \sum_{n=0}^{\infty}{\gamma^n {\underset{\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}_n(\gamma)-{\operatorname{Q}}_n(\gamma)\right]}}}\]

Applying Proposition B.1 to the right hand side

\[(1-\gamma) \sum_{l=0}^\infty \sum_{n=lT}^{(l+1)T-1} {\gamma^n{\underset{\rho^*_l - \rho^0}{\operatorname{E}}{\left[r_n\right]}}} + \frac{2\tau\gamma^T(1-\gamma)}{1-\gamma^T} \geq {\operatorname{EU}}_{\upsilon}^{\pi^*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi^0}(\gamma)\]

Proof of Lemma A.1

Fix \(\gamma \in (0,1)\), \(\eta\in\left(0,N^{-1}\right)\) and \(T \in {\mathbb{N}}^+\). Denote \(\nu^k:=\bar{\mu}^k\left[\sigma^k S^k\right]\). To avoid cumbersome notation, whenever \(M^k\) should appear a subscript, we will replace it by \(k\). Let \((\Omega,P \in \Delta\Omega)\) be a probability space. Let \(K: \Omega \rightarrow [N]\) be a random variable and the following be stochastic processes

\[{Z}_n,\tilde{{Z}}_n: \Omega \rightarrow \Delta[N]\]

\[{J}_n: \Omega \rightarrow [N]\]

\[\Psi_n: \Omega \rightarrow {\mathcal{A}}\]

\[A_n: \Omega \rightarrow {\bar{{\mathcal{A}}}}\]

\[\Theta_n: \Omega \rightarrow {\bar{{\mathcal{O}}}}\]

We also define \(A\Theta_{:n}: \Omega \rightarrow {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\) by

\[A\Theta_{:n}:= A_0\Theta_0A_1\Theta_1 \ldots A_{n-1}\Theta_{n-1}\]

(The following conditions on \(A\) and \(\Theta\) imply that the range of the above is indeed in \({{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\).) Let \({\mathcal{D}}\) and \({\mathcal{D}}^{!k}\) be as in Proposition C.1 (we assume w.l.o.g. that \(\epsilon < \frac{1}{{\left\vert {\mathcal{A}}\right\vert}}\)). We construct \(\Omega\), \(K\), \({Z}\), \(\tilde{{Z}}\), \({J}\), \(\Psi\), \(A\) and \(\Theta\) s.t \(K\) is uniformly distributed and for any \(k \in [N]\), \(l \in {\mathbb{N}}\), \(m \in [T]\) and \(o \in {\mathcal{O}}\), denoting \(n = lT+m\)

\[\tilde{{Z}}_0(k)\equiv\frac{1}{N}\]

\[{Z}_{n}(k) = \frac{\tilde{{Z}}_{n}(k)[[\tilde{{Z}}_{n}(k) \geq \eta]] }{\sum_{j = 0}^{N-1}\tilde{{Z}}_{n}(j)[[\tilde{{Z}}_{n}(j) \geq \eta]]}\]

\[\Pr\left[{J}_{l} = k \mid Z_{lT}\right] = {Z}_{lT}\left(k\right)\]

\[\Psi_{n} = \pi^{J_l}{\left(A\Theta_{:n}\right)}\]

\[\Pr\left[\Theta_{n} = o \mid A\Theta_{:n}\right] = \nu^K\left(o \mid A\Theta_{:n}\right)\]

\[A_n = {\mathcal{D}}\left(A\Theta_{:n}, \Psi_n\right)\]

\[\tilde{{Z}}_{n+1}(k)\sum_{j = 0}^{N-1} {Z}_n(j) [[A_n = {\mathcal{D}}^{!j}\left(A\Theta_{:n}, \Psi_n\right)]] \nu^j(\Theta_n \mid A\Theta_{:n}A_n)={Z}_{n}(k) [[A_n = {\mathcal{D}}^{!k}\left(A\Theta_{:n}, \Psi_n\right)]] \nu^k\left(\Theta_{n} \mid A\Theta_{:n}A_{n}\right)\]

Note that the last equation has the form of a Bayesian update which is allowed to be arbitrary when update is on “impossible” information.

We now construct the \({{\bar{{\mathcal{I}}}}}\)-policy \(\pi^*\) s.t. for any \(n \in {\mathbb{N}}\), \(h \in {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\) s.t. \(\Pr\left[A\Theta_{:n}=h\right] > 0\) and \(a \in {\bar{{\mathcal{A}}}}\)

\[\pi^*(a \mid h):=\Pr\left[A_n = a \mid A\Theta_{:n} = h\right]\]

That is, we perform Thompson sampling at time intervals of size \(T\), moderated by the delegation routine \({\mathcal{D}}\), and discard from our belief state hypotheses whose probability is below \(\eta\) and hypotheses sampling which resulted in recommending “unsafe” actions i.e. actions that \({\mathcal{D}}\) refused to perform.

In order to prove \(\pi^*\) has the desired property, we will define the stochastic processes \({Z}^!\), \(\tilde{{Z}}^!\), \({J}^!\), \(\Psi^!\), \(A^!\) and \(\Theta^!\), each process of the same type as its shriekless counterpart (thus \(\Omega\) is constructed to accommodate them). These processes are required to satisfy the following:

\[\tilde{{Z}}^!_0(k)\equiv\frac{1}{N}\]

\[{Z}_{n}^!(k) = \frac{\tilde{{Z}}^!_{n}(k)[[\tilde{{Z}}^!_{n}(k) \geq \eta]] }{\sum_{j = 0}^{N-1}\tilde{{Z}}^!_{n}(j)[[\tilde{{Z}}^!_{n}(j) \geq \eta]]}[[\tilde{{Z}}^!_{n}(K) \geq \eta]] + [[K = k]]\cdot [[\tilde{{Z}}^!_{n}(K) < \eta]]\]

\[\Pr\left[{J}^!_{l} = k \mid Z^!_{lT}\right] = {Z}^!_{lT}\left(k\right)\]

\[\Psi^!_{n} = \pi^{{J}^!_l}{\left(A\Theta^!_{:n}\right)}\]

\[\Pr\left[\Theta^!_{n} = o \mid A\Theta^!_{:n}\right] = \nu^K\left(o \mid A\Theta^!_{:n}\right)\]

\[A^!_n = {\mathcal{D}}^{!K}\left(A\Theta^!_{:n}, \Psi^!_n\right)\]

\[\tilde{{Z}}^!_{n+1}(k)=\frac{{Z}^!_{n}(k) [[A^!_n = {\mathcal{D}}^{!k}\left(A\Theta^!_{:n}, \Psi^!_n\right)]] \nu^k\left(\Theta^!_{n} \mid A\Theta^!_{:n}A^!_{n}\right)}{\sum_{j = 0}^{N-1} {Z}^!_n(j) [[A^!_n = {\mathcal{D}}^{!j}\left(A\Theta^!_{:n}, \Psi^!_n\right)]] \nu^j(\Theta^!_n \mid A\Theta^!_{:n}A^!_n)}\]

For any \(k \in [N]\), we construct the \({{\bar{{\mathcal{I}}}}}\)-policy \(\pi^{?k}\) s.t. for any \(n \in {\mathbb{N}}\), \(h \in {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\) s.t. \(\Pr\left[A\Theta^!_{:n}=h,\ K = k\right] > 0\) and \(a \in {\bar{{\mathcal{A}}}}\)

\[\pi^{?k}(a \mid h):=\Pr\left[A^!_n = a \mid A\Theta^!_{:n} = h,\ K = k\right]\]

Given any \({{\bar{{\mathcal{I}}}}}\)-policy \(\pi\) and \({\mathcal{I}}\)-policy \(\sigma\) we define \(\alpha_{\sigma\pi}: {({\mathcal{A}}\times {\mathcal{O}})^*}{\xrightarrow{\text{k}}}{{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\) by

\[\alpha_{\sigma\pi} (g \mid h) := [[h = \underline{g}]]C_h\prod_{n = 0}^{{\left\vert h \right\vert}-1} \sum_{a \in {\mathcal{A}}}\left([[g_n \in \bot a{\mathcal{O}}]] \pi\left(\bot \mid g_{:n}\right)\sigma\left(a \mid h_{:n}\right)+[[g_n \in a\bot{\mathcal{O}}]]\pi\left(a \mid g_{:n}\right)\right)\]

Here, \(C_h \in {\mathbb{R}}\) is a constant defined s.t. the probabilities sum to 1. We define the \({\mathcal{I}}\)-policy \(\left[\sigma\right]\underline{\pi}\) by

\[\left[\sigma\right]\underline{\pi}(a \mid h):=\Pr_{g \sim \alpha_{\sigma\pi}(h)}\left[\pi\left(g\right)=a \lor \left(\pi\left(g\right)=\bot \land \sigma(h)=a\right)\right]\]

Condition iii of Proposition C.1 and condition i of Definition A.1 imply that for any \(h \in \operatorname{hdom}{\mu^k}\)

\[\operatorname{supp}{\left[\sigma^k\right]\underline{\pi}^{?k}(h)} \subseteq {\mathcal{A}}^0_{M^k\pi^k}\left(S^k\left(h\right)\right)\]

This means we can apply Proposition B.2 and get

\[{\operatorname{EU}}^{\pi^k S^k}_{\upsilon^k}(\gamma)-{\operatorname{EU}}^{\pi^{?k}}_{\bar{\upsilon}^k{\left[\sigma^k S^k\right]}}(\gamma) \leq (1-\gamma)\sum_{n=0}^\infty \sum_{m=0}^{T-1} \gamma^{nT+m}\left({\underset{x\sim\mu^k\bowtie\pi^{*k}_n}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]-{\underset{x\sim\nu^k\bowtie\pi^{?k}}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]\right) + \frac{2\bar{\tau}\gamma^T(1-\gamma)}{1-\gamma^T}\]

Here, the \({\mathcal{I}}\)-policy \(\pi^{*k}_n\) is defined as \(\pi^*_n\) in Proposition B.2. We also define the \({{\bar{{\mathcal{I}}}}}\)-policies \(\pi^{!k}_n\) and \(\pi^{!!k}_n\) by

\[\pi^{!k}_n(a \mid h):=\begin{cases} \pi^{?k}(a \mid h) \text{ if } {\left\vert h \right\vert} < nT \\ \Pr\left[A^!_{{\left\vert h \right\vert}} = a \mid A\Theta^!_{:{{\left\vert h \right\vert}}} = h,\ K = k,\ {J}^!_n = k\right] \text{ otherwise} \end{cases}\]

\[\pi^{!!k}_n(a \mid h):=\begin{cases} \pi^{?k}(a \mid h) \text{ if } {\left\vert h \right\vert} < nT \\ \pi^{!k}_n(a \mid h) + \pi^{!k}_n(\bot \mid h) \cdot \pi^{*k}_n\left(a \mid \underline{h}\right) \text{ if } {\left\vert h \right\vert} \geq nT \text{ and } a \ne \bot \\ 0 \text{ if } {\left\vert h \right\vert} \geq nT \text{ and } a = \bot \end{cases}\]

Denote

\[\rho^{*k}_n:=\mu^k\bowtie\pi^{*k}_n\]

\[\rho^{!!k}_n:=\nu^k\bowtie\pi^{!!k}_n\]

\[\rho^{!k}_n:=\nu^k\bowtie\pi^{!k}_n\]

\[\rho^{?k}:=\nu^k\bowtie\pi^{?k}\]

\[R^{?k}:={\operatorname{EU}}^{\pi^k S^k}_{\upsilon^k}(\gamma)-{\operatorname{EU}}^{\pi^{?k}}_{\bar{\upsilon}^k{\left[\sigma^k S^k\right]}}(\gamma)\]

For each \(n \in {\mathbb{N}}\), denote

\[{\operatorname{EU}}_n^{*k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}{\underset{x\sim\rho^{*k}_n}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]\]

\[{\operatorname{EU}}_n^{!!k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}{\underset{x\sim\rho^{!!k}_n}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]\]

\[{\operatorname{EU}}_n^{!k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}{\underset{x\sim\rho^{!k}_n}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]\]

\[{\operatorname{EU}}_n^{?k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}{\underset{x\sim\rho^{?k}}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]\]

We have

\[R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left({\operatorname{EU}}^{*k}_n(\gamma)-{\operatorname{EU}}^{?k}_n(\gamma)\right) + \frac{2\bar{\tau}\gamma^T(1-\gamma)}{1-\gamma^T}\]

\[R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left({\operatorname{EU}}^{*k}_n(\gamma)-{\operatorname{EU}}^{!!k}_n(\gamma)+{\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma)+{\operatorname{EU}}^{!k}_n(\gamma)-{\operatorname{EU}}^{?k}_n(\gamma)\right) + \frac{2\bar{\tau}\gamma^T(1-\gamma)}{1-\gamma^T}\]

Condition iv of Proposition C.1 and condition ii of Definition A.1 imply that, given \(h \in \operatorname{hdom}{\nu^k}\) s.t. \({\left\vert h \right\vert} \geq nT\)

\[\operatorname{supp}{\pi^{!k}_n(h)} \subseteq {\left\{\pi^k{\left(S^k(h)\right)},\bot\right\}}\]

\[\pi^{!!k}_n{\left(\pi^k{\left(S^k(h)\right)} \mid h\right)} = 1\]

Therefore, \(\pi^{!!k}_n = \pi^{*k}_n\), and we remain with

\[R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left({\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma)+{\operatorname{EU}}^{!k}_n(\gamma)-{\operatorname{EU}}^{?k}_n(\gamma)\right) + \frac{2\bar{\tau}\gamma^T(1-\gamma)}{1-\gamma^T}\]

We have

\[{\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma) \leq \Pr_{x\sim\rho^{!k}_n}\left[\exists m \in [T]: x_{nT+m} \in \bot{\bar{{\mathcal{O}}}}\right]\]

Since \({Z}_{nT}^{!}(K) \geq \eta\), it follows that

\[{\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma) \leq \frac{1}{\eta}\Pr_{x\sim\rho^{?k}}\left[\exists m \in [T]: x_{nT+m} \in \bot{\bar{{\mathcal{O}}}}\right] \leq \frac{1}{\eta}{\underset{x\sim\rho^{?k}}{\operatorname{E}}{\left[{\left\vert {\left\{m \in [T] \mid x_{nT+m} \in \bot{\bar{{\mathcal{O}}}}\right\}} \right\vert}\right]}}\]

\[\sum_{n=0}^\infty {\left({\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma)\right)} \leq \frac{1}{\eta}{\underset{x\sim\rho^{?k}}{\operatorname{E}}{\left[{\left\vert {\left\{n \in {\mathbb{N}}\mid x_n \in \bot{\bar{{\mathcal{O}}}}\right\}} \right\vert}\right]}}\]

Using condition i of Proposition C.1, we conclude

\[R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left({\operatorname{EU}}^{!k}_n(\gamma)-{\operatorname{EU}}^{?k}_n(\gamma)\right) + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)\]

Define the random variables \({{\{U^!_n : \Omega \rightarrow [0,1]\}_{n \in {\mathbb{N}}}}}\) by

\[U^!_n:=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m} r\left(A\Theta^!_{:nT+m}\right)\]

Averaging the previous inequality over \(k\), we get

\[\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} {\underset{}{\operatorname{E}}}\left[{\underset{}{\operatorname{E}}}\left[U^!_n \mid {J}^!_n = K,\ Z^!_{nT}\right]-{\underset{}{\operatorname{E}}}\left[U^!_n \mid Z^!_{nT}\right]\right] + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)\]

\[\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{(1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} {\underset{}{\operatorname{E}}}\left[\left({\underset{}{\operatorname{E}}}\left[U^!_n \mid {J}^!_n = K,\ Z^!_{nT}\right]-{\underset{}{\operatorname{E}}}\left[U^!_n \mid Z^!_{nT}\right]\right)^2\right]} + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)\]

We apply Proposition C.2 to each term in the sum over \(n\).

\[\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{(1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} {\underset{}{\operatorname{E}}}\left[\frac{1}{2\eta}{\underset{}{\operatorname{I}}}\left[K;{J}^!_n,U^!_n \mid Z^!_{nT}\right]\right]} + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)\]

\[\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{\frac{1-\gamma^T}{2\eta}\sum_{n=0}^\infty \gamma^{nT} {\underset{}{\operatorname{E}}}\left[{\operatorname{H}}\Big(Z^!_{nT}\Big)-{\operatorname{H}}\left(Z^!_{(n+1)T}\right)\right]} + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)\]

\[\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{\frac{1-\gamma^T}{2\eta}\ln N} + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)\]

\[\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = O\left(\sqrt{\frac{1-\gamma^T}{\eta}} +\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)\]

Condition ii of Proposition C.1 implies that

\[{\operatorname{d}_{\text{tv}}}\left(\frac{1}{N}\sum_{k=0}^{N-1}{\bar{\nu}^k\left[\sigma^k\right]\bowtie\pi^*},\ \frac{1}{N}\sum_{k=0}^{N-1}{\bar{\nu}^k\left[\sigma^k\right]\bowtie\pi^{?k}}\right) \leq 2(N-1)\eta\]

Here, the factor of 2 comes from the difference between the equations for \(Z_n\) and \(Z^!_n\) (we can construct and intermediate policy between \(\pi^*\) and \(\pi^{?k}\) and use the triangle inequality for \({\operatorname{d}_{\text{tv}}}\)). We conclude

\[{\operatorname{EU}}^{\pi^k S^k}_{\upsilon^k}(\gamma)-{\operatorname{EU}}^{\pi^{*}}_{\bar{\upsilon}^k{\left[\sigma^k S^k\right]}}(\gamma) = O\left(\eta+\sqrt{\frac{1-\gamma^T}{\eta}} +\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)\]

Now we set

\[\eta:=\bar{\tau}^{1/4} (1-\gamma)^{1/4}\]

\[T:={\left\lceil \bar{\tau}^{3/4}(1-\gamma)^{-1/4} \right\rceil}\]

Without loss of generality, we can assume that \(\bar{\tau}(1-\gamma) \ll 1\) (because of the form of the bound we are proving), which implies that \(T(1-\gamma) \ll 1\) and \(1-\gamma^T \approx T(1-\gamma) \approx \bar{\tau}^{3/4}(1-\gamma)^{3/4}\). We get

\[{\operatorname{EU}}^{\pi^k S^k}_{\upsilon^k}(\gamma)-{\operatorname{EU}}^{\pi^{*}}_{\bar{\upsilon}^k{\left[\sigma^k S^k\right]}}(\gamma) = O\left(\bar{\tau}^{1/4}(1-\gamma)^{1/4}\right)\]

Appendix C

The following is a simple special case of what appeared as “Proposition A.2” in the previous essay, where we restrict \(\pi\) to be single-valued (the more general case isn’t needed).

Proposition C.1

Fix an interface \({\mathcal{I}}=({\mathcal{A}},{\mathcal{O}})\), \(N \in {\mathbb{N}}\), \(\epsilon \in (0,\frac{1}{{\left\vert {\mathcal{A}}\right\vert}})\), \(\eta \in (0,\frac{1}{N})\). Consider some \(\{\sigma^k: {({\mathcal{A}}\times {\mathcal{O}})^*}{\xrightarrow{\text{k}}}{\mathcal{A}}\}_{k \in [N]}\). Then, there exist \({\mathcal{D}}: {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\times {\mathcal{A}}\rightarrow {\bar{{\mathcal{A}}}}\) and \(\{{\mathcal{D}}^{!k}: {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\times {\mathcal{A}}\rightarrow {\bar{{\mathcal{A}}}}\}_{k \in [N]}\) with the following properties. Given \(x \in \left(2^{\mathcal{A}}\times {\overline{{\mathcal{A}}\times {\mathcal{O}}}}\right)^*\), we denote \(\underline{x}\) its projection to \({{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\). Thus, \(\underline{\underline{x}}\in{({\mathcal{A}}\times {\mathcal{O}})^*}\). Given \(\mu\) an \({\mathcal{I}}\)-environment, \(\pi: \operatorname{hdom}{\mu} {\xrightarrow{\text{k}}}{\mathcal{A}}\), \({\mathcal{D}}': {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\times {\mathcal{A}}\rightarrow {\bar{{\mathcal{A}}}}\) and \(k \in [N]\), we can define \(\Xi\left[\mu,\sigma^k,{\mathcal{D}}',\pi\right]\in \Delta\left({\mathcal{A}}\times {\overline{{\mathcal{A}}\times {\mathcal{O}}}}\right)^\omega\) as follows

\[\Xi\left[\mu,\sigma^k,{\mathcal{D}}',\pi\right]\left(b,a,o \mid x\right):=\pi\left(b \mid \underline{\underline{x}}\right){\mathcal{D}}'\left(a \mid \underline{x},b\right) \bar{\mu}[\sigma^k]\left(o \mid \underline{x}a\right)\]

We require that for every \(\pi\), \(\mu\) and \(k\) as above, the following conditions hold

  1. \[{\underset{x \sim\Xi\left[\mu,\sigma^k,{\mathcal{D}}^{!k},\pi\right]}{\operatorname{E}}}\left[{\left\vert \{n \in {\mathbb{N}}\mid x_n \in {\mathcal{A}}\times \bot \times \bar{{\mathcal{O}}}\} \right\vert}\right] \leq \frac{\ln N}{\eta \ln\left(1 + \epsilon(1-\epsilon)^{(1-\epsilon)/\epsilon}\right)}=O\left(\frac{\ln N}{\eta \epsilon}\right)\]

  2. \({\operatorname{d}_{\text{tv}}}\left(\frac{1}{N}\sum_{j=0}^{N-1}{\Xi\left[\mu,\sigma^j,{\mathcal{D}},\pi\right]},\frac{1}{N}\sum_{j=0}^{N-1}{\Xi\left[\mu,\sigma^j,{\mathcal{D}}^{!j},\pi\right]}\right) \leq (N-1)\eta\)

  3. For all \(x \in \operatorname{hdom}{\bar{\mu}[\sigma^k]}\), if \({\mathcal{D}}^{!k}\left(x,\pi\left(\underline{x}\right)\right) \ne \bot\) then \(\sigma^k\left({\mathcal{D}}^{!k}\left(x,\pi\left(\underline{x}\right)\right) \mid \underline{x}\right) > 0\)

  4. For all \(x \in \operatorname{hdom}{\bar{\mu}[\sigma^k]}\), if \({\mathcal{D}}^{!k}\left(x,\pi\left(\underline{x}\right)\right) \not\in {\left\{\pi\left(\underline{x}\right), \bot\right\}}\) then \(\sigma^k\left(\pi\left(\underline{x}\right) \mid \underline{x}\right) \leq \epsilon\)


The following appeared in the previous essay as “Proposition A.1”.

Proposition C.2

Consider a probability space \((\Omega, P \in \Delta\Omega)\), \(N \in {\mathbb{N}}\), \(R \subseteq [0,1]\) a finite set and random variables \(U: \Omega \rightarrow R\), \(K: \Omega \rightarrow [N]\) and \({J}: \Omega \rightarrow [N]\). Assume that \(K_*P = J_*P = \zeta \in \Delta[N]\) and \({\underset{}{\operatorname{I}}}[K;J] = 0\). Then

\[{\underset{}{\operatorname{I}}}\left[K;J,U\right] \geq 2 \left(\min_{i \in [N]} {\zeta(i)}\right) \left({\underset{}{\operatorname{E}}}\left[U \mid J = K\right]-{\underset{}{\operatorname{E}}}\left[U\right]\right)^2\]



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes

Without reading closely, this
by Paul Christiano on Policy Selection Solves Most Problems | 1 like

>policy selection converges
by Stuart Armstrong on Policy Selection Solves Most Problems | 0 likes

Indeed there is some kind of
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

RSS

Privacy & Terms