Intelligent Agent Foundations Forumsign up / log in
Catastrophe Mitigation Using DRL (Appendices)
discussion post by Vadim Kosoy 187 days ago | discuss

These are Appendices B and C for the essay Catastrophe Mitigation Using DRL. They appear in a separate post because of a length limit in the website.

Appendix B

Given \(p=(a,o)\in{\mathcal{A}}\times{\mathcal{O}}\), we denote \(p^{\mathcal{A}}:=a\), \(p^{\mathcal{O}}:=o\).

Proposition B.1

Consider a universe \(\upsilon=(\mu,r)\) which an \({\mathcal{O}}\)-realization of an MDP \(M\) with state function \(S\), a stationary policy \(\pi^*: {\mathcal{S}}_M {\xrightarrow{\text{k}}}{\mathcal{A}}\), an arbitrary \({\mathcal{I}}\)-policy \(\pi^0\) and some \(\gamma \in (0,1)\). Then,

\[{\operatorname{EU}}_{\upsilon}^{\pi^* S}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi^0}(\gamma)=\sum_{n=0}^\infty {\gamma^n {\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}_{M\pi^*}{\left(S{\left(x_{:n}\right)},\gamma\right)}-{\operatorname{Q}}_{M\pi^*}{\left(S{\left(x_{:n}\right)},x_n^{\mathcal{A}},\gamma\right)}\right]}}}\]

Proof of Proposition B.1

For the sake of encumbering the notation less, we will omit the parameter \(\gamma\) in functions that depend on it. We will use \(S\) implicitly, i.e. given \(F\) a function on \({\mathcal{S}}_M\) and \(h \in \operatorname{hdom}{\mu}\), \(F(h):=F{\left(S(h)\right)}\). Finally, we will omit \(M\pi^*\), using the shorthand notations \({\operatorname{V}}:={\operatorname{V}}_{M\pi^*}\), \({\operatorname{Q}}:={\operatorname{Q}}_{M\pi^*}\).

For any \(x \in \operatorname{hdom}^\omega \mu\), it is easy to see that

\[{\operatorname{EU}}_{\upsilon}^{\pi^* S}={\operatorname{V}}{\left({\boldsymbol{\lambda}}\right)}=\sum_{n=0}^\infty \gamma^n {\left({\operatorname{V}}{\left(x_{:n}\right)}-\gamma{\operatorname{V}}{\left(x_{:n+1}\right)}\right)}\]

\[{\operatorname{U}}^{r}(x)=(1-\gamma)\sum_{n=0}^\infty \gamma^n r{\left(x_{:n}\right)}\]

\[{\operatorname{EU}}_{\upsilon}^{\pi^* S} - {\operatorname{U}}^{r}(x)=\sum_{n=0}^\infty \gamma^n {\left({\operatorname{V}}{\left(x_{:n}\right)}-(1-\gamma)r{\left(x_{:n}\right)}-\gamma{\operatorname{V}}{\left(x_{:n+1}\right)}\right)}\]

\[{\operatorname{EU}}_{\upsilon}^{\pi^* S} - {\operatorname{U}}^{r}(x)=\sum_{n=0}^\infty \gamma^n {\left({\operatorname{V}}{\left(x_{:n}\right)}-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}}\right)}+{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}}\right)}-(1-\gamma)r{\left(x_{:n}\right)}-\gamma{\operatorname{V}}{\left(x_{:n+1}\right)}\right)}\]

Taking expected value over \(x\), we get

\[{\operatorname{EU}}_{\upsilon}^{\pi^* S} - {\operatorname{EU}}_{\upsilon}^{\pi^0}=\sum_{n=0}^\infty \gamma^n {\left({\underset{\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}{\left(x_{:n}\right)}-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}}\right)}\right]}}+{\underset{\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}}\right)}-(1-\gamma)r{\left(x_{:n}\right)}-\gamma{\operatorname{V}}{\left(x_{:n+1}\right)}\right]}}\right)}\]

It is easy to see that the second term vanishes, yielding the desired result.

Proposition B.2

Consider some \(\tau\in(0,\infty)\), \(T\in{\mathbb{N}}^+\), a universe \(\upsilon=(\mu,r)\) that is an \({\mathcal{O}}\)-realization of \(M\) with state function \(S\), a stationary policy \(\pi^*: {\mathcal{S}}_M {\xrightarrow{\text{k}}}{\mathcal{A}}\) and an arbitrary \({\mathcal{I}}\)-policy \(\pi^0: {({\mathcal{A}}\times {\mathcal{O}})^*}{\xrightarrow{\text{k}}}{\mathcal{A}}\). For any \(n \in {\mathbb{N}}\), let \(\pi^*_n\) be an \({\mathcal{I}}\)-policy s.t. for any \(h \in \operatorname{hdom}{\mu}\)

\[\pi^*_n(h):=\begin{cases} \pi^0(h) \text{ if } {\left\vert h \right\vert} < nT \\ \pi^*{\left(S(h)\right)} \text{ otherwise} \end{cases}\]

Assume that

  1. For any \(h \in \operatorname{hdom}{\mu}\) \[\operatorname{supp}{\pi^0(h)} \subseteq {\mathcal{A}}_{M\pi^*}^0{\left(S(h)\right)}\]

  2. For any \(s \in {\mathcal{S}}_M\) and \(\gamma\in(0,1)\) \[{\left\vert \frac{{\mathrm{d}}{\operatorname{V}}_{M\pi^*}{\left(s,\gamma\right)}}{{\mathrm{d}}\gamma} \right\vert} \leq \tau\]

Then, for any \(\gamma\in(0,1)\),

\[{\operatorname{EU}}^{\pi^*S}_\upsilon(\gamma)-{\operatorname{EU}}^{\pi^0}_\upsilon(\gamma) \leq (1-\gamma)\sum_{n=0}^\infty \sum_{m=0}^{T-1} \gamma^{nT+m}\left({\underset{x\sim\mu\bowtie\pi^*_n}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]-{\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]\right) + \frac{2\tau\gamma^T(1-\gamma)}{1-\gamma^T}\]

Proof of Proposition B.2

For the sake of encumbering the notation less, we will use \(S\) implicitly, i.e. given \(F\) a function on \({\mathcal{S}}_M\) and \(h \in \operatorname{hdom}{\mu}\), \(F(h):=F{\left(S(h)\right)}\). Also, we will omit \(M\pi^*\), using the shorthand notations \({\operatorname{V}}:={\operatorname{V}}_{M\pi^*}\), \({\operatorname{Q}}:={\operatorname{Q}}_{M\pi^*}\).

By Proposition B.1, for any \(l \in {\mathbb{N}}\)

\[{\operatorname{EU}}_{\upsilon}^{\pi^*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi_l^*}(\gamma) = \sum_{n=0}^\infty{\gamma^n {\underset{x\sim\mu\bowtie\pi_l^*}{\operatorname{E}}{\left[{\operatorname{V}}\Big(x_{:n},\gamma\Big)-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}},\gamma\right)}\right]}}}\]

\(\pi^*_l\) coincides with \(\pi^*\) after \(lT\), therefore the corresponding expected values vanish.

\[{\operatorname{EU}}_{\upsilon}^{\pi^*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi_l^*}(\gamma) = \sum_{n=0}^{lT-1}{\gamma^n {\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}\Big(x_{:n},\gamma\Big)-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}},\gamma\right)}\right]}}}\]

Subtracting the equalities for \(l+1\) and \(l\), we get

\[{\operatorname{EU}}_{\upsilon}^{\pi_{l}^*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi_{l+1}^*}(\gamma) = \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}\Big(x_{:n},\gamma\Big)-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}},\gamma\right)}\right]}}}\]

\[(1-\gamma)\sum_{n=0}^\infty {\gamma^n{\left({\underset{x \sim \mu\bowtie\pi^*_l}{\operatorname{E}}{\left[r{\left(x_{:n}\right)}\right]}}-{\underset{x \sim \mu\bowtie\pi^*_{l+1}}{\operatorname{E}}{\left[r{\left(x_{:n}\right)}\right]}}\right)}} = \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}\Big(x_{:n},\gamma\Big)-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}},\gamma\right)}\right]}}}\]

\(\pi^*_l\) and \(\pi^*_{l+1}\) coincide until \(lT\), therefore

\[(1-\gamma)\sum_{n=lT}^\infty {\gamma^n{\left({\underset{x \sim \mu\bowtie\pi^*_l}{\operatorname{E}}{\left[r{\left(x_{:n}\right)}\right]}}-{\underset{x \sim \mu\bowtie\pi^*_{l+1}}{\operatorname{E}}{\left[r{\left(x_{:n}\right)}\right]}}\right)}} = \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}\Big(x_{:n},\gamma\Big)-{\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}},\gamma\right)}\right]}}}\]

Denote \(\rho^*_l:=\mu\bowtie\pi^*_l\), \(\rho^0:=\mu\bowtie\pi^0\). We also use the shorthand notations \(r_n:=r{\left(x_{:n}\right)}\), \({\operatorname{V}}_n(\gamma):={\operatorname{V}}{\left(x_{:n},\gamma\right)}\), \({\operatorname{Q}}_n(\gamma):={\operatorname{Q}}{\left(x_{:n},x_n^{\mathcal{A}},\gamma\right)}\). Both \(\pi^*_l\) and \(\pi^*_{l+1}\) coincide with \(\pi^*\) after \((l+1)T\), therefore

\[(1-\gamma)\sum_{n=lT}^{(l+1)T-1} {\gamma^n{\left({\underset{\rho^*_l}{\operatorname{E}}{\left[r_n\right]}}-{\underset{\rho^0}{\operatorname{E}}{\left[r_n\right]}}\right)}} + \gamma^{(l+1)T}{\left({\underset{\rho^*_l}{\operatorname{E}}{\left[{\operatorname{V}}_{(l+1)T}(\gamma)\right]}}-{\underset{\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}_{(l+1)T}(\gamma)\right]}}\right)}= \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}_n(\gamma)-{\operatorname{Q}}_n(\gamma)\right]}}}\]

Denote \({\operatorname{V}}'(s,\gamma):=\frac{{\mathrm{d}}{\operatorname{V}}(s,\gamma)}{{\mathrm{d}}\gamma}\). By the mean value theorem, for each \(s\in{\mathcal{S}}_M\) there is \(\gamma^*\in(\gamma,1)\) s.t.

\[{\operatorname{V}}(s,\gamma) = {\operatorname{V}}^0(s) - {\operatorname{V}}'(s,\gamma^*)\cdot(1-\gamma)\]

\[{\operatorname{V}}^0(s) - \tau(1-\gamma) \leq {\operatorname{V}}(s,\gamma) \leq {\operatorname{V}}^0(s) + \tau(1-\gamma)\]

It follows that

\[(1-\gamma)\sum_{n=lT}^{(l+1)T-1} {\gamma^n{\underset{\rho^*_l - \rho^0}{\operatorname{E}}{\left[r_n\right]}}} + \gamma^{(l+1)T}{\left({\underset{\rho^*_l-\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}^0_{(l+1)T}\right]}}+ 2\tau(1-\gamma)\right)}\geq \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}_n(\gamma)-{\operatorname{Q}}_n(\gamma)\right]}}}\]

Here, an expected value w.r.t. the difference of two probability measures is understood to mean the corresponding difference of expected values.

It is easy to see that assumption i implies that \({\operatorname{V}}_n^0\) is a submartingale for \(\rho^0\) (whereas it is a martingale for \(\mu\bowtie\pi^*\)) and therefore

\[{\underset{\rho^*_l-\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}^0_{(l+1)T}\right]}} \leq 0\]

We get

\[(1-\gamma)\sum_{n=lT}^{(l+1)T-1} {\gamma^n{\underset{\rho^*_l - \rho^0}{\operatorname{E}}{\left[r_n\right]}}} + 2\tau\gamma^{(l+1)T}(1-\gamma)\geq \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}_n(\gamma)-{\operatorname{Q}}_n(\gamma)\right]}}}\]

Summing over \(l\), we get

\[(1-\gamma) \sum_{l=0}^\infty \sum_{n=lT}^{(l+1)T-1} {\gamma^n{\underset{\rho^*_l - \rho^0}{\operatorname{E}}{\left[r_n\right]}}} + \frac{2\tau\gamma^T(1-\gamma)}{1-\gamma^T}\geq \sum_{n=0}^{\infty}{\gamma^n {\underset{\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}_n(\gamma)-{\operatorname{Q}}_n(\gamma)\right]}}}\]

Applying Proposition B.1 to the right hand side

\[(1-\gamma) \sum_{l=0}^\infty \sum_{n=lT}^{(l+1)T-1} {\gamma^n{\underset{\rho^*_l - \rho^0}{\operatorname{E}}{\left[r_n\right]}}} + \frac{2\tau\gamma^T(1-\gamma)}{1-\gamma^T} \geq {\operatorname{EU}}_{\upsilon}^{\pi^*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi^0}(\gamma)\]

Proof of Lemma A.1

Fix \(\gamma \in (0,1)\), \(\eta\in\left(0,N^{-1}\right)\) and \(T \in {\mathbb{N}}^+\). Denote \(\nu^k:=\bar{\mu}^k\left[\sigma^k S^k\right]\). To avoid cumbersome notation, whenever \(M^k\) should appear a subscript, we will replace it by \(k\). Let \((\Omega,P \in \Delta\Omega)\) be a probability space. Let \(K: \Omega \rightarrow [N]\) be a random variable and the following be stochastic processes

\[{Z}_n,\tilde{{Z}}_n: \Omega \rightarrow \Delta[N]\]

\[{J}_n: \Omega \rightarrow [N]\]

\[\Psi_n: \Omega \rightarrow {\mathcal{A}}\]

\[A_n: \Omega \rightarrow {\bar{{\mathcal{A}}}}\]

\[\Theta_n: \Omega \rightarrow {\bar{{\mathcal{O}}}}\]

We also define \(A\Theta_{:n}: \Omega \rightarrow {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\) by

\[A\Theta_{:n}:= A_0\Theta_0A_1\Theta_1 \ldots A_{n-1}\Theta_{n-1}\]

(The following conditions on \(A\) and \(\Theta\) imply that the range of the above is indeed in \({{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\).) Let \({\mathcal{D}}\) and \({\mathcal{D}}^{!k}\) be as in Proposition C.1 (we assume w.l.o.g. that \(\epsilon < \frac{1}{{\left\vert {\mathcal{A}}\right\vert}}\)). We construct \(\Omega\), \(K\), \({Z}\), \(\tilde{{Z}}\), \({J}\), \(\Psi\), \(A\) and \(\Theta\) s.t \(K\) is uniformly distributed and for any \(k \in [N]\), \(l \in {\mathbb{N}}\), \(m \in [T]\) and \(o \in {\mathcal{O}}\), denoting \(n = lT+m\)

\[\tilde{{Z}}_0(k)\equiv\frac{1}{N}\]

\[{Z}_{n}(k) = \frac{\tilde{{Z}}_{n}(k)[[\tilde{{Z}}_{n}(k) \geq \eta]] }{\sum_{j = 0}^{N-1}\tilde{{Z}}_{n}(j)[[\tilde{{Z}}_{n}(j) \geq \eta]]}\]

\[\Pr\left[{J}_{l} = k \mid Z_{lT}\right] = {Z}_{lT}\left(k\right)\]

\[\Psi_{n} = \pi^{J_l}{\left(A\Theta_{:n}\right)}\]

\[\Pr\left[\Theta_{n} = o \mid A\Theta_{:n}\right] = \nu^K\left(o \mid A\Theta_{:n}\right)\]

\[A_n = {\mathcal{D}}\left(A\Theta_{:n}, \Psi_n\right)\]

\[\tilde{{Z}}_{n+1}(k)\sum_{j = 0}^{N-1} {Z}_n(j) [[A_n = {\mathcal{D}}^{!j}\left(A\Theta_{:n}, \Psi_n\right)]] \nu^j(\Theta_n \mid A\Theta_{:n}A_n)={Z}_{n}(k) [[A_n = {\mathcal{D}}^{!k}\left(A\Theta_{:n}, \Psi_n\right)]] \nu^k\left(\Theta_{n} \mid A\Theta_{:n}A_{n}\right)\]

Note that the last equation has the form of a Bayesian update which is allowed to be arbitrary when update is on “impossible” information.

We now construct the \({{\bar{{\mathcal{I}}}}}\)-policy \(\pi^*\) s.t. for any \(n \in {\mathbb{N}}\), \(h \in {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\) s.t. \(\Pr\left[A\Theta_{:n}=h\right] > 0\) and \(a \in {\bar{{\mathcal{A}}}}\)

\[\pi^*(a \mid h):=\Pr\left[A_n = a \mid A\Theta_{:n} = h\right]\]

That is, we perform Thompson sampling at time intervals of size \(T\), moderated by the delegation routine \({\mathcal{D}}\), and discard from our belief state hypotheses whose probability is below \(\eta\) and hypotheses sampling which resulted in recommending “unsafe” actions i.e. actions that \({\mathcal{D}}\) refused to perform.

In order to prove \(\pi^*\) has the desired property, we will define the stochastic processes \({Z}^!\), \(\tilde{{Z}}^!\), \({J}^!\), \(\Psi^!\), \(A^!\) and \(\Theta^!\), each process of the same type as its shriekless counterpart (thus \(\Omega\) is constructed to accommodate them). These processes are required to satisfy the following:

\[\tilde{{Z}}^!_0(k)\equiv\frac{1}{N}\]

\[{Z}_{n}^!(k) = \frac{\tilde{{Z}}^!_{n}(k)[[\tilde{{Z}}^!_{n}(k) \geq \eta]] }{\sum_{j = 0}^{N-1}\tilde{{Z}}^!_{n}(j)[[\tilde{{Z}}^!_{n}(j) \geq \eta]]}[[\tilde{{Z}}^!_{n}(K) \geq \eta]] + [[K = k]]\cdot [[\tilde{{Z}}^!_{n}(K) < \eta]]\]

\[\Pr\left[{J}^!_{l} = k \mid Z^!_{lT}\right] = {Z}^!_{lT}\left(k\right)\]

\[\Psi^!_{n} = \pi^{{J}^!_l}{\left(A\Theta^!_{:n}\right)}\]

\[\Pr\left[\Theta^!_{n} = o \mid A\Theta^!_{:n}\right] = \nu^K\left(o \mid A\Theta^!_{:n}\right)\]

\[A^!_n = {\mathcal{D}}^{!K}\left(A\Theta^!_{:n}, \Psi^!_n\right)\]

\[\tilde{{Z}}^!_{n+1}(k)=\frac{{Z}^!_{n}(k) [[A^!_n = {\mathcal{D}}^{!k}\left(A\Theta^!_{:n}, \Psi^!_n\right)]] \nu^k\left(\Theta^!_{n} \mid A\Theta^!_{:n}A^!_{n}\right)}{\sum_{j = 0}^{N-1} {Z}^!_n(j) [[A^!_n = {\mathcal{D}}^{!j}\left(A\Theta^!_{:n}, \Psi^!_n\right)]] \nu^j(\Theta^!_n \mid A\Theta^!_{:n}A^!_n)}\]

For any \(k \in [N]\), we construct the \({{\bar{{\mathcal{I}}}}}\)-policy \(\pi^{?k}\) s.t. for any \(n \in {\mathbb{N}}\), \(h \in {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\) s.t. \(\Pr\left[A\Theta^!_{:n}=h,\ K = k\right] > 0\) and \(a \in {\bar{{\mathcal{A}}}}\)

\[\pi^{?k}(a \mid h):=\Pr\left[A^!_n = a \mid A\Theta^!_{:n} = h,\ K = k\right]\]

Given any \({{\bar{{\mathcal{I}}}}}\)-policy \(\pi\) and \({\mathcal{I}}\)-policy \(\sigma\) we define \(\alpha_{\sigma\pi}: {({\mathcal{A}}\times {\mathcal{O}})^*}{\xrightarrow{\text{k}}}{{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\) by

\[\alpha_{\sigma\pi} (g \mid h) := [[h = \underline{g}]]C_h\prod_{n = 0}^{{\left\vert h \right\vert}-1} \sum_{a \in {\mathcal{A}}}\left([[g_n \in \bot a{\mathcal{O}}]] \pi\left(\bot \mid g_{:n}\right)\sigma\left(a \mid h_{:n}\right)+[[g_n \in a\bot{\mathcal{O}}]]\pi\left(a \mid g_{:n}\right)\right)\]

Here, \(C_h \in {\mathbb{R}}\) is a constant defined s.t. the probabilities sum to 1. We define the \({\mathcal{I}}\)-policy \(\left[\sigma\right]\underline{\pi}\) by

\[\left[\sigma\right]\underline{\pi}(a \mid h):=\Pr_{g \sim \alpha_{\sigma\pi}(h)}\left[\pi\left(g\right)=a \lor \left(\pi\left(g\right)=\bot \land \sigma(h)=a\right)\right]\]

Condition iii of Proposition C.1 and condition i of Definition A.1 imply that for any \(h \in \operatorname{hdom}{\mu^k}\)

\[\operatorname{supp}{\left[\sigma^k\right]\underline{\pi}^{?k}(h)} \subseteq {\mathcal{A}}^0_{M^k\pi^k}\left(S^k\left(h\right)\right)\]

This means we can apply Proposition B.2 and get

\[{\operatorname{EU}}^{\pi^k S^k}_{\upsilon^k}(\gamma)-{\operatorname{EU}}^{\pi^{?k}}_{\bar{\upsilon}^k{\left[\sigma^k S^k\right]}}(\gamma) \leq (1-\gamma)\sum_{n=0}^\infty \sum_{m=0}^{T-1} \gamma^{nT+m}\left({\underset{x\sim\mu^k\bowtie\pi^{*k}_n}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]-{\underset{x\sim\nu^k\bowtie\pi^{?k}}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]\right) + \frac{2\bar{\tau}\gamma^T(1-\gamma)}{1-\gamma^T}\]

Here, the \({\mathcal{I}}\)-policy \(\pi^{*k}_n\) is defined as \(\pi^*_n\) in Proposition B.2. We also define the \({{\bar{{\mathcal{I}}}}}\)-policies \(\pi^{!k}_n\) and \(\pi^{!!k}_n\) by

\[\pi^{!k}_n(a \mid h):=\begin{cases} \pi^{?k}(a \mid h) \text{ if } {\left\vert h \right\vert} < nT \\ \Pr\left[A^!_{{\left\vert h \right\vert}} = a \mid A\Theta^!_{:{{\left\vert h \right\vert}}} = h,\ K = k,\ {J}^!_n = k\right] \text{ otherwise} \end{cases}\]

\[\pi^{!!k}_n(a \mid h):=\begin{cases} \pi^{?k}(a \mid h) \text{ if } {\left\vert h \right\vert} < nT \\ \pi^{!k}_n(a \mid h) + \pi^{!k}_n(\bot \mid h) \cdot \pi^{*k}_n\left(a \mid \underline{h}\right) \text{ if } {\left\vert h \right\vert} \geq nT \text{ and } a \ne \bot \\ 0 \text{ if } {\left\vert h \right\vert} \geq nT \text{ and } a = \bot \end{cases}\]

Denote

\[\rho^{*k}_n:=\mu^k\bowtie\pi^{*k}_n\]

\[\rho^{!!k}_n:=\nu^k\bowtie\pi^{!!k}_n\]

\[\rho^{!k}_n:=\nu^k\bowtie\pi^{!k}_n\]

\[\rho^{?k}:=\nu^k\bowtie\pi^{?k}\]

\[R^{?k}:={\operatorname{EU}}^{\pi^k S^k}_{\upsilon^k}(\gamma)-{\operatorname{EU}}^{\pi^{?k}}_{\bar{\upsilon}^k{\left[\sigma^k S^k\right]}}(\gamma)\]

For each \(n \in {\mathbb{N}}\), denote

\[{\operatorname{EU}}_n^{*k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}{\underset{x\sim\rho^{*k}_n}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]\]

\[{\operatorname{EU}}_n^{!!k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}{\underset{x\sim\rho^{!!k}_n}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]\]

\[{\operatorname{EU}}_n^{!k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}{\underset{x\sim\rho^{!k}_n}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]\]

\[{\operatorname{EU}}_n^{?k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}{\underset{x\sim\rho^{?k}}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]\]

We have

\[R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left({\operatorname{EU}}^{*k}_n(\gamma)-{\operatorname{EU}}^{?k}_n(\gamma)\right) + \frac{2\bar{\tau}\gamma^T(1-\gamma)}{1-\gamma^T}\]

\[R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left({\operatorname{EU}}^{*k}_n(\gamma)-{\operatorname{EU}}^{!!k}_n(\gamma)+{\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma)+{\operatorname{EU}}^{!k}_n(\gamma)-{\operatorname{EU}}^{?k}_n(\gamma)\right) + \frac{2\bar{\tau}\gamma^T(1-\gamma)}{1-\gamma^T}\]

Condition iv of Proposition C.1 and condition ii of Definition A.1 imply that, given \(h \in \operatorname{hdom}{\nu^k}\) s.t. \({\left\vert h \right\vert} \geq nT\)

\[\operatorname{supp}{\pi^{!k}_n(h)} \subseteq {\left\{\pi^k{\left(S^k(h)\right)},\bot\right\}}\]

\[\pi^{!!k}_n{\left(\pi^k{\left(S^k(h)\right)} \mid h\right)} = 1\]

Therefore, \(\pi^{!!k}_n = \pi^{*k}_n\), and we remain with

\[R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left({\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma)+{\operatorname{EU}}^{!k}_n(\gamma)-{\operatorname{EU}}^{?k}_n(\gamma)\right) + \frac{2\bar{\tau}\gamma^T(1-\gamma)}{1-\gamma^T}\]

We have

\[{\left\vert {\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma) \right\vert} \leq \Pr_{x\sim\rho^{!k}_n}\left[\exists m \in [T]: x_{nT+m} \in \bot{\bar{{\mathcal{O}}}}\right]\]

Since \({Z}_{nT}^{!}(K) \geq \eta\), it follows that

\[{\left\vert {\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma) \right\vert} \leq \frac{1}{\eta}\Pr_{x\sim\rho^{?k}}\left[\exists m \in [T]: x_{nT+m} \in \bot{\bar{{\mathcal{O}}}}\right] \leq \frac{1}{\eta}{\underset{x\sim\rho^{?k}}{\operatorname{E}}{\left[{\left\vert {\left\{m \in [T] \mid x_{nT+m} \in \bot{\bar{{\mathcal{O}}}}\right\}} \right\vert}\right]}}\]

\[\sum_{n=0}^\infty {\left\vert {\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma) \right\vert} \leq \frac{1}{\eta}{\underset{x\sim\rho^{?k}}{\operatorname{E}}{\left[{\left\vert {\left\{n \in {\mathbb{N}}\mid x_n \in \bot{\bar{{\mathcal{O}}}}\right\}} \right\vert}\right]}}\]

Using condition i of Proposition C.1, we conclude

\[R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left({\operatorname{EU}}^{!k}_n(\gamma)-{\operatorname{EU}}^{?k}_n(\gamma)\right) + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)\]

Define the random variables \({{\{U^!_n : \Omega \rightarrow [0,1]\}_{n \in {\mathbb{N}}}}}\) by

\[U^!_n:=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m} r\left(A\Theta^!_{:nT+m}\right)\]

Averaging the previous inequality over \(k\), we get

\[\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} {\underset{}{\operatorname{E}}}\left[{\underset{}{\operatorname{E}}}\left[U^!_n \mid {J}^!_n = K,\ Z^!_{nT}\right]-{\underset{}{\operatorname{E}}}\left[U^!_n \mid Z^!_{nT}\right]\right] + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)\]

\[\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{(1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} {\underset{}{\operatorname{E}}}\left[\left({\underset{}{\operatorname{E}}}\left[U^!_n \mid {J}^!_n = K,\ Z^!_{nT}\right]-{\underset{}{\operatorname{E}}}\left[U^!_n \mid Z^!_{nT}\right]\right)^2\right]} + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)\]

We apply Proposition C.2 to each term in the sum over \(n\).

\[\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{(1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} {\underset{}{\operatorname{E}}}\left[\frac{1}{2\eta}{\underset{}{\operatorname{I}}}\left[K;{J}^!_n,U^!_n \mid Z^!_{nT}\right]\right]} + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)\]

\[\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{\frac{1-\gamma^T}{2\eta}\sum_{n=0}^\infty \gamma^{nT} {\underset{}{\operatorname{E}}}\left[{\operatorname{H}}\Big(Z^!_{nT}\Big)-{\operatorname{H}}\left(Z^!_{(n+1)T}\right)\right]} + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)\]

\[\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{\frac{1-\gamma^T}{2\eta}\ln N} + O\left(\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)\]

\[\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = O\left(\sqrt{\frac{1-\gamma^T}{\eta}} +\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)\]

Condition ii of Proposition C.1 implies that

\[{\operatorname{d}_{\text{tv}}}\left(\frac{1}{N}\sum_{k=0}^{N-1}{\bar{\nu}^k\left[\sigma^k\right]\bowtie\pi^*},\ \frac{1}{N}\sum_{k=0}^{N-1}{\bar{\nu}^k\left[\sigma^k\right]\bowtie\pi^{?k}}\right) \leq 2(N-1)\eta\]

Here, the factor of 2 comes from the difference between the equations for \(Z_n\) and \(Z^!_n\) (we can construct and intermediate policy between \(\pi^*\) and \(\pi^{?k}\) and use the triangle inequality for \({\operatorname{d}_{\text{tv}}}\)). We conclude

\[{\operatorname{EU}}^{\pi^k S^k}_{\upsilon^k}(\gamma)-{\operatorname{EU}}^{\pi^{*}}_{\bar{\upsilon}^k{\left[\sigma^k S^k\right]}}(\gamma) = O\left(\eta+\sqrt{\frac{1-\gamma^T}{\eta}} +\frac{1-\gamma^T}{\eta^2}+\frac{\bar{\tau}(1-\gamma)}{1-\gamma^T}\right)\]

Now we set

\[\eta:=\bar{\tau}^{1/4} (1-\gamma)^{1/4}\]

\[T:={\left\lceil \bar{\tau}^{3/4}(1-\gamma)^{-1/4} \right\rceil}\]

Without loss of generality, we can assume that \(\bar{\tau}(1-\gamma) \ll 1\) (because of the form of the bound we are proving), which implies that \(T(1-\gamma) \ll 1\) and \(1-\gamma^T \approx T(1-\gamma) \approx \bar{\tau}^{3/4}(1-\gamma)^{3/4}\). We get

\[{\operatorname{EU}}^{\pi^k S^k}_{\upsilon^k}(\gamma)-{\operatorname{EU}}^{\pi^{*}}_{\bar{\upsilon}^k{\left[\sigma^k S^k\right]}}(\gamma) = O\left(\bar{\tau}^{1/4}(1-\gamma)^{1/4}\right)\]

Appendix C

The following is a simple special case of what appeared as “Proposition A.2” in the previous essay, where we restrict \(\pi\) to be single-valued (the more general case isn’t needed).

Proposition C.1

Fix an interface \({\mathcal{I}}=({\mathcal{A}},{\mathcal{O}})\), \(N \in {\mathbb{N}}\), \(\epsilon \in (0,\frac{1}{{\left\vert {\mathcal{A}}\right\vert}})\), \(\eta \in (0,\frac{1}{N})\). Consider some \(\{\sigma^k: {({\mathcal{A}}\times {\mathcal{O}})^*}{\xrightarrow{\text{k}}}{\mathcal{A}}\}_{k \in [N]}\). Then, there exist \({\mathcal{D}}: {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\times {\mathcal{A}}\rightarrow {\bar{{\mathcal{A}}}}\) and \(\{{\mathcal{D}}^{!k}: {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\times {\mathcal{A}}\rightarrow {\bar{{\mathcal{A}}}}\}_{k \in [N]}\) with the following properties. Given \(x \in \left({\mathcal{A}}\times {\overline{{\mathcal{A}}\times {\mathcal{O}}}}\right)^*\), we denote \(\underline{x}\) its projection to \({{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\). Thus, \(\underline{\underline{x}}\in{({\mathcal{A}}\times {\mathcal{O}})^*}\). Given \(\mu\) an \({\mathcal{I}}\)-environment, \(\pi: \operatorname{hdom}{\mu} {\xrightarrow{\text{k}}}{\mathcal{A}}\), \({\mathcal{D}}': {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\times {\mathcal{A}}\rightarrow {\bar{{\mathcal{A}}}}\) and \(k \in [N]\), we can define \(\Xi\left[\mu,\sigma^k,{\mathcal{D}}',\pi\right]\in \Delta\left({\mathcal{A}}\times {\overline{{\mathcal{A}}\times {\mathcal{O}}}}\right)^\omega\) as follows

\[\Xi\left[\mu,\sigma^k,{\mathcal{D}}',\pi\right]\left(b,a,o \mid x\right):=\pi\left(b \mid \underline{\underline{x}}\right){\mathcal{D}}'\left(a \mid \underline{x},b\right) \bar{\mu}[\sigma^k]\left(o \mid \underline{x}a\right)\]

We require that for every \(\pi\), \(\mu\) and \(k\) as above, the following conditions hold

  1. \[\frac{1}{N}\sum_{j=0}^{N-1}{\underset{x \sim\Xi\left[\mu,\sigma^j,{\mathcal{D}}^{!j},\pi\right]}{\operatorname{E}}}\left[{\left\vert \{n \in {\mathbb{N}}\mid x_n \in {\mathcal{A}}\times \bot \times \bar{{\mathcal{O}}}\} \right\vert}\right] \leq \frac{\ln N}{\eta \ln\left(1 + \epsilon(1-\epsilon)^{(1-\epsilon)/\epsilon}\right)}=O\left(\frac{\ln N}{\eta \epsilon}\right)\]

  2. \({\operatorname{d}_{\text{tv}}}\left(\frac{1}{N}\sum_{j=0}^{N-1}{\Xi\left[\mu,\sigma^j,{\mathcal{D}},\pi\right]},\frac{1}{N}\sum_{j=0}^{N-1}{\Xi\left[\mu,\sigma^j,{\mathcal{D}}^{!j},\pi\right]}\right) \leq (N-1)\eta\)

  3. For all \(x \in \operatorname{hdom}{\bar{\mu}[\sigma^k]}\), if \({\mathcal{D}}^{!k}\left(x,\pi\left(\underline{x}\right)\right) \ne \bot\) then \(\sigma^k\left({\mathcal{D}}^{!k}\left(x,\pi\left(\underline{x}\right)\right) \mid \underline{x}\right) > 0\)

  4. For all \(x \in \operatorname{hdom}{\bar{\mu}[\sigma^k]}\), if \({\mathcal{D}}^{!k}\left(x,\pi\left(\underline{x}\right)\right) \not\in {\left\{\pi\left(\underline{x}\right), \bot\right\}}\) then \(\sigma^k\left(\pi\left(\underline{x}\right) \mid \underline{x}\right) \leq \epsilon\)


The following appeared in the previous essay as “Proposition A.1”.

Proposition C.2

Consider a probability space \((\Omega, P \in \Delta\Omega)\), \(N \in {\mathbb{N}}\), \(R \subseteq [0,1]\) a finite set and random variables \(U: \Omega \rightarrow R\), \(K: \Omega \rightarrow [N]\) and \({J}: \Omega \rightarrow [N]\). Assume that \(K_*P = J_*P = \zeta \in \Delta[N]\) and \({\underset{}{\operatorname{I}}}[K;J] = 0\). Then

\[{\underset{}{\operatorname{I}}}\left[K;J,U\right] \geq 2 \left(\min_{i \in [N]} {\zeta(i)}\right) \left({\underset{}{\operatorname{E}}}\left[U \mid J = K\right]-{\underset{}{\operatorname{E}}}\left[U\right]\right)^2\]



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

Note: I currently think that
by Jessica Taylor on Predicting HCH using expert advice | 0 likes

Counterfactual mugging
by Jessica Taylor on Doubts about Updatelessness | 0 likes

What do you mean by "in full
by David Krueger on Doubts about Updatelessness | 0 likes

It seems relatively plausible
by Paul Christiano on Maximally efficient agents will probably have an a... | 1 like

I think that in that case,
by Alex Appel on Smoking Lesion Steelman | 1 like

Two minor comments. First,
by Sam Eisenstat on No Constant Distribution Can be a Logical Inductor | 1 like

A: While that is a really
by Alex Appel on Musings on Exploration | 0 likes

> The true reason to do
by Jessica Taylor on Musings on Exploration | 0 likes

A few comments. Traps are
by Vadim Kosoy on Musings on Exploration | 1 like

I'm not convinced exploration
by Abram Demski on Musings on Exploration | 0 likes

Update: This isn't really an
by Alex Appel on A Difficulty With Density-Zero Exploration | 0 likes

If you drop the
by Alex Appel on Distributed Cooperation | 1 like

Cool! I'm happy to see this
by Abram Demski on Distributed Cooperation | 0 likes

Caveat: The version of EDT
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

[Delegative Reinforcement
by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

RSS

Privacy & Terms