More precise regret bound for DRL
post by Vadim Kosoy 91 days ago | Alex Appel likes this | discuss

We derive a regret bound for DRL reflecting dependence on:

• Number of hypotheses

• Mixing time of MDP hypotheses

• The probability with which the advisor takes optimal actions

That is, the regret bound we get is fully explicit up to a multiplicative constant (which can also be made explicit). Currently we focus on plain (as opposed to catastrophe) and uniform (finite number of hypotheses, uniform prior) DRL, although this result can and should be extended to the catastrophe and/or non-uniform settings.

Appendix A contains the proofs, Appendix B recalls a few lemmas we need from previous essays.

## Results

First, we briefly recall some properties of Markov chains.

# Definition 1

Consider $${\mathcal{S}}$$ a finite set and $${\mathcal{T}}: {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{S}}$$. We say that $$k \in {\mathbb{N}}^+$$ is a period of $${\mathcal{T}}$$ when there is $$s \in {\mathcal{S}}$$ an essential state of $${\mathcal{T}}$$ (that is, $${\mathcal{T}}^\infty(s \mid s) > 0$$) s.t. $$k$$ is its period, i.e. $$k = \gcd {\left\{n \in {\mathbb{N}}^+ \mid {\mathcal{T}}^n(s \mid s) > 0\right\}}$$. We denote $${P}_{\mathcal{T}}$$ the set of periods of $${\mathcal{T}}$$.

The following is a corollary of the Perron-Frobenius theorem which we give without proof. [I believe this is completely standard and would be grateful to get a source for this which treats the reducible case; of course I can produce the proof but it seems redundant.]

# Proposition 1

Consider $${\mathcal{S}}$$ a finite set and $${\mathcal{T}}: {\mathcal{S}}{\xrightarrow{\text{k}}}{\mathcal{S}}$$. Then, there are $$F \in (0,\infty)$$, $$\lambda\in(0,1)$$, $$\zeta: {\mathcal{S}}{\xrightarrow{\text{k}}}{P}_{\mathcal{T}}$$ and $$\xi: {\mathcal{S}}\times {P}_{\mathcal{T}}{\xrightarrow{\text{k}}}{\mathcal{S}}$$ s.t. for any $$s \in {\mathcal{S}}$$

$\forall k \in {P}_{\mathcal{T}}: {\mathcal{T}}^k \xi(s,k) = \xi(s,k)$

$\forall n \in {\mathbb{N}}: {\operatorname{d}_{\text{tv}}{\left({\mathcal{T}}^n(s),{\underset{k \sim \zeta(s)}{\operatorname{E}}{\left[{\mathcal{T}}^n \xi(s,k)\right]}}\right)}} \leq F \lambda^n$

For the purpose of this essay, we will use a definition of local sanity slightly stronger than what previously appeared as “Definition 4.” We think this strengthening is not substantial, but also the current analysis can be generalized to the weaker case by adding a term proportional to the 2nd derivative of $${\operatorname{V}}$$ (or the 2nd moment of the mixing time). We leave out the details for the time being.

We will use the notation $${\mathcal{A}}_M^\omega(s):=\bigcap_{k\in{\mathbb{N}}} {\mathcal{A}}_M^k(s)$$.

# Definition 2

Let $$\upsilon = (\mu,r)$$ be a universe and $$\epsilon \in (0,1)$$. A policy $$\sigma$$ is said to be locally $$\epsilon$$-sane for $$\upsilon$$ when there are $$M$$, $$S$$ and $${\mathcal{U}}_M \subseteq {\mathcal{S}}_M$$ (the set of uncorrupt states) s.t. $$\upsilon$$ is an $${\mathcal{O}}$$-realization of $$M$$ with state function $$S$$, $$S({\boldsymbol{\lambda}}) \in {\mathcal{U}}_M$$ and for any $$h \in \operatorname{hdom}{\mu}$$, if $$S(h) \in {\mathcal{U}}_M$$ then

1. $\forall a \in \operatorname{supp}{\sigma(h)}: {\mathcal{T}}_M{\left({\mathcal{U}}_M \mid S(h),a\right)} = 1$

2. $\operatorname{supp}{\sigma(h)} \subseteq {\mathcal{A}}_M^0\left(S(h)\right)$

3. $\exists a \in {\mathcal{A}}_M^\omega{\left(S(h)\right)}: \sigma(a \mid h) > \epsilon$

Recall that for any MPD $$M$$, there is $$\gamma_M\in(0,1)$$ s.t. for any $$\gamma\in[\gamma_M,1)$$, $$a \in {\mathcal{A}}_M^\omega(s)$$ if and only if $${\operatorname{Q}}_M(s,a,\gamma)={\operatorname{V}}_M(s,\gamma)$$.

We are now ready to state the regret bound.

# Theorem 1

There is some $$C \in (0,\infty)$$ s.t. the following holds.

Consider $${\mathcal{I}}$$ an interface, $$\alpha,\epsilon \in (0,1)$$, $${\mathcal{H}}= \{\upsilon^k = (\mu^k,r^k) \in \Upsilon_{{\mathcal{I}}}\}_{k \in [N]}$$ for some $$N \in {\mathbb{N}}$$. For each $$k \in [N]$$, let $$\sigma^k$$ be a locally $$\epsilon$$-sane policy for $$\upsilon^k$$. For each $$k \in [N]$$, let $$M^k$$ be the corresponding MDP and $${\mathcal{U}}^k \subseteq {\mathcal{S}}_{M^k}$$ the corresponding set of uncorrupt states. Assume that

1. For each $$k \in [N]$$, $$\alpha \leq 1 - \gamma_{M^k}$$.

2. $$\alpha \leq \frac{\epsilon}{N^2 \ln{N}}$$

3. For any $$k,j \in [N]$$ and $$h \in \operatorname{hdom}{\mu^k} \cap \operatorname{hdom}{\mu^j}$$, if $$S^k(h) \in {\mathcal{U}}^k$$ and $$S^j(h) \in {\mathcal{U}}^j$$, then $$r^k(h)=r^j(h)$$.

Then, there is an $$\bar{{\mathcal{I}}}$$-policy $$\pi^*$$ s.t. for any $$k \in [N]$$

${\operatorname{EU}}_{\upsilon^k}^*(1-\alpha) - {\operatorname{EU}}_{\bar{\upsilon}^k\left[\sigma^k\right]}^{\pi^*}(1-\alpha) \leq C{\left(\alpha N^5 \ln{N} {\left(\epsilon^{-1}+{\left\vert {\mathcal{A}}\right\vert}\right)} \sum_{j = 0}^{N-1} {\left(\max_{s \in {\mathcal{U}}^j} \sup_{\gamma \in [1-\alpha,1)} {\left\vert \frac{{\mathrm{d}}{{\operatorname{V}}_{M^j}(s,\gamma)}}{{\mathrm{d}}\gamma} \right\vert}+1\right)}\right)}^{1/4}$

The factor $$\max_{s \in {\mathcal{U}}^k} \sup_{\gamma \in [1-\alpha,1]} {\left\vert \frac{{\mathrm{d}}{{\operatorname{V}}_{M^j}(s,\gamma)}}{{\mathrm{d}}\gamma} \right\vert}$$ might seem difficult to understand. However, it can be bounded as follows.

# Proposition 2

Let $$M$$ be an MDP, $$\pi$$ a Blackwell optimal policy for $$M$$ and $$F \in (0,\infty)$$, $$\lambda \in (0,1)$$, $${P}\subseteq {\mathbb{N}}^+$$ as in Proposition 1 applied to the Markov chain $${\mathcal{T}}_{M\pi}$$. Then

$\max_{s \in {\mathcal{S}}_M} \sup_{\gamma \in [\gamma_M,1)} {\left\vert \frac{{\mathrm{d}}{{\operatorname{V}}_M(s,\gamma)}}{{\mathrm{d}}\gamma} \right\vert} \leq F\frac{1+\lambda}{1-\lambda}+\max{{P}}$

Theorem 1 and Proposition 2 immediately give the following:

# Corollary 1

There is some $$C' \in (0,\infty)$$ s.t. the following holds.

Under the conditions of Theorem 1, let $$\pi^k: {\mathcal{S}}_{M^k} \rightarrow {\mathcal{A}}$$ be s.t. for any $$s \in {\mathcal{U}}^k$$

1. $\pi^k(s) \in {\mathcal{A}}_{M^k}^\omega(s)$

2. ${\mathcal{T}}_{M^k \pi^k}{\left({\mathcal{U}}^k \mid s\right)} = 1$

Assuming w.l.o.g. that all uncorrupt states are reachable from $$S^k({\boldsymbol{\lambda}})$$, $$\pi^k$$ is guaranteed to exist thanks to condition iii of Definition 2 (if some uncorrupt state is unreachable, we can consider it to be corrupt.) Let $$F_k\in(0,\infty)$$, $$\lambda_k\in(0,1)$$ and $${P}_k \subseteq {\mathbb{N}}^+$$ be as in Proposition 1, for the Markov chain $${\mathcal{T}}_{M^k\pi^k}: {\mathcal{U}}^k {\xrightarrow{\text{k}}}{\mathcal{U}}^k$$. Then, there is an $$\bar{{\mathcal{I}}}$$-policy $$\pi^*$$ s.t. for any $$k \in [N]$$

${\operatorname{EU}}_{\upsilon^k}^*(1-\alpha) - {\operatorname{EU}}_{\bar{\upsilon}^k\left[\sigma^k\right]}^{\pi^*}(1-\alpha) \leq C'{\left(\alpha N^5 \ln{N} {\left(\epsilon^{-1}+{\left\vert {\mathcal{A}}\right\vert}\right)} \sum_{j = 0}^{N-1} {\left(\frac{F_j}{1-\lambda_j}+\max{{P}_j}\right)}\right)}^{1/4}$

## Appendix A

The proof of Theorem 1 mostly repeats the proof of the previous “Theorem 1”, except that we keep track of the bounds more carefully.

# Proof of Theorem 1

Fix $$\eta\in\left(0,N^{-1}\right)$$ and $$T \in {\mathbb{N}}^+$$. Denote $$\nu^k:=\bar{\mu}^k\left[\sigma^k S^k\right]$$. To avoid cumbersome notation, whenever $$M^k$$ should appear a subscript, we will replace it by $$k$$. Let $$(\Omega,P \in \Delta\Omega)$$ be a probability space. Let $$K: \Omega \rightarrow [N]$$ be a random variable and the following be stochastic processes

${Z}_n,\tilde{{Z}}_n: \Omega \rightarrow \Delta[N]$

${J}_n: \Omega \rightarrow [N]$

$\Psi_n: \Omega \rightarrow {\mathcal{A}}$

$A_n: \Omega \rightarrow {\bar{{\mathcal{A}}}}$

$\Theta_n: \Omega \rightarrow {\bar{{\mathcal{O}}}}$

We also define $$A\Theta_{:n}: \Omega \rightarrow {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}$$ by

$A\Theta_{:n}:= A_0\Theta_0A_1\Theta_1 \ldots A_{n-1}\Theta_{n-1}$

(The following conditions on $$A$$ and $$\Theta$$ imply that the range of the above is indeed in $${{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}$$.) Let $${\mathcal{D}}$$ and $${\mathcal{D}}^{!k}$$ be as in Proposition B.1 (the form of the bound we are proving allows assuming w.l.o.g. that $$\epsilon < \frac{1}{{\left\vert {\mathcal{A}}\right\vert}}$$). By condition iii of Definition 2, there is $$\pi^k : \operatorname{hdom}{\mu^k} \rightarrow {\mathcal{A}}$$ s.t. for any $$h \in \operatorname{hdom}{\mu^k}$$, if $$S^k(h) \in {\mathcal{U}}^k$$ then

$\sigma^k{\left(\pi^k(h) \mid h\right)} > \epsilon$

$\pi^k(h) \in {\mathcal{A}}_k^\omega{\left(S^k(h)\right)}$

We construct $$\Omega$$, $$K$$, $${Z}$$, $$\tilde{{Z}}$$, $${J}$$, $$\Psi$$, $$A$$ and $$\Theta$$ s.t $$K$$ is uniformly distributed and for any $$k \in [N]$$, $$l \in {\mathbb{N}}$$, $$m \in [T]$$ and $$o \in {\mathcal{O}}$$, denoting $$n = lT+m$$

$\tilde{{Z}}_0(k)\equiv\frac{1}{N}$

${Z}_{n}(k) = \frac{\tilde{{Z}}_{n}(k)[[\tilde{{Z}}_{n}(k) \geq \eta]] }{\sum_{j = 0}^{N-1}\tilde{{Z}}_{n}(j)[[\tilde{{Z}}_{n}(j) \geq \eta]]}$

$\Pr\left[{J}_{l} = k \mid Z_{lT}\right] = {Z}_{lT}\left(k\right)$

$\Psi_{n} = \pi^{J_l}{\left(A\Theta_{:n}\right)}$

$\Pr\left[\Theta_{n} = o \mid A\Theta_{:n}\right] = \nu^K\left(o \mid A\Theta_{:n}\right)$

$A_n = {\mathcal{D}}\left(A\Theta_{:n}, \Psi_n\right)$

$\tilde{{Z}}_{n+1}(k)\sum_{j = 0}^{N-1} {Z}_n(j) [[A_n = {\mathcal{D}}^{!j}\left(A\Theta_{:n}, \Psi_n\right)]] \nu^j(\Theta_n \mid A\Theta_{:n}A_n)={Z}_{n}(k) [[A_n = {\mathcal{D}}^{!k}\left(A\Theta_{:n}, \Psi_n\right)]] \nu^k\left(\Theta_{n} \mid A\Theta_{:n}A_{n}\right)$

Note that the last equation has the form of a Bayesian update which is allowed to be arbitrary when update is on “impossible” information.

We now construct the $${{\bar{{\mathcal{I}}}}}$$-policy $$\pi^*$$ s.t. for any $$n \in {\mathbb{N}}$$, $$h \in {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}$$ s.t. $$\Pr\left[A\Theta_{:n}=h\right] > 0$$ and $$a \in {\bar{{\mathcal{A}}}}$$

$\pi^*(a \mid h):=\Pr\left[A_n = a \mid A\Theta_{:n} = h\right]$

That is, we perform Thompson sampling at time intervals of size $$T$$, moderated by the delegation routine $${\mathcal{D}}$$, and discard from our belief state hypotheses whose probability is below $$\eta$$ and hypotheses sampling which resulted in recommending “unsafe” actions i.e. actions that $${\mathcal{D}}$$ refused to perform.

In order to prove $$\pi^*$$ has the desired property, we will define the stochastic processes $${Z}^!$$, $$\tilde{{Z}}^!$$, $${J}^!$$, $$\Psi^!$$, $$A^!$$ and $$\Theta^!$$, each process of the same type as its shriekless counterpart (thus $$\Omega$$ is constructed to accommodate them). These processes are required to satisfy the following:

$\tilde{{Z}}^!_0(k)\equiv\frac{1}{N}$

${Z}_{n}^!(k) = \frac{\tilde{{Z}}^!_{n}(k)[[\tilde{{Z}}^!_{n}(k) \geq \eta]] }{\sum_{j = 0}^{N-1}\tilde{{Z}}^!_{n}(j)[[\tilde{{Z}}^!_{n}(j) \geq \eta]]}[[\tilde{{Z}}^!_{n}(K) \geq \eta]] + [[K = k]]\cdot [[\tilde{{Z}}^!_{n}(K) < \eta]]$

$\Pr\left[{J}^!_{l} = k \mid Z^!_{lT}\right] = {Z}^!_{lT}\left(k\right)$

$\Psi^!_{n} = \pi^{{J}^!_l}{\left(A\Theta^!_{:n}\right)}$

$\Pr\left[\Theta^!_{n} = o \mid A\Theta^!_{:n}\right] = \nu^K\left(o \mid A\Theta^!_{:n}\right)$

$A^!_n = {\mathcal{D}}^{!K}\left(A\Theta^!_{:n}, \Psi^!_n\right)$

$\tilde{{Z}}^!_{n+1}(k)=\frac{{Z}^!_{n}(k) [[A^!_n = {\mathcal{D}}^{!k}\left(A\Theta^!_{:n}, \Psi^!_n\right)]] \nu^k\left(\Theta^!_{n} \mid A\Theta^!_{:n}A^!_{n}\right)}{\sum_{j = 0}^{N-1} {Z}^!_n(j) [[A^!_n = {\mathcal{D}}^{!j}\left(A\Theta^!_{:n}, \Psi^!_n\right)]] \nu^j(\Theta^!_n \mid A\Theta^!_{:n}A^!_n)}$

For any $$k \in [N]$$, we construct the $${{\bar{{\mathcal{I}}}}}$$-policy $$\pi^{?k}$$ s.t. for any $$n \in {\mathbb{N}}$$, $$h \in {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}$$ s.t. $$\Pr\left[A\Theta^!_{:n}=h,\ K = k\right] > 0$$ and $$a \in {\bar{{\mathcal{A}}}}$$

$\pi^{?k}(a \mid h):=\Pr\left[A^!_n = a \mid A\Theta^!_{:n} = h,\ K = k\right]$

Given any $${{\bar{{\mathcal{I}}}}}$$-policy $$\pi$$ and $${\mathcal{I}}$$-policy $$\sigma$$ we define $$\alpha_{\sigma\pi}: {({\mathcal{A}}\times {\mathcal{O}})^*}{\xrightarrow{\text{k}}}{{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}$$ by

$\alpha_{\sigma\pi} (g \mid h) := [[h = \underline{g}]]C_h\prod_{n = 0}^{{\left\vert h \right\vert}-1} \sum_{a \in {\mathcal{A}}}\left([[g_n \in \bot a{\mathcal{O}}]] \pi\left(\bot \mid g_{:n}\right)\sigma\left(a \mid h_{:n}\right)+[[g_n \in a\bot{\mathcal{O}}]]\pi\left(a \mid g_{:n}\right)\right)$

Here, $$C_h \in {\mathbb{R}}$$ is a constant defined s.t. the probabilities sum to 1. We define the $${\mathcal{I}}$$-policy $$\left[\sigma\right]\underline{\pi}$$ by

$\left[\sigma\right]\underline{\pi}(a \mid h):=\Pr_{g \sim \alpha_{\sigma\pi}(h)}\left[\pi\left(g\right)=a \lor \left(\pi\left(g\right)=\bot \land \sigma(h)=a\right)\right]$

Define $$\tau^k \in (0,1)$$ by

$\tau^k := \max_{s \in {\mathcal{U}}^k} \sup_{\gamma \in [1 - \alpha, 1)} {\left\vert \frac{{\mathrm{d}}{\operatorname{V}}_k(s,\gamma)}{{\mathrm{d}}\gamma} \right\vert}+1$

Condition iii of Proposition B.1 and condition ii of Definition 2 imply that for any $$h \in \operatorname{hdom}{\mu^k}$$

$\operatorname{supp}{{\left[\sigma^k\right]}\underline{\pi}^{?k}(h)} \subseteq {\mathcal{A}}^0_k{\left(S^k(h)\right)}$

This means we can apply Proposition B.2 and get

${\operatorname{EU}}^*_{\upsilon^k}(1 - \alpha)-{\operatorname{EU}}^{\pi^{?k}}_{\bar{\upsilon}^k{\left[\sigma^k\right]}}(1 - \alpha) \leq \alpha\sum_{n=0}^\infty \sum_{m=0}^{T-1} (1-\alpha)^{nT+m}\left({\underset{x\sim\mu^k\bowtie\pi^{*k}_n}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]-{\underset{x\sim\nu^k\bowtie\pi^{?k}}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]\right) + \frac{2\tau^k\alpha}{1-(1-\alpha)^T}$

Here, the $${\mathcal{I}}$$-policy $$\pi^{*k}_n$$ is defined as $$\pi^*_n$$ in Proposition B.2, with $$\pi^k$$ in place of $$\pi^*$$. We also define the $${{\bar{{\mathcal{I}}}}}$$-policies $$\pi^{!k}_n$$ and $$\pi^{!!k}_n$$ by

$\pi^{!k}_n(a \mid h):=\begin{cases} \pi^{?k}(a \mid h) \text{ if } {\left\vert h \right\vert} < nT \\ \Pr\left[A^!_{{\left\vert h \right\vert}} = a \mid A\Theta^!_{:{{\left\vert h \right\vert}}} = h,\ K = k,\ {J}^!_n = k\right] \text{ otherwise} \end{cases}$

$\pi^{!!k}_n(a \mid h):=\begin{cases} \pi^{?k}(a \mid h) \text{ if } {\left\vert h \right\vert} < nT \\ \pi^{!k}_n(a \mid h) + \pi^{!k}_n(\bot \mid h) \cdot \pi^k_n\left(a \mid \underline{h}\right) \text{ if } {\left\vert h \right\vert} \geq nT \text{ and } a \ne \bot \\ 0 \text{ if } {\left\vert h \right\vert} \geq nT \text{ and } a = \bot \end{cases}$

Denote

$\rho^{*k}_n:=\mu^k\bowtie\pi^{*k}_n$

$\rho^{!!k}_n:=\nu^k\bowtie\pi^{!!k}_n$

$\rho^{!k}_n:=\nu^k\bowtie\pi^{!k}_n$

$\rho^{?k}:=\nu^k\bowtie\pi^{?k}$

$R^{?k}:={\operatorname{EU}}^{*}_{\upsilon^k}(1-\alpha)-{\operatorname{EU}}^{\pi^{?k}}_{\bar{\upsilon}^k{\left[\sigma^k\right]}}(1-\alpha)$

For each $$n \in {\mathbb{N}}$$, denote

${\operatorname{EU}}_n^{*k}:=\frac{\alpha}{1-(1-\alpha)^T}\sum_{m=0}^{T-1} (1-\alpha)^{m}{\underset{x\sim\rho^{*k}_n}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]$

${\operatorname{EU}}_n^{!!k}:=\frac{\alpha}{1-(1-\alpha)^T}\sum_{m=0}^{T-1} (1-\alpha)^{m}{\underset{x\sim\rho^{!!k}_n}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]$

${\operatorname{EU}}_n^{!k}:=\frac{\alpha}{1-(1-\alpha)^T}\sum_{m=0}^{T-1} (1-\alpha)^{m}{\underset{x\sim\rho^{!k}_n}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]$

${\operatorname{EU}}_n^{?k}:=\frac{\alpha}{1-(1-\alpha)^T}\sum_{m=0}^{T-1} (1-\alpha)^{m}{\underset{x\sim\rho^{?k}}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]$

We have

$R^{?k} \leq {\left(1-(1-\alpha)^T\right)}\sum_{n=0}^\infty (1-\alpha)^{nT} \left({\operatorname{EU}}^{*k}_n-{\operatorname{EU}}^{?k}_n\right) + \frac{2\tau^k\alpha}{1-(1-\alpha)^T}$

$R^{?k} \leq{\left(1-(1-\alpha)^T\right)}\sum_{n=0}^\infty (1-\alpha)^{nT} \left({\operatorname{EU}}^{*k}_n-{\operatorname{EU}}^{!!k}_n+{\operatorname{EU}}^{!!k}_n-{\operatorname{EU}}^{!k}_n+{\operatorname{EU}}^{!k}_n-{\operatorname{EU}}^{?k}_n\right) + \frac{2\tau^k\alpha}{1-(1-\alpha)^T}$

Condition iv of Proposition B.1 implies that, given $$h \in \operatorname{hdom}{\nu^k}$$ s.t. $${\left\vert h \right\vert} \geq nT$$

$\operatorname{supp}{\pi^{!k}_n(h)} \subseteq {\left\{\pi^k{\left(S^k(h)\right)},\bot\right\}}$

$\pi^{!!k}_n{\left(\pi^k{\left(S^k(h)\right)} \mid h\right)} = 1$

Therefore, $$\pi^{!!k}_n = \pi^{*k}_n$$, and we remain with

$R^{?k} \leq{\left(1-(1-\alpha)^T\right)}\sum_{n=0}^\infty (1-\alpha)^{nT} \left({\operatorname{EU}}^{!!k}_n-{\operatorname{EU}}^{!k}_n+{\operatorname{EU}}^{!k}_n-{\operatorname{EU}}^{?k}_n\right) + \frac{2\tau^k\alpha}{1-(1-\alpha)^T}$

We have

${\left\vert {\operatorname{EU}}^{!!k}_n-{\operatorname{EU}}^{!k}_n \right\vert} \leq \Pr_{x\sim\rho^{!k}_n}\left[\exists m \in [T]: x_{nT+m} \in \bot{\bar{{\mathcal{O}}}}\right]$

Since $${Z}_{nT}^{!}(K) \geq \eta$$, it follows that

${\left\vert {\operatorname{EU}}^{!!k}_n-{\operatorname{EU}}^{!k}_n \right\vert} \leq \frac{1}{\eta}\Pr_{x\sim\rho^{?k}}\left[\exists m \in [T]: x_{nT+m} \in \bot{\bar{{\mathcal{O}}}}\right] \leq \frac{1}{\eta}{\underset{x\sim\rho^{?k}}{\operatorname{E}}{\left[{\left\vert {\left\{m \in [T] \mid x_{nT+m} \in \bot{\bar{{\mathcal{O}}}}\right\}} \right\vert}\right]}}$

$\sum_{n=0}^\infty {\left\vert {\operatorname{EU}}^{!!k}_n-{\operatorname{EU}}^{!k}_n \right\vert} \leq \frac{1}{\eta}{\underset{x\sim\rho^{?k}}{\operatorname{E}}{\left[{\left\vert {\left\{n \in {\mathbb{N}}\mid x_n \in \bot{\bar{{\mathcal{O}}}}\right\}} \right\vert}\right]}}$

Using condition i of Proposition B.1, we conclude

$\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} \leq\frac{1-(1-\alpha)^T}{N}\sum_{k=0}^{N-1}\sum_{n=0}^\infty (1-\alpha)^{nT} \left({\operatorname{EU}}^{!k}_n-{\operatorname{EU}}^{?k}_n\right) + O{\left(\frac{{\left(1-(1-\alpha)^T\right)}\ln N}{\eta^2\epsilon}+\frac{\tau^k\alpha}{1-(1-\alpha)^T}\right)}$

Define the random variables $${{\{U^!_n : \Omega \rightarrow [0,1]\}_{n \in {\mathbb{N}}}}}$$ by

$U^!_n:=\frac{\alpha}{1-(1-\alpha)^T}\sum_{m=0}^{T-1} (1-\alpha)^{m} r\left(A\Theta^!_{:nT+m}\right)$

Denote

$\bar{\tau} := \frac{1}{N}\sum_{k=0}^{N-1} {\tau^k}$

$\beta:=\frac{{\left(1-(1-\alpha)^T\right)}\ln N}{\eta^2\epsilon}+\frac{\bar{\tau}\alpha}{1-(1-\alpha)^T}$

We get

$\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} \leq {\left(1-(1-\alpha)^T\right)}\sum_{n=0}^\infty (1-\alpha)^{nT} {\underset{}{\operatorname{E}}}\left[{\underset{}{\operatorname{E}}}\left[U^!_n \mid {J}^!_n = K,\ K,\ Z^!_{nT}\right]-{\underset{}{\operatorname{E}}}\left[U^!_n \mid Z^!_{nT}\right]\right] + O{\left(\beta\right)}$

$\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{{\left(1-(1-\alpha)^T\right)}\sum_{n=0}^\infty (1-\alpha)^{nT} {\underset{}{\operatorname{E}}}\left[\left({\underset{}{\operatorname{E}}}\left[U^!_n \mid {J}^!_n = K,\ K,\ Z^!_{nT}\right]-{\underset{}{\operatorname{E}}}\left[U^!_n \mid Z^!_{nT}\right]\right)^2\right]} + O{\left(\beta\right)}$

We apply Proposition B.3 to each term in the sum over $$n$$.

$\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{{\left(1-(1-\alpha)^T\right)}\sum_{n=0}^\infty (1-\alpha)^{nT} {\underset{}{\operatorname{E}}}\left[\frac{1}{2\eta}{\underset{}{\operatorname{I}}}\left[K;{J}^!_n,U^!_n \mid Z^!_{nT}\right]\right]} + O{\left(\beta\right)}$

$\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{\frac{1-(1-\alpha)^T}{2\eta}\sum_{n=0}^\infty {\underset{}{\operatorname{E}}}\left[{\operatorname{H}}\Big(Z^!_{nT}\Big)-{\operatorname{H}}\left(Z^!_{(n+1)T}\right)\right]} + O{\left(\beta\right)}$

$\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = O\left(\sqrt{\frac{1-(1-\alpha)^T}{\eta}\ln N} +\frac{{\left(1-(1-\alpha)^T\right)}\ln N}{\eta^2\epsilon}+\frac{\bar{\tau}\alpha}{1-(1-\alpha)^T}\right)$

Condition ii of Proposition B.1 implies that

${\operatorname{d}_{\text{tv}}}\left(\frac{1}{N}\sum_{k=0}^{N-1}{\bar{\nu}^k\left[\sigma^k\right]\bowtie\pi^*},\ \frac{1}{N}\sum_{k=0}^{N-1}{\bar{\nu}^k\left[\sigma^k\right]\bowtie\pi^{?k}}\right) \leq 2(N-1)\eta$

Here, the factor of 2 comes from the difference between the equations for $$Z_n$$ and $$Z^!_n$$ (we can construct and intermediate policy between $$\pi^*$$ and $$\pi^{?k}$$ and use the triangle inequality for $${\operatorname{d}_{\text{tv}}}$$). We conclude

$\frac{1}{N}\sum_{k=0}^{N-1}{\left({\operatorname{EU}}^{*}_{\upsilon^k}(1-\alpha)-{\operatorname{EU}}^{\pi^{*}}_{\bar{\upsilon}^k{\left[\sigma^k\right]}}(1-\alpha)\right)} = O\left(N\eta +\sqrt{\frac{1-(1-\alpha)^T}{\eta}\ln N} +\frac{{\left(1-(1-\alpha)^T\right)}\ln N}{\eta^2\epsilon}+\frac{\bar{\tau}\alpha}{1-(1-\alpha)^T}\right)$

For each $$k \in [N]$$, we get

${\operatorname{EU}}^{*}_{\upsilon^k}(1-\alpha)-{\operatorname{EU}}^{\pi^{*}}_{\bar{\upsilon}^k{\left[\sigma^k\right]}}(1-\alpha) = N \cdot O\left(N\eta +\sqrt{\frac{1-(1-\alpha)^T}{\eta}\ln N} +\frac{{\left(1-(1-\alpha)^T\right)}\ln N}{\eta^2\epsilon}+\frac{\bar{\tau}\alpha}{1-(1-\alpha)^T}\right)$

Now we set

$\eta:=\alpha^{1/4} N^{-1/2} {\left(\ln N\right)}^{1/4} \epsilon^{-1/4} \bar{\tau}^{1/4}$

$T:={\left\lceil \alpha^{-1/4}N^{-1/2} {\left(\ln N\right)}^{-1/4} \epsilon^{1/4} \bar{\tau}^{3/4} \right\rceil}$

Without loss of generality, we can assume this to be consistent with the assumption $$\eta < \frac{1}{N}$$ since the bound contains a term of $$N^2\eta$$ anyway. Similarly, the second term in the bound implies we can assume that $$1 - (1-\alpha)^T \ll 1$$ and therefore $$1 - (1-\alpha)^T = O(T\alpha)$$. Finally, note that assumption ii implies that the expression we round to get $$T$$ is $$\geq 1$$. We get

${\operatorname{EU}}^{*}_{\upsilon^k}(1-\alpha)-{\operatorname{EU}}^{\pi^{*}}_{\bar{\upsilon}^k{\left[\sigma^k\right]}}(1-\alpha) = O{\left({\left(\alpha N^6 \ln{N} \epsilon^{-1} \bar{\tau}\right)}^{1/4}\right)}$

# Proposition A.1

Fix $$n \in {\mathbb{N}}$$, $$m \in [n]$$. Then, for $$x \in (0,1)$$ we have

${\left\vert \frac{{\mathrm{d}}}{{\mathrm{d}}x}{\left(x^m\frac{1-x}{1-x^n}\right)} \right\vert} \leq 1$

# Proof of Proposition A.1

$\frac{{\mathrm{d}}}{{\mathrm{d}}x}{\left(x^m\frac{1-x}{1-x^n}\right)} = mx^{m-1}\frac{1-x}{1-x^n}+x^m\frac{{\mathrm{d}}}{{\mathrm{d}}x}{\left(\frac{1-x}{1-x^n}\right)}$

We will establish the claim by showing that the first term is in $$[0,1]$$ and the second term is in $$[-1,0]$$.

For the first term, we have (assuming w.l.o.g. that $$m \geq 1$$)

$0 < mx^{m-1}\frac{1-x}{1-x^n} = \frac{mx^{m-1}}{\sum_{k=0}^{n-1} x^k} \leq \frac{mx^{m-1}}{\sum_{k=0}^{m-1} x^k} \leq \frac{mx^{m-1}}{\sum_{k=0}^{m-1} x^{m-1}}=1$

For the second term, we have

$\frac{{\mathrm{d}}}{{\mathrm{d}}x}{\left(\frac{1-x}{1-x^n}\right)} = \frac{(-1)\cdot{\left(1-x^n\right)}-(1-x)\cdot{\left(-nx^{n-1}\right)}}{{\left(1-x^n\right)}^2}=\frac{-(n-1)x^n+nx^{n-1}-1}{{\left(1-x^n\right)}^2}$

Assume w.l.o.g. that $$n \geq 2$$. First, we show that the numerator (and hence the entire fraction) is negative. Denote $$p(x):=-(n-1)x^n+nx^{n-1}-1$$. We have

$\frac{{\mathrm{d}}{p(x)}}{{\mathrm{d}}x}=n(n-1){\left(x^{n-2}-x^{n-1}\right)}$

Therefore, $$p(x)$$ is increasing in $$[0,1]$$. Observing that $$p(1)=0$$, we conclude that $$p$$ is negative in $$(0,1)$$.

Now we show that the fraction is $$\geq -1$$. Equivalently, we need to show that $$p(x) + {\left(1-x^n\right)}^2 \geq 0$$. We have

$p(x) + {\left(1-x^n\right)}^2 = -(n-1)x^n+nx^{n-1}-1 + 1 - 2x^n + x^{2n} = x^{2n} - (n+1)x^n + nx^{n-1} = x^{n-1}{\left(x^{n+1}-(n+1)x+n\right)}$

Denote $$q(x):=x^{n+1}-(n+1)x+n$$. We have

$\frac{{\mathrm{d}}{q(x)}}{{\mathrm{d}}x} = (n+1){\left(x^n - 1\right)}$

Therefore, $$q(x)$$ is decreasing in $$[0,1]$$. Observing that $$q(1) = 0$$ we conclude that $$q$$ is positive in $$(0,1)$$, establishing the desired result.

# Proposition A.2

For any $$x,y\in(0,1)$$:

$\frac{1-x}{{\left(1-xy\right)}^2} \leq \frac{1}{1-y}$

# Proof of Proposition A.2

Consider $$p(z):=z^3-3z+2$$. We have

$\frac{{\mathrm{d}}{p(z)}}{{\mathrm{d}}z} = 3z^2 - 3$

$$p(z)$$ has a local minimum at $$z=1$$ and a local maximum at $$z = -1$$. Therefore, for $$z \geq -1$$, $$p(z) \geq p(1) = 0$$. For $$z \geq 0$$, we get

$z^2 {\left(z^2 - 3\right)} + 2z = z^4-3z^2+2z = zp(z) \geq 0$

Taking $$z := \frac{1}{2}(x+y)$$, we get

${\left(\frac{x+y}{2}\right)}^2 {\left({\left(\frac{x+y}{2}\right)}^2 - 3\right)} + x + y \geq 0$

Now, $$\frac{1}{2}(x+y) \geq \sqrt{xy}$$ and $$t(t-3)$$ is a decreasing function of $$t$$ for $$t\in{\left[0,\frac{3}{2}\right]}$$. We conclude

$xy {\left(xy - 3\right)} + x + y \geq 0$

$x^2y^2 - 3xy + x + y \geq 0$

$x^2 y^2 - 2xy + 1 \geq xy - x - y + 1$

${\left(1-xy\right)}^2 \geq (1-x)(1-y)$

$\frac{1}{1-y} \geq \frac{1-x}{{\left(1-xy\right)}^2}$

# Proof of Proposition 2

${\operatorname{V}}_M(s,\gamma) = (1-\gamma) \sum_{n=0}^\infty \gamma^n \sum_{t \in {\mathcal{S}}_M} {\mathcal{T}}_{M\pi}^n (t \mid s) {\mathcal{R}}_M(t)$

By Proposition 1, we have

$\sum_{t \in {\mathcal{S}}_M} {\mathcal{T}}_{M\pi}^n(t \mid s) {\mathcal{R}}_M(t) = {\underset{k \sim \zeta(s)}{\operatorname{E}}{\left[\sum_{t \in {\mathcal{S}}_M} {\mathcal{T}}_{M\pi}^n \xi(t \mid s,k) {\mathcal{R}}_M(t)\right]}} + \delta_n$

where, $${\left\vert \delta_n \right\vert} \leq F \lambda^n$$. Denote

${\operatorname{V}}_M^{\text{per}}(s,\gamma) := (1-\gamma) \sum_{n=0}^\infty \gamma^n \delta_n$

${\operatorname{V}}_{Mk}^{\text{cyc}}(s,\gamma) := (1-\gamma) \sum_{n=0}^\infty \gamma^n \sum_{t \in {\mathcal{S}}_M} {\mathcal{T}}_{M\pi}^n \xi(t \mid s,k) {\mathcal{R}}_M(t)$

We get

${\operatorname{V}}_M(s,\gamma) = {\underset{k \sim \zeta(s)}{\operatorname{E}}{\left[{\operatorname{V}}_{Mk}^{\text{cyc}}(s,\gamma)\right]}} + {\operatorname{V}}_M^{\text{per}}(s,\gamma)$

$\frac{{\mathrm{d}}{{\operatorname{V}}_M(s,\gamma)}}{{\mathrm{d}}\gamma} = {\underset{k \sim \zeta(s)}{\operatorname{E}}{\left[\frac{{\mathrm{d}}{{\operatorname{V}}_{Mk}^{\text{cyc}}(s,\gamma)}}{{\mathrm{d}}\gamma}\right]}} + \frac{{\mathrm{d}}{{\operatorname{V}}_M^{\text{per}}(s,\gamma)}}{{\mathrm{d}}\gamma}$

We now analyze each term separately.

Denote

$r_{kn}:=\sum_{t \in {\mathcal{S}}_M} {\mathcal{T}}_{M\pi}^n \xi(t \mid s,k) {\mathcal{R}}_M(t)$

Note that, due to the property of $$\xi$$, $$r_{kn}$$ is periodic in $$n$$ with period $$k$$. We have

${\operatorname{V}}_{Mk}^{\text{cyc}}(s,\gamma) := (1-\gamma) \sum_{n=0}^{\infty} \gamma^{n} r_{kn} = (1-\gamma) \sum_{i=0}^{k-1}\sum_{n=0}^\infty \gamma^{nk+i} r_{ki} = \sum_{i=0}^{k-1} \frac{(1-\gamma)\gamma^i}{1-\gamma^k}r_{ki}$

${\left\vert \frac{{\mathrm{d}}{{\operatorname{V}}_{Mk}^{\text{cyc}}(s,\gamma)}}{{\mathrm{d}}\gamma} \right\vert} \leq \sum_{i=0}^{k-1} {\left\vert \frac{{\mathrm{d}}}{{\mathrm{d}}\gamma}{\left(\frac{(1-\gamma)\gamma^i}{1-\gamma^k}\right)} \right\vert}r_{ki}$

By Proposition A.1, we get

${\left\vert \frac{{\mathrm{d}}{{\operatorname{V}}_{Mk}^{\text{cyc}}(s,\gamma)}}{{\mathrm{d}}\gamma} \right\vert} \leq \sum_{i=0}^{k-1} r_{ki} \leq k$

Now we analyze the second term.

${\operatorname{V}}_M^{\text{per}}(s,\gamma) = \delta_0 + \sum_{n=0}^\infty \gamma^{n+1} {\left(\delta_{n+1} - \delta_n\right)}$

$\frac{{\mathrm{d}}{{\operatorname{V}}_M^{\text{per}}(s,\gamma)}}{{\mathrm{d}}\gamma} = \sum_{n=0}^\infty (n+1)\gamma^n {\left(\delta_{n+1} - \delta_n\right)}=\sum_{n=0}^\infty {\left(n\gamma^{n-1}-(n+1)\gamma^n\right)} \delta_n = \sum_{n=0}^\infty {\left({\left(1-\gamma\right)}n\gamma^{n-1}-\gamma^n\right)} \delta_n$

${\left\vert \frac{{\mathrm{d}}{{\operatorname{V}}_M^{\text{per}}(s,\gamma)}}{{\mathrm{d}}\gamma} \right\vert} \leq F \sum_{n=0}^\infty {\left({\left(1-\gamma\right)}n\gamma^{n-1}+\gamma^n\right)} \lambda^n=F{\left((1-\gamma)\sum_{n=0}^\infty n \gamma^{n-1}\lambda^n+\frac{1}{1-\gamma\lambda}\right)}$

${\left\vert \frac{{\mathrm{d}}{{\operatorname{V}}_M^{\text{per}}(s,\gamma)}}{{\mathrm{d}}\gamma} \right\vert} \leq F{\left((1-\gamma)\frac{{\mathrm{d}}}{{\mathrm{d}}\gamma}\sum_{n=0}^\infty \gamma^n\lambda^n+\frac{1}{1-\gamma\lambda}\right)}=F{\left((1-\gamma)\frac{{\mathrm{d}}}{{\mathrm{d}}\gamma}{\left(\frac{1}{1-\gamma\lambda}\right)}+\frac{1}{1-\gamma\lambda}\right)}$

${\left\vert \frac{{\mathrm{d}}{{\operatorname{V}}_M^{\text{per}}(s,\gamma)}}{{\mathrm{d}}\gamma} \right\vert} \leq F{\left(\lambda\cdot\frac{1-\gamma}{{\left(1-\gamma\lambda\right)}^2}+\frac{1}{1-\gamma\lambda}\right)}$

Applying Proposition A.2 to the first term, we get

${\left\vert \frac{{\mathrm{d}}{{\operatorname{V}}_M^{\text{per}}(s,\gamma)}}{{\mathrm{d}}\gamma} \right\vert} \leq F{\left(\lambda\cdot\frac{1}{1-\lambda}+\frac{1}{1-\gamma\lambda}\right)} \leq F\frac{1+\lambda}{1-\lambda}$

Collecting the terms, we have

${\left\vert \frac{{\mathrm{d}}{{\operatorname{V}}_M(s,\gamma)}}{{\mathrm{d}}\gamma} \right\vert} \leq {\underset{k \sim \zeta(s)}{\operatorname{E}}{\left[k\right]}} + F\frac{1+\lambda}{1-\lambda} \leq \max {P}+ F\frac{1+\lambda}{1-\lambda}$

## Appendix B

The following appeared in a previous essay as “Proposition C.1”.

# Proposition B.1

Fix an interface $${\mathcal{I}}=({\mathcal{A}},{\mathcal{O}})$$, $$N \in {\mathbb{N}}$$, $$\epsilon \in (0,\frac{1}{{\left\vert {\mathcal{A}}\right\vert}})$$, $$\eta \in (0,\frac{1}{N})$$. Consider some $$\{\sigma^k: {({\mathcal{A}}\times {\mathcal{O}})^*}{\xrightarrow{\text{k}}}{\mathcal{A}}\}_{k \in [N]}$$. Then, there exist $${\mathcal{D}}: {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\times {\mathcal{A}}\rightarrow {\bar{{\mathcal{A}}}}$$ and $$\{{\mathcal{D}}^{!k}: {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\times {\mathcal{A}}\rightarrow {\bar{{\mathcal{A}}}}\}_{k \in [N]}$$ with the following properties. Given $$x \in \left({\mathcal{A}}\times {\overline{{\mathcal{A}}\times {\mathcal{O}}}}\right)^*$$, we denote $$\underline{x}$$ its projection to $${{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}$$. Thus, $$\underline{\underline{x}}\in{({\mathcal{A}}\times {\mathcal{O}})^*}$$. Given $$\mu$$ an $${\mathcal{I}}$$-environment, $$\pi: \operatorname{hdom}{\mu} {\xrightarrow{\text{k}}}{\mathcal{A}}$$, $${\mathcal{D}}': {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\times {\mathcal{A}}\rightarrow {\bar{{\mathcal{A}}}}$$ and $$k \in [N]$$, we can define $$\Xi\left[\mu,\sigma^k,{\mathcal{D}}',\pi\right]\in \Delta\left({\mathcal{A}}\times {\overline{{\mathcal{A}}\times {\mathcal{O}}}}\right)^\omega$$ as follows

$\Xi\left[\mu,\sigma^k,{\mathcal{D}}',\pi\right]\left(b,a,o \mid x\right):=\pi\left(b \mid \underline{\underline{x}}\right){\mathcal{D}}'\left(a \mid \underline{x},b\right) \bar{\mu}[\sigma^k]\left(o \mid \underline{x}a\right)$

We require that for every $$\pi$$, $$\mu$$ and $$k$$ as above, the following conditions hold

1. $\frac{1}{N}\sum_{j=0}^{N-1}{\underset{x \sim\Xi\left[\mu,\sigma^j,{\mathcal{D}}^{!j},\pi\right]}{\operatorname{E}}}\left[{\left\vert \{n \in {\mathbb{N}}\mid x_n \in {\mathcal{A}}\times \bot \times \bar{{\mathcal{O}}}\} \right\vert}\right] \leq \frac{\ln N}{\eta \ln\left(1 + \epsilon(1-\epsilon)^{(1-\epsilon)/\epsilon}\right)}=O\left(\frac{\ln N}{\eta \epsilon}\right)$

2. $${\operatorname{d}_{\text{tv}}}\left(\frac{1}{N}\sum_{j=0}^{N-1}{\Xi\left[\mu,\sigma^j,{\mathcal{D}},\pi\right]},\frac{1}{N}\sum_{j=0}^{N-1}{\Xi\left[\mu,\sigma^j,{\mathcal{D}}^{!j},\pi\right]}\right) \leq (N-1)\eta$$

3. For all $$x \in \operatorname{hdom}{\bar{\mu}[\sigma^k]}$$, if $${\mathcal{D}}^{!k}\left(x,\pi\left(\underline{x}\right)\right) \ne \bot$$ then $$\sigma^k\left({\mathcal{D}}^{!k}\left(x,\pi\left(\underline{x}\right)\right) \mid \underline{x}\right) > 0$$

4. For all $$x \in \operatorname{hdom}{\bar{\mu}[\sigma^k]}$$, if $${\mathcal{D}}^{!k}\left(x,\pi\left(\underline{x}\right)\right) \not\in {\left\{\pi\left(\underline{x}\right), \bot\right\}}$$ then $$\sigma^k\left(\pi\left(\underline{x}\right) \mid \underline{x}\right) \leq \epsilon$$

The following is a simple modification of what appeared there as “Proposition B.2” (the corresponding modification of the proof is trivial and we leave it out.)

# Proposition B.2

Consider some $$\gamma\in(0,1)$$, $$\tau\in(0,\infty)$$, $$T\in{\mathbb{N}}^+$$, a universe $$\upsilon=(\mu,r)$$ that is an $${\mathcal{O}}$$-realization of $$M$$ with state function $$S$$, some $$\pi^*: \operatorname{hdom}{\mu} \rightarrow {\mathcal{A}}$$ and some $$\pi^0: \operatorname{hdom}{\mu} {\xrightarrow{\text{k}}}{\mathcal{A}}$$. Assume that $$\gamma \geq \gamma_M$$. For any $$n \in {\mathbb{N}}$$, let $$\pi^*_n$$ be an $${\mathcal{I}}$$-policy s.t. for any $$h \in \operatorname{hdom}{\mu}$$

$\pi^*_n(h):=\begin{cases} \pi^0(h) \text{ if } {\left\vert h \right\vert} < nT \\ \pi^*(h) \text{ otherwise} \end{cases}$

Assume that for any $$h \in \operatorname{hdom}{\mu}$$

1. $\pi^*(s) \in {\mathcal{A}}_{M}^\omega{\left(S(h)\right)}$

2. $\operatorname{supp}{\pi^0(h)} \subseteq {\mathcal{A}}_{M}^0{\left(S(h)\right)}$

3. For any $$\theta\in(\gamma,1)$$ ${\left\vert \frac{{\mathrm{d}}{\operatorname{V}}_{M}{\left(S(h),\theta\right)}}{{\mathrm{d}}\theta} \right\vert} \leq \tau$

Then

${\operatorname{EU}}^{*}_\upsilon(\gamma)-{\operatorname{EU}}^{\pi^0}_\upsilon(\gamma) \leq (1-\gamma)\sum_{n=0}^\infty \sum_{m=0}^{T-1} \gamma^{nT+m}\left({\underset{x\sim\mu\bowtie\pi^*_n}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]-{\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]\right) + \frac{2\tau\gamma^T(1-\gamma)}{1-\gamma^T}$

The following appeared previously as “Proposition A.1”.

# Proposition B.3

Consider a probability space $$(\Omega, P \in \Delta\Omega)$$, $$N \in {\mathbb{N}}$$, $$R \subseteq [0,1]$$ a finite set and random variables $$U: \Omega \rightarrow R$$, $$K: \Omega \rightarrow [N]$$ and $${J}: \Omega \rightarrow [N]$$. Assume that $$K_*P = J_*P = \zeta \in \Delta[N]$$ and $${\underset{}{\operatorname{I}}}[K;J] = 0$$. Then

${\underset{}{\operatorname{I}}}\left[K;J,U\right] \geq 2 \left(\min_{i \in [N]} {\zeta(i)}\right) \left({\underset{}{\operatorname{E}}}\left[U \mid J = K,\ K\right]-{\underset{}{\operatorname{E}}}\left[U\right]\right)^2$

### NEW DISCUSSION POSTS

If you drop the
 by Alex Appel on Distributed Cooperation | 1 like

Cool! I'm happy to see this
 by Abram Demski on Distributed Cooperation | 0 likes

Caveat: The version of EDT
 by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

[Delegative Reinforcement
 by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
 by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
 by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
 by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
 by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
 by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
 by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
 by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
 by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
 by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
 by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
 by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes