Intelligent Agent Foundations Forumsign up / log in
Delegative Reinforcement Learning with a Merely Sane Advisor
post by Vadim Kosoy 82 days ago | discuss

Previously, we defined a setting called “Delegative Inverse Reinforcement Learning” (DIRL) in which the agent can delegate actions to an “advisor” and the reward is only visible to the advisor as well. We proved a sublinear regret bound (converted to traditional normalization in online learning, the bound is \(O(n^{2/3})\)) for one-shot DIRL (as opposed to standard regret bounds in RL which are only applicable in the episodic setting). However, this required a rather strong assumption about the advisor: in particular, the advisor had to choose the optimal action with maximal likelihood. Here, we consider “Delegative Reinforcement Learning” (DRL), i.e. a similar setting in which the reward is directly observable by the agent. We also restrict our attention to finite MDP environments (we believe these results can be generalized to a much larger class of environments, but not to arbitrary environments). On the other hand, the assumption about the advisor is much weaker: the advisor is only required to avoid catastrophic actions (i.e. actions that lose value to zeroth order in the interest rate) and assign some positive probability to a nearly optimal action. As before, we prove a one-shot regret bound (in traditional normalization, \(O(n^{3/4})\)). Analogously to before, we allow for “corrupt” states in which both the advisor and the reward signal stop being reliable.


Appendix A contains the proofs and Appendix B contains propositions proved before.

Notation

The notation \(K: X {\xrightarrow{\text{k}}}Y\) means \(K\) is Markov kernel from \(X\) to \(Y\). When \(Y\) is a finite set, this is the same as a measurable function \(K: X \rightarrow \Delta Y\), and we use these notations interchangeably. Given \(K: X {\xrightarrow{\text{k}}}Y\) and \(A \subseteq Y\) (corr. \(y \in Y\)), we will use the notation \(K(A \mid x)\) (corr \(K(y \mid x)\)) to stand for \(\Pr_{y' \sim K(x)}\left[y' \in A\right]\) (corr. \(\Pr_{y' \sim K(x)}\left[y' = y\right]\)). Given \(Y\) is a finite set, \(\mu \in \Delta Y^\omega\), \(h \in Y^*\) and \(y \in Y\), the notation \(\mu(y \mid h)\) means \(\Pr_{x \sim \mu}\left[x_{{\lvert h \rvert}} = y \mid x_{:{\lvert h \rvert}}=h\right]\).

Given \(\Omega\) a measurable space, \(\mu \in \Delta \Omega\), \(n,m \in {\mathbb{N}}\), \(\{A_k\}_{k \in [n]}\), \(\{B_j\}_{j \in [m]}\) finite sets and \(\{X_k: \Omega \rightarrow A_k\}_{k \in [n]}\), \(\{Y_j: \Omega \rightarrow B_j\}_{j \in [m]}\) random variables (measurable mappings), the mutual information between the joint distribution of the \(X_k\) and the joint distribution of the \(Y_j\) will be denoted

\[{\underset{\omega \sim \mu}{\operatorname{I}}}\left[X_0, X_1 \ldots X_{n-1}; Y_0, Y_1 \ldots Y_{m-1}\right]\]

We will parameterize our geometric time discount by \(\gamma=e^{-1/t}\), thus all functions that were previously defined to depend on \(t\) are now considered functions of \(\gamma\).

Results

We start by explaining the relation between the formalism of general environments we used before and the formalism of finite MDPs.

Definition 1

A finite Markov Decision Process (MDP) is a tuple

\[M=\left({\mathcal{S}}_M,\ {\mathcal{A}}_M,\ {\mathcal{T}}_M: {\mathcal{S}}_M \times {\mathcal{A}}_M {\xrightarrow{\text{k}}}{\mathcal{S}}_M,\ {\mathcal{R}}_M: {\mathcal{S}}_M \rightarrow [0,1]\right)\]

Here, \({\mathcal{S}}_M\) is a finite set (the set of states), \({\mathcal{A}}_M\) is a finite set (the set of actions), \({\mathcal{T}}_M\) is the transition kernel and \({\mathcal{R}}_M\) is the reward function.

A stationary policy for \(M\) is any \(\pi: {\mathcal{S}}_M {\xrightarrow{\text{k}}}{\mathcal{A}}_M\). The space of stationary policies is denoted \(\Pi_M\). Given \(\pi \in \Pi_M\), we define \({\mathcal{T}}_{M\pi}: {\mathcal{S}}_M {\xrightarrow{\text{k}}}{\mathcal{S}}_M\) by

\[{\mathcal{T}}_{M\pi}(t \mid s) := \sum_{a \in {\mathcal{A}}_M} {{\mathcal{T}}_M\left(t \mid s, a\right) \pi(a \mid s)}\]

We define \({\operatorname{V}}_M : {\mathcal{S}}_M \times (0,1) \rightarrow [0,1]\) and \({\operatorname{Q}}_M: {\mathcal{S}}_M \times {\mathcal{A}}_M \times (0,1) \rightarrow [0,1]\) by

\[{\operatorname{V}}_M(s,\gamma):=(1-\gamma)\max_{\pi \in \Pi_M} \sum_{n=0}^\infty{\gamma^n {\underset{{\mathcal{T}}_{M\pi}^n(s)}{\operatorname{E}}}{\left[{\mathcal{R}}_M\right]}}\]

\[{\operatorname{Q}}_M(s,a,\gamma):=(1-\gamma){\mathcal{R}}_M(s)+\gamma{\underset{t \sim {\mathcal{T}}_M(s,a)}{\operatorname{E}}}\left[{\operatorname{V}}_M(t,\gamma)\right]\]

Here, \({\mathcal{T}}_{M\pi}^n: {\mathcal{S}}_M {\xrightarrow{\text{k}}}{\mathcal{S}}_M\) is just the \(n\)-th power of \({\mathcal{T}}_{M\pi}\) in the sense of Markov kernel composition.

As well known, \({\operatorname{V}}_M\) and \({\operatorname{Q}}_M\) are rational functions in \(\gamma\) for \(1-\gamma \ll 1\), therefore in this limit we have the Taylor expansions

\[{\operatorname{V}}_M(s,\gamma)=\sum_{k=0}^\infty {\frac{1}{k!} {\operatorname{V}}_M^k(s)\cdot(1-\gamma)^k}\]

\[{\operatorname{Q}}_M(s,a,\gamma)=\sum_{k=0}^\infty {\frac{1}{k!} {\operatorname{Q}}_M^k(s,a)\cdot(1-\gamma)^k}\]

Given any \(s \in {\mathcal{S}}_M\), we define \({\{{\mathcal{A}}_M^k(s) \subseteq {\mathcal{A}}_M\}_{k \in {\mathbb{N}}}}\) recursively by

\[{\mathcal{A}}^0_M(s) := {\underset{a \in {\mathcal{A}}_M}{\operatorname{arg\,max}}\,} {\operatorname{Q}}_M^0(s,a)\]

\[{\mathcal{A}}^{k+1}_M(s) := {\underset{a \in {\mathcal{A}}_M^k(s)}{\operatorname{arg\,max}}\,} {\operatorname{Q}}_M^{k+1}(s,a)\]


All MDPs will be assumed to be finite, so we drop the adjective “finite” from now on.

Definition 2

Let \({\mathcal{I}}=({\mathcal{A}},{\mathcal{O}})\) be an interface. An \({\mathcal{I}}\)-universe \(\upsilon=(\mu,r)\) is said to be an \({\mathcal{O}}\)-realization of MDP \(M\) with state function \(S: \operatorname{hdom}{\mu} \rightarrow {\mathcal{S}}_M\) when \({\mathcal{A}}_M={\mathcal{A}}\) and for any \(h \in \operatorname{hdom}{\mu}\), \(a \in {\mathcal{A}}\) and \(o \in {\mathcal{O}}\)

\[{\mathcal{T}}_M\left(s \mid S(h),a\right) =\Pr_{o \sim \mu(ha)}\left[S(hao)=s\right]\]

\[r(h)={\mathcal{R}}_M\left(S(h)\right)\]


Now now define the relevant notion of a “good advisor.”

Definition 3

Let \(\upsilon = (\mu,r)\) be a universe and \(\epsilon > 0\). A policy \(\pi\) is said to be \(\epsilon\)-sane for \(\upsilon\) when there are \(M\), \(S\) s.t. \(\upsilon\) is an \({\mathcal{O}}\)-realization of \(M\) with state function \(S\) and for any \(h \in \operatorname{hdom}{\mu}\)

  1. \(\operatorname{supp}{\pi(h)} \subseteq {\mathcal{A}}_M^0\left(S(h)\right)\)

  2. \(\exists a \in {\mathcal{A}}_M^1\left(S(h)\right): \pi(a \mid h) > \epsilon\)


We can now formulate the regret bound.

Theorem 1

Fix an interface \({\mathcal{I}}\) and \(\epsilon > 0\). Consider \({\mathcal{H}}= \{\upsilon^k = (\mu^k,r) \in \Upsilon_{{\mathcal{I}}}\}_{k \in [N]}\) for some \(N \in {\mathbb{N}}\) (note that \(r\) doesn’t depend on \(k\)). Assume that for each \(k \in [N]\), \(\sigma^k\) is an \(\epsilon\)-sane policy for \(\upsilon^k\). Then, there is an \(\bar{{\mathcal{I}}}\)-metapolicy \(\pi^*\) s.t. for any \(k \in [N]\)

\[{\operatorname{EU}}_{\upsilon^k}^*(\gamma) - {\operatorname{EU}}_{\bar{\upsilon}^k\left[\sigma^k\right]}^{\pi^*}(\gamma) = O\left((1-\gamma)^{1/4}\right)\]

Corollary 1

Fix an interface \({\mathcal{I}}\) and \(\epsilon > 0\). Consider \({\mathcal{H}}= \{\upsilon^k = (\mu^k,r) \in \Upsilon_{{\mathcal{I}}}\}_{k \in {\mathbb{N}}}\). Assume that for each \(k \in {\mathbb{N}}\), \(\sigma^k\) is a \(\epsilon\)-sane policy for \(\upsilon^k\). Define \(\bar{{\mathcal{H}}}:={{\{\bar{\upsilon}^k\left[\sigma^k\right]\}_{n \in {\mathbb{N}}}}}\). Then, \(\bar{{\mathcal{H}}}\) is learnable.


Now, we deal with corrupt states.

Definition 4

Let \(\upsilon = (\mu,r)\) be a universe and \(\epsilon > 0\). A policy \(\pi\) is said to be locally \(\epsilon\)-sane for \(\upsilon\) when there are \(M\), \(S\) and \({\mathcal{U}}\subseteq {\mathcal{S}}_M\) (the set of uncorrupt states) s.t. \(\upsilon\) is an \({\mathcal{O}}\)-realization of \(M\) with state function \(S\), \(S({\boldsymbol{\lambda}}) \in {\mathcal{U}}\) and for any \(h \in \operatorname{hdom}{\mu}\), if \(S(h) \in {\mathcal{U}}\) then

  1. If \(a \in \operatorname{supp}{\pi(h)}\) and \(o \in \operatorname{supp}{\mu(ha)}\) then \(S\left(hao\right) \in {\mathcal{U}}\).

  2. \(\operatorname{supp}{\pi(h)} \subseteq {\mathcal{A}}_M^0\left(S(h)\right)\)

  3. \(\exists a \in {\mathcal{A}}_M^1\left(S(h)\right): \pi(a \mid h) > \epsilon\)

  4. \(\exists a \in {\mathcal{A}}_M^2{\left(S(h)\right)}: {\mathcal{T}}_M{\left({\mathcal{U}}\mid S(h), a\right)} = 1\)


Of course, this requirement is still unrealistic for humans in the real world. In particular, it makes the formalism unsuitable for modeling the use of AI for catastrophe mitigation (which is ultimately what we are interested in!) since it assumes the advisor is already capable of avoiding any catastrophe. In following work, we plan to relax the assumptions further.

Corollary 2

Fix an interface \({\mathcal{I}}\) and \(\epsilon > 0\). Consider \({\mathcal{H}}= \{\upsilon^k = (\mu^k,r^k) \in \Upsilon_{{\mathcal{I}}}\}_{k \in [N]}\) for some \(N \in {\mathbb{N}}\). Assume that for each \(k \in [N]\), \(\sigma^k\) is locally \(\epsilon\)-sane for \(\upsilon^k\). For each \(k \in [N]\), let \({\mathcal{U}}^k \subseteq {\mathcal{S}}_{M^k}\) be the corresponding set of uncorrupt states. Assume further that for any \(k,j \in {\mathbb{N}}\) and \(h \in \operatorname{hdom}{\mu^k} \cap \operatorname{hdom}{\mu^j}\), if \(S^k(h) \in {\mathcal{U}}^k\) and \(S^j(h) \in {\mathcal{U}}^j\), then \(r^k(h)=r^j(h)\). Then, there is an \(\bar{{\mathcal{I}}}\)-policy \(\pi^*\) s.t. for any \(k \in [N]\)

\[{\operatorname{EU}}_{\upsilon^k}^*(\gamma) - {\operatorname{EU}}_{\bar{\upsilon}^k\left[\sigma^k\right]}^{\pi^*}(\gamma) = O\left((1-\gamma)^{1/4}\right)\]

Corollary 3

Assume the same conditions as in Corollary 2, except that \({\mathcal{H}}\) may be countable infinite. Then, \(\bar{{\mathcal{H}}}\) is learnable.

Appendix A

First, we prove an information theoretic bound that shows that for Thompson sampling, the expected information gain is bounded below by a function of the loss.

Proposition A.1

Consider a probability space \((\Omega, P \in \Delta\Omega)\), \(N \in {\mathbb{N}}\), \(R \subseteq [0,1]\) a finite set and random variables \(U: \Omega \rightarrow R\), \(K: \Omega \rightarrow [N]\) and \({J}: \Omega \rightarrow [N]\). Assume that \(K_*P = J_*P = \zeta \in \Delta[N]\) and \({\underset{}{\operatorname{I}}}[K;J] = 0\). Then

\[{\underset{}{\operatorname{I}}}\left[K;J,U\right] \geq 2 \left(\min_{i \in [N]} {\zeta(i)}\right) \left({\underset{}{\operatorname{E}}}\left[U \mid J = K\right]-{\underset{}{\operatorname{E}}}\left[U\right]\right)^2\]

Proof of Proposition A.1

We have

\[{\underset{}{\operatorname{I}}}\left[K;J,U\right] = {\underset{}{\operatorname{I}}}\big[K;J\big] + {\underset{}{\operatorname{I}}}\left[K;U \mid J\right] = {\underset{}{\operatorname{I}}}\left[K;U \mid J\right] = {\underset{}{\operatorname{E}}}\left[{\operatorname{D}_{\mathrm{KL}}\left(U_*\left(P \mid K,J\right) \| \ U_*\left(P \mid J\right)\right)}\right]\]

Using Pinsker’s inequality, we get

\[{\underset{}{\operatorname{I}}{\left[K;J,U\right]}} \geq 2{\underset{}{\operatorname{E}}{\left[{\operatorname{d}_{\text{tv}}{\left(U_*\left(P \mid K,J\right),U_*\left(P \mid J\right)\right)}}^2\right]}} \geq 2{\underset{}{\operatorname{E}}{\left[{\left({\underset{}{\operatorname{E}}{\left[U \mid K,J\right]}}-{\underset{}{\operatorname{E}}{\left[U \mid J\right]}}\right)}^2\right]}}\]

Denote \(U_{kj} := {\underset{}{\operatorname{E}}{\left[U \mid K = k, J = j\right]}}\). We get

\[{\underset{}{\operatorname{I}}{\left[K;J,U\right]}} \geq 2{\underset{(k,j)\sim\zeta\times\zeta}{\operatorname{E}}{\left[{\left(U_{kj}-{\underset{k'\sim\zeta}{\operatorname{E}}{\left[U_{k'j}\right]}}\right)}^2\right]}} \geq 2{\underset{(k,j)\sim\zeta\times\zeta}{\operatorname{E}}{\left[[[k=j]]{\left(U_{kj}-{\underset{k'\sim\zeta}{\operatorname{E}}{\left[U_{k'j}\right]}}\right)}^2\right]}}\]

\[{\underset{}{\operatorname{I}}{\left[K;J,U\right]}} \geq 2{\underset{j \sim \zeta}{\operatorname{E}}{\left[\zeta(j){\left(U_{jj}-{\underset{k\sim\zeta}{\operatorname{E}}{\left[U_{kj}\right]}}\right)}^2\right]}} \geq 2 {\left(\min_{j \in [N]} \zeta(j)\right)}{\underset{j \sim \zeta}{\operatorname{E}}{\left[{\left(U_{jj}-{\underset{k\sim\zeta}{\operatorname{E}}{\left[U_{kj}\right]}}\right)}^2\right]}}\]

\[{\underset{}{\operatorname{I}}{\left[K;J,U\right]}}\geq 2 {\left(\min_{j \in [N]} \zeta(j)\right)} {\left({\underset{j \sim \zeta}{\operatorname{E}}{\left[U_{jj}\right]}}-{\underset{(k,j)\sim\zeta\times\zeta}{\operatorname{E}}{\left[U_{kj}\right]}}\right)}^2=2 \left(\min_{i \in [N]} {\zeta(i)}\right) \left({\underset{}{\operatorname{E}}}\left[U \mid J = K\right]-{\underset{}{\operatorname{E}}}\left[U\right]\right)^2\]


Now, we describe a “delegation routine” \({\mathcal{D}}\) that can transform any “proto-policy” \(\pi\) that recommends some set of actions from \({\mathcal{A}}\) into an actual \({{\bar{{\mathcal{I}}}}}\)-policy s.t (i) with high probability, on each round, either a “safe” recommended action is taken, or all recommended actions are “unsafe” or delegation is performed and (ii) the expected number of delegations is small. For technical reasons, we also need to the modified routines \({\mathcal{D}}^{!k}\) which behave the same way as \({\mathcal{D}}\) except for some low probability cases.

Proposition A.2

Fix an interface \({\mathcal{I}}=({\mathcal{A}},{\mathcal{O}})\), \(N \in {\mathbb{N}}\), \(\epsilon \in (0,\frac{1}{{\lvert {\mathcal{A}}\rvert}})\), \(\delta \in (0,\frac{1}{N})\). Consider some \(\{\sigma^k: {({\mathcal{A}}\times {\mathcal{O}})^*}{\xrightarrow{\text{k}}}{\mathcal{A}}\}_{k \in [N]}\). Then, there exist \({\mathcal{D}}: {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\times 2^{\mathcal{A}}\rightarrow {\bar{{\mathcal{A}}}}\) and \(\{{\mathcal{D}}^{!k}: {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\times 2^{\mathcal{A}}\rightarrow {\bar{{\mathcal{A}}}}\}_{k \in [N]}\) with the following properties. Given \(x \in \left(2^{\mathcal{A}}\times {\overline{{\mathcal{A}}\times {\mathcal{O}}}}\right)^*\), we denote \(\underline{x}\) its projection to \({{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\). Thus, \(\underline{\underline{x}}\in{({\mathcal{A}}\times {\mathcal{O}})^*}\). Given \(\mu\) an \({\mathcal{I}}\)-environment, \(\pi: \operatorname{hdom}{\mu} {\xrightarrow{\text{k}}}2^{\mathcal{A}}\), \({\mathcal{D}}': {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\times 2^{\mathcal{A}}\rightarrow {\bar{{\mathcal{A}}}}\) and \(k \in [N]\), we can define \(\Xi\left[\mu,\sigma^k,{\mathcal{D}}',\pi\right]\in \Delta\left(2^{\mathcal{A}}\times {\overline{{\mathcal{A}}\times {\mathcal{O}}}}\right)^\omega\) as follows

\[\Xi\left[\mu,\sigma^k,{\mathcal{D}}',\pi\right]\left({\mathcal{B}},a,o \mid x\right):=\pi\left({\mathcal{B}}\mid \underline{\underline{x}}\right){\mathcal{D}}'\left(a \mid \underline{x},{\mathcal{B}}\right) \bar{\mu}[\sigma^k]\left(o \mid \underline{x}a\right)\]

We require that for every \(\pi\), \(\mu\) and \(k\) as above, the following conditions hold

  1. \[{\underset{x \sim\Xi\left[\mu,\sigma^k,{\mathcal{D}}^{!k},\pi\right]}{\operatorname{E}}}\left[{\lvert \{n \in {\mathbb{N}}\mid x_n \in 2^{\mathcal{A}}\times \bot \times \bar{{\mathcal{O}}}\} \rvert}\right] \leq \frac{\ln N}{\delta \ln\left(1 + \epsilon(1-\epsilon)^{(1-\epsilon)/\epsilon}\right)}=O\left(\frac{\ln N}{\delta \epsilon}\right)\]

  2. \({\operatorname{d}_{\text{tv}}}\left(\frac{1}{N}\sum_{j=0}^{N-1}{\Xi\left[\mu,\sigma^j,{\mathcal{D}},\pi\right]},\frac{1}{N}\sum_{j=0}^{N-1}{\Xi\left[\mu,\sigma^j,{\mathcal{D}}^{!j},\pi\right]}\right) \leq (N-1)\delta\)

  3. For all \(x \in \operatorname{hdom}{\bar{\mu}[\sigma^k]}\), if \({\mathcal{D}}^{!k}\left(x,\pi\left(\underline{x}\right)\right) \ne \bot\) then \(\sigma^k\left({\mathcal{D}}^{!k}\left(x,\pi\left(\underline{x}\right)\right) \mid \underline{x}\right) > 0\)

  4. For all \(x \in \operatorname{hdom}{\bar{\mu}[\sigma^k]}\), if \({\mathcal{D}}^{!k}\left(x,\pi\left(\underline{x}\right)\right) \not\in \pi\left(\underline{x}\right) \cup \{\bot\}\) then \(\forall a \in \pi\left(\underline{x}\right): \sigma^k\left(a \mid \underline{x}\right) \leq \epsilon\)


In order to prove Proposition A.2, we need another mutual information bound.

Proposition A.3

Consider \(N \in {\mathbb{N}}\), \({\mathcal{A}}\) a finite set, \(\epsilon \in (0,\frac{1}{{\lvert {\mathcal{A}}\rvert}})\), \(\delta \in (0,1)\), \({\mathcal{B}}\subseteq {\mathcal{A}}\), \(\zeta \in \Delta[N]\) and \(\sigma: [N] {\xrightarrow{\text{k}}}{\mathcal{A}}\). Suppose that for every \(a \in {\mathcal{A}}\)

\[{\underset{k \sim \zeta}{\operatorname{Pr}}{\left[\sigma(a \mid k) > 0 \land {\left(a \in {\mathcal{B}}\lor \forall b \in {\mathcal{B}}: \sigma(b \mid k) \leq \epsilon\right)}\right]}} \leq 1 - \delta\]

Then

\[{\underset{(k,a)\sim\zeta\ltimes\sigma}{\operatorname{I}}{\left[k;a\right]}} \geq \delta \ln\left(1 + \epsilon(1-\epsilon)^{(1-\epsilon)/\epsilon}\right)\]

Proof of Proposition A.3

We have

\[{\underset{(k,a)\sim\zeta\ltimes\sigma}{\operatorname{I}}{\left[k;a\right]}} = {\underset{k \sim \zeta}{\operatorname{E}}{\left[{\operatorname{D}_{\mathrm{KL}}\left(\sigma(k) \| \sigma_*\zeta\right)}\right]}}\]

Define \(q \in (0,1)\) by

\[q:=\frac{1}{\epsilon+(1-\epsilon)^{-(1-\epsilon)/\epsilon}}\]

Let \(a^* \in {\mathcal{A}}\) be s.t. \({\left(\sigma_*\zeta\right)}{\left(a^*\right)} > q\epsilon\) and either \(a^* \in {\mathcal{B}}\) or every \(a \in {\mathcal{B}}\) has \({\left(\sigma_*\zeta\right)}(a) \leq q\epsilon\). For every \(k \in [N]\), denote

\[{\mathcal{A}}_k := \{a \in {\mathcal{A}}\mid \sigma(a \mid k) > 0 \land {\left(a \in {\mathcal{B}}\lor \forall b \in {\mathcal{B}}: \sigma(b \mid k) \leq \epsilon\right)}\}\]

If \(a^* \not\in {\mathcal{A}}_k\) then either \(\sigma{\left(a^* \mid k\right)} = 0\) or there is \(a \in {\mathcal{B}}\) s.t. \({\left(\sigma_*\zeta\right)}(a) \leq q\epsilon\) and \(\sigma(b \mid k) \geq \epsilon\). This implies

\[{\operatorname{D}_{\mathrm{KL}}\left(\sigma(k) \| \sigma_*\zeta\right)} \geq \min{\left({\operatorname{D}_{\mathrm{KL}}\left(0 \| q\epsilon\right)},{\operatorname{D}_{\mathrm{KL}}\left(\epsilon \| q\epsilon\right)}\right)}\]

We have

\[{\operatorname{D}_{\mathrm{KL}}\left(0 \| q\epsilon\right)} = \ln{\frac{1}{1-q\epsilon}}=\ln{\frac{1}{1-\frac{\epsilon}{\epsilon+(1-\epsilon)^{-(1-\epsilon)/\epsilon}}}}=\ln{\frac{\epsilon+(1-\epsilon)^{-(1-\epsilon)/\epsilon}}{\epsilon+(1-\epsilon)^{-(1-\epsilon)/\epsilon}-\epsilon}}=\ln{{\left(1+\epsilon(1-\epsilon)^{(1-\epsilon)/\epsilon}\right)}}\]

\[{\operatorname{D}_{\mathrm{KL}}\left(\epsilon \| q\epsilon\right)} = \epsilon \ln{\frac{\epsilon}{q\epsilon }}+(1-\epsilon)\ln{\frac{1-\epsilon}{1-q\epsilon}}= \epsilon \ln{\frac{1}{q}}+(1-\epsilon)\ln{(1-\epsilon)} - \epsilon \ln{\frac{1}{1-q\epsilon}}+\ln{\frac{1}{1-q\epsilon}}\]

\[{\operatorname{D}_{\mathrm{KL}}\left(\epsilon \| q\epsilon\right)} = \epsilon \ln{\frac{1-q\epsilon}{q}}+\ln{(1-\epsilon)^{1-\epsilon}} +\ln{\frac{1}{1-q\epsilon}}=\epsilon \ln{{\left(\frac{1}{q}-\epsilon\right)}}+\ln{(1-\epsilon)^{1-\epsilon}} +\ln{\frac{1}{1-q\epsilon}}\]

\[{\operatorname{D}_{\mathrm{KL}}\left(\epsilon \| q\epsilon\right)} = \epsilon \ln{(1-\epsilon)^{-(1-\epsilon)/\epsilon}}+\ln{(1-\epsilon)^{1-\epsilon}} +\ln{\frac{1}{1-q\epsilon}} = \ln{(1-\epsilon)^{-(1-\epsilon)}}+\ln{(1-\epsilon)^{1-\epsilon}} +\ln{\frac{1}{1-q\epsilon}}=\ln{\frac{1}{1-q\epsilon}}\]

\[{\operatorname{D}_{\mathrm{KL}}\left(\epsilon \| q\epsilon\right)} = \ln{{\left(1+\epsilon(1-\epsilon)^{(1-\epsilon)/\epsilon}\right)}}\]

It follows that

\[{\underset{(k,a)\sim\zeta\ltimes\sigma}{\operatorname{I}}{\left[k;a\right]}} \geq {\underset{k \sim \zeta}{\operatorname{Pr}}{\left[a^* \not \in {\mathcal{A}}_k\right]}} \ln{{\left(1+\epsilon(1-\epsilon)^{(1-\epsilon)/\epsilon}\right)}} \geq \delta \ln{{\left(1+\epsilon(1-\epsilon)^{(1-\epsilon)/\epsilon}\right)}}\]

Proof of Proposition A.2

We define \({\{{\mathcal{A}}_k: {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\times 2^{\mathcal{A}}\rightarrow 2^A\}_{k \in {\mathbb{N}}}}\), \(\tilde{\zeta}: {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}{\xrightarrow{\text{k}}}[N]\), \(\zeta: {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}{\xrightarrow{\text{k}}}[N]\), \({\{\tilde{\zeta}^{!k}: {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}{\xrightarrow{\text{k}}}[N]\}_{k \in {\mathbb{N}}}}\), \({\{\zeta^{!k}: {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}{\xrightarrow{\text{k}}}[N]\}_{k \in {\mathbb{N}}}}\), \({\mathcal{D}}\) and \({\{{\mathcal{D}}^{!k}\}_{k \in {\mathbb{N}}}}\) recursively. For each \(h \in {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\), \(a \in {\bar{{\mathcal{A}}}}\), \(o \in {\bar{{\mathcal{O}}}}\), \({\mathcal{B}}\subseteq {\mathcal{A}}\) and \(j,k \in [N]\), we require

\[{\mathcal{A}}_k(h, {\mathcal{B}}):=\{a \in {\mathcal{A}}\mid \sigma^k{\left(a \mid \underline{h}\right)} > 0 \land {\left(a \in {\mathcal{B}}\lor \forall b \in {\mathcal{B}}: \sigma^k{\left(b \mid \underline{h}\right)} \leq \epsilon\right)}\}\]

\[\tilde{\zeta}(k \mid {\boldsymbol{\lambda}})=\tilde{\zeta}^{!j}(k \mid {\boldsymbol{\lambda}}):=\frac{1}{N}\]

\[\tilde{\zeta}(k \mid hao):=\begin{cases} \zeta(k \mid h) \text{ if } a\ne\bot \\ \frac{1}{N} \text{ if } a=\bot,\ \sum_{i=0}^{N-1} \zeta(i \mid h)\cdot \sigma^i{\left(b \mid \underline{h}\right)}=0 \\ {\left(\sum_{i=0}^{N-1} \zeta(i \mid h)\cdot \sigma^i{\left(b \mid \underline{h}\right)}\right)}^{-1} \zeta(k \mid h)\cdot \sigma^k{\left(b \mid \underline{h}\right)} \text{ otherwise, assuming } o \in b{\mathcal{O}}\end{cases}\]

\[\tilde{\zeta}^{!j}(k \mid hao):=\begin{cases} \zeta^{!j}(k \mid h) \text{ if } a\ne\bot \\ {\left(\sum_{i=0}^{N-1} \zeta^{!j}(i \mid h)\cdot \sigma^i{\left(b \mid \underline{h}\right)}\right)}^{-1} \zeta^{!j}(k \mid h)\cdot \sigma^k{\left(b \mid \underline{h}\right)} \text{ if } a=\bot,\ o \in b{\mathcal{O}}\end{cases}\]

\[\zeta(k \mid h) := \frac{\tilde{\zeta}(k \mid h) [[\tilde{\zeta}(k \mid h) \geq \delta]]}{\sum_{i=0}^{N-1} \tilde{\zeta}(i \mid h) [[\tilde{\zeta}(i \mid h) \geq \delta]]}\]

\[\zeta^{!j}(k \mid h) := \frac{\tilde{\zeta}^{!j}(k \mid h) [[\tilde{\zeta}^{!j}(k \mid h) \geq \delta]]}{\sum_{i=0}^{N-1} \tilde{\zeta}^{!j}(i \mid h) [[\tilde{\zeta}^{!j}(i \mid h) \geq \delta]]}[[\tilde{\zeta}^{!j}(j \mid h) \geq \delta]] + [[k = j]] \cdot [[\tilde{\zeta}^{!j}(j \mid h) < \delta]]\]

\[{\mathcal{D}}(h,{\mathcal{B}}) := \begin{cases} \text{any } a \text{ s.t. } \forall k \in \operatorname{supp}\zeta(h): a \in {\mathcal{A}}_k(h,{\mathcal{B}}) \text{ if such exists} \\ \bot \text{ otherwise} \end{cases}\]

\[{\mathcal{D}}^{!j}(h,{\mathcal{B}}) := \begin{cases} \text{any } a \text{ s.t. } \forall k \in \operatorname{supp}\zeta^{!j}(h): a \in {\mathcal{A}}_k(h,{\mathcal{B}}) \text{ if such exists} \\ \bot \text{ otherwise} \end{cases}\]

Denote \(\Xi^{!k}:= \Xi{\left[\mu,\sigma^k,{\mathcal{D}}^{!k},\pi\right]}\). Proposition A.3 implies that for any \(k \in [N]\)

\[\ln N \leq \sum_{n=0}^\infty {\underset{x \sim \Xi^{!k}}{\operatorname{E}}{\left[{\operatorname{H}{\left(\zeta^{!k}{\left(\underline{x}_{:n+1}\right)}\right)}}-{\operatorname{H}{\left(\zeta^{!k}{\left(\underline{x}_{:n}\right)}\right)}}\right]}} \leq \sum_{n=0}^\infty {\underset{x \sim \Xi^{!k}}{\operatorname{Pr}}{\left[x_n \in 2^{\mathcal{A}}\times\bot\times{\mathcal{O}}\right]}}\delta \ln\left(1 + \epsilon(1-\epsilon)^{\frac{1-\epsilon}{\epsilon}}\right)\]

This gives us condition i.

Condition ii follows because the only difference is in the equation for \(\zeta\) vs. \(\zeta^{!j}\). That is, \(\zeta\) may “discard” the “correct” element of \([N]\) but this happens with probability at most \(\delta\) per discard and there are at most \(N-1\) discards.

Conditions iii and iv are obvious from the definition.

Definition A.1

Consider \(k \in {\mathbb{N}}\) and a universe \(\upsilon=(\mu,r)\) that is an \({\mathcal{O}}\)-realization of \(M\) with state function \(S\). A policy \(\pi\) is called \(k\)-optimal for \(\upsilon\) when for any \(h \in \operatorname{hdom}{\mu}\)

\[\operatorname{supp}\pi(h) \subseteq {\mathcal{A}}_M^k\left(S(h)\right)\]

\(\pi\) is called Blackwell optimal for \(\upsilon\) when it is \(k\)-optimal for any \(k \in {\mathbb{N}}\). Obviously, a stationary Blackwell optimal policy always exists. It is a standard (up to straightforward adaption to our formalism) result in MDP theory that any Blackwell optimal policy has maximal expected utility for any \(\gamma\) sufficiently close to 1.


The following proposition relates optimality in terms of expected utility to expected truncated utility, where truncated utility is defined by only summing rewards within a time duration \(T\).

Proposition A.4

Fix an MDP \(M\). Then, for any \(\gamma\in(0,1)\), \(T \in {\mathbb{N}}\), universe \(\upsilon=(\mu,r)\) that is an \({\mathcal{O}}\)-realization of \(M\) with state function \(S\), \(\pi^1\) a \(1\)-optimal policy for \(\upsilon\) and \(\pi^*\) a Blackwell optimal policy for \(\upsilon\), we have

\[\sum_{n=0}^{T-1} \gamma^n \left({\underset{x \sim \mu\bowtie\pi^*}{\operatorname{E}}}\left[r\left(x_{:n}\right)\right]-{\underset{x \sim \mu\bowtie\pi^1}{\operatorname{E}}}\left[r\left(x_{:n}\right)\right]\right) \leq O\left(1 + T\left(1-\gamma\right)\right)\]

Proof of Proposition A.4

Let \(\pi^*_1\) be a policy s.t. for any \(h \in \operatorname{hdom}{\mu}\)

\[\pi^*_1(h):=\begin{cases} \pi^1(h) \text{ if } {\lvert h \rvert} < T \\ \pi^*(h) \text{ otherwise} \end{cases}\]

By Proposition B.1

\[{\operatorname{EU}}_{\upsilon}^{*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi^*_1}(\gamma) = \sum_{n=0}^\infty{\gamma^n {\underset{x\sim\mu\bowtie\pi^*_1}{\operatorname{E}}{\left[{\operatorname{V}}^{\upsilon}_\gamma\Big(x_{:n}\Big)-{\operatorname{Q}}^{\upsilon}_\gamma{\left(x_{:n},x_n^{\mathcal{A}}\right)}\right]}}}\]

\[{\operatorname{EU}}_{\upsilon}^{*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi^*_1}(\gamma) = \sum_{n=0}^\infty{\gamma^n {\underset{x\sim\mu\bowtie\pi^*_1}{\operatorname{E}}{\left[{\operatorname{V}}_M\Big(S{\left(x_{:n}\right)},\gamma\Big)-{\operatorname{Q}}_M{\left(S{\left(x_{:n}\right)},x_n^{\mathcal{A}},\gamma\right)}\right]}}}\]

\[{\operatorname{EU}}_{\upsilon}^{*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi^*_1}(\gamma) = \sum_{n=0}^\infty{\gamma^n \sum_{k=0}^\infty{\frac{(1-\gamma)^k}{k!} {\underset{x\sim\mu\bowtie\pi^*_1}{\operatorname{E}}{\left[{\operatorname{V}}_M^k\Big(S{\left(x_{:n}\right)}\Big)-{\operatorname{Q}}_M^k{\left(S{\left(x_{:n}\right)},x_n^{\mathcal{A}}\right)}\right]}}}}\]

Using the Blackwell optimality of \(\pi^*\) and 1-optimality of \(\pi^1\), we get that for \(1-\gamma \ll 1\)

\[{\operatorname{EU}}_{\upsilon}^{*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi^*_1}(\gamma) = \sum_{n=0}^{T-1}{\gamma^n \sum_{k=2}^\infty{\frac{(1-\gamma)^k}{k!} {\underset{x\sim\mu\bowtie\pi^1}{\operatorname{E}}{\left[{\operatorname{V}}_M^k\Big(S{\left(x_{:n}\right)}\Big)-{\operatorname{Q}}_M^k{\left(S{\left(x_{:n}\right)},x_n^{\mathcal{A}}\right)}\right]}}}}\]

\[{\operatorname{EU}}_{\upsilon}^{*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi^*_1}(\gamma) \leq \sum_{n=0}^{T-1}{\gamma^n \max_{s \in {\mathcal{S}}_M} \max_{a \in {\mathcal{A}}_M^1(s)} \sum_{k=2}^\infty{\frac{(1-\gamma)^k}{k!} {\left({\operatorname{V}}_M^k(s)-{\operatorname{Q}}_M^k{\left(s,a\right)}\right)}}}\]

\[{\operatorname{EU}}_{\upsilon}^{*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi^*_1}(\gamma) = O{\left(T(1-\gamma)^2\right)}\]

\[(1-\gamma)\sum_{n=0}^{\infty} \gamma^n \left({\underset{x \sim \mu\bowtie\pi^*}{\operatorname{E}}}\left[r\left(x_{:n}\right)\right]-{\underset{x \sim \mu\bowtie\pi^*_1}{\operatorname{E}}}\left[r\left(x_{:n}\right)\right]\right) = O{\left(T(1-\gamma)^2\right)}\]

\[\sum_{n=0}^{\infty} \gamma^n \left({\underset{x \sim \mu\bowtie\pi^*}{\operatorname{E}}}\left[r\left(x_{:n}\right)\right]-{\underset{x \sim \mu\bowtie\pi^*_1}{\operatorname{E}}}\left[r\left(x_{:n}\right)\right]\right) = O{\left(T(1-\gamma)\right)}\]

Denote \(\rho^*:=\mu\bowtie\pi^*\), \(\rho^1:=\mu\bowtie\pi^1\). Using again the Blackwell optimality of \(\pi^*\)

\[\sum_{n=0}^{T - 1} \gamma^n \left({\underset{x \sim \rho^*}{\operatorname{E}}}\left[r\left(x_{:n}\right)\right]-{\underset{x \sim \rho^1}{\operatorname{E}}}\left[r\left(x_{:n}\right)\right]\right) + \frac{\gamma^T}{1-\gamma}{\left({\underset{x \sim \rho^*}{\operatorname{E}}}\left[{\operatorname{V}}_M{\left(S{\left(x_{:T}\right)}\right)}\right]-{\underset{x \sim \rho^1}{\operatorname{E}}}\left[{\operatorname{V}}_M{\left(S{\left(x_{:T}\right)}\right)}\right]\right)} = O{\left(T(1-\gamma)\right)}\]

Since both \(\pi^*\) and \(\pi^1\) are in particular 0-optimal, we have

\[{\underset{x \sim \rho^*}{\operatorname{E}}{\left[{\operatorname{V}}_M^0{\left(S{\left(x_{:T}\right)}\right)}\right]}}={\underset{x \sim \rho^1}{\operatorname{E}}{\left[{\operatorname{V}}_M^0{\left(S{\left(x_{:T}\right)}\right)}\right]}}={\operatorname{V}}_M^0{\left(S{\left({\boldsymbol{\lambda}}\right)}\right)}\]

It follows

\[\sum_{n=0}^{T - 1} \gamma^n \left({\underset{x \sim \rho^*}{\operatorname{E}}}\left[r\left(x_{:n}\right)\right]-{\underset{x \sim \rho^1}{\operatorname{E}}}\left[r\left(x_{:n}\right)\right]\right) \pm \frac{\gamma^T}{1-\gamma}O{\left(1-\gamma\right)}= O{\left(T(1-\gamma)\right)}\]

\[\sum_{n=0}^{T - 1} \gamma^n \left({\underset{x \sim \rho^*}{\operatorname{E}}}\left[r\left(x_{:n}\right)\right]-{\underset{x \sim \rho^1}{\operatorname{E}}}\left[r\left(x_{:n}\right)\right]\right) = O{\left(\gamma^T+T(1-\gamma)\right)}\]


The following shows that for any policy that doesn’t make “irreversible errors,” regret can be approximated by “episodic regret” for sufficiently large episode duration.

Proposition A.5

Consider a universe \(\upsilon=(\mu,r)\) that is an \({\mathcal{O}}\)-realization of \(M\) with state function \(S\). Suppose that \(\pi^*\) is a Blackwell optimal policy for \(\upsilon\) and \(\pi^0\) is a \(0\)-optimal policy for \(\upsilon\). For any \(n \in {\mathbb{N}}\), let \(\pi^*_n\) be a policy s.t. for any \(h \in \operatorname{hdom}{\mu}\)

\[\pi^*_n(h):=\begin{cases} \pi^0(h) \text{ if } {\lvert h \rvert} < nT \\ \pi^*(h) \text{ otherwise} \end{cases}\]

Then, for any \(\gamma\in(0,1)\) and \(T \in {\mathbb{N}}^+\)

\[{\operatorname{EU}}^*_\upsilon(\gamma)-{\operatorname{EU}}^{\pi^0}_\upsilon(\gamma) \leq (1-\gamma)\sum_{n=0}^\infty \sum_{m=0}^{T-1} \gamma^{nT+m}\left({\underset{x\sim\mu\bowtie\pi^*_n}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]-{\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]\right) + O\left(\frac{1-\gamma}{1-\gamma^T}\right)\]

Proof of Proposition A.5

By Proposition B.1, for any \(l \in {\mathbb{N}}\)

\[{\operatorname{EU}}_{\upsilon}^{*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi_l^*}(\gamma) = \sum_{n=0}^\infty{\gamma^n {\underset{x\sim\mu\bowtie\pi_l^*}{\operatorname{E}}{\left[{\operatorname{V}}^{\upsilon}_\gamma\Big(x_{:n}\Big)-{\operatorname{Q}}^{\upsilon}_\gamma{\left(x_{:n},x_n^{\mathcal{A}}\right)}\right]}}}\]

\(\pi^*_l\) becomes Blackwell optimal after \(lT\), therefore for \(1 - \gamma \ll 1\),

\[{\operatorname{EU}}_{\upsilon}^{*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi_l^*}(\gamma) = \sum_{n=0}^{lT-1}{\gamma^n {\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}^{\upsilon}_\gamma\Big(x_{:n}\Big)-{\operatorname{Q}}^{\upsilon}_\gamma{\left(x_{:n},x_n^{\mathcal{A}}\right)}\right]}}}\]

\[{\operatorname{EU}}_{\upsilon}^{\pi_{l}^*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi_{l+1}^*}(\gamma) = \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}^{\upsilon}_\gamma\Big(x_{:n}\Big)-{\operatorname{Q}}^{\upsilon}_\gamma{\left(x_{:n},x_n^{\mathcal{A}}\right)}\right]}}}\]

\[(1-\gamma)\sum_{n=0}^\infty {\gamma^n{\left({\underset{x \sim \mu\bowtie\pi^*_l}{\operatorname{E}}{\left[r{\left(x_{:n}\right)}\right]}}-{\underset{x \sim \mu\bowtie\pi^*_{l+1}}{\operatorname{E}}{\left[r{\left(x_{:n}\right)}\right]}}\right)}} = \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}^{\upsilon}_\gamma\Big(x_{:n}\Big)-{\operatorname{Q}}^{\upsilon}_\gamma{\left(x_{:n},x_n^{\mathcal{A}}\right)}\right]}}}\]

\(\pi^*_l\) and \(\pi^*_{l+1}\) coincide until \(lT\), therefore

\[(1-\gamma)\sum_{n=lT}^\infty {\gamma^n{\left({\underset{x \sim \mu\bowtie\pi^*_l}{\operatorname{E}}{\left[r{\left(x_{:n}\right)}\right]}}-{\underset{x \sim \mu\bowtie\pi^*_{l+1}}{\operatorname{E}}{\left[r{\left(x_{:n}\right)}\right]}}\right)}} = \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{x\sim\mu\bowtie\pi^0}{\operatorname{E}}{\left[{\operatorname{V}}^{\upsilon}_\gamma\Big(x_{:n}\Big)-{\operatorname{Q}}^{\upsilon}_\gamma{\left(x_{:n},x_n^{\mathcal{A}}\right)}\right]}}}\]

Denote \(\rho^*_l:=\mu\bowtie\pi^*_l\), \(\rho^0:=\mu\bowtie\pi^0\). We also use the shorthand notations \(r_n:=r{\left(x_{:n}\right)}\), \({\operatorname{V}}_n:={\operatorname{V}}^\upsilon_\gamma{\left(x_{:n}\right)}\), \({\operatorname{Q}}_n:={\operatorname{Q}}^{\upsilon}_\gamma{\left(x_{:n},x_n^{\mathcal{A}}\right)}\). Both \(\pi^*_l\) and \(\pi^*_{l+1}\) are Blackwell optimal after \((l+1)T\), therefore

\[(1-\gamma)\sum_{n=lT}^{(l+1)T-1} {\gamma^n{\left({\underset{\rho^*_l}{\operatorname{E}}{\left[r_n\right]}}-{\underset{\rho^0}{\operatorname{E}}{\left[r_n\right]}}\right)}} + \gamma^{(l+1)T}{\left({\underset{\rho^*_l}{\operatorname{E}}{\left[{\operatorname{V}}_{(l+1)T}\right]}}-{\underset{\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}_{(l+1)T}\right]}}\right)}= \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}_n-{\operatorname{Q}}_n\right]}}}\]

Since \(\pi^0\) is 0-optimal, we get

\[(1-\gamma)\sum_{n=lT}^{(l+1)T-1} {\gamma^n{\left({\underset{\rho^*_l}{\operatorname{E}}{\left[r_n\right]}}-{\underset{\rho^0}{\operatorname{E}}{\left[r_n\right]}}\right)}} + O{\left(\gamma^{(l+1)T}(1-\gamma)\right)}= \sum_{n=lT}^{(l+1)T-1}{\gamma^n {\underset{\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}_n-{\operatorname{Q}}_n\right]}}}\]

Summing over \(l\), we get

\[(1-\gamma)\sum_{l=0}^\infty\sum_{n=lT}^{(l+1)T-1} {\gamma^n{\left({\underset{\rho^*_l}{\operatorname{E}}{\left[r_n\right]}}-{\underset{\rho^0}{\operatorname{E}}{\left[r_n\right]}}\right)}} + O{\left(\frac{\gamma^T(1-\gamma)}{1-\gamma^{T}}\right)}= \sum_{n=0}^{\infty}{\gamma^n {\underset{\rho^0}{\operatorname{E}}{\left[{\operatorname{V}}_n-{\operatorname{Q}}_n\right]}}}\]

Applying Proposition B.1 to the right hand side

\[(1-\gamma)\sum_{l=0}^\infty\sum_{n=lT}^{(l+1)T-1} {\gamma^n{\left({\underset{\rho^*_l}{\operatorname{E}}{\left[r_n\right]}}-{\underset{\rho^0}{\operatorname{E}}{\left[r_n\right]}}\right)}} + O{\left(\frac{\gamma^T(1-\gamma)}{1-\gamma^{T}}\right)}= {\operatorname{EU}}_{\upsilon}^{*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi^0}(\gamma)\]

Proof of Theorem 1

Fix \(\gamma \in (0,1)\), \(\delta\in\left(0,N^{-1}\right)\) and \(T \in {\mathbb{N}}^+\). For each \(k \in [N]\), suppose \(\upsilon^k\) is an \({\mathcal{O}}\)-realization of \(M^k\) with state function \(S^k\) and denote \(\nu^k:=\bar{\mu}^k\left[\sigma^k\right]\). To avoid cumbersome notation, whenever \(M^k\) should appear a subscript, we will replace it by \(k\). Let \((\Omega,P \in \Delta\Omega)\) be a probability space. Let \(K: \Omega \rightarrow [N]\) be a random variable and the following be stochastic processes

\[{Z}_n,\tilde{{Z}}_n: \Omega \rightarrow \Delta[N]\]

\[{J}_n: \Omega \rightarrow [N]\]

\[\Psi_n: \Omega \rightarrow 2^{\mathcal{A}}\]

\[A_n: \Omega \rightarrow {\bar{{\mathcal{A}}}}\]

\[\Theta_n: \Omega \rightarrow {\bar{{\mathcal{O}}}}\]

We also define \(A\Theta_{:n}: \Omega \rightarrow {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\) by

\[A\Theta_{:n}:= A_0\Theta_0A_1\Theta_1 \ldots A_{n-1}\Theta_{n-1}\]

(The following conditions on \(A\) and \(\Theta\) imply that the range of the above is indeed in \({{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\).) Let \({\mathcal{D}}\) and \({\mathcal{D}}^{!k}\) be as in Proposition A.2 (we assume w.l.o.g. that \(\epsilon < \frac{1}{{\lvert {\mathcal{A}}\rvert}}\)). We construct \(\Omega\), \(K\), \({Z}\), \(\tilde{{Z}}\), \({J}\), \(\Psi\), \(A\) and \(\Theta\) s.t \(K\) is uniformly distributed and for any \(k \in [N]\), \(l \in {\mathbb{N}}\), \(m \in [T]\) and \(o \in {\mathcal{O}}\), denoting \(n = lT+m\)

\[\tilde{{Z}}_0(k)\equiv\frac{1}{N}\]

\[{Z}_{n}(k) = \frac{\tilde{{Z}}_{n}(k)[[\tilde{{Z}}_{n}(k) \geq \delta]] }{\sum_{j = 0}^{N-1}\tilde{{Z}}_{n}(j)[[\tilde{{Z}}_{n}(j) \geq \delta]]}\]

\[\Pr\left[{J}_{l} = k \mid Z_{lT}\right] = {Z}_{lT}\left(k\right)\]

\[\Psi_{n} = {\mathcal{A}}^1_{{J}_l}\left(S^{{J}_l}(A\Theta_{:n})\right)\]

\[\Pr\left[\Theta_{n} = o \mid A\Theta_{:n}\right] = \nu^K\left(o \mid A\Theta_{:n}\right)\]

\[A_n = {\mathcal{D}}\left(A\Theta_{:n}, \Psi_n\right)\]

\[\tilde{{Z}}_{n+1}(k)\sum_{j = 0}^{N-1} {Z}_n(j) [[A_n = {\mathcal{D}}^{!j}\left(A\Theta_{:n}, \Psi_n\right)]] \nu^j(\Theta_n \mid A\Theta_{:n}A_n)={Z}_{n}(k) [[A_n = {\mathcal{D}}^{!k}\left(A\Theta_{:n}, \Psi_n\right)]] \nu^k\left(\Theta_{n} \mid A\Theta_{:n}A_{n}\right)\]

Note that the last equation has the form of a Bayesian update which is allowed to be arbitrary when update is on “impossible” information.

We now construct the \({{\bar{{\mathcal{I}}}}}\)-policy \(\pi^*\) s.t. for any \(n \in {\mathbb{N}}\), \(h \in {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\) s.t. \(\Pr\left[A\Theta_{:n}=h\right] > 0\) and \(a \in {\bar{{\mathcal{A}}}}\)

\[\pi^*(a \mid h):=\Pr\left[A_n = a \mid A\Theta_{:n} = h\right]\]

That is, we perform Thompson sampling at time intervals of size \(T\), moderated by the delegation routine \({\mathcal{D}}\), and discard from our belief state hypotheses whose probability is below \(\delta\) and hypotheses sampling which resulted in recommending “unsafe” actions i.e. actions that \({\mathcal{D}}\) refused to perform.

In order to prove \(\pi^*\) has the desired property, we will define the stochastic processes \({Z}^!\), \(\tilde{{Z}}^!\), \({J}^!\), \(\Psi^!\), \(A^!\) and \(\Theta^!\), each process of the same type as its shriekless counterpart (thus \(\Omega\) is constructed to accommodate them). These processes are required to satisfy the following:

\[\tilde{{Z}}^!_0(k)\equiv\frac{1}{N}\]

\[{Z}_{n}^!(k) = \frac{\tilde{{Z}}^!_{n}(k)[[\tilde{{Z}}^!_{n}(k) \geq \delta]] }{\sum_{j = 0}^{N-1}\tilde{{Z}}^!_{n}(j)[[\tilde{{Z}}^!_{n}(j) \geq \delta]]}[[\tilde{{Z}}^!_{n}(K) \geq \delta]] + [[K = k]]\cdot [[\tilde{{Z}}^!_{n}(K) < \delta]]\]

\[\Pr\left[{J}^!_{l} = k \mid Z^!_{lT}\right] = {Z}^!_{lT}\left(k\right)\]

\[\Psi^!_{n} = {\mathcal{A}}^1_{{J}^!_l}\left(S^{{J}^!_l}(A\Theta^!_{:n})\right)\]

\[\Pr\left[\Theta^!_{n} = o \mid A\Theta^!_{:n}\right] = \nu^K\left(o \mid A\Theta^!_{:n}\right)\]

\[A^!_n = {\mathcal{D}}^{!K}\left(A\Theta^!_{:n}, \Psi^!_n\right)\]

\[\tilde{{Z}}^!_{n+1}(k)=\frac{{Z}^!_{n}(k) [[A^!_n = {\mathcal{D}}^{!k}\left(A\Theta^!_{:n}, \Psi^!_n\right)]] \nu^k\left(\Theta^!_{n} \mid A\Theta^!_{:n}A^!_{n}\right)}{\sum_{j = 0}^{N-1} {Z}^!_n(j) [[A^!_n = {\mathcal{D}}^{!j}\left(A\Theta^!_{:n}, \Psi^!_n\right)]] \nu^j(\Theta^!_n \mid A\Theta^!_{:n}A^!_n)}\]

For any \(k \in [N]\), we construct the \({{\bar{{\mathcal{I}}}}}\)-policy \(\pi^{?k}\) s.t. for any \(n \in {\mathbb{N}}\), \(h \in {{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\) s.t. \(\Pr\left[A\Theta^!_{:n}=h,\ K = k\right] > 0\) and \(a \in {\bar{{\mathcal{A}}}}\)

\[\pi^{?k}(a \mid h):=\Pr\left[A^!_n = a \mid A\Theta^!_{:n} = h,\ K = k\right]\]

Given any \({{\bar{{\mathcal{I}}}}}\)-policy \(\pi\) and \({\mathcal{I}}\)-policy \(\sigma\) we define \(\alpha_{\sigma\pi}: {({\mathcal{A}}\times {\mathcal{O}})^*}{\xrightarrow{\text{k}}}{{\overline{{\mathcal{A}}\times {\mathcal{O}}}}^*}\) by

\[\alpha_{\sigma\pi} (g \mid h) := [[h = \underline{g}]]C_h\prod_{n = 0}^{{\lvert h \rvert}-1} \sum_{a \in {\mathcal{A}}}\left([[g_n \in \bot a{\mathcal{O}}]] \pi\left(\bot \mid g_{:n}\right)\sigma\left(a \mid h_{:n}\right)+[[g_n \in a\bot{\mathcal{O}}]]\pi\left(a \mid g_{:n}\right)\right)\]

Here, \(C_h \in {\mathbb{R}}\) is a constant defined s.t. the probabilities sum to 1. We define the \({\mathcal{I}}\)-policy \(\left[\sigma\right]\underline{\pi}\) by

\[\left[\sigma\right]\underline{\pi}(a \mid h):=\Pr_{g \sim \alpha_{\sigma\pi}(h)}\left[\pi\left(g\right)=a \lor \left(\pi\left(g\right)=\bot \land \sigma(h)=a\right)\right]\]

Condition iii of Proposition A.2 and condition i of \(\epsilon\)-sanity for \(\sigma^k\) imply that for any \(h \in \operatorname{hdom}{\mu^k}\)

\[\operatorname{supp}{\left[\sigma^k\right]\underline{\pi}^{?k}(h)} \subseteq {\mathcal{A}}^0_k\left(S^k\left(h\right)\right)\]

This means we can apply Proposition A.5 and get

\[{\operatorname{EU}}^*_{\upsilon^k}(\gamma)-{\operatorname{EU}}^{\pi^{?k}}_{\bar{\upsilon}^k[\sigma^k]}(\gamma) \leq (1-\gamma)\sum_{n=0}^\infty \sum_{m=0}^{T-1} \gamma^{nT+m}\left({\underset{x\sim\mu^k\bowtie\pi^{*k}_n}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]-{\underset{x\sim\nu^k\bowtie\pi^{?k}}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]\right) + O\left(\frac{1-\gamma}{1-\gamma^T}\right)\]

Here, the \({\mathcal{I}}\)-policy \(\pi^{*k}_n\) is defined as \(\pi^*_n\) in Proposition A.5. We also define the \({{\bar{{\mathcal{I}}}}}\)-policies \(\pi^{!k}_n\) and \(\pi^{!!k}_n\) by

\[\pi^{!k}_n(a \mid h):=\begin{cases} \pi^{?k}(a \mid h) \text{ if } {\lvert h \rvert} < nT \\ \Pr\left[A^!_{{\lvert h \rvert}} = a \mid A\Theta^!_{:{{\lvert h \rvert}}} = h,\ K = k,\ {J}^!_n = k\right] \text{ otherwise} \end{cases}\]

\[\pi^{!!k}_n(a \mid h):=\begin{cases} \pi^{?k}(a \mid h) \text{ if } {\lvert h \rvert} < nT \\ \pi^{!k}_n(a \mid h) + \pi^{!k}_n(\bot \mid h) \cdot \pi^{*k}_n\left(a \mid \underline{h}\right) \text{ if } {\lvert h \rvert} \geq nT \text{ and } a \ne \bot \\ 0 \text{ if } {\lvert h \rvert} \geq nT \text{ and } a = \bot \end{cases}\]

Denote \(\rho^{*k}_n:=\mu^k\bowtie\pi^{*k}_n\), \(\rho^{!!k}_n:=\nu^k\bowtie\pi^{!!k}_n\), \(\rho^{!k}_n:=\nu^k\bowtie\pi^{!k}_n\), \(\rho^{?k}:=\nu^k\bowtie\pi^{?k}\), \(R^{?k}={\operatorname{EU}}^*_{\upsilon^k}(\gamma)-{\operatorname{EU}}^{\pi^{?k}}_{\bar{\upsilon}^k[\sigma^k]}(\gamma)\). For each \(n \in {\mathbb{N}}\), denote

\[{\operatorname{EU}}_n^{*k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}{\underset{x\sim\rho^{*k}_n}{\operatorname{E}}}\left[r\left(x_{:nT+m}\right)\right]\]

\[{\operatorname{EU}}_n^{!!k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}{\underset{x\sim\rho^{!!k}_n}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]\]

\[{\operatorname{EU}}_n^{!k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}{\underset{x\sim\rho^{!k}_n}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]\]

\[{\operatorname{EU}}_n^{?k}(\gamma):=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m}{\underset{x\sim\rho^{?k}}{\operatorname{E}}}\left[r\left(\underline{x}_{:nT+m}\right)\right]\]

We have

\[R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left({\operatorname{EU}}^{*k}_n(\gamma)-{\operatorname{EU}}^{?k}_n(\gamma)\right) + O\left(\frac{1-\gamma}{1-\gamma^T}\right)\]

\[R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left({\operatorname{EU}}^{*k}_n(\gamma)-{\operatorname{EU}}^{!!k}_n(\gamma)+{\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma)+{\operatorname{EU}}^{!k}_n(\gamma)-{\operatorname{EU}}^{?k}_n(\gamma)\right) + O\left(\frac{1-\gamma}{1-\gamma^T}\right)\]

Condition iv of Proposition A.2 and condition ii of \(\epsilon\)-sanity for \(\sigma^k\) imply that, given \(h \in \operatorname{hdom}{\nu^k}\) s.t. \({\lvert h \rvert} \geq nT\)

\[\operatorname{supp}{\pi^{!k}_n(h)} \subseteq {\mathcal{A}}^1_k\left(S^k\left(\underline{h}\right)\right) \cup \{\bot\}\]

\[\operatorname{supp}{\pi^{!!k}_n(h)} \subseteq {\mathcal{A}}^1_k\left(S^k\left(\underline{h}\right)\right)\]

Therefore, we can apply Proposition A.4 to the terms \({\operatorname{EU}}^{*k}_n(\gamma)-{\operatorname{EU}}^{!!k}_n(\gamma)\) and get

\[R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left({\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma)+{\operatorname{EU}}^{!k}_n(\gamma)-{\operatorname{EU}}^{?k}_n(\gamma)\right) + O\left(\frac{1-\gamma}{1-\gamma^T}+T\frac{(1-\gamma)^2}{1-\gamma^T}\right)\]

We have

\[{\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma) \leq \Pr_{x\sim\rho^{!k}_n}\left[\exists m \in [T]: x_{nT+m} \in \bot{\bar{{\mathcal{O}}}}\right]\]

Since \({Z}_{nT}^{!}(K) \geq \eta\), it follows that

\[{\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma) \leq \frac{1}{\eta}\Pr_{x\sim\rho^{?k}}\left[\exists m \in [T]: x_{nT+m} \in \bot{\bar{{\mathcal{O}}}}\right] \leq \frac{1}{\eta}{\underset{x\sim\rho^{?k}}{\operatorname{E}}{\left[{\lvert {\left\{m \in [T] \mid x_{nT+m} \in \bot{\bar{{\mathcal{O}}}}\right\}} \rvert}\right]}}\]

\[\sum_{n=0}^\infty {\left({\operatorname{EU}}^{!!k}_n(\gamma)-{\operatorname{EU}}^{!k}_n(\gamma)\right)} \leq \frac{1}{\eta}{\underset{x\sim\rho^{?k}}{\operatorname{E}}{\left[{\lvert {\left\{n \in {\mathbb{N}}\mid x_n \in \bot{\bar{{\mathcal{O}}}}\right\}} \rvert}\right]}}\]

Using condition i of Proposition A.2, we conclude

\[R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} \left({\operatorname{EU}}^{!k}_n(\gamma)-{\operatorname{EU}}^{?k}_n(\gamma)\right) + O\left(\frac{1-\gamma^T}{\delta^2}+\frac{1-\gamma}{1-\gamma^T}+T\frac{(1-\gamma)^2}{1-\gamma^T}\right)\]

We denote

\[\xi(\gamma,T,\delta):=\frac{1-\gamma^T}{\delta^2}+\frac{1-\gamma}{1-\gamma^T}+T\frac{(1-\gamma)^2}{1-\gamma^T}\]

Define the random variables \({{\{U^!_n : \Omega \rightarrow [0,1]\}_{n \in {\mathbb{N}}}}}\) by

\[U^!_n:=\frac{1-\gamma}{1-\gamma^T}\sum_{m=0}^{T-1} \gamma^{m} r\left(A\Theta^!_{:nT+m}\right)\]

Averaging the previous inequality over \(k\), we get

\[\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} \leq (1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} {\underset{}{\operatorname{E}}}\left[{\underset{}{\operatorname{E}}}\left[U^!_n \mid {J}^!_n = K,\ Z^!_{nT}\right]-{\underset{}{\operatorname{E}}}\left[U^!_n \mid Z^!_{nT}\right]\right] + O\left(\xi(\gamma,T,\delta)\right)\]

\[\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{(1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} {\underset{}{\operatorname{E}}}\left[\left({\underset{}{\operatorname{E}}}\left[U^!_n \mid {J}^!_n = K,\ Z^!_{nT}\right]-{\underset{}{\operatorname{E}}}\left[U^!_n \mid Z^!_{nT}\right]\right)^2\right]} + O\left(\xi(\gamma,T,\delta)\right)\]

We apply Proposition A.1 to each term in the sum over \(n\).

\[\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{(1-\gamma^T)\sum_{n=0}^\infty \gamma^{nT} {\underset{}{\operatorname{E}}}\left[\frac{1}{2\delta}{\underset{}{\operatorname{I}}}\left[K;{J}^!_n,U^!_n \mid Z^!_{nT}\right]\right]} + O\left(\xi(\gamma,T,\delta)\right)\]

\[\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{\frac{1-\gamma^T}{2\delta}\sum_{n=0}^\infty \gamma^{nT} {\underset{}{\operatorname{E}}}\left[{\operatorname{H}}\Big(Z^!_{nT}\Big)-{\operatorname{H}}\left(Z^!_{(n+1)T}\right)\right]} + O\left(\xi(\gamma,T,\delta)\right)\]

\[\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = \sqrt{\frac{1-\gamma^T}{2\delta}\ln N} + O\left(\frac{1-\gamma^T}{\delta^2}+\frac{1-\gamma}{1-\gamma^T}+T\frac{(1-\gamma)^2}{1-\gamma^T}\right)\]

\[\frac{1}{N}\sum_{k=0}^{N-1}R^{?k} = O\left(\sqrt{\frac{1-\gamma^T}{\delta}}+\frac{1-\gamma^T}{\delta^2}+\frac{1-\gamma}{1-\gamma^T}+T\frac{(1-\gamma)^2}{1-\gamma^T}\right)\]

Condition ii of Proposition A.2 implies that

\[{\operatorname{d}_{\text{tv}}}\left(\frac{1}{N}\sum_{k=0}^{N-1}{\bar{\nu}^k\left[\sigma^k\right]\bowtie\pi^*},\ \frac{1}{N}\sum_{k=0}^{N-1}{\bar{\nu}^k\left[\sigma^k\right]\bowtie\pi^{?k}}\right) \leq 2(N-1)\delta\]

Here, the factor of 2 comes from the difference between the equations for \(Z_n\) and \(Z^!_n\) (we can construct and intermediate policy between \(\pi^*\) and \(\pi^{?k}\) and use the triangle inequality for \({\operatorname{d}_{\text{tv}}}\)). We conclude

\[{\operatorname{EU}}^*_{\upsilon^k}(\gamma)-{\operatorname{EU}}^{\pi^{*}}_{\bar{\upsilon}^k[\sigma^k]}(\gamma) = O\left(\delta+\sqrt{\frac{1-\gamma^T}{\delta}}+\frac{1-\gamma^T}{\delta^2}+\frac{1-\gamma}{1-\gamma^T}+T\frac{(1-\gamma)^2}{1-\gamma^T}\right)\]

Now we set \(\delta:=\left(1-\gamma\right)^{1/4}\), \(T:={\lfloor \left(1-\gamma\right)^{-1/4} \rfloor}\). Since \(\gamma^T \rightarrow 1\) as \(\gamma \rightarrow 1\), we can use the approximation \(1-\gamma^T \approx T(1-\gamma) \approx (1-\gamma)^{3/4}\). We get

\[{\operatorname{EU}}^*_{\upsilon^k}(\gamma)-{\operatorname{EU}}^{\pi^{*}}_{\bar{\upsilon}^k[\sigma^k]}(\gamma) = O\left((1-\gamma)^{1/4}\right)\]

Proof of Corollary 1

Follows immediately from Theorem 1 and Proposition B.2.

Definition A.2

Consider an MDP \(M\) and \({\mathcal{U}}\subseteq {\mathcal{S}}_M\). The MDP \(M/{{\mathcal{U}}^\complement}\) is defined by by

\[{\mathcal{S}}_{M / {{\mathcal{U}}^\complement}} := {\mathcal{U}}\sqcup \{\bot\}\]

\[{\mathcal{A}}_{M / {{\mathcal{U}}^\complement}} := {\mathcal{A}}_M\]

\[{\mathcal{T}}_{M / {{\mathcal{U}}^\complement}}(t \mid s, a) := \begin{cases} {\mathcal{T}}_M(t \mid s,a) \text{ if } t,s \in {\mathcal{U}}\\ {\mathcal{T}}_M({\mathcal{S}}_M \setminus {\mathcal{U}}\mid s,a) \text{ if } t = \bot,\ s \in {\mathcal{U}}\\ 0 \text { if } t\in{\mathcal{U}},\ s = \bot \\ 1 \text{ if } t=s=\bot\end{cases}\]

\[{\mathcal{R}}_{M / {{\mathcal{U}}^\complement}}(s) := \begin{cases} {\mathcal{R}}_M(s) \text{ if } s \in {\mathcal{U}}\\ 0 \text{ if } s=\bot \end{cases}\]

Proposition A.6

Consider an MDP \(M\) and some \({\mathcal{U}}\subseteq{\mathcal{S}}_M\). Suppose that for every \(s \in {\mathcal{U}}\) there is \(a \in {\mathcal{A}}_M^k(s)\) s.t. \({\mathcal{T}}_M({\mathcal{U}}\mid s, a) = 1\). Then, for every \(s \in {\mathcal{S}}_M\) and \(j \in [k]\)

\[{\operatorname{V}}_M^j(s) = {\operatorname{V}}_{M/{{\mathcal{U}}^\complement}}^j(s)\]

Moreover, if \(a \in {\mathcal{A}}\) is s.t. \({\mathcal{T}}_M({\mathcal{U}}\mid s,a) = 1\) then

\[{\operatorname{Q}}_M^j(s,a) = {\operatorname{Q}}_{M/{{\mathcal{U}}^\complement}}^j(s,a)\]

Proof of Proposition A.6

It is obvious that for any \(s \in {\mathcal{U}}\) and \(\gamma\in(0,1)\), \({\operatorname{V}}_{M/{{\mathcal{U}}^\complement}}(s,\gamma) \leq {\operatorname{V}}_M(s,\gamma)\). Now consider any \(\pi: {\mathcal{S}}_M \rightarrow {\mathcal{A}}\) s.t. for any \(s \in {\mathcal{U}}\), \(\pi(s) \in {\mathcal{A}}_M^k(s)\) and \({\mathcal{T}}_{M\pi}({\mathcal{U}}\mid s) = 1\). Fix \(s \in {\mathcal{U}}\). For any \(n \in {\mathbb{N}}\), \(\operatorname{supp}{{\mathcal{T}}_{M\pi}^n(s)} \subseteq {\mathcal{U}}\) and therefore

\[\sum_{n=0}^\infty{\gamma^n {\underset{{\mathcal{T}}_{M/{{\mathcal{U}}^\complement}\pi}^n(s)}{\operatorname{E}}}{\left[{\mathcal{R}}_{M/{{\mathcal{U}}^\complement}}\right]}} = \sum_{n=0}^\infty{\gamma^n {\underset{{\mathcal{T}}_{M\pi}^n(s)}{\operatorname{E}}}\Big[{\mathcal{R}}_M\Big]}\]

On the other hand, \(\pi\) is \(k\)-optimal and therefore

\[\sum_{n=0}^\infty{\gamma^n {\underset{{\mathcal{T}}_{M\pi}^n(s)}{\operatorname{E}}}\Big[{\mathcal{R}}_M\Big]} = {\operatorname{V}}_M(s) - O{\left((1-\gamma)^k\right)}\]

We conclude that

\[{\operatorname{V}}_{M/{{\mathcal{U}}^\complement}}(s,\gamma) \geq \sum_{n=0}^\infty{\gamma^n {\underset{{\mathcal{T}}_{M/{{\mathcal{U}}^\complement}\pi}^n(s)}{\operatorname{E}}}{\left[{\mathcal{R}}_{M/{{\mathcal{U}}^\complement}}\right]}} = {\operatorname{V}}_M(s,\gamma) - O{\left((1-\gamma)^k\right)}\]

\[{\lvert {\operatorname{V}}_{M/{{\mathcal{U}}^\complement}}(s,\gamma) - {\operatorname{V}}_M(s,\gamma) \rvert} = O{\left((1-\gamma)^k\right)}\]

This implies the equation for \({\operatorname{V}}^j\) and the equation for \({\operatorname{Q}}^j\) is implied in turn.

Definition A.4

Fix an interface \({\mathcal{I}}=({\mathcal{A}},{\mathcal{O}})\). Consider an \({\mathcal{I}}\)-universe \(\upsilon=(\mu,r)\) which is an \({\mathcal{O}}\)-realization of \(M\) with state function \(S\), and \({\mathcal{U}}\subseteq {\mathcal{S}}_M\) s.t. \(S({\boldsymbol{\lambda}}) \in {\mathcal{U}}\). Denote \({\mathcal{O}}':={\mathcal{O}}\sqcup\{\bot\}\) and \({\mathcal{I}}':=({\mathcal{A}},{\mathcal{O}}')\). The \({\mathcal{I}}'\)-universe \(\upsilon/_S{{\mathcal{U}}^\complement}=(\mu/_S{{\mathcal{U}}^\complement},r')\) is defined by

\[\mu/_S{{\mathcal{U}}^\complement}(o \mid ha) = \begin{cases} \mu(o \mid ha) \text{ if } hao \in {({\mathcal{A}}\times {\mathcal{O}})^*}\text{ and } S(hao)\in{\mathcal{U}}\\ 0 \text{ if } hao \in {({\mathcal{A}}\times {\mathcal{O}})^*}\text{ and } S(hao)\not\in{\mathcal{U}}\\ {\mathcal{T}}_{M/{{\mathcal{U}}^\complement}}{\left(\bot \mid S(h),a\right)} \text{ if } h \in {({\mathcal{A}}\times {\mathcal{O}})^*}\text{ and } o = \bot \\ 0 \text{ if } h \not\in {({\mathcal{A}}\times {\mathcal{O}})^*}\text{ and } o \ne \bot \\ 1 \text{ if } h\not\in{({\mathcal{A}}\times {\mathcal{O}})^*}\text{ and } o = \bot \end{cases}\]

\[r'(h):=\begin{cases} r(h) \text{ if } h \in {({\mathcal{A}}\times {\mathcal{O}})^*}\\ 0 \text{ otherwise} \end{cases}\]

It is easy to see that \(\upsilon/_S {{\mathcal{U}}^\complement}\) is an \({\mathcal{O}}'\)-realization of \(M/{{\mathcal{U}}^\complement}\) with state function \(S'\) defined by

\[S'(h):=\begin{cases} S(h) \text{ if } h \in {({\mathcal{A}}\times {\mathcal{O}})^*}\\ \bot \text{ otherwise} \end{cases}\]

Proposition A.7

Fix an interface \({\mathcal{I}}=({\mathcal{A}},{\mathcal{O}})\). Consider \(\epsilon > 0\) and \(\upsilon=(\mu,r)\) an \({\mathcal{O}}\)-realization of \(M\) with state function \(S\). Suppose that \(\pi\) is a locally \(\epsilon\)-sane policy for \(\mu\) and \({\mathcal{U}}\subseteq {\mathcal{S}}_M\) is the corresponding set of uncorrupt states. Let \(\pi'\) be any \({\mathcal{I}}'\)-policy s.t. for any \(h \in {({\mathcal{A}}\times {\mathcal{O}})^*}\), \(\pi(h) = \pi(h')\). Then, \(\pi'\) is \(\epsilon\)-sane for \(\upsilon/_S {{\mathcal{U}}^\complement}\).

Proof of Proposition A.7

Without loss of generality, assumes all states of \(M\) are reachable from \(S({\boldsymbol{\lambda}})\) (otherwise, \(\upsilon\) is an \({\mathcal{O}}\)-realization of the MDP we get by discarding the unreachable states). Consider any \(h \in \operatorname{hdom}{\mu/_S{{\mathcal{U}}^\complement}}\). If \(S'(h)=\bot\) then conditions i+ii of Definition 3 are trivial. Otherwise, we have \(h \in \operatorname{hdom}{\mu}\) and \(s:=S(h)\in{\mathcal{U}}\). We apply Proposition A.6 for \(k=2\) (by condition iv of Definition 4) and get that: conditions i+ii of Definition 4 imply condition i of Definition 3 and conditions i+iii of Definition 4 imply condition ii of Definition 3.

Proof of Corollary 2

For every \(k \in [N]\), denote \(\psi^k:=\upsilon^k/_{S^k} {\mathcal{U}}^{k\complement}\). By Proposition A.7, \(\sigma^k\) is \(\epsilon\)-sane for \(\psi^k\) (where we abuse notation by arbitrarily extending \(\sigma^k\) to an \({\mathcal{I}}'\)-policy). Moreover, it is easy to see that all the \(\psi^k\) share the same reward function \(r'\). Apply Theorem 1 to \({\mathcal{H}}':=\{\psi^k\}_{k \in [N]}\). We get \(\pi^*\) s.t. for every \(k \in [N]\) and \(\gamma\in(0,1)\)

\[{\operatorname{EU}}_{\psi^k}^*(\gamma) - {\operatorname{EU}}_{\bar{\psi}^k\left[\sigma^k\right]}^{\pi^*}(\gamma) = O\left((1-\gamma)^{1/4}\right)\]

Obviously \({\operatorname{EU}}_{\bar{\psi}^k\left[\sigma^k\right]}^{\pi^*}(\gamma) \leq {\operatorname{EU}}_{\bar{\upsilon}^k\left[\sigma^k\right]}^{\pi^*}(\gamma)\) and therefore

\[{\operatorname{EU}}_{\psi^k}^*(\gamma) - {\operatorname{EU}}_{\bar{\upsilon}^k\left[\sigma^k\right]}^{\pi^*}(\gamma) \leq O\left((1-\gamma)^{1/4}\right)\]

On the other hand, by Proposition A.6

\[{\operatorname{EU}}_{\psi^k}^*(\gamma) = {\operatorname{V}}_{M^k/{\mathcal{U}}^{k\complement}}{\left(S({\boldsymbol{\lambda}}),\gamma\right)} = {\operatorname{V}}_{M^k}{\left(S({\boldsymbol{\lambda}}),\gamma\right)} - O{\left(1-\gamma\right)} = {\operatorname{EU}}_{\upsilon^k}^*(\gamma) - O{\left(1-\gamma\right)}\]

We conclude

\[{\operatorname{EU}}_{\upsilon^k}^*(\gamma) - O{\left(1-\gamma\right)} - {\operatorname{EU}}_{\bar{\upsilon}^k\left[\sigma^k\right]}^{\pi^*}(\gamma) \leq O\left((1-\gamma)^{1/4}\right)\]

\[{\operatorname{EU}}_{\upsilon^k}^*(\gamma) - {\operatorname{EU}}_{\bar{\upsilon}^k\left[\sigma^k\right]}^{\pi^*}(\gamma) = O\left((1-\gamma)^{1/4}\right)\]

Proof of Corollary 3

Follows immediately from Corollary 2 and Proposition B.2.

Appendix B

The following appeared before as Proposition A.5:

Proposition B.1

Given \(\alpha \in {\mathcal{A}}\times {\mathcal{O}}\), \(\alpha^{\mathcal{A}}\in {\mathcal{A}}\) and \(\alpha^{\mathcal{O}}\in {\mathcal{O}}\) are defined s.t. \(\alpha = {\left(\alpha^{\mathcal{A}},\alpha^{\mathcal{O}}\right)}\).

Consider a universe \(\upsilon=(\mu,r)\), a policy \(\pi\) and \(\gamma \in (0,1)\). Then,

\[{\operatorname{EU}}_{\upsilon}^{*}(\gamma) - {\operatorname{EU}}_{\upsilon}^{\pi^0}(\gamma) = \sum_{n=0}^\infty{\gamma^n {\underset{x\sim\mu\bowtie\pi}{\operatorname{E}}{\left[{\operatorname{V}}^{\upsilon}_\gamma\Big(x_{:n}\Big)-{\operatorname{Q}}^{\upsilon}_\gamma{\left(x_{:n},x_n^{\mathcal{A}}\right)}\right]}}}\]


The followed appeared as Proposition 2:

Proposition B.2

Fix an interface \({\mathcal{I}}\). Let \({\mathcal{H}}\) be a countable set of meta-universes s.t. any finite \(\mathcal{G} \subseteq {\mathcal{H}}\) is learnable. Then, \({\mathcal{H}}\) is learnable.



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

Unfortunately, it's not just
by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

I think the point I was
by Abram Demski on Predictable Exploration | 0 likes

(also x-posted from
by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

(x-posted from Arbital ==>
by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

>If the other players can see
by Stuart Armstrong on Predictable Exploration | 0 likes

Thinking about this more, I
by Abram Demski on Predictable Exploration | 0 likes

> So I wound up with
by Abram Demski on Predictable Exploration | 0 likes

Hm, I got the same result
by Alex Appel on Predictable Exploration | 1 like

Paul - how widely do you want
by David Krueger on Funding opportunity for AI alignment research | 0 likes

I agree, my intuition is that
by Abram Demski on Smoking Lesion Steelman III: Revenge of the Tickle... | 0 likes

RSS

Privacy & Terms