Implementing CDT with optimal predictor systems
post by Vadim Kosoy 1066 days ago | Patrick LaVictoire likes this | 2 comments

We consider transparent games between bounded CDT agents (“transparent” meaning each player has a model of the other players). The agents compute the expected utility of each possible action by executing an optimal predictor of a causal counterfactual, i.e. an optimal predictor for a function that evaluates the other players and computes the utility for the selected action. Since the agents simultaneously attempt to predict each other, the optimal predictors form an optimal predictor system for the reflective system comprised by the causal counterfactuals of all agents. We show that for strict maximizers, the resulting outcome is a bounded analogue of an approximate Nash equilibrium, i.e. a strategy which is an optimal response within certain resource constraints up to an asymptotically small error. For “thermalizers” (agents that choose an action with probability proportional to $$2^{\frac{u}{T}}$$), we get a similar result with expected utility $$\operatorname{E}_s[u]$$ replaced by “free utility” $$\operatorname{E}_s[u]+T \operatorname{H}(s)$$. Thus, such optimal predictor systems behave like bounded counterparts of reflective oracles.

## Preliminaries

The proofs for this section are given in Appendix A.

We redefine $$\mathcal{E}_{2(ll,\phi)}$$ and $$\mathcal{E}_{2(ll)}$$ to be somewhat smaller proto-error spaces which nevertheless yield the same existence theorems as before. This is thanks to Lemma A.1.

# Construction 1

Given $$\phi \in \Phi$$, denote $$\mathcal{E}_{2(ll,\phi)}$$ the set of bounded functions $$\delta: {\mathbb{N}}^2 \rightarrow {\mathbb{R}}^{\geq 0}$$ s.t.

$\forall \psi \in \Phi: \psi \leq \phi \implies E_{\lambda_\psi^k}[\delta(k,j)] = O(\psi(k)^{-1})$

Denote

$\mathcal{E}_{2(ll)} := \bigcap_{\phi \in \Phi} \mathcal{E}_{2(ll,\phi)}$

# Proposition 1

If $$\phi \in \Phi$$ is s.t. $$\exists n: {\liminf_{k \rightarrow \infty}} 2^{-k^n} \phi(k) = 0$$, $$O(\phi^{-1})$$ is a proto-error space.

For any $$\phi \in \Phi$$, $$\mathcal{E}_{2(ll,\phi)}$$ is an ample proto-error space. When $$\phi$$ is non-decreasing, $$\mathcal{E}_{2(ll,\phi)}$$ is stable.

$$\mathcal{E}_{2(ll)}$$ is a stable ample proto-error space.

# Notation

We denote $$O(\phi^{-\frac{1}{\infty}}):=O(\phi^{-1})^{\frac{1}{\infty}}$$. We allow the same abuse of notation for this symbol as for usual big $$O$$ notation.

For any $$(poly,rlog)$$-bischeme $$\hat{Q}=(Q,r_Q,\sigma_Q)$$ we use $$\hat{\sigma}_Q^{kj}$$ to denote $$U^{r_Q(k,j)} \times \sigma_Q^{kj}$$.

For reflective systems $$\mathcal{R}=(\Sigma,f,\mu)$$ we write indices in $$\Sigma$$ as superscripts rather than subscripts as before.

Given $$\pi \in \Pi_\Sigma$$, we denote $$U(\pi)^{nkj}:=U^{\pi_2^{nkj}}$$ and $$\pi^{nkj}(x,y):=\beta(ev(\pi_1^{nkj};x,y))$$.

## Results

The proofs for this section are given in Appendix B.

We start by showing that choosing an action by maximizing over optimally predicted utility is an optimal strategy in some sense.

# Theorem 1

Fix a proto-error space $$\mathcal{E}$$. Consider $$\mathcal{A}$$ a finite set, $$\mu$$ a word ensemble and $$\{f_a: \operatorname{supp}\mu \rightarrow [0,1]\}_{a \in \mathcal{A}}$$. Suppose $$\{\hat{P}_a\}_{a \in \mathcal{A}}$$ are $$\mathcal{E}(poly,rlog)$$-optimal predictors for $$f$$. Assume that $$\sigma_{P_a}$$ and $$r_{P_a}$$ don’t depend on $$a$$ (this doesn’t limit generality since we can replace $$\sigma_{P_a}$$ by $$\prod_{b \in \mathcal{A}} \sigma_{P_b}$$ and $$r_{P_a}$$ by a polynomial $$r_P \geq \max_{b \in \mathcal{A}} r_{P_b}$$). Consider $$\hat{A}$$ an $$\mathcal{A}$$-valued $$(poly,rlog)$$-bischeme. Define $$\hat{M}$$ another $$\mathcal{A}$$-valued $$(poly,rlog)$$-bischeme where $$\hat{M}^{kj}(x)$$ is computed by sampling $$y$$ from $$\hat{\sigma}_P^{kj}$$ and choosing arbitrarily from $$a \in \mathcal{A}$$ s.t. $$\hat{P}_a^{kj}(x,y)$$ is maximal. Then, there is $$\delta \in \mathcal{E}^{\frac{1}{2}}$$ s.t.

$\operatorname{E}_{\mu^k \times \hat{\sigma}_M^{kj}}[f_{\hat{M}^{kj}}] \geq \operatorname{E}_{\mu^k \times \hat{\sigma}_A^{kj}}[f_{\hat{A}^{kj}}] - \delta(k,j)$

# Definition 1

Given suitable sets $$X,Y$$ and $$\phi \in \Phi$$, a $$\phi(poly,rlog)$$-scheme of signature $$X \rightarrow Y$$ is $$\hat{A}=(A:{\mathbb{N}}\times X \times {{\{ 0, 1 \}^*}}^2 \rightarrow Y, r_A: {\mathbb{N}}\rightarrow {\mathbb{N}}, \sigma_A: {\mathbb{N}}\times {{\{ 0, 1 \}^*}}\rightarrow [0,1])$$ s.t.

1. The runtime of $$A^k$$ is bounded by $$p(t_\phi(k))$$ for some polynomial $$p$$.

2. $$|r_A(k)| \leq p'(t_\phi(k))$$ for some polynomial $$p'$$.

3. $$\sigma_A^k$$ is a probability measure on $${{\{ 0, 1 \}^*}}$$ and there is $$c > 0$$ s.t. $$\operatorname{supp}\sigma_A^k \subseteq {{\{ 0, 1 \}^{\leq{c {\lfloor log(k+2)^{\phi(k)} \rfloor}}}}}$$.

A $$\phi(poly,rlog)$$-scheme of signature $${{\{ 0, 1 \}^*}}\rightarrow Y$$ is also called a $$Y$$-valued $$\phi(poly,rlog)$$-scheme.

As usual, we sometimes use $$\hat{A}^{kj}$$ as implicit notation in which $$A$$ receives a value sampled from $$U^{r_A(k)}$$ for the second argument and/or a value sampled from $$\sigma_A^k$$ for the third argument.

# Corollary 1

Fix $$\phi \in \Phi$$. Consider $$\mathcal{A}$$ a finite set, $$\mu$$ a word ensemble and $$\{f_a: \operatorname{supp}\mu \rightarrow [0,1]\}_{a \in \mathcal{A}}$$. Suppose $$\{\hat{P}_a\}_{a \in \mathcal{A}}$$ are $$\mathcal{E}_{2(ll,\phi)}(poly,rlog)$$-optimal predictors for $$f$$. Consider $$\epsilon \in (0,1)$$ and $$\hat{A}$$ an $$\mathcal{A}$$-valued $$\phi^{1-\epsilon}(poly,rlog)$$-scheme. Define $$\hat{M}$$ as in Theorem 1. Then

$\operatorname{E}_{\mu^k \times \lambda_\phi^k}[\operatorname{E}_{\hat{\sigma}_M^{kj}}[f_{\hat{M}^{kj}}]] \geq \operatorname{E}_{\mu^k \times \hat{\sigma}_A^{k}}[f_{\hat{A}^{k}}] - O(\phi^{-\frac{1}{\infty}})$

We now introduce the game theoretical setting.

# Definition 2

A distributional game is a quadruple $$G=(\mu,\mathcal{P},\mathcal{A},u)$$ where $$\mu$$ is a word ensemble, $$\mathcal{P}$$ is a finite set representing players, $$\{\mathcal{A}^n\}_{n \in \mathcal{P}}$$ are finite sets representing the possible actions of each player and $$\{u^n: \operatorname{supp}\mu \times \prod_{n \in \mathcal{P}} \mathcal{A}^n \rightarrow [0,1]\}_{n \in \mathcal{P}}$$ represent the utility functions.

For each $$x \in \operatorname{supp}\mu$$, $$G$$ defines an ordinary game in normal form. We think of $$G$$ as a single game where the strategies are (some class of) functions $$s: \operatorname{supp}\mu \rightarrow \mathcal{A}^n$$ and the payoffs are $$\operatorname{E}_\mu[u^n(\prod_{n \in \mathcal{P}} s^n(x))]$$.

# Construction 2

Consider $$G=(\mu,\mathcal{P},\mathcal{A},u)$$ a distributional game and $$\{\phi^n \in \Phi\}_{n \in \mathcal{P}}$$. We construct the reflective system $$\mathcal{R}_G^\phi=(\mathcal{P},u_\phi,\mu_G)$$.

$$\mu_G$$ is defined by

$\mu_G^{nk}(k,x,a)=\frac{\mu^k(x) \chi_{\mathcal{A}^n}(a)}{\#\mathcal{A}^n}$

Define $$M_G^n: {\mathbb{N}}^2 \times \operatorname{supp}\mu \times {{\{ 0, 1 \}^*}}\times \Pi_{\mathcal{P}} \rightarrow \wp(\mathcal{A}^n)$$ by

$M_G^{nkj}(x,y,\pi):={\underset{a \in \mathcal{A}}{\operatorname{arg\,max}}\,} \pi^{nkj}((k,x,a),y)$

Denote $$\Delta_G^n$$ to be the space of mixed actions of player $$n$$ i.e. the space of probability measures on $$\mathcal{A}^n$$.

Define $$s_{G,\phi}^n: {\mathbb{N}}\times \operatorname{supp}\mu \times \Pi_{\mathcal{P}} \rightarrow \Delta_G^n$$ by

$\operatorname{Pr}_{s_{G,\phi}^{nk}(x,\pi)}[a]:=\operatorname{E}_{\lambda_{\phi^n}^k}[\operatorname{E}_{U(\pi)^{nkj}}[\frac{\chi_{M_G^{nkj}(x,y,\pi)}}{\#M_G^{nkj}(x,y,\pi)}]]$

Define $$s_{G,\phi}: {\mathbb{N}}\times \operatorname{supp}\mu \times \Pi_{\mathcal{P}} \rightarrow \prod_{n \in \mathcal{P}} \Delta_G^n$$ by

$s_{G,\phi}^k(x,\pi) := \prod_{n \in \mathcal{P}} s_{G,\phi}^{nk}(x,\pi)$

Define $$s_{G,\phi}^{\bar{n}}: {\mathbb{N}}\times \operatorname{supp}\mu \times \Pi_{\mathcal{P}} \rightarrow \prod_{m \in \mathcal{P} \setminus n} \Delta_G^m$$ by

$s_{G,\phi}^{\bar{n}k}(x,\hat{Q}):=\prod_{m \in \mathcal{P} \setminus n} s_{G,\phi}^{mk}(x,\hat{Q}^m)$

Finally, define $$u_\phi$$ by

$u_\phi^{nk}(x,a,\pi):=\operatorname{E}_{a \times s_{G,\phi}^{\bar{n}k}(x,\pi)}[u^n]$

For any $$(poly,rlog)$$-predictor $$\hat{Q}$$ we will use the notation $$s_{G,\phi}^{nk}(x,\hat{Q}) \in \Delta_G^n$$ to mean

$s_{G,\phi}^{nk}(x,\hat{Q}):=\operatorname{E}_{\sigma_Q}[s_{G,\phi}^{nk}(x,\hat{Q}[z])]$

For a family $$\{\hat{Q}^n\}_{n \in \mathcal{P}}$$ of $$(poly,rlog)$$-predictors, we will use the notations $$s_{G,\phi}^k(x,\hat{Q}) \in \prod_{n \in \mathcal{P}} \Delta_G^n$$ and $$s_{G,\phi}^{\bar{n}k}(x,\hat{Q}) \in \prod_{m \in \mathcal{P} \setminus n} \Delta_G^m$$ to mean

$s_G^k(x,\hat{Q}):=\prod_{n \in \mathcal{P}} s_{G,\phi}^{nk}(x,\hat{Q}^n)$

$s_G^{\bar{n}k}(x,\hat{Q}):=\prod_{m \in \mathcal{P} \setminus n} s_{G,\phi}^{mk}(x,\hat{Q}^m)$

# Corollary 2

Fix $$G=(\mu,\mathcal{P},\mathcal{A},u)$$ a distributional game, $$\{\phi^n \in \Phi\}_{n \in \mathcal{P}}$$ and $$\hat{P}$$ an $$\mathcal{E}_{2(ll,\phi)}^*(poly,rlog)$$-optimal predictor system for $$\mathcal{R}_G^\phi$$. Consider $$n \in \mathcal{P}$$, $$\epsilon \in (0,1)$$ and $$\hat{A}$$ an $$\mathcal{A}^n$$-valued $$(\phi^n)^{1-\epsilon}(poly,rlog)$$ scheme. Then

$\operatorname{E}_{\mu^k}[\operatorname{E}_{s_{G,\phi}^k(x,\hat{P})}[u^n]] \geq \operatorname{E}_{\mu^k}[\operatorname{E}_{\hat{A}^k(x) \times s_{G,\phi}^{\bar{n}k}(x,\hat{P})}[u^n]] - O((\phi^n)^{-\frac{1}{\infty}})$

We now move from studying strict maximizers to studying thermalizers.

# Theorem 2

Fix a proto-error space $$\mathcal{E}$$. Consider $$\mathcal{A}$$ a finite set, $$\mu$$ a word ensemble and $$\{f_a: \operatorname{supp}\mu \rightarrow [0,1]\}_{a \in \mathcal{A}}$$. Suppose $$\{\hat{P}_a\}_{a \in \mathcal{A}}$$ are $$\mathcal{E}(poly,rlog)$$-optimal predictors for $$f$$ and $$T: {\mathbb{N}}^2 \rightarrow {\mathbb{R}}^{>0}$$ is some function. Assume that $$\sigma_{P_a}$$ and $$r_{P_a}$$ don’t depend on $$a$$. Consider $$\hat{A}$$ an $$\mathcal{A}$$-valued $$(poly,rlog)$$-bischeme. Suppose $$\hat{M}_T$$ is another $$\mathcal{A}$$-valued $$(poly,rlog)$$-bischeme satisfying

$|\operatorname{Pr}_{\hat{\sigma}_M}[M_T^{kj}(x)=a] - E_{\hat{\sigma}_P}[Z_T^{kj}(x,y)^{-1} 2^{\frac{P_a^{kj}(x,y)}{T^{kj}}}]| \leq \delta_r(k,j)$

Here $$y$$ is sampled from $$\hat{\sigma}_P$$, $$Z_T^{kj}(x,y):=\sum_{a \in \mathcal{A}} 2^{\frac{P_a^{kj}(x,y)}{T^{kj}}}$$ is the normalization factor and $$\delta_r,T\delta_r^{\frac{1}{2}} \in \mathcal{E}^{\frac{1}{4}}$$. Then, there is $$\delta \in \mathcal{E}^{\frac{1}{4}}$$ s.t.

$\operatorname{E}_{\mu^k}[\operatorname{E}_{\hat{\sigma}_{M_T}^{kj}}[f_{\hat{M}_T^{kj}(x)}(x)] + T^{kj} \operatorname{H}(\hat{M}_T^{kj}(x))] \geq \operatorname{E}_{\mu^k}[\operatorname{E}_{\hat{\sigma}_A^{kj}}[f_{\hat{A}^{kj}(x)}(x)] + T^{kj} \operatorname{H}(\hat{A}^{kj}(x))] - \delta(k,j)$

Here $$\operatorname{H}$$ for stands the (base $$2$$) Shannon entropy of a probability measure on $$\mathcal{A}$$.

# Corollary 3

Fix $$\phi \in \Phi$$. Consider $$\mathcal{A}$$ a finite set, $$\mu$$ a word ensemble and $$\{f_a: \operatorname{supp}\mu \rightarrow [0,1]\}_{a \in \mathcal{A}}$$. Suppose $$\{\hat{P}_a\}_{a \in \mathcal{A}}$$ are $$\mathcal{E}_{2(ll,\phi)}(poly,rlog)$$-optimal predictors for $$f$$ and $$T: {\mathbb{N}}\rightarrow {\mathbb{R}}^{>0}$$ a bounded function. Consider $$\epsilon \in (0,1)$$ and $$\hat{A}$$ an $$\mathcal{A}$$-valued $$\phi^{1-\epsilon}(poly,rlog)$$-scheme. Assume $$\hat{M}_T$$ is as in Theorem 2. Then

$\operatorname{E}_{\mu^k}[\operatorname{E}_{\lambda_\phi^k}[E_{\hat{\sigma}_{M_T}^{kj}}[f_{\hat{M}_T^{kj}(x)}(x)]] + T^k \operatorname{H}(\operatorname{E}_{\lambda_\phi^k}[ \hat{M}_T^{kj}(x)])] \geq \operatorname{E}_{\mu^k}[\operatorname{E}_{\hat{\sigma}_A^k}[f_{\hat{A}^k(x)}(x)] + T^k \operatorname{H}(\hat{A}^k(x))] - O(\phi^{-\frac{1}{\infty}})$

Here $$\operatorname{E}_{\lambda_\phi^k}[\hat{M}_T^{kj}(x)]$$ stands for averaging the probability measure $$\hat{M}_T^{kj}(x)$$ on $$\mathcal{A}$$ with respect to parameter $$j$$ using probability distribution $$\lambda_\phi^k$$.

# Construction 3

Consider $$G=(\mu,\mathcal{P},\mathcal{A},u)$$ a distributional game, $$\{\phi^n \in \Phi\}_{n \in \mathcal{P}}$$ and $$T: \mathcal{P} \times {\mathbb{N}}\rightarrow {\mathbb{R}}^{>0}$$. We construct the reflective system $$\mathcal{R}_G^{\phi,T}=(\mathcal{P},u_{\phi,T},\mu_G)$$.

Define $$s_{G,\phi,T}^n: {\mathbb{N}}\times \operatorname{supp}\mu \times \Pi_{\mathcal{P}} \rightarrow \Delta_G^n$$ by

$\operatorname{Pr}_{s_{G,\phi,T}^{nk}(x,\pi)}[a]:=\operatorname{E}_{\lambda_{\phi^n}^k}[\operatorname{E}_{U(\pi)^{nkj}}[Z_{G,\phi,T}^{nkj}(x,y,\pi)^{-1} 2^{(T^{nk})^{-1}\pi^{nkj}((k,x,a),y)}]]$

Here $$Z$$ is the normalization factor.

Define $$s_{G,\phi,T}: {\mathbb{N}}\times \operatorname{supp}\mu \times \Pi_{\mathcal{P}} \rightarrow \prod_{n \in \mathcal{P}} \Delta_G^n$$ by

$s_{G,\phi,T}^k(x,\pi) := \prod_{n \in \mathcal{P}} s_{G,\phi,T}^{nk}(x,\pi)$

Define $$s_{G,\phi,T}^{\bar{n}}: {\mathbb{N}}\times \operatorname{supp}\mu \times \Pi_{\mathcal{P}} \rightarrow \prod_{m \in \mathcal{P} \setminus n} \Delta_G^m$$ by

$s_{G,\phi,T}^{\bar{n}k}(x,\pi) := \prod_{m \in \mathcal{P} \setminus n} s_{G,\phi,T}^{mk}(x,\pi)$

Finally, define $$u_{\phi,T}$$ by

$u_{\phi,T}^{nk}(x,a,\pi):=\operatorname{E}_{a \times s_{G,\phi,T}^{\bar{n}k}(x,\pi)}[u^n]$

We define the notations $$s_{G,\phi,T}^{nk}(x,\hat{Q})$$, $$s_{G,\phi,T}^k(x,\hat{Q})$$ and $$s_{G,\phi,T}^{\bar{n}k}(x,\hat{Q})$$ analogously to before.

# Corollary 4

Fix $$G=(\mu,\mathcal{P},\mathcal{A},u)$$ a distributional game, $$\{\phi^n \in \Phi\}_{n \in \mathcal{P}}$$, $$T: \mathcal{P} \times {\mathbb{N}}\rightarrow {\mathbb{R}}^{>0}$$ bounded and $$\hat{P}$$ an $$\mathcal{E}_{2(ll,\phi)}^*(poly,rlog)$$-optimal predictor system for $$\mathcal{R}_G^{\phi,T}$$. Consider $$n \in \mathcal{P}$$, $$\epsilon \in (0,1)$$ and $$\hat{A}$$ an $$\mathcal{A}^n$$-valued $$(\phi^n)^{1-\epsilon}(poly,rlog)$$ scheme. Then

$\operatorname{E}_{\mu^k}[\operatorname{E}_{s_{G,\phi,T}^k(x,\hat{P})}[u^n] + T^k \operatorname{H}(s_{G,\phi,T}^{nk}(x,\hat{P}))] \geq \operatorname{E}_{\mu^k}[\operatorname{E}_{\hat{A}^k(x) \times s_{G,\phi,T}^{\bar{n}k}(x,\hat{P})}[u^n] + T^k \operatorname{H}(\hat{A}^k(x))] - O((\phi^n)^{-\frac{1}{\infty}})$

Finally, we show that assuming $$\phi^n$$ doesn’t depend on $$n$$ and $$T$$ doesn’t fall to zero too fast, deterministic advice is sufficient to construct an optimal predictor system for $$\mathcal{R}_G^{\phi,T}$$.

# Definition 3

Consider a reflective system $$\mathcal{R}=(\Sigma,f,\mu)$$ and $$\{\phi^n \in \Phi\}_{n \in \Sigma}$$. $$\mathcal{R}$$ is said to be $$\phi$$-Hoelder when there are $$\{c^{nk} \in {\mathbb{R}}^{>0}\}_{n \in \Sigma, k \in {\mathbb{N}}}$$, $$\{\rho^n: \Sigma \rightarrow [0,1]\}_{n \in \Sigma}$$, $$\{\psi^{nm} \in \Phi\}_{n,m \in \Sigma}$$, $$\{\alpha^n \in (0,1]\}_{n \in \Sigma}$$ and $$\{\delta^n: {\mathbb{N}}\rightarrow {\mathbb{R}}^{\geq 0}\}_{n \in \Sigma}$$ s.t.

1. $\forall n \in \Sigma \, \exists \epsilon^n > 0: c^{nk}=O(\phi^n(k)^{\frac{\alpha^n}{2} - \epsilon^n})$

2. $\sum_{m \in \Sigma} \rho^n(m) = 1$

3. $\psi^{nm} \geq \phi^n$

4. $\delta^n(k) = O(\phi^n(k)^{-\frac{1}{\infty}})$

5. $\operatorname{E}_{\mu^{nk}}[(f^n(x,\pi)-f^n(x,\tilde{\pi}))^2] \leq c^{nk} \operatorname{E}_{\rho^n}[\operatorname{E}_{\mu^{mk} \times \lambda_{\psi^{nm}}^k}[\operatorname{E}_{U(\pi)^{mkj} \times U(\tilde{\pi})^{mkj}}[(\pi^{mkj}(x,y)-\tilde{\pi}^{mkj}(x,y))^2]]]^{\alpha^n} + \delta^n(k)$

# Theorem 3

Consider a finite set $$\Sigma$$, a reflective system $$\mathcal{R}=(\Sigma,f,\mu)$$ and $$\{\phi^n \in \Phi\}_{n \in \Sigma}$$. If $$\mathcal{R}$$ is $$\phi$$-Hoelder then it has an $$\mathcal{E}_{2(ll,\phi)}^*(poly,log)$$-optimal predictor system.

# Theorem 4

Consider $$G=(\mu,\mathcal{P},\mathcal{A},u)$$ a distributional game, $$\phi \in \Phi$$ and $$T: \mathcal{P} \times {\mathbb{N}}\rightarrow {\mathbb{R}}^{>0}$$. Assume $$\frac{1}{T^{nk}} = O(\phi(k)^{\frac{1}{4} - \epsilon})$$ for some $$\epsilon > 0$$. Then $$\mathcal{R}_G^{\phi,T}$$ is $$\phi$$-Hoelder.

# Proposition A.1

Suppose $$h: {\mathbb{R}}^{\geq 0} \rightarrow {\mathbb{R}}^{\geq 0}$$ is a non-decreasing function s.t. $$\int_0^\infty h(x)^{-1} dx < \infty$$. Define $$\delta_h: {\mathbb{N}}^2 \rightarrow {\mathbb{R}}^{\geq 0}$$ by

$\delta_h(k,j):=\min(\frac{\log \log (k+3)}{h(\log \log (j+3))},1)$

Then, $$\delta_h \in \mathcal{E}_{2(ll)}$$.

# Proof of Proposition A.1

Consider $$\phi \in \Phi$$. We have

${\limsup_{k \rightarrow \infty}} \, \phi(k) E_{\lambda_\phi^k}[\delta_h(k,j)] = {\limsup_{k \rightarrow \infty}} \, \phi(k) \frac{\sum_{j=2}^{t_\phi(k)-1} (\log\log(j+1)-\log \log j) \delta_h(k,j)}{\log \log t_\phi(k)}$

${\limsup_{k \rightarrow \infty}} \, \phi(k) E_{\lambda_\phi^k}[\delta_h(k,j)] = {\limsup_{k \rightarrow \infty}} \, \phi(k) \frac{\int_{x=2}^{t_\phi(k)} \delta_h(k,{\lfloor x \rfloor}) d (\log \log x)}{\phi(k) \log\log k}$

${\limsup_{k \rightarrow \infty}} \, \phi(k) E_{\lambda_\phi^k}[\delta_h(k,j)] \leq {\limsup_{k \rightarrow \infty}} \, \int_{x=2}^{t_\phi(k)} h(\log\log ({\lfloor x \rfloor}+3))^{-1} d (\log \log x)$

${\limsup_{k \rightarrow \infty}} \, \phi(k) E_{\lambda_\phi^k}[\delta_h(k,j)] \leq {\limsup_{k \rightarrow \infty}} \, \int_{x=2}^{t_\phi(k)} h(\log\log x)^{-1} d (\log \log x)$

${\limsup_{k \rightarrow \infty}} \, \phi(k) E_{\lambda_\phi^k}[\delta_h(k,j)] \leq \int_0^\infty h(t)^{-1} dt$

# Proof of Proposition 1

Mostly obvious modulo Proposition A.1. To see $$\mathcal{E}_{2(ll,\phi)}$$ is stable for non-decreasing $$\phi$$, consider a non-constant polynomial $$p: {\mathbb{N}}\rightarrow {\mathbb{N}}$$ and $$\delta \in \mathcal{E}_{2(ll,\phi)}$$. Define $$\delta'(k,j):=\delta(p(k),j)$$. To get the desired condition for $$\delta'$$ and $$\psi \in \Phi$$ with $$\psi \leq \phi$$, consider any $$\psi' \in \Phi$$ s.t. $$\psi' \leq \phi$$ and for sufficiently large $$k$$ we have $$\psi'(p(k))=\psi(k)$$. Suche $$\psi'$$ exists since for for $$k$$ sufficiently large $$\phi(k) \leq \phi(p(k))$$. We have

${\limsup_{k \rightarrow \infty}} \, \psi'(k) E_{\lambda_{\psi'}^k}[\delta(k,j)] < \infty$

In particular

${\limsup_{k \rightarrow \infty}} \,\psi'(p(k)) E_{\lambda_{\psi'}^{p(k)}}[\delta(p(k),j)] < \infty$

${\limsup_{k \rightarrow \infty}} \, \psi(k) E_{\lambda_{\psi}^k}[\delta'(k,j)] < \infty$

# Proposition A.2

Consider a polynomial $$q: {\mathbb{N}}^2 \rightarrow {\mathbb{N}}$$. There is a function $$\lambda_q: {\mathbb{N}}^3 \rightarrow [0,1]$$ s.t.

1. $\forall k,j \in {\mathbb{N}}: \sum_{i \in {\mathbb{N}}} \lambda_q(k,j,i) = 1$

2. For any function $$\epsilon: {\mathbb{N}}^2 \rightarrow [0,1]$$ we have

$\epsilon(k,j) - \sum_{i \in {\mathbb{N}}} \lambda_q(k,j,i) \, \epsilon(k,q(k,j)+i) \in \mathcal{E}_{2(ll)}$

# Proof of Proposition A.2

Given functions $$q_1,q_2: {\mathbb{N}}^2 \rightarrow {\mathbb{N}}$$ s.t. $$q_1(k,j) \geq q_2(k,j)$$ for $$k,j \gg 0$$, the proposition for $$q_1$$ implies the proposition for $$q_2$$ by setting

$\lambda_{q_2}(k,j,i):=\begin{cases}\lambda_{q_1}(k,j,i-q_1(k,j)+q_2(k,j)) & \text{if } i-q_1(k,j)+q_2(k,j) \geq 0 \\ 0 & \text{if } i-q_1(k,j)+q_2(k,j) < 0 \end{cases}$

Therefore, it is enough to prove to proposition for functions of the form $$q(k,j)=j^{m+n \log k}$$ for $$m > 0$$.

Consider any $$\phi \in \Phi$$. We have

${\limsup_{k \rightarrow \infty}} \, \phi(k) \frac{\log\log k}{\phi(k) \log\log k} < \infty$

${\limsup_{k \rightarrow \infty}} \, \phi(k) \frac{\log (m+n \log k)}{\phi(k) \log \log k} < \infty$

${\limsup_{k \rightarrow \infty}} \, \phi(k) \frac{\int\limits_{x=2}^{2^{m+n \log k}} d(\log \log x)}{\phi(k) \log \log k} < \infty$

Since $$\epsilon$$ takes values in $$[0,1]$$

${\limsup_{k \rightarrow \infty}} \, \phi(k) \frac{\int\limits_{x=2}^{2^{m+n \log k}} \epsilon(k,{\lfloor x \rfloor}) d(\log \log x)}{\phi(k) \log \log k} < \infty$

Similarly

${\limsup_{k \rightarrow \infty}} \, \phi(k) \frac{\int\limits_{x=t_\phi(k)}^{t_\phi(k)^{m+n \log k}} \epsilon(k,{\lfloor x \rfloor}) d(\log \log x)}{\phi(k) \log \log k} < \infty$

The last two equations imply that

${\limsup_{k \rightarrow \infty}} \, \phi(k) \frac{\int\limits_{x=2}^{t_\phi(k)} \epsilon(k,{\lfloor x \rfloor}) d(\log \log x) - \int\limits_{x=2^{m+n \log k}}^{t_\phi(k)^{m+n \log k}} \epsilon(k,{\lfloor x \rfloor}) d(\log \log x)}{\phi(k) \log \log k} < \infty$

Raising $$x$$ to a power is equivalent to adding a constant to $$\log \log x$$, therefore

${\limsup_{k \rightarrow \infty}} \, \phi(k) \frac{\int\limits_{x=2}^{t_\phi(k)} \epsilon(k,{\lfloor x \rfloor}) d(\log \log x) - \int\limits_{x=2}^{t_\phi(k)} \epsilon(k,{\lfloor x^{m+n \log k} \rfloor}) d(\log \log x)}{\phi(k) \log \log k} < \infty$

${\limsup_{k \rightarrow \infty}} \, \phi(k) \frac{\int\limits_{x=2}^{t_\phi(k)} (\epsilon(k,{\lfloor x \rfloor})-\epsilon(k,{\lfloor x^{m+n \log k} \rfloor})) d(\log \log x)}{\phi(k) \log \log k} < \infty$

Since $${\lfloor x^{m+n \log k} \rfloor} \geq {\lfloor x \rfloor}^{m+n \log k}$$ we can choose $$\lambda_q$$ satisfying condition (i) so that

$\int\limits_{x=j}^{j+1} \epsilon(k,{\lfloor x^{m+n \log k} \rfloor}) d(\log\log x) = (\log\log(j+1)-\log\log j) \sum_i \lambda_q(k,j,i) \, \epsilon(k,j^{m+n \log k}+i)$

It follows that

$\int\limits_{x=j}^{j+1} \epsilon(k,{\lfloor x^{m+n \log k} \rfloor}) d(\log\log x) = \int\limits_{x=j}^{j+1} \sum_i \lambda_q(k,{\lfloor x \rfloor},i) \, \epsilon(k,{\lfloor x \rfloor}^{m+n \log k}+i) d(\log\log x)$

${\limsup_{k \rightarrow \infty}} \, \phi(k) \frac{\int\limits_{x=2}^{t_\phi(k)} (\epsilon(k,{\lfloor x \rfloor})-\sum_i \lambda_q(k,{\lfloor x \rfloor},i) \, \epsilon(k,{\lfloor x \rfloor}^{m+n \log k}+i)) d(\log \log x)}{\phi(k) \log \log k} < \infty$

${\limsup_{k \rightarrow \infty}} \, \phi(k) \frac{\sum_{j=2}^{t_\phi(k)-1} (\log\log(j+1)-\log\log j)(\epsilon(k,j)-\sum_i \lambda_q(k,j,i) \, \epsilon(k,j^{m+n \log k}+i))}{\phi(k) \log \log k} < \infty$

$\epsilon(k,j) - \sum_{i \in {\mathbb{N}}} \lambda_q(k,j,i) \, \epsilon(k,q(k,j)+i) \in \mathcal{E}_{2(ll)}$

# Lemma A.1

Consider $$(f, \mu)$$ a distributional estimation problem, $$\hat{P}$$, $$\hat{Q}$$ $$(poly,rlog)$$-predictors. Suppose $$p: {\mathbb{N}}^2 \rightarrow {\mathbb{N}}$$ a polynomial, $$\phi \in \Phi$$ and $$\delta \in \mathcal{E}_{2(ll,\phi)}$$ are s.t.

$\forall i,k,j \in {\mathbb{N}}: \operatorname{E}[(\hat{P}^{k,p(k,j)+i}-f)^2] \leq \operatorname{E}[(\hat{Q}^{kj}-f)^2] + \delta(k,j)$

Then $$\exists \delta' \in \mathcal{E}_{2(ll,\phi)}$$ s.t.

$\operatorname{E}[(\hat{P}^{kj}-f)^2] \leq \operatorname{E}[(\hat{Q}^{kj}-f)^2] + \delta'(k,j)$

The proof of Lemma A.1 is analogous to before and we omit it.

## Appendix B

The following is a slightly stronger version of one direction of the orthogonality lemma.

# Lemma B.1

Consider $$(f, \mu)$$ a distributional estimation problem and $$\hat{P}$$ an $$\mathcal{E}(poly,rlog)$$-optimal predictor for $$(f, \mu)$$. Then there are $$c_1,c_2 \in {\mathbb{R}}$$ and an $$\mathcal{E}^{\frac{1}{2}}$$-moderate function $$\delta: {\mathbb{N}}^5 \rightarrow [0,1]$$ s.t. for any $$k,j,r,l,t \in {\mathbb{N}}$$, $$\tau$$ a probability measure on $${{\{ 0, 1 \}^{\leq{l}}}}$$ and $$Q: \operatorname{supp}\mu^k \times {{\{ 0, 1 \}^{r_P(k,j)}}} \times {{\{ 0, 1 \}^*}}\times {{\{ 0, 1 \}^{r}}} \times {{\{ 0, 1 \}^{\leq{l}}}} \xrightarrow{alg} {\mathbb{Q}}$$ that runs in time $$t$$ on any valid input

${\lvert \operatorname{E}_{\mu^k \times U^{r_P(k,j)} \times \sigma_P^{kj} \times U^r \times \tau}[Q(x,y,z,u,w)(P^{kj}(x,y,z)-f(x))] \rvert} \leq (c_1 + c_2 E_{\mu^k \times U^{r_P(k,j)} \times \sigma_P^{kj} \times U^r \times \tau}[Q(x,y,z,u,w)^2]) \delta(k,j,t,2^l,2^{{\lvert Q \rvert}})$

The proof is analogous to the original and we omit it.

# Lemma B.2

Fix a proto-error space $$\mathcal{E}$$. Consider $$\mathcal{A}$$ a finite set, $$\mu$$ a word ensemble and $$\{f_a: \operatorname{supp}\mu \rightarrow [0,1]\}_{a \in \mathcal{A}}$$. Suppose $$\{\hat{P}_a\}_{a \in \mathcal{A}}$$ are $$\mathcal{E}(poly,rlog)$$-optimal predictors for $$f$$. Assume that $$\sigma_{P_a}$$ and $$r_{P_a}$$ don’t depend on $$a$$. Consider $$\hat{A}$$ an $$\mathcal{A}$$-valued $$(poly,rlog)$$-bischeme. Assume $$\hat{\sigma}_A$$ factors as $$\hat{\sigma}_P \times \tau$$. Then

${\lvert \operatorname{E}_{\mu \times \hat{\sigma}_P \times \tau}[P_{A(x,y,z)}(x,y) - f_{A(x,y,z)}] \rvert} \in \mathcal{E}^{\frac{1}{2}}$

# Proof of Lemma B.2

$\operatorname{E}_{\mu \times \hat{\sigma}_P \times \tau}[P_{A(x,y,z)}(x,y) - f_{A(x,y,z)}(x)] = \sum_{a \in \mathcal{A}} \operatorname{E}_{\mu \times \hat{\sigma}_P \times \tau}[\delta_{A(x,y,z),a} (P_a(x,y) - f_a(x))]$

Applying Lemma B.1 we get the desired result.

# Proof of Theorem 1

Lemma B.2 implies

${\lvert \operatorname{E}_{\mu \times \hat{\sigma}_P \times \hat{\sigma}_A}[\hat{P}_{\hat{A}} - f_{\hat{A}}] \rvert} \in \mathcal{E}^{\frac{1}{2}}$

${\lvert \operatorname{E}_{\mu \times \hat{\sigma}_P \times \tau}[\hat{P}_{\hat{M}} - f_{\hat{M}}] \rvert} \in \mathcal{E}^{\frac{1}{2}}$

Combining the two

${\lvert \operatorname{E}_{\mu \times \hat{\sigma}_P \times \tau}[\hat{P}_{\hat{M}} - f_{\hat{M}}] \rvert} + {\lvert \operatorname{E}_{\mu \times \hat{\sigma}_P \times \hat{\sigma}_A}[\hat{P}_{\hat{A}} - f_{\hat{A}}] \rvert} \in \mathcal{E}^{\frac{1}{2}}$

${\lvert \operatorname{E}_{\mu \times \hat{\sigma}_P \times \tau}[\hat{P}_{\hat{M}} - f_{\hat{M}}] - \operatorname{E}_{\mu \times \hat{\sigma}_P \times \hat{\sigma}_A}[\hat{P}_{\hat{A}} - f_{\hat{A}}] \rvert} \in \mathcal{E}^{\frac{1}{2}}$

${\lvert \operatorname{E}_{\mu \times \hat{\sigma}_P \times \tau \times \hat{\sigma}_A}[\hat{P}_{\hat{M}} - \hat{P}_{\hat{A}}] - \operatorname{E}_{\mu \times \hat{\sigma}_P \times \tau \times \hat{\sigma}_A}[f_{\hat{M}} - f_{\hat{A}}] \rvert} \in \mathcal{E}^{\frac{1}{2}}$

The construction of $$\hat{M}$$ implies that for every $$x \in \operatorname{supp}\mu^k$$ we have

$\operatorname{E}_{\hat{\sigma}_P \times \tau}[P_{M^{kj}(x,y,z)}^{kj}(x,y)] \geq \operatorname{E}_{\hat{\sigma}_P \times \hat{\sigma}_A}[P_{A^{kj}(x,z)}^{kj}(x,y)]$

Combining with the above we get the desired result.

# Proposition B.1

Consider $$\phi \in \Phi$$, $$\epsilon \in (0,1)$$, $$\zeta: {\mathbb{N}}^2 \rightarrow {\mathbb{R}}$$ bounded below and $$\xi: {\mathbb{N}}\rightarrow {\mathbb{R}}$$ bounded. Assume that $$\forall k,j \in {\mathbb{N}}: j \geq t_{\phi^{1-\epsilon}}(k) \implies \zeta^{kj}=\xi^k$$. Then $$\operatorname{E}_{\lambda_\phi^k}[\zeta^{kj}] \geq \xi^k - O(\phi(k)^{-\epsilon})$$.

# Proof of Proposition B.1

$\operatorname{E}_{\lambda_\phi^k}[\zeta^{kj}] = \operatorname{Pr}_{\lambda_\phi^k}[j < t_{\phi^{1-\epsilon}}(k)] \operatorname{E}_{\lambda_\phi^k}[\zeta^{kj} \mid j < t_{\phi^{1-\epsilon}}(k)] + \operatorname{Pr}_{\lambda_\phi^k}[j \geq t_{\phi^{1-\epsilon}}(k)] \operatorname{E}_{\lambda_\phi^k}[\zeta^{kj} \mid j \geq t_{\phi^{1-\epsilon}}(k)]$

$\operatorname{E}_{\lambda_\phi^k}[\zeta^{kj}] \geq \frac{\log\log t_{\phi^{1-\epsilon}}(k)}{\log\log t_\phi(k)} \inf \zeta + \frac{\log\log t_\phi(k) - \log\log t_{\phi^{1-\epsilon}}(k)}{\log\log t_\phi(k)} \xi^k$

$\operatorname{E}_{\lambda_\phi^k}[\zeta^{kj}] \geq O(\phi(k)^{-\epsilon}) \inf \zeta +(1 - O(\phi(k)^{-\epsilon})) \xi^k$

$\operatorname{E}_{\lambda_\phi^k}[\zeta^{kj}] \geq \xi^k - O(\phi(k)^{-\epsilon})$

# Proof of Corollary 1

Choose some $$a_0 \in \mathcal{A}$$. Define the $$\mathcal{A}$$-valued $$(poly,rlog)$$-scheme $$\hat{B}$$ by

$\hat{B}^{kj}(x):=\begin{cases}\hat{A}^k(x) \text{ if } j \geq t_{\phi^{1-\epsilon}}(k) \\ a_0 \text{ otherwise} \end{cases}$

Applying Theorem 1 we get $$\delta \in \mathcal{E}_{2(ll,\phi)}^{\frac{1}{2}}$$ s.t.

$\operatorname{E}_{\mu^k \times \hat{\sigma}_M^{kj}}[f_{\hat{M}^{kj}}] \geq \operatorname{E}_{\mu^k \times \hat{\sigma}_B^{kj}}[f_{\hat{B}^{kj}}] - \delta(k,j)$

$\operatorname{E}_{\mu^k \times \lambda_\phi^k}[\operatorname{E}_{\hat{\sigma}_M^{kj}}[f_{\hat{M}^{kj}}]] \geq \operatorname{E}_{\mu^k \times \lambda_\phi^k}[\operatorname{E}_{\hat{\sigma}_B^{kj}}[f_{\hat{B}^{kj}}]] - \operatorname{E}_{\lambda_\phi^k}[\delta(k,j)]$

Proposition B.1 implies

$\operatorname{E}_{\mu^k \times \lambda_\phi^k}[\operatorname{E}_{\hat{\sigma}_B^{kj}}[f_{\hat{B}^{kj}}]] \geq \operatorname{E}_{\mu^k \times \hat{\sigma}_A^k}[f_{\hat{A}^k}] - O(\phi(k)^{-\epsilon})$

Also

$\operatorname{E}_{\lambda_\phi^k}[\delta(k,j)] \leq \operatorname{E}_{\lambda_\phi^k}[\delta(k,j)^2]^{\frac{1}{2}}$

$\operatorname{E}_{\lambda_\phi^k}[\delta(k,j)] = O(\phi(k)^{-\frac{1}{2}})$

Putting everything together we get the desired result.

# Proof of Corollary 2

For each $$a \in \mathcal{A}^n$$ define $$\hat{P}^{kj}_a(k,x):=\hat{P}^{kj}(k,x,a)$$ and $$u_{\phi,a}^n(k,x):=u_\phi^{nk}(x,a,\hat{P})$$. It is easy to see $$\hat{P}_a$$ is an $$\mathcal{E}_{2(ll,\phi)}^*(poly,rlog)$$-optimal predictor for $$u_{\phi,a}$$. We have

$\operatorname{E}_{s_{G,\phi}^k(x,\hat{P})}[u^n] = \operatorname{E}_{\lambda_{\phi^n}^k}[\operatorname{E}_{\hat{\sigma}_M^{kj}}[u_{\phi,\hat{M}^{kj}(x)}^{nk}(x)]] \pm O(\phi^n(k)^{-1})$

Here $$\hat{M}^{kj}(x,y,z)$$ selects an element of $$M_G^{nkj}(x,y,\hat{P}[z])$$ uniformly at random up to a small rounding error which yields the $$O(\phi^n(k)^{-1})$$ term. Also

$\operatorname{E}_{\hat{A}^k(x) \times s_{G,\phi}^{\bar{n}k}(x,\hat{P})}[u^n] = \operatorname{E}_{\hat{\sigma}_A^{k}}[u_{\phi,\hat{A}^{k}(x)}^{nk}(x)]$

Applying Corollary 1 we get the desired result.

# Proposition B.2

Consider $$\mathcal{A}$$ a finite set, $$u: \mathcal{A} \rightarrow {\mathbb{R}}$$ and $$T > 0$$. Define $$\vartheta: \mathcal{A} \rightarrow [0,1]$$ by $$\vartheta(a):=Z^{-1} 2^{\frac{u(a)}{T}}$$ where $$Z$$ is the normalization constant $$\sum_{a \in \mathcal{A}} 2^{\frac{u(a)}{T}}$$. Consider $$\rho: \mathcal{A} \rightarrow [0,1]$$ with $$\sum_{a \in \mathcal{A}} \rho(a) = 1$$. Then

$\operatorname{E}_{\vartheta}[u] + T \operatorname{H}(\vartheta) = \operatorname{E}_{\rho}[u] + T \operatorname{H}(\rho) + T {\operatorname{D}_{\mathrm{KL}}(\rho \| \vartheta)}$

# Proof of Proposition B.2

${\operatorname{D}_{\mathrm{KL}}(\rho \| \vartheta)} = \operatorname{E}_\rho[\log \frac{\rho}{\vartheta}]$

${\operatorname{D}_{\mathrm{KL}}(\rho \| \vartheta)} = \operatorname{E}_\rho[\log \frac{\rho}{Z^{-1} 2^{\frac{u}{T}}}]$

${\operatorname{D}_{\mathrm{KL}}(\rho \| \vartheta)} = \operatorname{E}_\rho[\log \rho + \log Z - \frac{u}{T}]$

${\operatorname{D}_{\mathrm{KL}}(\rho \| \vartheta)} = -\operatorname{H}(\rho) + \log Z - \frac{\operatorname{E}_\rho[u]}{T}$

$T {\operatorname{D}_{\mathrm{KL}}(\rho \| \vartheta)} = T \log Z - (\operatorname{E}_\rho[u] + T \operatorname{H}(\rho))$

On the other hand

$\operatorname{H}(\vartheta) = -\operatorname{E}_\vartheta[\log \vartheta]$

$\operatorname{H}(\vartheta) = -\operatorname{E}_\vartheta[\log (Z^{-1} 2^{\frac{u}{T}})]$

$\operatorname{H}(\vartheta) = -\operatorname{E}_\vartheta[-\log Z + \frac{u}{T}]$

$\operatorname{H}(\vartheta) = \log Z - \frac{\operatorname{E}_{\vartheta}[u]}{T}$

$T \log Z = \operatorname{E}_{\vartheta}[u] + T \operatorname{H}(\vartheta)$

Combining the two we get the desired result.

# Proposition B.3

Consider $$\mathcal{A}$$ a finite set, $$u_1,u_2: \mathcal{A} \rightarrow {\mathbb{R}}$$ and $$T > 0$$. Define the probability measures $$\vartheta_1, \vartheta_2: \mathcal{A} \rightarrow [0,1]$$ by $$\vartheta_i(a):=Z_i^{-1} 2^{\frac{u_i(a)}{T}}$$ where $$Z_i$$ are normalization factors. Then

${\operatorname{D}_{\mathrm{KL}}(\vartheta_1 \| \vartheta_2)} \leq T^{-1} \sum_{a \in \mathcal{A}} {\lvert u_1(a) - u_2(a) \rvert}$

# Proof of Proposition B.3

${\operatorname{D}_{\mathrm{KL}}(\vartheta_1 \| \vartheta_2)} = \operatorname{E}_{\vartheta_1}[\log \frac{\vartheta_1}{\vartheta_2}]$

${\operatorname{D}_{\mathrm{KL}}(\vartheta_1 \| \vartheta_2)} = \operatorname{E}_{\vartheta_1}[\log \frac{Z_1^{-1} 2^{\frac{u_1}{T}}}{Z_2^{-1} 2^{\frac{u_2}{T}}}]$

${\operatorname{D}_{\mathrm{KL}}(\vartheta_1 \| \vartheta_2)} = \operatorname{E}_{\vartheta_1}[\frac{u_1-u_2}{T}] + \log \frac{Z_2}{Z_1}$

${\operatorname{D}_{\mathrm{KL}}(\vartheta_1 \| \vartheta_2)} = \operatorname{E}_{\vartheta_1}[\frac{u_1-u_2}{T}] + \log \frac{\sum_{a \in \mathcal{A}} 2^{\frac{u_2(a)}{T}}}{Z_1}$

${\operatorname{D}_{\mathrm{KL}}(\vartheta_1 \| \vartheta_2)} = \operatorname{E}_{\vartheta_1}[\frac{u_1-u_2}{T}] + \log \sum_{a \in \mathcal{A}} Z_1^{-1} 2^{\frac{u_1(a)}{T}} 2^{\frac{u_2(a)-u_1(a)}{T}}$

${\operatorname{D}_{\mathrm{KL}}(\vartheta_1 \| \vartheta_2)} = \operatorname{E}_{\vartheta_1}[\frac{u_1-u_2}{T}] + \log \operatorname{E}_{\vartheta_1}[2^{\frac{u_2-u_1}{T}}]$

${\operatorname{D}_{\mathrm{KL}}(\vartheta_1 \| \vartheta_2)} \leq T^{-1}(\max_{a \in \mathcal{A}} (u_1(a) - u_2(a)) - \min_{a \in \mathcal{A}} (u_1(a) - u_2(a)))$

${\operatorname{D}_{\mathrm{KL}}(\vartheta_1 \| \vartheta_2)} \leq T^{-1} \sum_{a \in \mathcal{A}} {\lvert u_1(a) - u_2(a) \rvert}$

# Proposition B.4

Consider $$\mathcal{A}$$ a finite set. There is a $$c > 0$$ s.t. for any $$\rho_1, \rho_2$$ probability measures on $$\mathcal{A}$$ we have

${\lvert \operatorname{H}(\rho_1)-\operatorname{H}(\rho_2) \rvert} \leq c d_{TV}(\rho_1, \rho_2)^{\frac{1}{2}}$

Here $$d_{TV}(\rho_1, \rho_2) := \frac{1}{2} \sum_{a \in \mathcal{A}} {\lvert \rho_1(a)-\rho_2(a) \rvert}$$ is the total variation distance.

# Proof of Proposition B.4

Given $$0 \leq p \leq q \leq 1$$ we have

${\lvert q \log q - p \log p \rvert} = {\lvert \int_p^q d(x \log x) \rvert}$

${\lvert q \log q - p \log p \rvert} = {\lvert \int_p^q (\log x + \frac{1}{\ln 2}) dx \rvert}$

${\lvert q \log q - p \log p \rvert} = {\lvert \int_p^q \log (\mathrm{e}x) dx \rvert}$

${\lvert q \log q - p \log p \rvert} \leq \int_p^q {\lvert \log (\mathrm{e}x) \rvert} dx$

For some $$c' > 0$$ we have $${\lvert \log (\mathrm{e}x) \rvert} \leq \frac{c'}{2}x^{-\frac{1}{2}}$$ for $$x \in (0,1]$$ hence

${\lvert q \log q - p \log p \rvert} \leq \int_p^q \frac{c'}{2}x^{-\frac{1}{2}} dx$

${\lvert q \log q - p \log p \rvert} \leq c'(q^{\frac{1}{2}}-p^{\frac{1}{2}})$

${\lvert q \log q - p \log p \rvert} \leq c'(q-p)^{\frac{1}{2}}$

${\lvert \operatorname{H}(\rho_1)-\operatorname{H}(\rho_2) \rvert} = {\lvert \sum_{a \in \mathcal{A}} \rho_1(a) \log \rho_1(a) - \rho_2(a) \log \rho_2(a) \rvert}$

${\lvert \operatorname{H}(\rho_1)-\operatorname{H}(\rho_2) \rvert} \leq \sum_{a \in \mathcal{A}} {\lvert \rho_1(a) \log \rho_1(a) - \rho_2(a) \log \rho_2(a) \rvert}$

${\lvert \operatorname{H}(\rho_1)-\operatorname{H}(\rho_2) \rvert} \leq \sum_{a \in \mathcal{A}} c' {\lvert \rho_1(a)-\rho_2(a) \rvert}^{\frac{1}{2}}$

Using the concavity of the square root we complete the proof.

# Proof of Theorem 2

For any $$k,j \in {\mathbb{N}}$$, $$x \in \operatorname{supp}\mu^k$$ and $$y \in {{\{ 0, 1 \}^*}}^2$$ define $$\vartheta^{kj}(x)$$ and $$\vartheta^{kj}(x,y)$$ probability measures on $$\mathcal{A}$$ by

$\operatorname{Pr}_{\vartheta^{kj}(x)}[a]:=W^{kj}(x)^{-1} 2^{(T^{kj})^{-1} \operatorname{E}_{\hat{\sigma}_P}[\hat{P}_a^{kj}(x)]}$

$\operatorname{Pr}_{\vartheta^{kj}(x,y)}[a]:=Z_T^{kj}(x,y)^{-1} 2^{(T^{kj})^{-1} P_a^{kj}(x,y)}$

Here $$W$$ is the normalization factor. By the convexity of Kullback-Leibler divergence

${\operatorname{D}_{\mathrm{KL}}(\operatorname{E}_{\hat{\sigma}_P}[\vartheta^{kj}(x,y)] \| \vartheta^{kj}(x))} \leq \operatorname{E}_{\hat{\sigma}_P}[{\operatorname{D}_{\mathrm{KL}}(\vartheta^{kj}(x,y) \| \vartheta^{kj}(x))}]$

Applying Proposition B.3

${\operatorname{D}_{\mathrm{KL}}(\operatorname{E}_{\hat{\sigma}_P}[\vartheta^{kj}(x,y)] \| \vartheta^{kj}(x))} \leq \operatorname{E}_{\hat{\sigma}_P}[(T^{kj})^{-1} \sum_{a \in \mathcal{A}} {\lvert P_a^{kj}(x,y)-\operatorname{E}_{\hat{\sigma}_P}[P_a^{kj}(x,y')] \rvert}]$

${\operatorname{D}_{\mathrm{KL}}(\operatorname{E}_{\hat{\sigma}_P}[\vartheta^{kj}(x,y)] \| \vartheta^{kj}(x))} \leq (T^{kj})^{-1} \sum_{a \in \mathcal{A}} \operatorname{E}_{\hat{\sigma}_P}[(P_a^{kj}(x,y)-\operatorname{E}_{\hat{\sigma}_P}[P_a^{kj}(x,y')])^2]^{\frac{1}{2}}$

${\operatorname{D}_{\mathrm{KL}}(\operatorname{E}_{\hat{\sigma}_P}[\vartheta^{kj}(x,y)] \| \vartheta^{kj}(x))} \leq (T^{kj})^{-1} \sum_{a \in \mathcal{A}} \operatorname{E}_{\hat{\sigma}_P \times \hat{\sigma}_P}[(P_a^{kj}(x,y)-P_a^{kj}(x,y'))^2]^{\frac{1}{2}}$

Applying the uniqueness theorem, we get

$T^{kj} \operatorname{E}_{\mu^k}[{\operatorname{D}_{\mathrm{KL}}(\operatorname{E}_{\hat{\sigma}_P}[\vartheta^{kj}(x,y)] \| \vartheta^{kj}(x))}] \in \mathcal{E}^{\frac{1}{4}}$

We also have

$d_{TV}(\hat{M}_T^{kj}(x),\operatorname{E}_{\hat{\sigma}_P}[\vartheta^{kj}(x,y)]) \leq c_1 \delta_r(k,j)$

for some $$c_1 > 0$$. Applying Propositions B.2 and B.4 we get that

$T^{kj} {\lvert {\operatorname{D}_{\mathrm{KL}}(\hat{M}_T^{kj}(x) \| \vartheta^{kj}(x))} - {\operatorname{D}_{\mathrm{KL}}(\operatorname{E}_{\hat{\sigma}_P}[\vartheta^{kj}(x,y)] \| \vartheta^{kj}(x))} \rvert} \leq c_2 \delta_r(k,j)+ c_3 T^{kj} \delta_r(k,j)^{\frac{1}{2}}$

for some $$c_2, c_3 > 0$$. Combining with the above, we get

$T^{kj} \operatorname{E}_{\mu^k}[{\operatorname{D}_{\mathrm{KL}}(\hat{M}_T^{kj}(x) \| \vartheta^{kj}(x))}] \in \mathcal{E}^{\frac{1}{4}}$

In particular, for some $$\delta \in \mathcal{E}^{\frac{1}{4}}$$

$T^{kj} \operatorname{E}_{\mu^k}[{\operatorname{D}_{\mathrm{KL}}(\hat{M}_T^{kj}(x) \| \vartheta^{kj}(x))}] \leq T^{kj} \operatorname{E}_{\mu^k}[{\operatorname{D}_{\mathrm{KL}}(\hat{A}^{kj}(x) \| \vartheta^{kj}(x))}] + \delta(k,j)$

$-T^{kj} \operatorname{E}_{\mu^k}[{\operatorname{D}_{\mathrm{KL}}(\hat{M}_T^{kj}(x) \| \vartheta^{kj}(x))}] \geq -T^{kj} \operatorname{E}_{\mu^k}[{\operatorname{D}_{\mathrm{KL}}(\hat{A}^{kj}(x) \| \vartheta^{kj}(x))}] - \delta(k,j)$

Applying Proposition B.2

$\operatorname{E}_{\mu^k}[\operatorname{E}_{\hat{\sigma}_{M_T}^{kj}}[\hat{P}_{\hat{M}_T^{kj}(x)}(x)] + T^{kj} \operatorname{H}(\hat{M}_T^{kj}(x))] \geq \operatorname{E}_{\mu^k}[\operatorname{E}_{\hat{\sigma}_A^{kj}}[\hat{P}_{\hat{A}^{kj}(x)}(x)] + T^{kj} \operatorname{H}(\hat{A}^{kj}(x))] - \delta(k,j)$

Applying Lemma B.2 we get the desired result.

# Proof of Corollary 3

Choose some $$a_0 \in \mathcal{A}$$. Define the $$\mathcal{A}$$-valued $$(poly,rlog)$$-scheme $$\hat{B}$$ by

$\hat{B}^{kj}(x):=\begin{cases}\hat{A}^k(x) \text{ if } j \geq t_{\phi^{1-\epsilon}}(k) \\ a_0 \text{ otherwise} \end{cases}$

Applying Theorem 2 we get $$\delta \in \mathcal{E}_{2(ll,\phi)}^{\frac{1}{4}}$$ s.t.

$\operatorname{E}_{\mu^k}[\operatorname{E}_{\hat{\sigma}_{M_T}^{kj}}[f_{\hat{M}_T^{kj}(x)}(x)] + T^k \operatorname{H}(\hat{M}_T^{kj}(x))] \geq \operatorname{E}_{\mu^k}[\operatorname{E}_{\hat{\sigma}_B^{kj}}[f_{\hat{B}^{kj}(x)}(x)] + T^k \operatorname{H}(\hat{B}^{kj}(x))] - \delta(k,j)$

$\operatorname{E}_{\mu^k}[\operatorname{E}_{\lambda_\phi^k}[\operatorname{E}_{\hat{\sigma}_{M_T}^{kj}}[f_{\hat{M}_T^{kj}(x)}(x)]] + T^k \operatorname{E}_{\lambda_\phi^k}[\operatorname{H}(\hat{M}_T^{kj}(x))]] \geq \operatorname{E}_{\lambda_\phi^k \times \mu^k}[\operatorname{E}_{\hat{\sigma}_B^{kj}}[f_{\hat{B}^{kj}(x)}(x)] + T^k \operatorname{H}(\hat{B}^{kj}(x))] - \operatorname{E}_{\lambda_\phi^k}[\delta(k,j)]$

Using the concavity of entropy and the property of $$\delta$$

$\operatorname{E}_{\mu^k}[\operatorname{E}_{\lambda_\phi^k}[\operatorname{E}_{\hat{\sigma}_{M_T}^{kj}}[f_{\hat{M}_T^{kj}(x)}(x)]] + T^k \operatorname{H}(\hat{M}_T^{kj}(x))] \geq \operatorname{E}_{\lambda_\phi^k \times \mu^k}[\operatorname{E}_{\hat{\sigma}_B^{kj}}[f_{\hat{B}^{kj}(x)}(x)] + T^k \operatorname{H}(\hat{B}^{kj}(x))] - O(\phi(k)^{-\frac{1}{4}})$

Applying Proposition B.1 we get the desired result.

# Proof of Corollary 4

For each $$a \in \mathcal{A}^n$$ define $$\hat{P}^{kj}_a(k,x):=\hat{P}^{kj}(k,x,a)$$ and $$u_{\phi,T,a}^n(k,x):=u_{\phi,T}^{nk}(x,a,\hat{P})$$. It is easy to see $$\hat{P}_a$$ is an $$\mathcal{E}_{2(ll,\phi)}^*(poly,rlog)$$-optimal predictor for $$u_{\phi,T,a}$$. Construct the $$(poly,rlog)$$-bischeme $$\hat{M}_T$$ to satisfy the condition of Theorem 2 with $$\delta_r(k,j) \leq 2^{-j}$$. We have (with the help of Proposition B.4)

$\operatorname{E}_{s_{G,\phi,T}^k(x,\hat{P})}[u^n] = \operatorname{E}_{\lambda_{\phi^n}^k}[\operatorname{E}_{\hat{\sigma}_{M_T}^{kj}}[u_{\phi,T,\hat{M}_T^{kj}(x)}^{nk}(x)]] \pm O(\phi^n(k)^{-1})$

$\operatorname{H}(s_{G,\phi,T}^{nk}(x,\hat{P})) = \operatorname{H}(\operatorname{E}_{\lambda_{\phi^n}^k}[\hat{M}_T^{kj}(x)]) \pm O(\phi^n(k)^{-\frac{1}{2}})$

$\operatorname{E}_{\hat{A}^k(x) \times s_{G,\phi,T}^{\bar{n}k}(x,\hat{P})}[u^n] = \operatorname{E}_{\hat{\sigma}_A^{k}}[u_{\phi,T,\hat{A}^{k}(x)}^{nk}(x)]$

Applying Corollary 3 we get the desired result.

# Proposition B.5

Consider a finite set $$\Sigma$$, $$\mathcal{R}=(\Sigma,f,\mu)$$ a $$\phi$$-Hoelder reflective system and two collections of $$(poly,rlog)$$-predictors $$\{\hat{Q}_1^n\}_{n \in \Sigma}$$ and $$\{\hat{Q}_2^n\}_{n \in \Sigma}$$. Assume that $$\forall n \in \Sigma: \hat{Q}_1^n \underset{\mathcal{E}_{2(ll)}^{\frac{1}{2}}}{\overset{\mu^n}{\simeq}} \hat{Q}_2^n$$. Then

$\forall n \in \Sigma: \operatorname{E}_{\mu_n^k}[(\mathcal{R}[\hat{Q}_1]^n(x)-\mathcal{R}[\hat{Q}_1]^n(x))^2] = O(\phi^n(k)^{-\frac{1}{\infty}})$

# Proof of Proposition B.5

$\operatorname{E}_{\mu^{nk}}[(\mathcal{R}[\hat{Q}_1]^n(x)-\mathcal{R}[\hat{Q}_1]^n(x))^2] = \operatorname{E}_{\mu^{nk}}[(\operatorname{E}_{\sigma_1}[f^n(x,\hat{Q}_1[a])]-\operatorname{E}_{\sigma_2}[f^n(x,\hat{Q}_2[a])])^2]$

$\operatorname{E}_{\mu^{nk}}[(\mathcal{R}[\hat{Q}_1]^n(x)-\mathcal{R}[\hat{Q}_1]^n(x))^2] \leq E_{\mu^{nk} \times \sigma_1 \times \sigma_2}[(f^n(x,\hat{Q}_1[a_1])-f^n(x,\hat{Q}_2[a_2]))^2]$

$\operatorname{E}_{\mu^{nk}}[(\mathcal{R}[\hat{Q}_1]^n(x)-\mathcal{R}[\hat{Q}_1]^n(x))^2] \leq \operatorname{E}_{\sigma_1 \times \sigma_2}[c^{nk} \operatorname{E}_{\rho^n}[\operatorname{E}_{\lambda_{\psi^{nm}}^k}[\operatorname{E}_{\mu^{mk} \times U^{r_1(k,j)} \times U^{r_2(k,j)}}[(Q_1^{mkj}(x,y,a_1)-Q_2^{mkj}(x,y,a_2))^2]]]^{\alpha^n}] + \delta^n(k)$

$\operatorname{E}_{\mu^{nk}}[(\mathcal{R}[\hat{Q}_1]^n(x)-\mathcal{R}[\hat{Q}_1]^n(x))^2] \leq c^{nk} \operatorname{E}_{\rho^n}[\operatorname{E}_{\lambda_{\psi^{nm}}^k}[\operatorname{E}_{\mu^{mk} \times U^{r_1(k,j)} \times U^{r_2(k,j)} \times \sigma_1 \times \sigma_2}[(Q_1^{mkj}(x,y,a_1)-Q_2^{mkj}(x,y,a_2))^2]]]^{\alpha^n} + \delta^n(k)$

Using the similarity of $$\hat{Q}_1$$ and $$\hat{Q}_2$$ there are $$\{\tilde{\delta}^n: {\mathbb{N}}^2 \rightarrow [0,1] \in \mathcal{E}_{2(ll)}\}_{n \in \Sigma}$$ s.t.

$\operatorname{E}_{\mu^{nk}}[(\mathcal{R}[\hat{Q}_1]^n(x)-\mathcal{R}[\hat{Q}_1]^n(x))^2] \leq c^{nk} \operatorname{E}_{\rho^n}[\operatorname{E}_{\lambda_{\psi^{nm}}^k}[\tilde{\delta}^m(k,j)^{\frac{1}{2}}]]^{\alpha^n} + \delta^n(k)$

$\operatorname{E}_{\mu^{nk}}[(\mathcal{R}[\hat{Q}_1]^n(x)-\mathcal{R}[\hat{Q}_1]^n(x))^2] \leq c^{nk} \operatorname{E}_{\rho^n}[\operatorname{E}_{\lambda_{\psi^{nm}}^k}[\tilde{\delta}^m(k,j)]]^{\frac{\alpha^n}{2}} + \delta^n(k)$

$\operatorname{E}_{\mu^{nk}}[(\mathcal{R}[\hat{Q}_1]^n(x)-\mathcal{R}[\hat{Q}_1]^n(x))^2] \leq O(\phi^n(k)^{\frac{\alpha^n}{2} - \epsilon^n}) \operatorname{E}_{\rho^n}[O(\psi^{nm}(k)^{-1})]^{\frac{\alpha^n}{2}} + O(\phi^n(k)^{-\frac{1}{\infty}})$

Using $$\psi^{nm} \geq \phi^n$$ we get the desired result.

# Proof of Theorem 3

Fix a finite set $$\Sigma$$ and a collection $$\{\phi_n \in \Phi\}_{n \in \Sigma}$$. Consider $$\mathcal{R}$$ a $$\phi$$-Hoelder reflective system. By the general existence theorem, there is $$\hat{R}$$ an $$\mathcal{E}_{2(ll)}(poly,rlog)$$-optimal predictor system for $$\mathcal{R}$$. For each $$n \in \Sigma$$ we can choose $$\hat{P}^n$$, an $$\mathcal{E}_{2(ll)}(poly,log)$$-optimal predictor for $$\mathcal{R}[\hat{R}]^n$$. By the uniqueness theorem, we have $$\hat{P}^{n} \underset{\mathcal{E}_{2(ll)}^\frac{1}{2}}{\overset{\mu^n}{\simeq}} \hat{R}^{n}$$. By Proposition B.5 this implies $$E_{\mu^{nk}}[(\mathcal{R}[\hat{P}]^n(x)-\mathcal{R}[\hat{R}]^n(x))^2] = O(\phi^n(k)^{-\frac{1}{\infty}})$$. This means $$\hat{P}^n$$ is an $$\mathcal{E}_{2(ll,\phi^n)}^*(poly,log)$$-optimal predictor for $$\mathcal{R}[\hat{P}]^n$$ and $$\hat{P}$$ is an $$\mathcal{E}_{2(ll,\phi)}^*(poly,log)$$-optimal predictor system for $$\mathcal{R}$$.

# Proposition B.6

Fix a finite set $$\Sigma$$ and a collection $$\{\phi_n \in \Phi\}_{n \in \Sigma}$$. Consider $$\mathcal{R} = (\Sigma, f, \mu)$$ a reflective system. Assume there are $$\{c^{nmk} \in {\mathbb{R}}^{>0}\}_{n,m \in \Sigma, k \in {\mathbb{N}}}$$, $$\{\psi^{nm} \in \Phi\}_{n,m \in \Sigma}$$, $$\{\alpha^n \in (0,1]\}_{n \in \Sigma}$$ and $$\{\delta^{nm}: {\mathbb{N}}\rightarrow {\mathbb{R}}^{\geq 0}\}_{n,m \in \Sigma}$$ s.t.

1. $\forall n,m \in \Sigma \, \exists \epsilon^{nm} > 0: c^{nmk}=O(\phi^n(k)^{\frac{\alpha^n}{2} - \epsilon^{nm}})$

2. $\psi^{nm} \geq \phi^n$

3. $\delta^{nm}(k) = O(\phi^n(k)^{-\frac{1}{\infty}})$

4. For any $$m \in \Sigma$$ and $$\pi,\tilde{\pi} \in \Pi_\Sigma$$ s.t. $$\forall l \in \Sigma \setminus m,k,j \in {\mathbb{N}}: \pi^{lkj}=\tilde{\pi}^{lkj}$$ $\operatorname{E}_{\mu^{nk}}[(f^n(x,\pi)-f^n(x,\tilde{\pi}))^2] \leq c^{nmk} \operatorname{E}_{\mu^{mk} \times \lambda_{\psi^{nm}}^k}[\operatorname{E}_{U(\pi)^{mkj} \times U(\tilde{\pi})^{mkj}}[(\pi^{mkj}(x,y)-\tilde{\pi}^{mkj}(x,y))^2]]^{\alpha^n} + \delta^{nm}(k)$

Then, $$\mathcal{R}$$ is $$\phi$$-Hoelder.

# Proof of Proposition B.6

Identify $$\Sigma$$ with $$\{n \in {\mathbb{N}}\mid 1 \leq n \leq N\}$$ where $$N = \#\Sigma$$. Consider any $$\pi,\tilde{\pi} \in \Pi_\Sigma$$. Define $$\pi_n$$ recursively for all $$n \leq N$$ by $$\pi_0 := \pi$$, $$\pi_{n+1}^{n+1,kj}:=\tilde{\pi}^{n+1,kj}$$ and $$\forall m \in \Sigma \setminus \{n+1\}: \pi_{n+1}^{mkj}:=\pi_n^{mkj}$$. In particular, it follows that $$\pi_N = \tilde{\pi}$$. We have

$\operatorname{E}_{\mu^{nk}}[(f^n(x,\pi_{m-1})-f^n(x,\pi_m))^2] \leq c^{nmk} \operatorname{E}_{\mu^{mk} \times \lambda_{\psi^{nm}}^k}[\operatorname{E}_{U(\pi)^{mkj} \times U(\tilde{\pi})^{mkj}}[(\pi^{mkj}(x,y)-\tilde{\pi}^{mkj}(x,y))^2]]^{\alpha^n} + \delta^{nm}(k)$

$\frac{1}{N} \sum_{m=1}^N \operatorname{E}_{\mu^{nk}}[(f^n(x,\pi_{m-1})-f^n(x,\pi_m))^2] \leq \frac{1}{N} \sum_{m=1}^N (c^{nmk} \operatorname{E}_{\mu^{mk} \times \lambda_{\psi^{nm}}^k}[\operatorname{E}_{U(\pi)^{mkj} \times U(\tilde{\pi})^{mkj}}[(\pi^{mkj}(x,y)-\tilde{\pi}^{mkj}(x,y))^2]]^{\alpha^n} + \delta^{nm}(k))$

Denoting $$c^{nk}:=\max_{m \in \Sigma} c^{nmk}$$, $$\delta^n:=\frac{1}{N} \sum_{m=1}^N \delta^{nm}$$ we get

$\operatorname{E}_{\mu^{nk}}[(\frac{1}{N} \sum_{m=1}^N (f^n(x,\pi_{m-1})-f^n(x,\pi_m)))^2] \leq c^{nk} (\frac{1}{N} \sum_{m=1}^N \operatorname{E}_{\mu^{mk} \times \lambda_{\psi^{nm}}^k}[\operatorname{E}_{U(\pi)^{mkj} \times U(\tilde{\pi})^{mkj}}[(\pi^{mkj}(x,y)-\tilde{\pi}^{mkj}(x,y))^2]])^{\alpha^n} + \delta^n(k)$

$\operatorname{E}_{\mu^{nk}}[(f^n(x,\pi)-f^n(x,\tilde{\pi}))^2] \leq N^2 c^{nk} (\frac{1}{N} \sum_{m=1}^N \operatorname{E}_{\mu^{mk} \times \lambda_{\psi^{nm}}^k}[\operatorname{E}_{U(\pi)^{mkj} \times U(\tilde{\pi})^{mkj}}[(\pi^{mkj}(x,y)-\tilde{\pi}^{mkj}(x,y))^2]])^{\alpha^n} + N^2 \delta^n(k)$

# Proof of Theorem 4

Consider $$n,m \in \mathcal{P}$$ and $$\pi,\tilde{\pi} \in \Pi_{\mathcal{P}}$$ s.t. $$\forall l \in \mathcal{P} \setminus m,k,j \in {\mathbb{N}}: \pi^{lkj}=\tilde{\pi}^{lkj}$$.

$\operatorname{E}_{\mu_G^{nk}}[(u_{\phi,T}^{nk}(x,a,\pi)-u_{\phi,T}^{nk}(x,a,\tilde{\pi}))^2] \leq \operatorname{E}_{\mu^k}[d_{TV}(s_{G,\phi,T}^{\bar{n}k}(x,\pi),s_{G,\phi,T}^{\bar{n}k}(x,\tilde{\pi}))^2]$

$\operatorname{E}_{\mu_G^{nk}}[(u_{\phi,T}^{nk}(x,a,\pi)-u_{\phi,T}^{nk}(x,a,\tilde{\pi}))^2] \leq \operatorname{E}_{\mu^k}[d_{TV}(s_{G,\phi,T}^{mk}(x,\pi),s_{G,\phi,T}^{mk}(x,\tilde{\pi}))^2]$

$\operatorname{E}_{\mu_G^{nk}}[(u_{\phi,T}^{nk}(x,a,\pi)-u_{\phi,T}^{nk}(x,a,\tilde{\pi}))^2] \leq \operatorname{E}_{\mu^k}[(\frac{1}{2} \sum_{b \in \mathcal{A}^m} {\lvert \operatorname{Pr}_{s_{G,\phi,T}^{mk}(x,\pi)}[b]-\operatorname{Pr}_{s_{G,\phi,T}^{mk}(x,\tilde{\pi})}[b] \rvert})^2]$

Slightly abusing notation, define $$s_{G,\phi,T}^m: {\mathbb{N}}^2 \times \operatorname{supp}\mu \times {{\{ 0, 1 \}^*}}\times \Pi_{\mathcal{P}} \rightarrow \Delta_G^m$$ by

$\operatorname{Pr}_{s_{G,\phi,T}^{mkj}(x,y,\pi)}[a]:=Z_{G,\phi,T}^{mkj}(x,y,\pi)^{-1} 2^{(T^{mk})^{-1}\pi^{mkj}((k,x,a),y)}$

We have

$s_{G,\phi,T}^{mk}(x,\pi) = \operatorname{E}_{\lambda_\phi^k}[\operatorname{E}_{U(\pi)^{mkj}}[s_{G,\phi,T}^{mkj}(x,y,\pi)]]$

$\operatorname{E}_{\mu_G^{nk}}[(u_{\phi,T}^{nk}(x,a,\pi)-u_{\phi,T}^{nk}(x,a,\tilde{\pi}))^2] \leq \operatorname{E}_{\mu^k}[(\frac{1}{2} \sum_{b \in \mathcal{A}^m} {\lvert \operatorname{E}_{\lambda_\phi^k}[\operatorname{E}_{U(\pi)^{mkj}}[\operatorname{Pr}_{s_{G,\phi,T}^{mkj}(x,y,\pi)}[b]]-\operatorname{E}_{U(\tilde{\pi})^{mkj}}[\operatorname{Pr}_{s_{G,\phi,T}^{mkj}(x,y,\tilde{\pi})}[b]]] \rvert})^2]$

$\operatorname{E}_{\mu_G^{nk}}[(u_{\phi,T}^{nk}(x,a,\pi)-u_{\phi,T}^{nk}(x,a,\tilde{\pi}))^2] \leq \operatorname{E}_{\mu^k}[\operatorname{E}_{\lambda_\phi^k}[\operatorname{E}_{U(\pi)^{mkj} \times U(\tilde{\pi})^{mkj}}[d_{TV}(s_{G,\phi,T}^{mkj}(x,y,\pi),s_{G,\phi,T}^{mkj}(x,y,\tilde{\pi}))]]^2]$

$\operatorname{E}_{\mu_G^{nk}}[(u_{\phi,T}^{nk}(x,a,\pi)-u_{\phi,T}^{nk}(x,a,\tilde{\pi}))^2] \leq \operatorname{E}_{\mu^k \times \lambda_\phi^k}[\operatorname{E}_{U(\pi)^{mkj} \times U(\tilde{\pi})^{mkj}}[d_{TV}(s_{G,\phi,T}^{mkj}(x,y,\pi),s_{G,\phi,T}^{mkj}(x,y,\tilde{\pi}))^2]]$

Using Pinsker’s inequality

$\operatorname{E}_{\mu_G^{nk}}[(u_{\phi,T}^{nk}(x,a,\pi)-u_{\phi,T}^{nk}(x,a,\tilde{\pi}))^2] \leq \frac{1}{2} \operatorname{E}_{\mu^k \times \lambda_\phi^k}[\operatorname{E}_{U(\pi)^{mkj} \times U(\tilde{\pi})^{mkj}}[{\operatorname{D}_{\mathrm{KL}}(s_{G,\phi,T}^{mkj}(x,y,\pi) \| s_{G,\phi,T}^{mkj}(x,y,\tilde{\pi}))}]]$

Using Proposition B.3

$\operatorname{E}_{\mu_G^{nk}}[(u_{\phi,T}^{nk}(x,a,\pi)-u_{\phi,T}^{nk}(x,a,\tilde{\pi}))^2] \leq \frac{1}{2} (T^{mk})^{-1} \operatorname{E}_{\mu^k \times \lambda_\phi^k}[\operatorname{E}_{U(\pi)^{mkj} \times U(\tilde{\pi})^{mkj}}[\sum_{b \in \mathcal{A}^m}{\lvert \pi^{mkj}((k,x,b),y)-\tilde{\pi}^{mkj}((k,x,b),y) \rvert}]]$

$\operatorname{E}_{\mu_G^{nk}}[(u_{\phi,T}^{nk}(x,a,\pi)-u_{\phi,T}^{nk}(x,a,\tilde{\pi}))^2] \leq \frac{1}{2} \#\mathcal{A}^m (T^{mk})^{-1} \operatorname{E}_{\mu_G^{mk} \times \lambda_\phi^k}[\operatorname{E}_{U(\pi)^{mkj} \times U(\tilde{\pi})^{mkj}}[{\lvert \pi^{mkj}((k,x,b),y)-\tilde{\pi}^{mkj}((k,x,b),y) \rvert}]]$

$\operatorname{E}_{\mu_G^{nk}}[(u_{\phi,T}^{nk}(x,a,\pi)-u_{\phi,T}^{nk}(x,a,\tilde{\pi}))^2] \leq \frac{1}{2} \#\mathcal{A}^m (T^{mk})^{-1} \operatorname{E}_{\mu_G^{mk} \times \lambda_\phi^k}[\operatorname{E}_{U(\pi)^{mkj} \times U(\tilde{\pi})^{mkj}}[(\pi^{mkj}((k,x,b),y)-\tilde{\pi}^{mkj}((k,x,b),y))^2]]^{\frac{1}{2}}$

Applying Proposition B.6 we get the desired result.

 by Paul Christiano 1066 days ago | Patrick LaVictoire likes this | link I find these theorem statements quite hard to follow. It looks like this is doing roughly the right thing and would be better to work with than reflective oracles (i.e. to the extent that there were new complications, those complications would represent real and important phenomena that were simply covered up by the simplifying assumptions). In that case I am quite interested in AIXI-with-optimal-predictors (though probably I’m most interested in a more accessible presentation, and I really feel like it should be possible to have a cleaner formalism). As an example, it seems like this result might be of broad interest to computer scientists, since e.g. there is no polynomial time algorithm that finds an approximate Nash equilibria in general games, even apparently simple ones. An extremely natural open question is “so what should we actually expect to happen when extremely rational players play such games?” It looks like this approach may be able to give some kind of an answer via an appropriate notion of boundedly optimal, but it is pretty hard to understand whether this approach does give a satisfying answer, and if so what it is. I think that the relevant take-away is in Theorem 1. My impression is that you are holding the strategy sets constant while effectively taking a limit over computational resources of the players (and potentially the difficulty of understanding the payoff matrix). Is that right? In that case, I expect convergence to the N.E. to only occur once the computational resources are exponential in the size of the game. (Of course that would still be a big improvement over uncomputability, I’m just trying to understand what is going on.) reply
 by Vadim Kosoy 1065 days ago | link AIXI-with-optimal-predictors: I believe this is relatively straightforward. However, my plan for the next step was adapting these results to a decision rule based on logical counterfactuals in a way which produces metathreat equilibria. Bounded Nash equilibria: I don’t think the concept is entirely novel. I’ve seen some papers which discuss Nash-like equilibria with computational resource bounds, although the the area seems to remain largely unexplored. The particular setting I use here is not very relevant to what you’re suggesting since finding Nash equilibria is non-polynomial in the number of strategies whereas here I keep the number of strategies constant. Instead, the complexity comes from the dependence of the payoff tensor on the parameter sampled from $$\mu^k$$. Your description of Theorem 1 is more or less correct except there’s only a “payoff vector” here since this is a 1 player setting. The multiplayer setting is used in Corollary 2. Regarding dependence on game size, it is not as bad as exponential. The Lipton-Markakis-Mehta algorithm finds $$\epsilon$$-equilibria in time $$O(n^{\frac{\log n}{\epsilon^2}})$$ reply

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes