Intelligent Agent Foundations Forumsign up / log in
Reflection with optimal predictors
post by Vadim Kosoy 1058 days ago | Patrick LaVictoire likes this | discuss

A change in terminology: It is convenient when important concepts have short names. The concept of an “optimal predictor scheme” seems much more important than its historical predecessor, the “optimal predictor”. Therefore “optimal predictor schemes” will be henceforth called just “optimal predictors” while the previous concept of “optimal predictor” might be called “flat optimal predictor”.

We study systems of computations which have access to optimal predictors for each other. We expect such systems to play an important role in decision theory (where self-prediction is required to define logical counterfactuals and mutual prediction is required for a collection of agents in a game) and Vingean reflection (where the different computations correspond to different successor agents). The previously known existence theorems for optimal predictors are not directly applicable to this case. To overcome this we prove new, specifically tailored existence theorems.

The Results section states the main novelties, Appendix A contains adaptations of old theorems, Appendix B proves selected claims from Appendix A and Appendix C proves the novel results.

Results

Notation

Given sets \(X\) and \(Y\), \(X \rightarrow Y\) will denote the set of mappings from \(X\) to \(Y\).


Before taking on reflection, we introduce a stronger concept of optimal predictor, to which the previous existence theorems still apply.

Definition 1

Let \(r\) be a positive integer. A proto-error space of rank \(r\) is a set \(\mathcal{E}\) of bounded functions from \({\mathbb{N}}^r\) to \({\mathbb{R}}^{\geq 0}\) s.t.

  1. If \(\delta_1, \delta_2 \in \Delta\) then \(\delta_1 + \delta_2 \in \mathcal{E}\).

  2. If \(\delta_1 \in \Delta\) and \(\delta_2 \leq \delta_1\) then \(\delta_2 \in \mathcal{E}\).

  3. There is a polynomial \(h: {\mathbb{N}}^r \rightarrow {\mathbb{R}}\) s.t. \(2^{-h} \in \mathcal{E}\).

Proposition 1

If \(\mathcal{E}\) is a proto-error space of rank \(r\) and \(\alpha \in {\mathbb{R}}^{>0}\), then \(\mathcal{E}^\alpha:=\{\delta^\alpha \mid \delta \in \mathcal{E}\}\) is also a proto-error space of rank \(r\).

Proposition 2

If \(\mathcal{E}\) is a proto-error space of rank \(r\), \(\alpha,\gamma \in {\mathbb{R}}^{>0}\) and \(\alpha < \gamma\) then \(\mathcal{E}^\gamma \subseteq \mathcal{E}^\alpha\).

Definition 3

Fix \(\mathcal{E}\) a proto-error space of rank 2 and \((f, \mu)\) a distributional estimation problem. Consider \(\hat{P}\) a \((poly, log)\)-predictor.

\(\hat{P}\) is called an \(\mathcal{E}(poly, log)\)-optimal predictor for \((f,\mu)\) when for any \((poly, log)\)-predictor \(\hat{Q}\), there is \(\delta \in \mathcal{E}\) s.t.

\[E_{\mu^k \times U^{r_P(k,j)}}[(\hat{P}^{kj}(x)-f(x))^2] \leq E_{\mu^k \times U^{r_Q(k,j)}}[(\hat{Q}^{kj}(x)-f(x))^2] + \delta(k,j)\]

\(\hat{P}\) is called an \(\mathcal{E}^*(poly, log)\)-optimal predictor for \((f,\mu)\) when there is \(\alpha > 0\) s.t. \(\hat{P}\) is an \(\mathcal{E}^{\alpha}(poly, log)\)-optimal predictor for \((f,\mu)\).


\(\mathcal{E}^*(poly, log)\)-optimal predictors have properties closely parallel to \(\Delta(poly, log)\)-optimal predictors. The corresponding theorems are listed in Appendix A. Most theorems are given without proof, as the proofs are closely analogous to before, with the exception of Theorem A.4 which is proven in Appendix B.

We now consider a generalization in which the advice is allowed to be random.

Definition 4

Given appropriate sets \(X\) and \(Y\), consider \(Q: {\mathbb{N}}^2 \times X \times {{\{ 0, 1 \}^*}}^2 \xrightarrow{alg} Y\), \(r_Q: {\mathbb{N}}^2 \rightarrow {\mathbb{N}}\) polynomial and \(\{\sigma_Q^{kj}: {{\{ 0, 1 \}^*}}\rightarrow [0,1]\}_{k,j \in {\mathbb{N}}}\) a family of probability measures. The triple \(\hat{Q}=(Q, r_Q, \sigma_Q)\) is called \((poly, rlog)\)-bischeme of signature \(X \rightarrow Y\) when

  1. The runtime of \(Q^{kj}(x, y, z)\) is bounded by \(p(k, j)\) with \(p\) polynomial.

  2. \(\operatorname{supp}\sigma_Q^{kj} \subseteq {{\{ 0, 1 \}^{\leq{c {\lfloor \log (k+2) + \log (j+2) \rfloor}}}}}\) for some \(c \in {\mathbb{N}}\).

We will use the notation \(\hat{Q}^{kj}(x,y)\) to stand for the obvious \(Y\)-valued \(\sigma_Q^{kj}\)-random variable and the notation \(\hat{Q}^{kj}(x)\) to stand for the obvious \(Y\)-valued \(U^{r_Q(k,j)} \times \sigma_Q^{kj}\)-random variable.

A \((poly, rlog)\)-bischeme of signature \({{\{ 0, 1 \}^*}}\rightarrow Y\) will also be called a \(Y\)-valued \((poly, rlog)\)-bischeme.

A \([0,1]\)-valued \((poly, rlog)\)-bischeme will also be called a \((poly, rlog)\)-predictor.

Note 1

Conceptually advice corresponds to precomputation which might be expensive or even “hyperprecomputation”. Random advice corresponds to random precomputation. Random logarithmic advice can be always replaced by deterministic polynomial advice at the cost of a small rounding error.

Definition 5

Fix \(\mathcal{E}\) a proto-error space of rank 2 and \((f, \mu)\) a distributional estimation problem. Consider \(\hat{P}\) a \((poly, rlog)\)-predictor. \(\hat{P}\) is called an \(\mathcal{E}(poly, rlog)\)-optimal predictor for \((f,\mu)\) when for any \((poly, rlog)\)-predictor \(\hat{Q}\), there is \(\delta \in \mathcal{E}\) s.t.

\[E_{\mu^k \times U^{r_P(k,j)} \times \sigma_P^{kj}}[(\hat{P}^{kj}(x)-f(x))^2] \leq E_{\mu^k \times U^{r_Q(k,j)} \times \sigma_Q^{kj}}[(\hat{Q}^{kj}(x)-f(x))^2] + \delta(k,j)\]


The concept of an \(\mathcal{E}(poly, rlog)\)-optimal predictor is essentially of the same strength as an \(\mathcal{E}(poly, rlog)\)-optimal predictors, as seen in the following two Propositions.

Proposition 3

Fix \(\mathcal{E}\) a proto-error space of rank 2 and \((f, \mu)\) a distributional estimation problem. Consider \(\hat{P}\) an \(\mathcal{E}(poly, log)\)-optimal predictor. Then it defines a \(\mathcal{E}(poly, rlog)\)-optimal predictor by setting \(\sigma_P^{kj}(a_P^{kj}) = 1\).

Proposition 4

Fix \(\mathcal{E}\) a proto-error space of rank 2 and \((f, \mu)\) a distributional estimation problem. If \((f, \mu)\) admits a \(\mathcal{E}(poly, rlog)\)-optimal predictor then it admits a \(\mathcal{E}(poly, log)\)-optimal predictor.


We are now ready to introduce the key abstraction.

Definition 6

Given a set \(\Sigma\), denote \(\Pi_\Sigma := \Sigma \times {\mathbb{N}}^2 \rightarrow {{\{ 0, 1 \}^*}}\times {\mathbb{N}}\) equipped with the product topology.

A reflective system is a triple \((\Sigma, f, \mu)\) where \(\Sigma\) is a set, \(\{\mu_n^k: {{\{ 0, 1 \}^*}}\rightarrow [0,1]\}_{n \in \Sigma, k \in {\mathbb{N}}}\) is a collection of probability measures where we regard each \(\mu_n\) as a word ensemble and \(\{f_n: \operatorname{supp}\mu_n \times \Pi_\Sigma \rightarrow [0,1]\}_{n \in \Sigma}\) is a collection of continuous functions (here \(\operatorname{supp}\mu_n\) has the discrete topology or, equivalently, we only require continuity in the second variable).


The motivation behind this definition is regarding \(\Pi_\Sigma\) as the space of possible predictor programs, where the first factor of \({{\{ 0, 1 \}^*}}\times {\mathbb{N}}\) is the program itself (including advice) and the second factor is the number of intrinsic coin flips. Thus the \(f_n\) represent a system of computations in which each has access to the source code of predictors for the entire system.

Definition 7

Consider a reflective system \(\mathcal{R}=(\Sigma, f, \mu)\) and a collection of \((poly,rlog)\)-predictors \(\{\hat{Q}_n=(Q_n,r_n,\sigma_n)\}_{n \in \Sigma}\).

Denote \(A_\Sigma := \Sigma \times {\mathbb{N}}^2 \rightarrow {{\{ 0, 1 \}^*}}\) equipped with the product \(\sigma\)-algebra. Denote \(\sigma:=\prod_{n,k,j} \sigma_n^{kj}\), a probability measure on \(A_\Sigma\).

Given \(n \in \Sigma\), \(k,j \in {\mathbb{N}}\) and \(a \in {{\{ 0, 1 \}^*}}\) denote \(Q_n^{kj}[a] \in {{\{ 0, 1 \}^*}}\) to be the program that computes \(Q_n^{kj}(x,y,a)\) given input \(x,y \in {{\{ 0, 1 \}^*}}\). Given \(a \in A_\Sigma\), define \(\hat{Q}[a] \in \Pi_\Sigma\) by \(\hat{Q}[a]_n^{kj}:=(Q_n^{kj}[a_n^{kj}],r_n(k,j))\).

Given \(n \in \Sigma\), \(\mathcal{R}[\hat{Q}]_n: \operatorname{supp}\mu_n \rightarrow [0,1]\) is defined as follows

\[\mathcal{R}[\hat{Q}]_n(x):=E_\sigma[f_n(x,\hat{Q}[a])]\]

The expectation value is well-defined thanks to the continuity of \(f_n\).

Definition 8

Fix \(\mathcal{E}\) a proto-error space of rank 2 and \(\mathcal{R}=(\Sigma,f,\mu)\) a reflective system. A collection \(\{P_n\}_{n \in \Sigma}\) of \((poly,rlog)\)-predictors is called an \(\mathcal{E}(poly,rlog)\)-optimal predictor system for \(\mathcal{R}\) when for any \(n \in \Sigma\), \(P_n\) is a \(\mathcal{E}(poly,rlog)\)-optimal predictor for \((\mathcal{R}[P]_n,\mu_n)\).

Construction 1

Given \(\phi \in \Phi\), denote \(\mathcal{E}_{1(\phi)}\) the set of bounded functions \(\delta: {\mathbb{N}}\rightarrow {\mathbb{R}}^{\geq 0}\) s.t.

\[\forall \epsilon \in (0,1): {\lim_{k \rightarrow \infty}} \phi(k)^{1-\epsilon} \delta(k) = 0\]

Given \(\phi \in \Phi\), denote \(\mathcal{E}_{2(ll,\phi)}\) the set of bounded functions \(\delta: {\mathbb{N}}^2 \rightarrow {\mathbb{R}}^{\geq 0}\) s.t.

\[\forall \psi \in \Phi: \psi \leq \phi \implies E_{\lambda_\psi^k}[\delta(k,j)] \in \mathcal{E}_{1(\psi)}\]

Denote

\[\mathcal{E}_{2(ll)} := \bigcap_{\phi \in \Phi} \mathcal{E}_{2(ll,\phi)}\]

Proposition 5

If \(\phi \in \Phi\) is s.t. \(\exists n: {\lim_{k \rightarrow \infty}} 2^{-k^n} \phi(k) = 0\), \(\mathcal{E}_{1(\phi)}\) is a proto-error space.

For any \(\phi \in \Phi\), \(\mathcal{E}_{2(ll,\phi)}\) is a proto-error space.

\(\mathcal{E}_{2(ll)}\) is a proto-error space.

Note 2

\(\mathcal{E}_{2(ll)}^*\)-optimality is strictly stronger than \(\Delta_{ll}^2\)-optimality. We exploit this strength in Theorem 2 below which is the main motivation for introducing \(\mathcal{E}\)-optimality at this point.

Theorem 1 (general existence theorem)

Any reflective system has a \(\mathcal{E}_{2(ll)}(poly,rlog)\)-optimal predictor system.


We also prove a more restricted existence theorem which allows deterministic advice.

Definition 9

Consider a set \(\Sigma\) and \(\{\mu_n^k: {{\{ 0, 1 \}^*}}\rightarrow [0,1]\}_{n \in \Sigma, k \in {\mathbb{N}}}\) a collection of probability measures. Denote \(D_\mu:=\{(n \in \Sigma, k \in {\mathbb{N}}, j \in {\mathbb{N}}, x \in {{\{ 0, 1 \}^*}}) \mid \mu_n^k(x) > 0\}\). Denote \(\Omega_\mu := D_\mu \rightarrow [0,1]\).

Fix a set \(\Sigma\) and a collection \(\{\phi_n \in \Phi\}_{n \in \Sigma}\). A \(\phi\)-reflective system is a pair \((f,\mu)\) where \(\mu\) is as above and \(\{f_n: \operatorname{supp}\mu_n \times \Omega_\mu \rightarrow [0,1]\}_{n \in \Sigma}\) is a collection of functions s.t. there are collections \(\{\psi_{nm} \in \Phi\}_{n,m \in \Sigma}\), \(\{\alpha_n \in (0,1]\}_{n \in \Sigma}\), \(\{c_n \in {\mathbb{R}}^{>0}\}_{n \in \Sigma}\) and probability measures \(\{\rho_n: \Sigma \rightarrow [0,1]\}_{n \in \Sigma}\) satisfying

\[\forall n,m \in \Sigma: \psi_{nm} \geq \phi_n\]

\[\forall n \in \Sigma, k \in {\mathbb{N}}, q,\tilde{q} \in \Omega_\mu: E_{\mu_n^k}[(f_n(x,q)-f_n(x,\tilde{q}))^2] \leq c_n (E_{\rho_n}[E_{\lambda_{\psi_{nm}}^k \times \mu_m^k}[(q_m^{kj}(x)-\tilde{q}_m^{kj}(x))^2]])^{\alpha_n}\]

Note 3

The condition on \(f\) is a kind of Hoelder condition, uniform over \(k\).

Proposition 6

Fix a set \(\Sigma\) and a collection \(\{\phi_n \in \Phi\}_{n \in \Sigma}\). For any \(\phi\)-reflective system \((f,\mu)\) and \(n \in \Sigma\), \(f_n\) is continuous with respect to the product topology on \(\Omega_\mu\).

Definition 10

Fix a set \(\Sigma\) and a collection \(\{\phi_n \in \Phi\}_{n \in \Sigma}\).

Given \(m \in {\mathbb{N}}\), \(x, y_1, y_2 \ldots y_m \in {{\{ 0, 1 \}^*}}\), we let \(ev(x, y_1, y_2 \ldots y_m) \in {{\{ 0, 1 \}^{\leq \omega}}}\) stand for the evaluation of program \(x\) on inputs \(y_1, y_2 \ldots y_m\) without a time limit (we assume that on the output tape the machine head moves right iff it produces a symbol in \(\{0,1\}\) and cannot be moved left). We also extend \(\beta\) to be defined on \({{\{ 0, 1 \}^{\leq \omega}}}\) in the obvious way.

We define \(ex_\mu: \Pi_\Sigma \rightarrow \Omega_\mu\) by \(ex_\mu(\pi)_n^{kj}(x):=E_{U^{(\pi_n^{kj})_2}}[\beta(ev((\pi_n^{kj})_1,x,y))]\).

Consider a \(\phi\)-reflective system \(\mathcal{R}=(f, \mu)\) and a collection of \((poly,rlog)\)-predictors \(\{\hat{Q}_n=(Q_n,r_n,\sigma_n)\}_{n \in \Sigma}\). Given \(n \in \Sigma\), \(\mathcal{R}[\hat{Q}]_n: \operatorname{supp}\mu_n \rightarrow [0,1]\) is defined as follows

\[\mathcal{R}[\hat{Q}]_n(x):=E_\sigma[f_n(x,ex_\mu(\hat{Q}[a]))]\]

The expectation value is well-defined due to Proposition 5.

Definition 11

Fix a set \(\Sigma\), \(\{\mathcal{E}_n\}_{n \in \Sigma}\) a collection of proto-error spaces of rank 2, a collection \(\{\phi_n \in \Phi\}_{n \in \Sigma}\) and \(\mathcal{R}=(f,\mu)\) a \(\phi\)-reflective system. A collection \(\{P_n\}_{n \in \Sigma}\) of \((poly,log)\)-predictors is called an \(\mathcal{E}^*(poly,log)\)-optimal predictor system for \(\mathcal{R}\) when for any \(n \in \Sigma\) there is \(\gamma \in {\mathbb{R}}^{>0}\) s.t. \(P_n\) is an \(\mathcal{E}_n^\gamma(poly,log)\)-optimal predictor for \((\mathcal{R}[P]_n,\mu_n)\).

Theorem 2

Consider a finite set \(\Sigma\) and a collection \(\{\phi_n \in \Phi\}_{n \in \Sigma}\). Any \(\phi\)-reflective system has an \(\mathcal{E}_{2(ll,\phi)}^*(poly,log)\)-optimal predictor system.

Appendix A

Definition A.1

\(\mathcal{E}\), a proto-error space of rank \(r\), is called ample when there is a polynomial \(h: {\mathbb{N}}^r \rightarrow {\mathbb{R}}^{>0}\) s.t. \(\frac{1}{h} \in \mathcal{E}\).


Fix \(\mathcal{E}\), a proto-error space of rank 2.

Theorem A.1

Assume \(\mathcal{E}\) is ample. Consider \((f, \mu)\) a distributional estimation problem, \(\hat{P}\) an \(\mathcal{E}(poly,rlog)\)-optimal predictor for \((f, \mu)\) and \(\{p_{kj} \in [0,1]\}_{k,j \in {\mathbb{N}}}\), \(\{q_{kj} \in [0,1]\}_{k,j \in {\mathbb{N}}}\). Define \(\gamma: {\mathbb{N}}^2 \rightarrow {\mathbb{R}}^{>0}\) by

\[ \gamma(k,j) := Pr_{\mu^k \times U^{r_P(k,j)} \times \sigma_P^{kj}}[p_{kj} \leq \hat{P}^{kj} \leq q_{kj}]^{-1}\]

Define

\[ \phi_{kj} := {E_{\mu^k \times U^{r_P(k,j)} \times \sigma_P^{kj}}[f-\hat{P}^{kj} \mid p_{kj} \leq \hat{P}^{kj} \leq q_{kj}]} \]

Assume that either \(p_{kj}, q_{kj}\) have a number of digits logarithmically bounded in \(k,j\) or \(\hat{P}^{kj}\) produces outputs with a number of digits logarithmically bounded in \(k,j\). Then, \(|\phi| \in (\gamma \mathcal{E})^{\frac{1}{2}}\).

Theorem A.2

Consider \(\mu\) a word ensemble and \(f_1, f_2: \operatorname{supp}\mu \rightarrow [0,1]\) s.t. \(f_1 + f_2 \leq 1\). Suppose \(\hat{P}_1\) is an \(\mathcal{E}^*(poly,rlog)\)-optimal predictor for \((f_1, \mu)\) and \(\hat{P}_2\) is an \(\mathcal{E}^*(poly,rlog)\)-optimal predictor for \((f_2, \mu)\). Then, \(\hat{P}_1 + \hat{P}_2\) is an \(\mathcal{E}^*(poly,rlog)\)-optimal predictor for \((f_1 + f_2, \mu)\).

Theorem A.3

Consider \(\mu\) a word ensemble and \(f_1, f_2: \operatorname{supp}\mu \rightarrow [0,1]\) s.t. \(f_1 + f_2 \leq 1\). Suppose \(\hat{P}_1\) is an \(\mathcal{E}^*(poly,rlog)\)-optimal predictor for \((f_1, \mu)\) and \(\hat{P}_2\) is an \(\mathcal{E}^*(poly,rlog)\)-optimal predictor for \((f_1 + f_2, \mu)\). Then, \(\hat{P}_2 - \hat{P}_1\) is an \(\mathcal{E}^*(poly,rlog)\)-optimal predictor for \((f_2, \mu)\).

Definition A.2

Given \(\mathcal{E}\) a proto-error space of rank \(r\), the associated error space is \(\mathcal{E}^{\frac{1}{\infty}}:=\bigcup_{\epsilon > 0} \mathcal{E}^{\epsilon}\).

Theorem A.4

Consider \((f_1, \mu_1)\), \((f_2, \mu_2)\) distributional estimation problems with respective \(\mathcal{E}^*(poly,rlog)\)-optimal predictors \(\hat{P}_1\) and \(\hat{P}_2\). Assume \(\mu_1\) is \(\mathcal{E}^{\frac{1}{\infty}}(log)\)-sampleable and \((f_2, \mu_2)\) is \(\mathcal{E}^{\frac{1}{\infty}}(log)\)-generatable. Then, \(\hat{P}_1 \times \hat{P}_2\) is an \(\mathcal{E}^*(poly,rlog)\)-optimal predictor for \((f_1 \times f_2, \mu_1 \times \mu_2)\).

Theorem A.5

Consider \(\mu\) a word ensemble, \(f: \operatorname{supp}\mu \rightarrow [0,1]\) and \(D \subseteq {{\{ 0, 1 \}^*}}\). Assume \(\hat{P}_D\) is an \(\mathcal{E}^*(poly,rlog)\)-optimal predictor for \((D, \mu)\) and \(\hat{P}_{f \mid D}\) is an \(\mathcal{E}^*(poly,rlog)\)-optimal predictor for \((f, \mu \mid D)\). Then \(\hat{P}_D \hat{P}_{f \mid D}\) is an \(\mathcal{E}^*(poly,rlog)\)-optimal predictor for \((\chi_D f, \mu)\).

Definition A.3

We define the stabilizer of \(\mathcal{E}\), denoted \(\operatorname{stab}\mathcal{E}\), to be the set of functions \(\gamma: {\mathbb{N}}^2 \rightarrow {\mathbb{R}}^{>0}\) s.t. for any \(\delta \in \mathcal{E}\) we have \(\gamma \delta \in \mathcal{E}\).

Theorem A.6

Fix \(h\) a polynomial s.t. \(2^{-h} \in \Delta\). Consider \(\mu\) a word ensemble, \(f: \operatorname{supp}\mu \rightarrow [0,1]\) and \(D \subseteq {{\{ 0, 1 \}^*}}\). Assume \(\mu^k(D)^{-1} \in \operatorname{stab}\mathcal{E}\). Assume \(\hat{P}\) is an \(\mathcal{E}^*(poly,rlog)\)-optimal predictor for \((D, \mu)\) and \(\hat{P}_{\chi_D f}\) is an \(\mathcal{E}^*(poly,rlog)\)-optimal predictor for \((\chi_D f, \mu)\). Define \(\hat{P}_{f \mid D}\) by

\[\hat{P}^{kj}_{f \mid D}(x) := \begin{cases}1 & \text{if } \hat{P}^{kj}_D(x) = 0 \\ \eta(\frac{\hat{P}^{kj}_{\chi_D f}(x)}{\hat{P}^{kj}_D(x)}) & \text{rounded to }h(k,j)\text{ binary places if } \hat{P}^{kj}_D(x) > 0 \end{cases}\]

Then, \(\hat{P}_{f \mid D}\) is an \(\mathcal{E}^*(poly,rlog)\)-optimal predictor for \((f, \mu \mid D)\).

Definition A.4

Consider \(\mu\) a word ensemble, \(\hat{Q}_1\), \(\hat{Q}_2\) \((poly,rlog)\)-predictors.

We say \(\hat{Q}_1\) is \(\mathcal{E}\)-similar to \(\hat{Q}_2\) relative to \(\mu\) (denoted \(\hat{Q}_1 \underset{\mathcal{E}}{\overset{\mu}{\simeq}} \hat{Q}_2\)) when \(E_{\mu^k \times U^{r_1(k,j)} \times U^{r_2(k,j)} \times \sigma_1^{kj} \times \sigma_2^{kj}}[(\hat{Q}_1^{kj}(x)-\hat{Q}_2^{kj}(x))^2] \in \mathcal{E}\).

Let \(\Delta\) be an error space.

We say \(\hat{Q}_1\) is \(\Delta\)-similar to \(\hat{Q}_2\) relative to \(\mu\) (denoted \(\hat{Q}_1 \underset{\Delta}{\overset{\mu}{\simeq}} \hat{Q}_2\)) when \(E_{\mu^k \times U^{r_1(k,j)} \times U^{r_2(k,j)} \times \sigma_1^{kj} \times \sigma_2^{kj}}[(\hat{Q}_1^{kj}(x)-\hat{Q}_2^{kj}(x))^2] \in \Delta\).

Theorem A.7 (uniqueness theorem)

Consider \((f, \mu)\) a distributional estimation problem, \(\hat{P}\) an \(\mathcal{E}(poly,rlog)\)-optimal predictor for \((f, \mu)\) and \(\hat{Q}\) an \((poly,rlog)\)-predictor.

If \(\hat{Q}\) is a \(\mathcal{E}(poly,rlog)\)-optimal predictor for \((f, \mu)\) then \(\hat{P} \underset{\mathcal{E}^{\frac{1}{2}}}{\overset{\mu}{\simeq}} \hat{Q}\).

Conversely, if \(\hat{P} \underset{\mathcal{E}^{\frac{1}{\infty}}}{\overset{\mu}{\simeq}} \hat{Q}\) then \(\hat{Q}\) is a \(\mathcal{E}^*(poly,rlog)\)-optimal predictor for \((f, \mu)\).

Definition A.5

\(\mathcal{E}\) is called stable when for any non-constant polynomial \(p: {\mathbb{N}}\rightarrow {\mathbb{N}}\) there is \(\alpha_p > 0\) s.t. for any \(\delta \in \mathcal{E}\), the function \(\delta'(k,j):=\delta(p(k),j)\) is in \(\mathcal{E}^{\alpha_p}\).

Proposition A.1

\(\mathcal{E}_{2(ll)}\) is stable.

Theorem A.8

Assume \(\mathcal{E}\) is ample and stable. Consider \((f, \mu)\), \((g, \nu)\) distributional estimation problems, \(\hat{\zeta}\) a \(\mathcal{E}^{\frac{1}{\infty}}\)-pseudo-invertible reduction of \((f, \mu)\) to \((g, \nu)\) and \(\hat{P}_g\) an \(\mathcal{E}^*(poly,rlog)\)-optimal predictor for \((g, \nu)\). Define \(\hat{P}_f\) by \(\hat{P}^{kj}_f(x):= \hat{P}^{kj}_g(\hat{\zeta}^{kj}(x))\). Then, \(\hat{P}_f\) is an \(\mathcal{E}^*(poly,rlog)\)-optimal predictor for \((f, \mu)\).

Theorem A.9

Assume \(\mathcal{E}\) is ample and stable. Consider \((f, \mu)\), \((g, \nu)\) distributional estimation problems, \(\hat{\zeta}\) a \(\mathcal{E}^{\frac{1}{\infty}}\)-pseudo-invertible weak reduction of \((f, \mu)\) to \((g, \nu)\) and \(\hat{P}_g\) an \(\mathcal{E}^*(poly,rlog)\)-optimal predictor for \((g, \nu)\). Choose \(h: {\mathbb{N}}^2 \rightarrow {\mathbb{N}}\) a polynomial with \(\frac{1}{h} \in \mathcal{E}\) and define \(\hat{P}_f\) by

\[\hat{P}^{kj}_f(x):= \frac{1}{h(k,j)} \sum_{i=1}^{h(k,j)} \hat{P}^{kj}_g(\hat{\zeta}^{kj}(x \ldots) \ldots)\]

Here, the ellipses signify that each term corresponds to an independent sampling of \(U^{r_\zeta(k,j)} \times U^{r_g(k,j)} \times \sigma_g^{kj}\). Then, \(\hat{P}_f\) is an \(\mathcal{E}^*(poly,rlog)\)-optimal predictor for \((f, \mu)\).

Theorem A.10

Consider \((f,\mu)\) a distributional estimation problem, \(\phi \in \Phi\) and \(\hat{G}\) a weak \(\Delta_{sqp,\phi}^2(log)\)-generator for \((f,\mu)\). Then, \(\hat{\Lambda}[\hat{G}]\) is an \(\mathcal{E}_{2(ll,\phi)}(poly,log)\)-optimal predictor for \((f,\mu)\).

Appendix B

Definition B.1

Given \(n \in {\mathbb{N}}\), a function \(\delta: {\mathbb{N}}^{2+n} \rightarrow {\mathbb{R}}^{\geq 0}\) is called \(\mathcal{E}\)-moderate when

  1. \(\delta\) is non-decreasing in arguments \(3\) to \(2+n\).

  2. For any collection of polynomials \(\{p_i: {\mathbb{N}}^2 \rightarrow {\mathbb{N}}\}_{i < n}\), \(\delta(k, j, p_0(k,j) \ldots p_{n-1}(k,j)) \in \mathcal{E}\)


Lemmas B.1 and B.2 below are given only for future reference (and as an aid in spelling out the proofs of other Theorems in Appendix A).

Lemma B.1

Fix \((f, \mu)\) a distributional estimation problem and \(\hat{P}\) a \((poly,rlog)\)-predictor. Then, \(\hat{P}\) is \(\mathcal{E}(poly,rlog)\)-optimal iff there is a \(\mathcal{E}\)-moderate function \(\delta: {\mathbb{N}}^4 \rightarrow [0,1]\) s.t. for any \(k,j,s \in {\mathbb{N}}\), \(Q: {{\{ 0, 1 \}^*}}^2 \xrightarrow{alg} [0,1]\)

\[E_{\mu^k \times U^{r_P(k,j)} \times \sigma_P^{kj}}[(P^{kj}(x,y,z)-f(x))^2] \leq E_{\mu^k \times U^s}[(Q(x,y)-f(x))^2] + \delta(k,j,T_Q^{\mu}(k,s),2^{|Q|})\]

Proof of Lemma B.1

Define

\[ \delta(k,j,t,u):=\max_{\substack{T_Q^{\mu}(k,s) \leq t \\ |Q| \leq \log u}} \max( E_{\mu^k \times U^{r_P(k,j)} \times \sigma_P^{kj}}[(P^{kj}(x,y,z)-f(x))^2] - E_{\mu^k \times U^s}[(Q(x,y)-f(x))^2] ,0) \]

Lemma B.2

Assume \(\mathcal{E}\) is ample. Fix \((f, \mu)\) a distributional estimation problem and \(\hat{P}\) a corresponding \(\mathcal{E}(poly,rlog)\)-optimal predictor. Consider \(\hat{Q}\) a \((poly,rlog)\)-predictor, \(M > 0\), \(\hat{w}\) a \({\mathbb{Q}}\cap [0,M]\)-valued \((poly,rlog)\)-bischeme. Assume \(r_w \geq \max(r_P,r_Q)\) and \(u_P, u_Q : {{\{ 0, 1 \}^*}}\rightarrow {{\{ 0, 1 \}^*}}\) are s.t. \(u_{P*}(\sigma_w) = \sigma_P\) and \(u_{Q*}(\sigma_w) = \sigma_Q\). Then there is \(\delta \in \mathcal{E}\) s.t.

\[ E_{\mu^k \times U^{r_w(k,j)} \times \sigma_w^{kj}}[w^{kj}(x,y,z)(P^{kj}(x,y_{\leq r_P(k,j)},u_P(z))-f(x))^2] \leq E_{\mu^k \times U^{r_w(k,j)} \times \sigma_w^{kj}}[w^{kj}(x,y,z)(Q^{kj}(x,y_{\leq r_Q(k,j)},u_Q(z))-f(x))^2] + \delta(k, j) \]

Proof of Lemma B.2

Suppose \(h: {\mathbb{N}}^2 \rightarrow {\mathbb{N}}\) is a polynomial s.t. \(\frac{1}{h} \in \mathcal{E}\). Given \(t \in [0,M]\), define \(\alpha^{kj}(t)\) to be \(t\) rounded within error \(h(k,j)^{-1}\). Thus, the number of digits in \(\alpha^{kj}(t)\) is logarithmic in \(k\) and \(j\). Consider \(\hat{Q}_t:=(Q_t, r_w, \sigma_u)\) the \((poly,rlog)\)-predictor defined by

\[\sigma_u:=(1 \times u_P \times u_Q)_* \sigma_w\]

\[ Q^{kj}_t(x,y,(a,b,c)):= \begin{cases} Q^{kj}(x,y_{\leq r_Q(k,j)},c) & \text{if } w^{kj}(x,y,a) \geq \alpha^{kj}(t) \\ P^{kj}(x,y_{\leq r_P(k,j)},b) & \text{if } w^{kj}(x,y,a) < \alpha^{kj}(t) \end{cases} \]

\(\hat{Q}_t\) satisfies bounds on runtime and advice size uniform in \(t\). Therefore, Lemma B.1 implies that there is \(\delta \in \mathcal{E}\) s.t.

\[ E[(\hat{P}^{kj}(x)-f(x))^2] \leq E[(\hat{Q}^{kj}_t(x)-f(x))^2] + \delta(k,j) \]

\[ E[(\hat{P}^{kj}(x)-f(x))^2-(\hat{Q}^{kj}_t(x)-f(x))^2] \leq \delta(k,j) \]

\[ E[\theta(\hat{w}^{kj}(x)-\alpha^{kj}(t))((\hat{P}^{kj}(x)-f(x))^2-(\hat{Q}^{kj}(x)-f(x))^2)] \leq \delta(k,j) \]

\[ E[\int_0^M\theta(\hat{w}^{kj}(x)-\alpha^{kj}(t))\,dt\,((\hat{P}^{kj}(x)-f(x))^2-(\hat{Q}^{kj}(x)-f(x))^2)] \leq M \delta(k,j) \]

\[ E[\hat{w}^{kj}(x)((\hat{P}^{kj}(x)-f(x))^2-(\hat{Q}^{kj}(x)-f(x))^2)] \leq M \delta(k, j) + h(k,j)^{-1} \]

Lemma B.3 (orthogonality lemma)

Consider \((f, \mu)\) a distributional estimation problem and \(\hat{P}\) an \(\mathcal{E}(poly,rlog)\)-optimal predictor for \((f, \mu)\). Then there are \(c_1,c_2 \in {\mathbb{R}}\) and an \(\mathcal{E}^{\frac{1}{2}}\)-moderate function \(\delta: {\mathbb{N}}^4 \rightarrow [0,1]\) s.t. for any \(k,j,s \in {\mathbb{N}}\), \(Q: {{\{ 0, 1 \}^*}}^2 \xrightarrow{alg} {\mathbb{Q}}\)

\[|E_{\mu^k \times U^s \times U^{r_P(k,j)} \times \sigma_P^{kj}}[Q(\hat{P}^{kj}-f)]| \leq (c_1 + c_2 E_{\mu^k \times U^s}[Q^2]) \delta(k,j,T_Q^{\mu}(k,s),2^{|Q|})\]

Conversely, consider \(M \in {\mathbb{Q}}\) and \(\hat{P}\) a \({\mathbb{Q}}\cap [-M,+M]\)-valued \((poly,rlog)\)-bischeme. Suppose that for any \({\mathbb{Q}}\cap [-M-1,+M]\)-valued \((poly,log)\)-bischeme \((Q,s,b)\) we have \(|E[Q(P-f)]| \in \mathcal{E}\).

Define \(\tilde{P}\) to be s.t. computing \(\tilde{P}^{kj}\) is equivalent to computing \(\eta(\hat{P}^{kj})\) rounded to \(h(k,j)\) digits after the binary point, where \(2^{-h} \in \mathcal{E}\). Then, \(\tilde{P}\) is an \(\mathcal{E}(poly,rlog)\)-optimal predictor for \((f, \mu)\).


We ommit the proofs of Lemma B.3 and Lemma B.4 below since they are closely analogous to before.

Lemma B.4

Consider \((f, \mu)\) a distributional estimation problem, \(\hat{P}\), \(\hat{Q}\) \((poly,rlog)\)-predictors. Suppose \(p: {\mathbb{N}}^2 \rightarrow {\mathbb{N}}\) a polynomial, \(\phi \in \Phi\) and \(\delta \in \mathcal{E}_{2(ll,\phi)}\) are s.t.

\[\forall i,k,j \in {\mathbb{N}}: E[(\hat{P}^{k,p(k,j)+i}-f)^2] \leq E[(\hat{Q}^{kj}-f)^2] + \delta(k,j)\]

Then \(\exists \delta' \in \mathcal{E}_{2(ll,\phi)}\) s.t.

\[E[(\hat{P}^{kj}-f)^2] \leq E[(\hat{Q}^{kj}-f)^2] + \delta'(k,j)\]

Proof of Theorem A.4

Denote \(\hat{P} := \hat{P}_1 \times \hat{P}_2\). We have

\[\hat{P}(x_1,x_2)-(f_1 \times f_2)(x_1,x_2) = (\hat{P}_1(x_1)- f_1(x_1))f_2(x_2) + \hat{P}_1(x_1)(\hat{P}_2(x_2)-f_2(x_2))\]

Therefore, for any \({\mathbb{Q}}\cap [-1,+1]\)-valued \((poly,log)\)-bischeme \(\hat{Q}\)

\[ |E[\hat{Q}(\hat{P}-f_1 \times f_2)]| \leq |E[\hat{Q}(x_1,x_2)(\hat{P}_1(x_1)-f_1(x_1))f_2(x_2)]| + |E[\hat{Q}(x_1,x_2)\hat{P}_1(x_1)(\hat{P}_2(x_2)-f_2(x_2))]| \]

By Lemma B.3, it is sufficient to show an appropriate bound for each of the terms on the right hand side. Suppose \(\hat{G}\) is a \(\mathcal{E}^{\frac{1}{\infty}}(log)\)-generator for \((f_2, \mu_2)\). For the first term, we have

\[|E[\hat{Q}^{kj}(x_1,x_2)(\hat{P}^{kj}_1(x_1)-f_1(x_1))f_2(x_2)]| \leq |E[\hat{Q}^{kj}(x_1,\hat{G}^{kj}_1)(\hat{P}^{kj}_1(x_1)-f_1(x_1))\hat{G}^{kj}_2]| + \delta_2(k,j)\]

where \(\delta_2 \in \mathcal{E}^{\frac{1}{\infty}}\) doesn’t depend on \(Q\). Applying Lemma B.3 for \(\hat{P}_1\), we get

\[|E[\hat{Q}^{kj}(x_1,x_2)(\hat{P}^{kj}_1(x_1)-f_1(x_1))f_2(x_2)]| \leq \delta_{Q,1}(k,j) + \delta_2(k,j)\]

where \(\delta_{Q,1} \in \mathcal{E}^{\alpha_1}\) for some \(\alpha_1 \in {\mathbb{R}}^{>0}\) that doesn’t depend on \(Q\).

Suppose \(\hat{S}\) is a \(\mathcal{E}^{\frac{1}{\infty}}(log)\)-sampler for \(\mu_1\). For the second term, we have

\[|E[\hat{Q}^{kj}(x_1,x_2)\hat{P}_1(x_1)(\hat{P}^{kj}_2(x_2)-f_2(x_2))]| \leq |E[\hat{Q}^{kj}(\hat{S}^{kj},x_2)\hat{P}_1(\hat{S}^{kj})(\hat{P}^{kj}_2(x_2)-f_2(x_2))]| + \delta_1(k,j)\]

where \(\delta_1 \in \mathcal{E}^{\frac{1}{\infty}}\) doesn’t depend on \(Q\). Applying Lemma B.3 for \(\hat{P}_2\), we get

\[|E[\hat{Q}^{kj}(x_1,x_2)\hat{P}_1(x_1)(\hat{P}^{kj}_2(x_2)-f_2(x_2))]| \leq \delta_{Q,2}(k,j) + \delta_1(k,j)\]

where \(\delta_{Q,2} \in \mathcal{E}^{\alpha_2}\) for some \(\alpha_2 \in {\mathbb{R}}^{>0}\) that doesn’t depend on \(Q\). Again, we got the required bound.

Proof of Proposition A.1

Consider a non-constant polynomial \(p: {\mathbb{N}}\rightarrow {\mathbb{N}}\) and \(\delta \in \mathcal{E}_{2(ll)}\). Define \(\delta'(k,j):=\delta(p(k),j)\). To get the desired condition for \(\delta'\) and \(\phi \in \Phi\), consider any \(\phi' \in \Phi\) s.t. for sufficiently large \(k\) we have \(\phi'(p(k))=\phi(k)\). For any \(\epsilon \in (0,1)\) we have

\[\lim_{k \rightarrow \infty} \phi'(k)^\epsilon E_{\lambda_{\phi'}^k}[\delta(k,j)] = 0\]

In particular

\[\lim_{k \rightarrow \infty} \phi'(p(k))^\epsilon E_{\lambda_{\phi'}^{p(k)}}[\delta(p(k),j)] = 0\]

\[\lim_{k \rightarrow \infty} \phi(k)^\epsilon E_{\lambda_{\phi}^k}[\delta'(k,j)] = 0\]

Appendix C

Proof of Proposition 1

To check condition (i), consider \(\delta_1, \delta_2 \in \mathcal{E}\).

If \(\alpha > 1\), \((\delta_1^\alpha + \delta_2^\alpha)^\frac{1}{\alpha} \leq \delta_1 + \delta_2 \in \mathcal{E}\) hence \((\delta_1^\alpha + \delta_2^\alpha)^\frac{1}{\alpha} \in \mathcal{E}\) and \(\delta_1^\alpha + \delta_2^\alpha \in \mathcal{E}^\alpha\).

If \(\alpha \leq 1\), \((\delta_1^\alpha + \delta_2^\alpha)^\frac{1}{\alpha} = 2^\frac{1}{\alpha}(\frac{\delta_1^\alpha + \delta_2^\alpha}{2})^\frac{1}{\alpha} \leq 2^\frac{1}{\alpha} \frac{\delta_1+\delta_2}{2} \in \mathcal{E}\) hence \((\delta_1^\alpha + \delta_2^\alpha)^\frac{1}{\alpha} \in \mathcal{E}\) and \(\delta_1^\alpha + \delta_2^\alpha \in \mathcal{E}^\alpha\).

Conditions (ii) and (iii) are obvious.

Proof of Proposition 2

Consider \(\delta \in \mathcal{E}\). We need to show that \(\delta^\gamma \in \mathcal{E}^\alpha\) i.e. that \(\delta^{\frac{\gamma}{\alpha}} \in \mathcal{E}\). But \(\frac{\gamma}{\alpha} > 1\) hence \(\delta^{\frac{\gamma}{\alpha}} = (\sup \delta)^{\frac{\gamma}{\alpha}}(\frac{\delta}{\sup \delta})^{\frac{\gamma}{\alpha}} \leq (\sup \delta)^{\frac{\gamma}{\alpha}} \frac{\delta}{\sup \delta} \in \mathcal{E}\).

Proof of Proposition 3

Follows immediately from Lemma B.1.

Proof of Proposition 4

Suppose \(\hat{P}\) is a \(\mathcal{E}(poly,rlog)\)-optimal predictor. Set \(a^{kj}:=\arg\min_{z \in \operatorname{supp}\sigma_P^{kj}} E_{\mu^k \times U^{r_P(k,j)}}[(P^{kj}(x,y,z)-f(x))^2]\). Replacing \(\sigma\) by \(a\) we get the desired \(\mathcal{E}(poly,log)\)-optimal predictor.

Proposition C.1

For any \(\psi \in \Phi\), \(\min(\frac{\log\log (k+3)}{\log\log (j+3)}, 1)^{\psi(k)} \in \mathcal{E}_{2(ll)}\)

In particular, this implies \(\frac{1}{j+1} \in \mathcal{E}_{2(ll)}\) so \(\mathcal{E}_{2(ll)}\) is ample.

Proof of Proposition C.1

Denote \(\delta_\psi(k,j):=\min(\frac{\log\log (k+3)}{\log\log (j+3)}, 1)^{\psi(k)}\). Consider \(\phi \in \Phi\), \(\epsilon \in (0,1)\). We have

\[E_{\lambda_{\phi}^k}[\delta_\psi(k,j)] = Pr_{\lambda_{\phi}^k}[j < t_{\phi^{\frac{\epsilon}{2}}}(k)] E_{\lambda_{\phi}^k}[\delta_\psi(k,j) \mid j < t_{\phi^{\frac{\epsilon}{2}}}(k)] + Pr_{\lambda_{\phi}^k}[j \geq t_{\phi^{\frac{\epsilon}{2}}}(k)] E_{\lambda_{\phi}^k}[\delta_\psi(k,j) \mid j \geq t_{\phi^{\frac{\epsilon}{2}}}(k)]\]

\[\limsup_{k \rightarrow \infty} \phi(k)^{1-\epsilon} E_{\lambda_{\phi}^k}[\delta(k,j)] \leq \limsup_{k \rightarrow \infty} \phi(k)^{1-\epsilon}(\phi(k)^{\frac{\epsilon}{2}-1} (\sup \delta_\psi) + \sup_{j \geq t_{\phi^{\frac{\epsilon}{2}}}(k)} \delta_\psi(k,j))\]

\[\limsup_{k \rightarrow \infty} \phi(k)^{1-\epsilon} E_{\lambda_{\phi}^k}[\delta(k,j)] \leq \limsup_{k \rightarrow \infty} \phi(k)^{1-\epsilon} (\phi(k)^{\frac{\epsilon}{2}-1} + \phi(k)^{-\frac{\epsilon}{2}\psi(k)})\]

\[\limsup_{k \rightarrow \infty} \phi(k)^{1-\epsilon} E_{\lambda_{\phi}^k}[\delta(k,j)] \leq \limsup_{k \rightarrow \infty} (\phi(k)^{-\frac{\epsilon}{2}} + \phi(k)^{1-\epsilon-\frac{\epsilon}{2}\psi(k)})\]

\[\lim_{k \rightarrow \infty} \phi(k)^{1-\epsilon} E_{\lambda_{\phi}^k}[\delta(k,j)] = 0\]

Proof of Proposition 5

The only not entirely obvious part is condition (iii) for \(\mathcal{E}_{2(ll)}\) which follows from Proposition C.1 (since \(2^{-j} \in \mathcal{E}_{2(ll)}\)).

Construction C.1

Fix \(\mathcal{R}=(\Sigma,f,\mu)\) a reflective system.

For any \(j \in {\mathbb{N}}\), denote \(W^j \subseteq {{\{ 0, 1 \}^*}}\) the set of the first \(j\) words in lexicographic order. Denote \(\Delta^j\) the space of probability distributions on \(W^j\). Denote \(\Delta_\Sigma := \prod_{\substack{k,j \in {\mathbb{N}}\\ n \in \Sigma}} \Delta^j\). Denote \(W_\Sigma:=\{(n \in \Sigma, k \in {\mathbb{N}}, j \in {\mathbb{N}}, x \in W^j)\}\). Let \(V_\Sigma\) be the locally convex topological vector space \(W_\Sigma \rightarrow {\mathbb{R}}\), where the topology is the product topology.

\(\Delta_\Sigma\) is a compact (by Tychonoff’s theorem) convex subset of \(V_\Sigma\). Each \(\vartheta \in \Delta_\Sigma\) can be regarded as a probability measure on \(A_\Sigma\).

Given \(a \in A_\Sigma\), we define \(\bar{a} \in \Pi_\Sigma\) by \(\bar{a}_n^{kj}:=(a_n^{kj},j)\).

Given \(k,j \in {\mathbb{N}}\) and \(n \in \Sigma\), define \(\epsilon_{\mathcal{R},n}^{kj}: \Delta_\Sigma \times \Delta^j \rightarrow [0,1]\) as follows

\[\epsilon_{\mathcal{R},n}^{kj}(\vartheta, \zeta):=E_{\zeta \times \mu_n^k \times U^j \times \vartheta}[(ev^j(x,y,z)-f_n(y,\overline{\Upsilon[w]}))^2]\]

Define \(\kappa_{\mathcal{R}} \subseteq \Delta_\Sigma \times \Delta_\Sigma\) by

\[\kappa_{\mathcal{R}} = \{(\vartheta_1,\vartheta_2) \mid \forall k,j \in {\mathbb{N}}, n \in \Sigma, \zeta \in \Delta^j: \epsilon_{\mathcal{R},n}^{kj}(\vartheta_1, (\vartheta_2)_n^{kj}) \leq \epsilon_{\mathcal{R},n}^{kj}(\vartheta_1, \zeta)\}\]

Proposition C.2

\(\kappa_{\mathcal{R}}\) is a Kakutani map.

Proof of Proposition C.2

\(\epsilon_{\mathcal{R},n}^{kj}\) is continuous in the 2nd argument and \(\Delta^j\) is compact for any \(j \in {\mathbb{N}}\). Therefore, for any \(\vartheta \in \Delta_\Sigma\), \(\kappa_{\mathcal{R}}(\vartheta) \subseteq \Delta_\Sigma\) is non-empty by the extreme value theorem. It is compact by Tychonoff’s theorem and it is obviously convex.

Given \(Y \subseteq \operatorname{supp}\mu_n^k\) finite, define \(\epsilon_{\mathcal{R},Y,n}^{kj}: \Delta_\Sigma \times \Delta^j \rightarrow [0,1]\) by

\[\epsilon_{\mathcal{R},Y,n}^{kj}(\vartheta, \zeta):=\sum_{y \in Y} \mu_n^k(y) E_{\zeta \times U^j \times \vartheta}[(ev^j(x,y,z)-f_n(y,\overline{\Upsilon[w]}))^2]\]

\(\epsilon_{\mathcal{R},Y,n}^{kj}\) is continuous since \(f\) is continuous. Regarding the \(Y\)s as a net by set inclusion, \(\epsilon_{\mathcal{R},Y,n}^{kj}\) uniformly converges to \(\epsilon_{\mathcal{R},n}^{kj}\), therefore \(\epsilon_{\mathcal{R},n}^{kj}\) is continuous.

Given \(n \in \Sigma\), \(k,j \in {\mathbb{N}}\), define \(\kappa_{\mathcal{R},n}^{kj} \subseteq \Delta_\Sigma \times \Delta_\Sigma\) by

\[\kappa_{\mathcal{R},n}^{kj} = \{(\vartheta_1,\vartheta_2) \mid \forall \zeta \in \Delta^j: \epsilon_{\mathcal{R},n}^{kj}(\vartheta_1, (\vartheta_2)_n^{kj}) \leq \epsilon_{\mathcal{R},n}^{kj}(\vartheta_1, \zeta)\}\]

\(\kappa_{\mathcal{R},n}^{kj}\) is closed since \(\epsilon_{\mathcal{R},n}^{kj}\) is continuous. Therefore \(\kappa_{\mathcal{R}} = \bigcap_{\substack{k,j \in {\mathbb{N}}\\ n \in \Sigma}} \kappa_{\mathcal{R},n}^{kj}\) is also closed.

Proof of Theorem 1

Using Proposition C.2, we apply the Kakutani-Glicksberg-Fan theorem to get \((\sigma,\sigma) \in \kappa_{\mathcal{R}}\). Define \(\hat{P}_n:=(\Upsilon,j,\sigma)\).

Consider \(n \in \Sigma\), \(\hat{Q}\) a \((poly,log)\)-predictor. Choose \(p: {\mathbb{N}}^2 \rightarrow {\mathbb{N}}\) a polynomial and \(\{q^{kj} \in W^{p(k,j)}\}_{k,j \in {\mathbb{N}}}\) s.t. \(p \geq r_Q\) and

\[\forall i,k,j \in {\mathbb{N}}, x \in \operatorname{supp}\mu_n^k, y \in {{\{ 0, 1 \}^{p(k,j)+i}}}: ev^{p(k,j)+i}(q^{kj},x,y) = Q^{kj}(x,y_{<r_Q(k,j)},a_Q^{kj})\]

By definition of \(\kappa_{\mathcal{R}}\), we have

\[\forall i,k,j \in {\mathbb{N}}: \epsilon_{\mathcal{R},n}^{k,p(k,j)+i}(\sigma,\sigma_n^{k,p(k,j)+i}) \leq \epsilon_{\mathcal{R},n}^{k,p(k,j)+i}(\sigma,q^{kj})\]

Here we implicitly used the natural injection \(W^m \rightarrow \Delta^m\).

\[\forall i,k,j \in {\mathbb{N}}: E_{\sigma_n^{k,p(k,j)+i} \times \mu_n^k \times U^{p(k,j)+i} \times \sigma}[(ev^{p(k,j)+i}(x,y,z)-f_n(y,\overline{\Upsilon[w]}))^2] \leq E_{\mu_n^k \times U^{p(k,j)+i} \times \sigma}[(ev^{p(k,j)+i}(q^{kj},y,z)-f_n(y,\overline{\Upsilon[w]}))^2]\]

\[\forall i,k,j \in {\mathbb{N}}: E_{\sigma_n^{k,p(k,j)+i} \times \mu_n^k \times U^{p(k,j)+i} \times \sigma}[(P_n^{k,p(k,j)+i}(y,z,x)-f_n(y,\hat{P}_n[w]))^2] \leq E_{\mu_n^k \times U^{r_Q(k,j)} \times \sigma}[(\hat{Q}^{kj}(y,z)-f_n(y,\hat{P}_n[w]))^2]\]

Denoting \(\hat{\sigma}_n^{k,p(k,j)+i}:=\sigma_n^{k,p(k,j)+i} \times \mu_n^k \times U^{p(k,j)+i}\), The left hand side satisfies

\[E_{\hat{\sigma}_n^{k,p(k,j)+i} \times \sigma}[(P_n^{k,p(k,j)+i}(y,z,x)-f_n(y,\hat{P}_n[w]))^2] = E_{\hat{\sigma}_n^{k,p(k,j)+i}}[(P_n^{k,p(k,j)+i}(y,z,x)-E_\sigma[f_n(y,\hat{P}_n[w])])^2] + Var_\sigma[f_n(y,\hat{P}_n[w])]\]

Similarly, for the right hand side we have

\[E_{\mu_n^k \times U^{r_Q(k,j)} \times \sigma}[(\hat{Q}^{kj}(y,z)-f_n(y,\hat{P}_n[w]))^2] = E_{\mu_n^k \times U^{r_Q(k,j)}}[(\hat{Q}^{kj}(y,z)-E_\sigma[f_n(y,\hat{P}_n[w])])^2] + Var_\sigma[f_n(y,\hat{P}_n[w])]\]

Combining the two together, we get

\[E_{\hat{\sigma}_n^{k,p(k,j)+i}}[(P_n^{k,p(k,j)+i}(y,z,x)-E_\sigma[f_n(y,\hat{P}_n[w])])^2] \leq E_{\mu_n^k \times U^{r_Q(k,j)}}[(\hat{Q}^{kj}(y,z)-E_\sigma[f_n(y,\hat{P}_n[w])])^2]\]

\[E_{\hat{\sigma}_n^{k,p(k,j)+i}}[(\hat{P}_n^{k,p(k,j)+i}(y)-\mathcal{R}[\hat{P}]_n(y))^2] \leq E_{\mu_n^k \times U^{r_Q(k,j)}}[(\hat{Q}^{kj}(y)-\mathcal{R}[\hat{P}]_n(y))^2]\]

Applying Lemma B.4 we conclude that there is \(\delta \in \mathcal{E}_{2(ll)}\) s.t.

\[E_{\hat{\sigma}_n^{kj}}[(\hat{P}_n^{kj}(y)-\mathcal{R}[\hat{P}]_n(y))^2] \leq E_{\mu_n^k \times U^{r_Q(k,j)}}[(\hat{Q}^{kj}(y)-\mathcal{R}[\hat{P}]_n(y))^2] + \delta(k,j)\]

Proof of Proposition 6

We need to show that for every \(n \in \Sigma\), \(x \in \operatorname{supp}\mu_n\), \(q \in \Omega_\mu\) and \(\epsilon > 0\), there is a finite set \(X \subseteq D_\mu\) and \(\delta > 0\) s.t. for every \(\tilde{q} \in \Omega_\mu\) with \(\forall (n,k,j,y) \in X: |q_n^{kj}(y) - \tilde{q}_n^{kj}(y)| < \delta\) we have \(|f(x,q) - f(x,\tilde{q})| < \epsilon\).

Choose \(k \in {\mathbb{N}}\) s.t. \(x \in \operatorname{supp}\mu_n^k\), \(Z \subseteq \Sigma\) finite and \(Y \subseteq {{\{ 0, 1 \}^*}}\) finite. Define \(X:=\{(k,j \in {\mathbb{N}},m \in Z,y \in Y) \mid j < t_{\psi_{nm}}(k)\}\). We get

\[E_{\mu_n^k}[(f_n(y,q)-f_n(y,\tilde{q}))^2] \leq c_n (E_{\rho_n}[E_{\lambda_{\psi_{nm}}^k \times \mu_m^k}[(q_m^{kj}(y)-\tilde{q}_m^{kj}(y))^2]])^{\alpha_n}\]

\[\mu_n^k(x)(f_n(x,q)-f_n(x,\tilde{q}))^2 \leq c_n (\rho_n(\Sigma \setminus Z) + \delta^2)^{\alpha_n}\]

By choosing \(Z\) with \(\rho_n(\Sigma \setminus Z)\) sufficiently small and \(\delta\) sufficiently small we get the desired condition.

Proposition C.3

Fix a finite set \(\Sigma\) and a collection \(\{\phi_n \in \Phi\}_{n \in \Sigma}\). Consider \(\mathcal{R}=(f,\mu)\) a \(\phi\)-reflective system and two collections of \((poly,rlog)\)-predictors \(\{\hat{Q}_{1n}\}_{n \in \Sigma}\) and \(\{\hat{Q}_{2n}\}_{n \in \Sigma}\). Assume that for some \(\gamma \in (0,1]\), \(\forall n \in \Sigma: \hat{Q}_{1n} \underset{\mathcal{E}_{2(ll)}^\gamma}{\overset{\mu_n}{\simeq}} \hat{Q}_{2n}\). Then

\[\forall n \in \Sigma: E_{\mu_n^k}[(\mathcal{R}[\hat{Q}_1]_n(x)-\mathcal{R}[\hat{Q}_1]_n(x))^2] \in \mathcal{E}_{1(\phi_n)}^{\frac{1}{\infty}}\]

Proof of Proposition C.3

\[E_{\mu_n^k}[(\mathcal{R}[\hat{Q}_1]_n(x)-\mathcal{R}[\hat{Q}_1]_n(x))^2] = E_{\mu_n^k}[(E_{\sigma_1}[f_n(x,ex_\mu(\hat{Q}_1[a]))]-E_{\sigma_2}[f_n(x,ex_\mu(\hat{Q}_2[a]))])^2]\]

\[E_{\mu_n^k}[(\mathcal{R}[\hat{Q}_1]_n(x)-\mathcal{R}[\hat{Q}_1]_n(x))^2] \leq E_{\mu_n^k \times \sigma_1 \times \sigma_2}[(f_n(x,ex_\mu(\hat{Q}_1[a_1]))-f_n(x,ex_\mu(\hat{Q}_2[a_2])))^2]\]

\[E_{\mu_n^k}[(\mathcal{R}[\hat{Q}_1]_n(x)-\mathcal{R}[\hat{Q}_1]_n(x))^2] \leq E_{\sigma_1 \times \sigma_2}[c_n (E_{\rho_n}[E_{\lambda_{\psi_{nm}}^k \times \mu_m^k}[(E_{U^{r_1(k,j)}}[Q_{1m}^{kj}(x,y,a_1)]-E_{U^{r_2(k,j)}}[Q_{2m}^{kj}(x,y,a_2)])^2]])^{\alpha_n}]\]

\[E_{\mu_n^k}[(\mathcal{R}[\hat{Q}_1]_n(x)-\mathcal{R}[\hat{Q}_1]_n(x))^2] \leq E_{\sigma_1 \times \sigma_2}[c_n (E_{\rho_n}[E_{\lambda_{\psi_{nm}}^k}[E_{\mu_m^k \times U^{r_1(k,j)} \times U^{r_2(k,j)}}[(Q_{1m}^{kj}(x,y,a_1)-Q_{2m}^{kj}(x,y,a_2))^2]]])^{\alpha_n}]\]

\[E_{\mu_n^k}[(\mathcal{R}[\hat{Q}_1]_n(x)-\mathcal{R}[\hat{Q}_1]_n(x))^2] \leq c_n (E_{\rho_n}[E_{\lambda_{\psi_{nm}}^k}[E_{\mu_m^k \times U^{r_1(k,j)} \times U^{r_2(k,j)} \times \sigma_1 \times \sigma_2}[(Q_{1m}^{kj}(x,y,a_1)-Q_{2m}^{kj}(x,y,a_2))^2]]])^{\alpha_n}\]

Using the similarity of \(\hat{Q}_1\) and \(\hat{Q}_2\) there are \(\{\delta_n: {\mathbb{N}}^2 \rightarrow [0,1] \in \mathcal{E}_{2(ll)}\}_{n \in \Sigma}\) s.t.

\[E_{\mu_n^k}[(\mathcal{R}[\hat{Q}_1]_n(x)-\mathcal{R}[\hat{Q}_1]_n(x))^2] \leq c_n (E_{\rho_n}[E_{\lambda_{\psi_{nm}}^k}[\delta_m(k,j)^\gamma]])^{\alpha_n}\]

\[E_{\mu_n^k}[(\mathcal{R}[\hat{Q}_1]_n(x)-\mathcal{R}[\hat{Q}_1]_n(x))^2] \leq c_n (E_{\rho_n}[E_{\lambda_{\psi_{nm}}^k}[\delta_m(k,j)]])^{\alpha_n\gamma}\]

\(\delta_m \in \mathcal{E}_{2(ll)}\) hence \(E_{\lambda_{\psi_{nm}}^k}[\delta_m(k,j)] \in \mathcal{E}_{1(\psi_{nm})} \subseteq \mathcal{E}_{1(\phi_n)}\). This implies \(E_{\rho_n}[E_{\lambda_{\psi_{nm}}^k}[\delta_m(k,j)]] \in \mathcal{E}_{1(\phi_n)}\) and \(E_{\mu_n^k}[(\mathcal{R}[\hat{Q}_1]_n(x)-\mathcal{R}[\hat{Q}_1]_n(x))^2] \in \mathcal{E}_{1(\phi_n)}^{\alpha_n\gamma}\)

Definition C.1

Fix a set \(\Sigma\) and a collection \(\{\phi_n \in \Phi\}_{n \in \Sigma}\). Given \(\mathcal{R}=(f,\mu)\) a \(\phi\)-reflective system, the associated reflective system \(ex^{-1}\mathcal{R}=(\Sigma,ex^{-1}f,\mu)\) is defined by

\[(ex^{-1} f_n)(x,\pi):=f_n(x, ex_\mu(\pi))\]

\(f_n\) is continuous thanks to Proposition 6 since \(ex_\mu\) is continuous.

Proof of Theorem 2

Fix a finite set \(\Sigma\) and a collection \(\{\phi_n \in \Phi\}_{n \in \Sigma}\). Consider \(\mathcal{R}\) a \(\phi\)-reflective system. By Theorem 1, there is \(\hat{R}\) an \(\mathcal{E}_{2(ll)}(poly,rlog)\)-optimal predictor system for \(ex^{-1}\mathcal{R}\). For each \(n \in \Sigma\) we can use Proposition 4 to choose \(\hat{P}_n\), an \(\mathcal{E}_{2(ll)}(poly,log)\)-optimal predictor for \((ex^{-1}\mathcal{R})[\hat{R}]_n=\mathcal{R}[\hat{R}]_n\). By Theorem A.7, we have \(\hat{P}_{n} \underset{\mathcal{E}_{2(ll)}^\frac{1}{2}}{\overset{\mu_n}{\simeq}} \hat{R}_{n}\). By Proposition C.3 this implies \(E_{\mu_n^k}[(\mathcal{R}[\hat{P}]_n(x)-\mathcal{R}[\hat{R}]_n(x))^2] \in \mathcal{E}_{1(\phi_n)}^{\frac{1}{\infty}}\). This means \(\hat{P}_n\) is an \(\mathcal{E}_{2(ll,\phi_n)}^*(poly,log)\)-optimal predictor for \(\mathcal{R}[\hat{P}]_n\) and \(\hat{P}\) is an \(\mathcal{E}_{2(ll,\phi)}^*(poly,log)\)-optimal predictor system for \(\mathcal{R}\).



NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

RSS

Privacy & Terms