Optimal predictor schemes

Vanessa Kosoy

I introduce the concept of "optimal predictor scheme" which differs from (quasi)optimal predictors in depending on an additional parameter representing the amount of computing resources the predictor is allowed to use. For a certain flavor of optimal predictor scheme, I prove existence for arbitrary distributional decision problems.

Results

It is convenient to think of the concepts of "optimal predictor" and "quasi-optimal predictor" as special cases of the more general $Δ$ -optimal predictors corresponding to different choices of the "error space" $Δ$ .

Definition 1

Let $r$ be a positive integer. An error space of rank r $Δ$ is a set of bounded functions from $N^{r}$ to $R^{\geq 0}$ s.t.

(i) If $δ_{1}, δ_{2} \in Δ$ then $δ_{1} + δ_{2} \in Δ$ .

(ii) If $δ_{1} \in Δ$ and $δ_{2} \leq δ_{1}$ then $δ_{2} \in Δ$ .

(iii) Given $α \in (0, 1]$ , if $δ \in Δ$ then $δ^{α} \in Δ$ .

(iv) There is a polynomial $h : N^{r} \to R$ s.t. $2^{- h} \in Δ$ .

Example 1.1

$Δ_{0}^{1}$ is the set of functions $δ : N \to R^{\geq 0}$ s.t. ${lim}_{k \to \infty} δ (k) = 0$ .

Example 1.2

$Δ_{n e g}^{1}$ is the set of negligible functions $δ : N \to R^{\geq 0}$ .

Example 1.3

$Δ_{a v g}^{2}$ is the set of bounded functions $δ : N^{2} \to R^{\geq 0}$ s.t. for any $f : N \to N$ , if

$lim k \to \infty \frac{log log k}{log log f (k)} = 0$

then

$lim k \to \infty \frac{f (k) - 1 \sum j = 3 (log log (j + 1) - log log j) δ (k, j)}{log log f (k) - log log 3} = 0$

It is slightly non-obvious that condition (iii) holds in this example. To see this, note that the expression inside the limit can be regarded as $E_{λ_{f}^{k}} [δ (k, j)]$ where $λ_{f}^{k}$ is a certain probability measure over $j$ . ${lim}_{k \to \infty} E_{λ_{f}^{k}} [δ (k, j)] = 0$ implies that for any $ϵ > 0$ , ${lim}_{k \to \infty} P r_{λ_{f}^{k}} [δ (k, j) > ϵ] = 0$ since $E_{λ_{f}^{k}} [δ (k, j)] \geq ϵ P r_{λ_{f}^{k}} [δ (k, j) > ϵ]$ . For any $ϵ$ , we have $E_{λ_{f}^{k}} [δ (k, j)^{α}] \leq ϵ + P r_{λ_{f}^{k}} [δ (k, j) > ϵ^{α^{- 1}}] sup δ$ , therefore ${lim}_{k \to \infty} E_{λ_{f}^{k}} [δ (k, j)^{α}] = 0$ .

Definition 2

Consider $Δ$ an error space of rank 1 and $(D, μ)$ a distributional decision problem. A $Δ$ -optimal predictor for $(D, μ)$ is a family of polynomial size Boolean circuits ${P^{k} : supp μ^{k} c i r c - - \to [0, 1]}_{k \in N}$ s.t. for any family of polynomial size Boolean circuits ${Q^{k} : supp μ^{k} c i r c - - \to [0, 1]}_{k \in N}$ we have

$E_{μ^{k}} [(P^{k} (x) - χ_{D} (x))^{2}] \leq E_{μ^{k}} [(Q^{k} (x) - χ_{D} (x))^{2}] + δ (k)$

where $δ \in Δ$ .

In particular, optimal predictors are $Δ_{n e g}^{1}$ -optimal predictors and quasi-optimal predictors are $Δ_{0}^{1}$ -optimal predictors.

Definition 3

Consider $Δ$ an error space of rank 2 and $(D, μ)$ a distributional decision problem. A $Δ$ -optimal predictor scheme for $(D, μ)$ is a family of Boolean circuits ${P^{k j} : supp μ^{k} c i r c - - \to [0, 1]}_{k, j \in N}$ s.t.

(i) $| P^{k j} | \leq p (k, j)$ for some polynomial $p$ .

(ii) for any family of Boolean circuits ${Q^{k j} : supp μ^{k} c i r c - - \to [0, 1]}_{k, j \in N}$ that satisfies (i) we have

$E_{μ^{k}} [(P^{k j} (x) - χ_{D} (x))^{2}] \leq E_{μ^{k}} [(Q^{k j} (x) - χ_{D} (x))^{2}] + δ (k, j)$

where $δ \in Δ$ .

It is straightforward to generalize Theorems 1-8 about quasi-optimal predictors to $Δ$ -optimal predictors and $Δ$ -optimal predictor schemes. The versions for optimal predictor schemes are stated in Appendix A without proof. The only theorem whose proof requires a slightly non-obvious tweak is Theorem 1. The amended proof is given in Appendix B.

The main technical novelty of this post is the following existence theorem.

Theorem 1

Consider $(D, μ)$ a distributional decision problem. Define the circuit family $P_{D, μ}^{*}$ by

$P_{D, μ}^{* k j} := a r g m i n | Q | \leq j E_{μ^{k}} [(Q (x) - χ_{D} (x))^{2}]$

Then, $P_{D, μ}^{*}$ is a $Δ_{a v g}^{2}$ -optimal predictor scheme for $(D, μ)$ .

The proof is given in Appendix C.

The definition of $P^{*}$ is rather trivial, but the fact Theorems A.1-A.7 apply to $P^{*}$ with regard to the error space $Δ_{a v g}^{2}$ is non-trivial. For example, this can be applied to the "classical" setting of logical uncertainty by taking $D$ to be the set of true sentences in some formal logic $F$ (e.g. $P A$ , $Z F C$ ). For a suitable choice of $μ$ , $P_{D, μ}^{*}$ will satisfy an approximate version of the coherence conditions because of the considerations here.

Appendix A

Fix $Δ$ an error space of rank 2, $h$ the polynomial from condition (iv).

Theorem A.1

Consider $(D, μ)$ a distributional decision problem and $P$ a $Δ$ -optimal predictor scheme for $(D, μ)$ . Suppose ${p_{k j} \in [0, 1]}_{k, j \in N}$ , ${q_{k j} \in [0, 1]}_{k, j \in N}$ are s.t.

$\exists ϵ > 0 \forall k, j : μ^{k} {x \in {0, 1}^{*} ∣ p_{k j} \leq P^{k j} (x) \leq q_{k j}} \geq ϵ$

Define

$ϕ_{k j} := E_{μ^{k}} [χ_{D} (x) - P^{k j} (x) ∣ p_{k j} \leq P^{k j} (x) \leq q_{k j}]$

Then, $| ϕ | \in Δ$ .

Theorem A.2

Consider $μ$ a word ensemble and $D_{1}$ , $D_{2}$ disjoint languages. Suppose $P_{1}$ is a $Δ$ -optimal predictor scheme for $(D_{1}, μ)$ and $P_{2}$ is a $Δ$ -optimal predictor scheme for $(D_{2}, μ)$ . Then, $P := η (P_{1} + P_{2})$ is a $Δ$ -optimal predictor scheme for $(D_{1} \cup D_{2}, μ)$ .

Theorem A.3

Consider $μ$ a word ensemble and $D_{1}$ , $D_{2}$ disjoint languages. Suppose $P_{1}$ is a $Δ$ -optimal predictor scheme for $(D_{1}, μ)$ and $P$ is a $Δ$ -optimal predictor scheme for $(D_{1} \cup D_{2}, μ)$ . Then, $P_{2} := η (P - P_{1})$ is a $Δ$ -optimal predictor scheme for $(D_{2}, μ)$ .

Theorem A.4

Consider $(D_{1}, μ_{1})$ , $(D_{2}, μ_{2})$ distributional decision problems with respective $Δ$ -optimal predictor schemes $P_{1}$ and $P_{2}$ . Define ${P^{k j} : supp μ^{k} c i r c - - \to [0, 1]}_{k \in N}$ as the family of circuits computing $P^{k j} ((x_{1}, x_{2})) := P_{1}^{k j} (x_{1}) P_{2}^{k j} (x_{2})$ . Then, $P$ is a $Δ$ -optimal predictor scheme for $(D_{1} \times D_{2}, μ_{1} \times μ_{2})$ .

Theorem A.5

Consider $C, D \subseteq {0, 1}^{*}$ and $μ$ a word ensemble. Assume $P_{D}$ is a $Δ$ -optimal predictor scheme for $(D, μ)$ and $P_{C ∣ D}$ is a $Δ$ -optimal predictor scheme for $(C, μ ∣ D)$ . Then $P_{D} P_{C ∣ D}$ is a $Δ$ -optimal predictor scheme for $(C \cap D, μ) .$

Theorem A.6

Consider $C, D \subseteq {0, 1}^{*}$ and $μ$ a word ensemble. Assume $\exists ϵ > 0 \forall k : μ^{k} (D) \geq ϵ$ . Assume $P_{D}$ is a $Δ$ -optimal predictor scheme for $(D, μ)$ and $P_{C \cap D}$ is a $Δ$ -optimal predictor scheme for $(C \cap D, μ)$ . Define $P_{C ∣ D}$ as the circuit family computing

$P_{C ∣ D}^{k j} (x) := ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ \begin{matrix} 1 & if P_{D}^{k j} (x) = 0 η (\frac{P_{C \cap D}^{k j} (x)}{P_{D}^{k j} (x)}) & rounded to h (k, j) binary places if P_{D}^{k j} (x) > 0 \end{matrix}$

Then, $P_{C ∣ D}$ is a $Δ$ -optimal predictor scheme for $(C, μ ∣ D)$ .

Definition A.1

Consider $μ$ a word ensemble and ${Q_{1, 2}^{k j} : supp μ^{k} c i r c - - \to [0, 1]}_{k, j \in N}$ two circuit families. We say $Q_{1}$ is $Δ$ -similar to $Q_{2}$ relative to $μ$ (denoted $Q_{1} μ ≃ Δ Q_{2}$ ) when $E_{μ^{k}} [(Q_{1}^{k} (x) - Q_{2}^{k} (x))^{2}] \in Δ$ .

Theorem A.7

Consider $(D, μ)$ a distributional decision problem, $P$ a $Δ$ -optimal predictor scheme for $(D, μ)$ and ${Q^{k j} : supp μ^{k} c i r c - - \to [0, 1]}_{k \in N}$ a polynomial size family. Then, $Q$ is a $Δ$ -optimal predictor scheme for $(D, μ)$ if and only if $P μ ≃ Δ Q$ .

Appendix B

Lemma B.1

Consider $(D, μ)$ a distributional decision problem and $P$ a corresponding $Δ$ -optimal predictor scheme. Then, there is a function $δ : N^{4} \to [0, 1]$ s.t.

(i) $δ$ is non-decreasing in the third and fourth arguments.

(ii) For all polynomials $p, q : N^{2} \to N$ , we have $δ (k, j, p (k, j), q (k, j)) \in Δ$ .

(iii) for all $k \in N$ , $Q : supp μ^{k} c i r c - - \to [0, 1]$ and $w : supp μ^{k} c i r c - - \to Q^{\geq 0}$ we have

$E_{μ^{k}} [w (x) (P^{k j} (x) - χ_{D} (x))^{2}] \leq E_{μ^{k}} [w (x) (Q (x) - χ_{D} (x))^{2}] + (max w) δ (k, j, | Q |, | w |)$

The proof of Lemma B.1 is completely analogous to the proof of Lemma 2 for quasi-optimal predictors and I omit it.

Proof of Theorem A.1

Define ${w^{k j} : supp μ^{k} c i r c - - \to {0, 1}}_{k, j \in N}$ as the circuits computing

$w^{k j} (x) := θ (P^{k j} (x) - p_{k j}) θ (q_{k j} - P^{k j} (x))$

$| w^{k j} |$ is bounded by a polynomial since $P^{k j}$ produces binary fractions of polynomial size therefore it is possible to compare them to the fixed numbers $p_{k j}, q_{k j}$ using a polynomial size circuit even if the latter have infinite binary expansions.

We have

$ϕ_{k j} = \frac{E_{μ^{k}} [w^{k j} (x) (χ_{D} (x) - P^{k j} (x))]}{E_{μ^{k}} [w^{k j} (x)]}$

Define $ψ_{k j}$ to be $ϕ_{k j}$ truncated to the first significant binary digit. Define ${Q^{k j} : supp μ^{k} c i r c - - \to [0, 1]}_{k, j \in N}$ as the circuits computing

$Q^{k j} (x) := η (P^{k j} (x) + ψ_{k j})$

Denote $I \subseteq N^{2}$ the set of $(k, j)$ for which $ϕ_{k j} > 2^{- h (k, j)}$ . For $(k, j) \in I$ , $ψ_{k j}$ has binary notation of polynomially bounded size, therefore $| Q^{k j} |$ is bounded by a polynomial for such $(k, j)$ .

Applying Lemma B.1 we get

$\forall (k, j) \in I : E_{μ^{k}} [w^{k j} (x) (P^{k j} (x) - χ_{D} (x))^{2}] \leq E_{μ^{k}} [w^{k j} (x) (Q^{k j} (x) - χ_{D} (x))^{2}] + δ (k, j)$

for $δ \in Δ$ .

$\forall (k, j) \in I : E_{μ^{k}} [w^{k j} (x) ((P^{k j} (x) - χ_{D} (x))^{2} - (Q^{k j} (x) - χ_{D} (x))^{2})] \leq δ (k, j)$

$\forall (k, j) \in I : E_{μ^{k}} [w^{k j} (x) ((P^{k j} (x) - χ_{D} (x))^{2} - (η (P^{k j} (x) + ψ_{k j}) - χ_{D} (x))^{2})] \leq δ (k, j)$

Obviously $(η (P^{k j} (x) + ψ_{k j}) - χ_{D} (x))^{2} \leq (P^{k j} (x) + ψ_{k j} - χ_{D} (x))^{2}$ , therefore

$\forall (k, j) \in I : E_{μ^{k}} [w^{k j} (x) ((P^{k j} (x) - χ_{D} (x))^{2} - (P^{k j} (x) + ψ_{k j} - χ_{D} (x))^{2})] \leq δ (k, j)$

$\forall (k, j) \in I : ψ_{k j} E_{μ^{k}} [w^{k j} (x) (2 (χ_{D} (x) - P^{k j} (x)) - ψ_{k j})] \leq δ (k, j)$

The expression on the left hand side is a quadratic polynomial in $ψ_{k j}$ which attains its maximum at $ϕ_{k j}$ and has roots at $0$ and $2 ϕ_{k j}$ . $ψ_{k j}$ is between $0$ and $ϕ_{k j}$ , but not closer to $0$ than $\frac{ϕ_{k j}}{2}$ . Therefore, the inequality is preserved if we replace $ψ_{k j}$ by $\frac{ϕ_{k j}}{2}$ .

$\forall (k, j) \in I : \frac{ϕ_{k j}}{2} E_{μ^{k}} [w^{k j} (x) (2 (χ_{D} (x) - P^{k j} (x)) - \frac{ϕ_{k j}}{2})] \leq δ (k, j)$

Substituting the equation for $ϕ_{k j}$ we get

$\forall (k, j) \in I : \frac{1}{2} \frac{E_{μ^{k}} [w^{k j} (x) (χ_{D} (x) - P^{k j} (x))]}{E_{μ^{k}} [w^{k j} (x)]} E_{μ^{k}} [w^{k j} (x) (2 (χ_{D} (x) - P^{k j} (x)) - \frac{1}{2} \frac{E_{μ^{k}} [w^{k j} (x) (χ_{D} (x) - P^{k j} (x))]}{E_{μ^{k}} [w^{k j} (x)]})] \leq δ (k, j)$

$\forall (k, j) \in I : \frac{3}{4} \frac{E_{μ^{k}} [w^{k j} (x) (χ_{D} (x) - P^{k j} (x))]^{2}}{E_{μ^{k}} [w^{k j} (x)]} \leq δ (k, j)$

$\forall (k, j) \in I : \frac{3}{4} E_{μ^{k}} [w^{k j} (x)] ϕ_{k j}^{2} \leq δ (k, j)$

$\forall (k, j) \in I : ϕ_{k j}^{2} \leq \frac{4}{3} E_{μ^{k}} [w^{k j} (x)]^{- 1} δ (k, j)$

$\forall (k, j) \in I : ϕ_{k j}^{2} \leq \frac{4}{3} μ^{k} {x \in {0, 1}^{*} ∣ p_{k j} \leq P^{k j} (x) \leq q_{k j}}^{- 1} δ (k, j)$

Thus for all $k, j \in N$ we have

$| ϕ_{k j} | \leq 2^{- h (k, j)} + \sqrt{\frac{4}{3} μ^{k} {x \in {0, 1}^{*} ∣ p_{k j} \leq P^{k j} (x) \leq q_{k j}}^{- 1} δ (k, j)}$

In particular, $| ϕ | \in Δ$ .

Appendix C

The following is a proof of Theorem 1.

Define $ϵ (k, j)$ by

$ϵ (k, j) := E_{μ^{k}} [(P_{D, μ}^{* k j} (x) - χ_{D} (x))^{2}] = min | Q | \leq j E_{μ^{k}} [(Q (x) - χ_{D} (x))^{2}]$

To prove $P_{D, μ}^{*}$ is $Δ_{a v g}^{2}$ -optimal, it enough to consider families $Q$ of the form $Q^{k j} = P_{D, μ}^{* k, q (k, j)}$ for polynomial $q$ , since

$ϵ (k, | Q^{k j} |) \leq E_{μ^{k}} [(Q^{k j} (x) - χ_{D} (x))^{2}]$

That is, we need to prove that for any polynomial $q$ , $ϵ (k, j) - ϵ (k, max (q (k, j), j)) \in Δ_{a v g}^{2}$ .

Without loss of generality, assume $q (k, j) = k^{n} j^{m}$ for $m > 0$ . We are interested in the function

$δ (k) := \frac{f (k) - 1 \sum j = 3 (log log (j + 1) - log log j) (ϵ (k, j) - ϵ (k, k^{n} j^{m}))}{log log f (k) - log log 3}$

This can be rewritten as

$δ (k) = \frac{f (k) - 1 \sum j = 3 (log log (j + 1) - log log j) (ϵ (k, j) - ϵ (k, j^{m + \frac{n log k}{log j}}))}{log log f (k) - log log 3}$

$δ (k) \leq \frac{f (k) - 1 \sum j = 3 (log log (j + 1) - log log j) (ϵ (k, j) - ϵ (k, j^{m + \frac{n log k}{log 3}}))}{log log f (k) - log log 3}$

The numerator can be recast as an integral

$δ (k) \leq \frac{f (k) \int x = 3 (ϵ (k, ⌊ x ⌋) - ϵ (k, ⌊ x ⌋^{m + \frac{n log k}{log 3}})) d (log log x)}{log log f (k) - log log 3}$

$δ (k) \leq \frac{f (k) \int x = 3 (ϵ (k, ⌊ x ⌋) - ϵ (k, ⌊ x^{m + \frac{n log k}{log 3}} ⌋)) d (log log x)}{log log f (k) - log log 3}$

$δ (k) \leq \frac{f (k) \int x = 3 ϵ (k, ⌊ x ⌋) d (log log x) - f (k) \int x = 3 ϵ (k, ⌊ x^{m + \frac{n log k}{log 3}} ⌋) d (log log x)}{log log f (k) - log log 3}$

Raising $x$ to a power is equivalent to adding a constant to $log log x$ , therefore

$δ (k) \leq \frac{f (k) \int x = 3 ϵ (k, ⌊ x ⌋) d (log log x) - f (k)^{m + \frac{n log k}{log 3}} \int x = 3^{m + \frac{n log k}{log 3}} ϵ (k, ⌊ x ⌋) d (log log x)}{log log f (k) - log log 3}$

$δ (k) \leq \frac{3^{m + \frac{n log k}{log 3}} \int x = 3 ϵ (k, ⌊ x ⌋) d (log log x)}{log log f (k) - log log 3}$

Since $ϵ \leq 1$ , we get

$δ (k) \leq \frac{log (m + \frac{n log k}{log 3})}{log log f (k) - log log 3}$

Using the assumption on $f$ , we conclude that ${lim}_{k \to \infty} δ (k) = 0$ , as needed.

[-]Paul Christiano8y00

If $δ (k, j) \in Δ_{a v g}^{2}$ , for what functions $g$ does $δ (k, g (k)) \to 0$ ?

Does this require $j$ to grow faster than any quasipolynomial function of $k$ ? Because that's pretty fast. I guess it's clear that $j$ is going to have to increase faster than any polynomial in $k$ ?

[-]Vanessa Kosoy8y00

There are no functions with this property. You have to do the log-log uniform average over $j$ s (up to a superquasipolynomial function) in order to guarantee convergence to 0 (however if you have a perfect predictor then you can amplify so that the error behaves like you described). I think it's possible to change the formalism in a way which replaces quasipolynomials by polynomials and superquasipolynomial functions by superpolynomial functions but this requires introducing assumptions about the computational model so I avoided it for now.

Note though that superquasipolynomial is still "far from exponential" in some sense because there is a natural infinite tower of complexities between polynomial and exponential where $0$ -th level is polynomial functions and the $n + 1$ -st level consists of functions of the form $2^{f (log n)}$ where $f$ is a function of the $n$ -th level (so quasipolynomials are level 1).

Btw, if you're trying to catch up on optimal predictors then I have a nearly finished draft of a paper with an orderly presentation and consistent notation that I can send you.