Personal Blog

I introduce the concept of "optimal predictor scheme" which differs from (quasi)optimal predictors in depending on an additional parameter representing the amount of computing resources the predictor is allowed to use. For a certain flavor of optimal predictor scheme, I prove existence for arbitrary distributional decision problems.

Results

It is convenient to think of the concepts of "optimal predictor" and "quasi-optimal predictor" as special cases of the more general -optimal predictors corresponding to different choices of the "error space" .

Definition 1

Let be a positive integer. An error space of rank r is a set of bounded functions from to s.t.

(i) If then .

(ii) If and then .

(iii) Given , if then .

(iv) There is a polynomial s.t. .

Example 1.1

is the set of functions s.t. .

Example 1.2

is the set of negligible functions .

Example 1.3

is the set of bounded functions s.t. for any , if

then

It is slightly non-obvious that condition (iii) holds in this example. To see this, note that the expression inside the limit can be regarded as where is a certain probability measure over . implies that for any , since . For any , we have , therefore .

Definition 2

Consider an error space of rank 1 and a distributional decision problem. A -optimal predictor for is a family of polynomial size Boolean circuits s.t. for any family of polynomial size Boolean circuits we have

where .

In particular, optimal predictors are -optimal predictors and quasi-optimal predictors are -optimal predictors.

Definition 3

Consider an error space of rank 2 and a distributional decision problem. A -optimal predictor scheme for is a family of Boolean circuits s.t.

(i) for some polynomial .

(ii) for any family of Boolean circuits that satisfies (i) we have

where .

It is straightforward to generalize Theorems 1-8 about quasi-optimal predictors to -optimal predictors and -optimal predictor schemes. The versions for optimal predictor schemes are stated in Appendix A without proof. The only theorem whose proof requires a slightly non-obvious tweak is Theorem 1. The amended proof is given in Appendix B.

The main technical novelty of this post is the following existence theorem.

Theorem 1

Consider a distributional decision problem. Define the circuit family by

Then, is a -optimal predictor scheme for .

The proof is given in Appendix C.

The definition of is rather trivial, but the fact Theorems A.1-A.7 apply to with regard to the error space is non-trivial. For example, this can be applied to the "classical" setting of logical uncertainty by taking to be the set of true sentences in some formal logic (e.g. , ). For a suitable choice of , will satisfy an approximate version of the coherence conditions because of the considerations here.

Appendix A

Fix an error space of rank 2, the polynomial from condition (iv).

Theorem A.1

Consider a distributional decision problem and a -optimal predictor scheme for . Suppose , are s.t.

Define

Then, .

Theorem A.2

Consider a word ensemble and , disjoint languages. Suppose is a -optimal predictor scheme for and is a -optimal predictor scheme for . Then, is a -optimal predictor scheme for .

Theorem A.3

Consider a word ensemble and , disjoint languages. Suppose is a -optimal predictor scheme for and is a -optimal predictor scheme for . Then, is a -optimal predictor scheme for .

Theorem A.4

Consider , distributional decision problems with respective -optimal predictor schemes and . Define as the family of circuits computing . Then, is a -optimal predictor scheme for .

Theorem A.5

Consider and a word ensemble. Assume is a -optimal predictor scheme for and is a -optimal predictor scheme for . Then is a -optimal predictor scheme for

Theorem A.6

Consider and a word ensemble. Assume . Assume is a -optimal predictor scheme for and is a -optimal predictor scheme for . Define as the circuit family computing

Then, is a -optimal predictor scheme for .

Definition A.1

Consider a word ensemble and two circuit families. We say is -similar to relative to (denoted ) when .

Theorem A.7

Consider a distributional decision problem, a -optimal predictor scheme for and a polynomial size family. Then, is a -optimal predictor scheme for if and only if .

Appendix B

Lemma B.1

Consider a distributional decision problem and a corresponding -optimal predictor scheme. Then, there is a function s.t.

(i) is non-decreasing in the third and fourth arguments.

(ii) For all polynomials , we have .

(iii) for all , and we have

The proof of Lemma B.1 is completely analogous to the proof of Lemma 2 for quasi-optimal predictors and I omit it.

Proof of Theorem A.1

Define as the circuits computing

is bounded by a polynomial since produces binary fractions of polynomial size therefore it is possible to compare them to the fixed numbers using a polynomial size circuit even if the latter have infinite binary expansions.

We have

Define to be truncated to the first significant binary digit. Define as the circuits computing

Denote the set of for which . For , has binary notation of polynomially bounded size, therefore is bounded by a polynomial for such .

Applying Lemma B.1 we get

for .

Obviously , therefore

The expression on the left hand side is a quadratic polynomial in which attains its maximum at and has roots at and . is between and , but not closer to than . Therefore, the inequality is preserved if we replace by .

Substituting the equation for we get

Thus for all we have

In particular, .

Appendix C

The following is a proof of Theorem 1.

Define by

To prove is -optimal, it enough to consider families of the form for polynomial , since

That is, we need to prove that for any polynomial , .

Without loss of generality, assume for . We are interested in the function

This can be rewritten as

The numerator can be recast as an integral

Raising to a power is equivalent to adding a constant to , therefore

Since , we get

Using the assumption on , we conclude that , as needed.

Personal Blog

2

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 2:13 PM

If , for what functions does ?

Does this require to grow faster than any quasipolynomial function of ? Because that's pretty fast. I guess it's clear that is going to have to increase faster than any polynomial in ?

There are no functions with this property. You have to do the log-log uniform average over s (up to a superquasipolynomial function) in order to guarantee convergence to 0 (however if you have a perfect predictor then you can amplify so that the error behaves like you described). I think it's possible to change the formalism in a way which replaces quasipolynomials by polynomials and superquasipolynomial functions by superpolynomial functions but this requires introducing assumptions about the computational model so I avoided it for now.

Note though that superquasipolynomial is still "far from exponential" in some sense because there is a natural infinite tower of complexities between polynomial and exponential where -th level is polynomial functions and the -st level consists of functions of the form where is a function of the -th level (so quasipolynomials are level 1).

Btw, if you're trying to catch up on optimal predictors then I have a nearly finished draft of a paper with an orderly presentation and consistent notation that I can send you.