 Online Learning 1: Bias-detecting online learners
post by Ryan Carey 1015 days ago | Vanessa Kosoy, Jessica Taylor, Nate Soares and Paul Christiano like this | 7 comments

Note: This describes an idea of Jessica Taylor’s, and is the first of several posts about aspects of online learning.

# Introduction

Thanks to the recent paper on logical inductors, we are currently interested in what can be done with weighted ensembles of predictors. In particular, it would be neat if the idea about finding the fixed point of some predictors could be applied to online learning.

It seems like this approach fits in neatly into the framework of prediction with expert advice in online convex optimization. Usually, in online convex optimization, you can achieve low cumulative regret with respect to experts that make some fixed predictions. However, using the approach of finding the experts’ fixed point, we can obtain a low cumulative regret with respect to experts that make modifications to our predictions.

# Setup

In each round, $$t=1,...,T$$, the learner predicts a vector $$\boldsymbol{w}_t \in \mathcal{Y}$$ where $$\mathcal{Y}$$ is some compact convex subset of Euclidean space. It does this with the advice of a set of expert predictors $$E=\{\boldsymbol{e}_1, ... \boldsymbol{e}_n\}$$. Each of these experts $$\boldsymbol{e}_i:\mathcal{Y} \mapsto \mathcal{Y}$$ is a continuous function that maps a prediction to a prediction. These experts are interpreted as hypotheses about how predictions can be improved, and they are known to the learner. After lodging its prediction, the learner is informed of the value of some loss function $$\ell_t:\mathcal{Y} \mapsto \mathbb{R}$$ if each expert’s advice was followed: $$\{\ell_t(\boldsymbol{e}_1(\boldsymbol{w}_t)),...\ell_t(\boldsymbol{e}_n(\boldsymbol{w}_t))\}$$. $$\ell_t$$ is assumed to be convex and $$L_t$$-Lipschitz with respect to $$\lVert \cdot \rVert_1$$. The learner’s goal is to minimize its regret at time $$T$$, relative to the advice of its experts, $$\text{sup}_{\boldsymbol{e} \in E}\text{Regret}_T(\boldsymbol{e})$$, where: $\text{Regret}_T(\boldsymbol{e}) := \sum_{t=1}^T \ell_t(\boldsymbol{w_t}) - \sum_{t=1}^T\ell_t(\boldsymbol{e}(\boldsymbol{w}_t))$

# Proposed learning approach

The basic idea is that the learner maintains a weighting $$\boldsymbol{v}_t$$ that indicates its relative level of trust in each expert. This lies on the simplex, $$\boldsymbol{v_t} \in \Delta^n := \{\boldsymbol{x} \in [0,1]^k:\sum_{i=1}^k{x_i}=1\}$$, with each element $$v_{t,i}$$ giving the credence in the $$i^{th}$$ expert. At each time step, this weighting is used to make a prediction $$\boldsymbol{w}_t$$. Specifically, the prediction is chosen so that the weighted ensemble of experts would not alter it (i.e. it is the prediction that lies at their fixed point).

Formally, define the weighted combination of the experts at time step $$t$$, $$\boldsymbol{e}^*_t:\mathcal{Y} \mapsto \mathcal{Y}$$ as: $\boldsymbol{e}^*_t(\boldsymbol{w}) := \sum_{i=1}^n v_{t,i} \boldsymbol{e}_i(\boldsymbol{w})$

For the first time step, each $$v_{t,i}$$ may simply be initialized to $$\frac{1}{n}$$, giving equal weight to each of the expert predictors. Then, at each time step $$t$$, the learner predicts the vector $$\boldsymbol{w}_t$$ that lies at the fixed point of $$\boldsymbol{e}^*_t$$, so that: $\boldsymbol{e}^*_t(\boldsymbol{w}_t) = \boldsymbol{w}_t$

By Brouwer’s fixed point theorem, there will be at least one such fixed point.

After submitting this prediction, the learner discovers the loss that each expert would have incurred. It uses these to compute a new value of $$\boldsymbol{v}_t$$ for the next time step. This is done by performing an update step with respect to $$\ell'_t:\Delta^n\mapsto R$$, which will be convex and $$L_t$$-Lipschitz with respect to $$\lVert \cdot \rVert_1$$.: $\ell_t'(\boldsymbol{v}_t) := \ell_t\left(\sum_{i=1}^n v_{t,i} \boldsymbol{e}_i(\boldsymbol{w}_t)\right)$

This can be done using the exponentiated gradients algorithm or some such other online convex optimization algorithm. If exponentiated gradients algorithm is used, we obtain a regret bound of : $\sum_{t=1}^T \ell_t'(\boldsymbol{v}_t) - \sum_{t=1}^T \ell_t'(\boldsymbol{v}^*_t) \leq \sqrt[]{2 L \log (n)T}\\$ where $$\boldsymbol{v}^*_t = \text{inf}_{\boldsymbol{v} \in \Delta^n}{\ell_t'(\boldsymbol{v}) }$$ is the optimal weighting over experts and $$L = \sqrt[]{\frac{1}{T}\sum_{i=1}^T L_n^2}$$ depends on the Lipschitz constants of the loss functions. Recall also that $$n$$ is the number of experts.

Equivalently: $\sum_{t=1}^T \ell_t\left(\sum_{i=1}^n v_{t,i} \boldsymbol{e}_i(\boldsymbol{w}_t)\right) - \sum_{t=1}^T \ell_t\left(\sum_{i=1}^n v^*_{t,i} \boldsymbol{e}_i(\boldsymbol{w}_t)\right) \leq \sqrt[]{2 \log (n)T}\\$

This bounds the regret with respect to the advice of any particular expert, because these experts correspond to cases in which any expert $$i$$ is given weighting $$v_{t,i}=1$$:

\begin{aligned} \forall i \quad \sum_{t=1}^T \ell_t(\boldsymbol{w}_t) - \sum_{t=1}^T \ell_t(\boldsymbol{e}_i(\boldsymbol{w}_t)) & \leq \sqrt[]{2\text{log}(n)T} \\ \text{sup}_{\boldsymbol{e} \in E}\text{Regret}_T(\boldsymbol{e}) &\leq\\ \end{aligned}

# Discussion

This is in more general than the usual use of prediction with expert advice because the experts are more varied. Instead of each expert representing a constant prediction, it represents a continuous map from the space of prediction vectors to itself. One limitation, however, is that in order for the fixed point argument to work, the prediction vectors $$\boldsymbol{w}_t$$ must be located in a compact convex set. Although, so long as these vectors are predicting some quantity with known bounds, this is not a severe limitation. Another limitation is that locating the fixed point of the experts may be very computationally difficult.

# Footnotes

1. Given in Corollary 2.14 on page 140 of the review Shalev-Shwartz, Shai. “Online learning and online convex optimization.” Foundations and Trends in Machine Learning 4.2 (2011): 107-194.  by Ryan Carey 1015 days ago | Alex Vermeer and Nate Soares like this | link Given that this is my first post, critical feedback is especially welcome. reply  by Paul Christiano 1012 days ago | Vanessa Kosoy and Jessica Taylor like this | link Also see the notion of “internal” regret, e.g. as discussed by Blum and Mansour. reply  by Vanessa Kosoy 1011 days ago | link And I just wanted to write a post about an algorithm that leads to correlated equilibria without information or monitoring by ensuring a bound on swap regret, thinking this result is novel. Oh well… reply  by Paul Christiano 1006 days ago | Ryan Carey and Nate Soares like this | link This is cool! It would be nice to know what you can get in polynomial time. Can you find $$\epsilon$$-approximate fixed points in this setting in time polynomial in $$1/\epsilon$$? That would give a polynomial time algorithm with the same regret asymptotically. (It’s normally prediction with expert advice rather than assistance. And you should probably use bigger parentheses around sums.) reply  by Jessica Taylor 1006 days ago | Nate Soares likes this | link In general finding $$\epsilon$$-approximate fixed points will be PPAD-hard. On the other hand, it seems like we could use an online learning algorithm to get a distribution over weight vectors such that the combined expert’s expected modification to a weight vector sampled from this distribution is small, in polynomial time. I haven’t worked out the details, but some version of this might imply that the combined expert can’t expect to do better than you, if you randomly choose a weight vector from this distribution and the loss function is decided before you make this random choice. (by the way, the same technique should work to construct a qualipolynomial-time randomized “logical inductor” satisfying a weaker property) reply  by Paul Christiano 916 days ago | link A very similar idea is actually introduced by Blum and Mansour in 2007 on page 1312, though they deal with linear rather than continuous transformations and so can find the fixed point easily. reply  by Paul Christiano 1006 days ago | Ryan Carey likes this | link We could also generalize this to a general update function $$U : \mathcal{X} \times \mathcal{Y} \rightarrow \mathcal{Y}$$, where $$\mathcal{X}$$ may be the simplex over experts or some other set. Then we just require that $$U$$ be convex in each argument and Lipschitz in the first. And the complexity is certainly polynomial in the number of calls to the fixed point oracle. This seems to be what I needed to improve the $$T^{2/3}$$ in my COLT paper to $$T^{1/2}$$. (I explicitly used the fixed point in the analysis, but didn’t think of this algorithm.) reply

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes