Summary: If we train a classifier on a training set that comes from one distribution, and test it on a dataset coming from a different distribution, uniform convergence guarantees generally no longer hold. This post presents a strategy for creating classifiers that will reject test points when they are sufficiently different from training data. It works by rejecting points that are much more probable under the predicted test distribution than under the training distribution.
Introduction
In machine learning, we often train a system (e.g. a classifier or regression system) on a training set, and then test it on a test set. If the test set comes from the same distribution as the training set, uniform convergence guarantees allow us to bound the system’s expected error on the test set based on its performance on the training set. As an example, if we are creating an automated system for making moral judgments, we could get training data by asking humans for their moral judgments. Then we could use the system to make additional moral judgments.
If the test dataset comes from the same distribution as the training dataset, then uniform convergence guarantees can give us nice bounds on the performance on the test set. In reality, the test set will often be different. For a moral judgment system, this could be disastrous: perhaps we only train the classifier on ordinary moral problems, but then the classifier decides whether it is a good idea to tile the universe with tiny smiley faces. At this point, we have no guarantees about whether the classifier will correctly judge this question.
Therefore, I aim to create a system that, when presented with a question, will choose to either answer the question or abort. It should abort when the question is sufficiently different from the training data that the system cannot make reliable judgments.
Mathematical setup
We have some input set \(\mathcal{X}\). Let \(p_{train} : \Delta \mathcal{X}\) be the (known) distribution of training data. We train a classifier (or some other system) on the training data, and (using uniform convergence guarantees) estimate that it has good expected performance on random points from \(p_{train}\). Afterwards, we will receive test points in \(\mathcal{X}\). The system can either use the classifier on the test point and make decisions on the basis of this classification, or abort (e.g. by asking a human for their judgment instead of the classifier).
Specifically, suppose there is some unknown cost function \(c : \mathcal{X} \rightarrow \mathbb{R}^+\). \(c(x)\) represents the cost of using the classifier’s judgment on point \(x\). Due to uniform convergence guarantees, we have \(\mathbb{E}_{p_{train}}[c(X)] \leq 1\) (i.e. it is not very costly to use the classifier on random training points). Also suppose that aborting always has a cost of \(k\).
Let \(p_{test} : \Delta \mathcal{X}\) represent the system’s prediction for the next test point. After predicting the next test point, the system chooses a policy \(\pi : \mathcal{X} \rightarrow [0, 1]\), which maps each possible input point to the probability that the system will use the classifier for that point.
The total expected cost will then be \(\mathbb{E}_{p_{test}}[\pi(X) c(X) + (1  \pi(X))k] = k  k\mathbb{E}_{p_{test}}[\pi(X)] + \mathbb{E}_{p_{test}}[\pi(X)c(X)]\).
The minimax policy
Suppose the system chooses a minimax policy. Then
\[\pi
= \arg\min_{\pi}\max_{c, \mathbb{E}_{p_{train}}[c(X)] \leq 1} (k  k\mathbb{E}_{p_{test}}[\pi(X)] + \mathbb{E}_{p_{test}}[\pi(X)c(X)])
\] \[
= \arg\min_{\pi}(k\mathbb{E}_{p_{test}}[\pi(X)] + \max_{c, \mathbb{E}_{p_{train}}[c(X)] \leq 1} \mathbb{E}_{p_{test}}[\pi(X)c(X)])
\]
To solve this optimization problem, first consider how to maximize \(\mathbb{E}_{p_{test}}[\pi(X)]\) (i.e. the probability of using the classifier for the next point) subject to the hard constraint on \(\pi\) that \(\forall c. \mathbb{E}_{p_{train}}[c(X)] \leq 1 \rightarrow \mathbb{E}_{p_{test}}[\pi(X)c(X)] \leq d\).
The constraint can be rewritten:
\[\forall c. \frac{\mathbb{E}_{p_{test}}[\pi(X)c(X)]}{\mathbb{E}_{p_{train}}[c(X)]} \leq d\]
As explained previously in the post about quantilizers, this constraint is equivalent to \[\forall x \in \mathcal{X}. \frac{p_{test}(X)\pi(X)}{p_{train}(X)} \leq d\] or equivalently \[\forall x \in \mathcal{X}. \pi(X) \leq d \frac{p_{train}(X)}{p_{test}(X)}\]
This constraint states that we may only rarely use the classifier for points that are much more probable under the predicted test distribution than the training distribution.
How can the system maximize \(\mathbb{E}_{p_{test}}[\pi(X)]\) subject to this constraint on \(\pi\)? Since the constraint is a conjunction of independent constraints for each \(x\), and the quantity to be maximized is monotonic in each \(\pi(x)\), it is sufficient to maximize each \(\pi(x)\) subject to the constraint that \(\pi(x) \leq d \frac{p_{train}(X)}{p_{test}(X)}\). But this is achieved by setting \(\pi(x) = \min\{1, d \frac{p_{train}(X)}{p_{test}(X)} \}\). Therefore, this policy maximizes \(\mathbb{E}_{p_{test}}[\pi(X)]\) subject to the constraint.
Now let’s step back and use this fact to solve the original optimization problem. This policy \(\pi\) achieves maximum cost \[\max_{c,
\mathbb{E}_{p_{train}}[c(X)] \leq 1} \mathbb{E}_{p_{test}}[\pi(X) c(X) + (1  \pi(X))k]
\] \[
= k 
k\mathbb{E}_{p_{test}}[\pi(X)] + \max_{c,
\mathbb{E}_{p_{train}}[c(X)] \leq 1}\mathbb{E}_{p_{test}}[\pi(X)c(X)]
\] \[
= k  k\mathbb{E}_{p_{test}}[\min\{1, d \frac{p_{train}(X)}{p_{test}(X)} \}] + d
\] \[
= k\mathbb{E}_{p_{test}}[\max\{0, 1  d \frac{p_{train}(X)}{p_{test}(X)} \}] + d
\]
To minimize worstcase expected cost, choose \(d\) to minimize this quantity, and then set \(\pi(x) = \min\{1, d \frac{p_{train}(x)}{p_{test}(x)} \}\).
Classifying multiple points
The previous section analyzed the case when we classify a single point \(X\). We can apply this strategy repeatedly to classify multiple data points, but this strategy is not necessarily optimal. Instead, the optimal policy is to reject points that will appear more expected times in the data, compared to the probability under the training set. The derivation of this rule follows.
Suppose we will classify multiple data points \(X_1, ..., X_n\), with joint distribution \(p_{test}(X_1, ..., X_n)\). We will choose the policy \(\pi_i : \mathcal{X}^i \rightarrow [0, 1]\), where \(\pi_i(X_1, ..., X_{i1}, X_i)\) is the probability of using the classification of point \(X_i\) given the previous points \(X_1, ..., X_{i1}\). All \(\pi_i\) will be chosen to jointly minimize the maximum possible total expected cost
\[
\max_{c : \mathbb{E}_{p_{train}}[c(X)] \leq 1} \mathbb{E}_{p_{test}}[\sum_{i=1}^n (\pi(X_1, ..., X_i) c(X_i) + (1  \pi(X_1, ..., X_i))k)]
\]
Due to linearity of expectation, this can be decomposed \[
\max_{c : \mathbb{E}_{p_{train}}[c(X)] \leq 1} \sum_{i=1}^n \mathbb{E}_{p_{test}}[\pi(X_1, ..., X_i) c(X_i) + (1  \pi(X_1, ..., X_i))k]
\] \[
=
\max_{c : \mathbb{E}_{p_{train}}[c(X)] \leq 1} \sum_{i=1}^n (k  k \mathbb{E}_{p_{test}}[\pi(X_1, ..., X_i)] + \mathbb{E}_{p_{test}}[\pi(X_1, ..., X_i) c(X_i)])
\] \[
=
nk  k \sum_{i=1}^n \mathbb{E}_{p_{test}}[\pi_i(X_1, ..., X_i)] + \max_{c : \mathbb{E}_{p_{train}}[c(X)] \leq 1} \sum_{i=1}^n \mathbb{E}_{p_{test}}[\pi_i(X_1, ..., X_i) c(X_i)]
\] \[
=
nk  k \sum_{i=1}^n \sum_{x \in \mathcal{X}} p_{test}(X_i = x)\mathbb{E}_{p_{test}}[\pi_i(X_1, ..., X_{i1}, x)] + \max_{c : \mathbb{E}_{p_{train}}[c(X)] \leq 1} \sum_{i=1}^n \sum_{x \in \mathcal{X}} p_{test}(X_i = x) \mathbb{E}_{p_{test}}[\pi_i(X_1, ..., X_{i1}, x)] c(x)
\] \[
=
nk  k \sum_{x \in \mathcal{X}} \sum_{i=1}^n p_{test}(X_i = x) \mathbb{E}_{p_{test}}[\pi_i(X_1, ..., X_{i  1}, x)] + \max_{c : \mathbb{E}_{p_{train}}[c(X)] \leq 1} \sum_{x \in \mathcal{X}} c(x) \sum_{i=1}^n p_{test}(X_i = x) \mathbb{E}_{p_{test}}[\pi_i(X_1, ..., X_{i1}, x)]
\]
In this term, each use of \(\pi\) appears in an expression of the form \(\sum_{i=1}^n \mathbb{E}_{p_{test}}[\pi_i(X_1, ..., X_{i  1}, x)]\). This means that the only features of \(\pi\) relevant to expected cost are \(\sum_{i=1}^n p_{test}(X_i = x) \mathbb{E}_{p_{test}}[\pi_i(X_1, ..., X_{i  1}, x)]\) for each \(x \in \mathcal{X}\). Define \(\alpha(x) = \sum_{i=1}^n p_{test}(X_i = x)\). The sum must be between 0 and \(\alpha(x)\), and any value in this interval can be achieved using a \(\pi_i\) which ignores both \(i\) and its first \(i1\) inputs (i.e. \(\pi_i(x_1, ..., x_i) = \beta(x_i)\) for some \(\beta : \mathcal{X} \rightarrow [0, 1]\). \(\beta(x)\) is the probability of using the classification of a point \(x\), regardless of which iteration it is encountered on). Therefore, an optimal policy of this form exists, and we can rewrite the expected total cost as
\[
nk  k \sum_{x \in \mathcal{X}} \alpha(x) \beta(x) + \max_{c : \mathbb{E}_{p_{train}}[c(X)] \leq 1} \sum_{x \in \mathcal{X}} c(x) \alpha(x) \beta(x)
\]
As before, suppose we have a hard constraint that \[\forall c. \mathbb{E}_{p_{train}}[c(X)] \leq 1 \rightarrow \sum_{x \in \mathcal{X}} c(x) \alpha(x) \beta(x) \leq d\] and we want to maximize \(\sum_{x \in \mathcal{X}} \alpha(x) \beta(x)\). Just as the original constraint \[\forall c. \mathbb{E}_{p_{train}}[c(X)] \leq 1 \rightarrow \mathbb{E}_{p_{test}}[\pi(X)c(X)] \leq d\] could be rewritten as \[\forall x \in \mathcal{X}. p_{test}(x)\pi(x) \leq d p_{train}(x)\] this constraint can be rewritten as \[\forall x \in \mathcal{X}. \alpha(x) \beta(x) \leq d p_{train}(x)\] To maximize \(\sum_{x \in \mathcal{X}} \alpha(x) \beta(x)\) subject to this constraint, just maximize each term \(\alpha(x) \beta(x)\) subject to \(\alpha(x) \beta(x) \leq d p_{train(x)}\) for each \(x\). This is achieved by setting \(\beta(x) = \min\{1, d \frac{p_{train}(x)}{\alpha(x)}\}\).
Now this policy achieves maximum expected cost \[
nk  k \sum_{x \in \mathcal{X}} \alpha(x) \beta(x) + d
\] \[
=
nk  k \sum_{x \in \mathcal{X}} \alpha(x) \min\{1, d \frac{p_{train}(x)}{\alpha(x)}\} + d
\] \[
=
nk  k \sum_{x \in \mathcal{X}} \min\{\alpha(x), d p_{train}(x)\} + d
\]
So to minimize the original expected cost, it is sufficient to choose \(d\) to minimize this quantity, and then choose \(\pi_i(x_1, ..., x_i) = \min\{1, d \frac{p_{train}(x)}{\alpha(x)}\}\).
Extensions and questions for future research
 These bounds rely on the cost decomposing as a sum of costs for each iteration. How problematic is this assumption, and how might we drop it? This problem is also discussed in the post about quantilizers.
 Here, the uniform convergence guarantee is only about a single distribution, \(p_{train}\). In practice, the same classifier may have good performance on many different distributions. It should be possible to use these additional bounds to use the classifier on a wider range of test points.
 The analysis does not handle the case when the policy \(\pi_i\) affects the distribution of \(X_j\) for any \(j > i\).
 It is somewhat worrying that the system uses the classification of every point with some probability. This is justified under certain adversarial assumptions, but it indicates some flaw in this approach, and fixing it in a principled manner might yield a better system.
 This system requires an accurate Bayesian prediction module. It should be possible to adapt this system to work under pessimistic assumptions about predictions (e.g. choosing a minimax policy if we have many predictors and only one is guaranteed to be accurate).
 If the data points contain timestamps, then the test data will always be identifiably different from the training data (since they have higher timestamps). This is a problem, because then the system will almost always reject new points. There might be some additional constraints we could place on the classifier to prevent this from being a problem.
 If we predict that we will see some class of points in the test data, and we do not have enough training data to know how to classify these points, perhaps it would be good to ask the user about some points in this class (i.e. active learning). It might be interesting to look at active learning in this framework.
 Conservative classifiers seem like a good fit for situations where the user doesn’t know the answer to some queries (so the system should not make judgments about these queries either), but I haven’t mathematically analyzed this.
