In Magical Categories, Eliezer argues that concepts learned by induction do not necessarily generalize well to new environments. This is partially because of the complexity and fragility of value, and partially because the training data might not cover all the cases that will need to be covered. I will give a brief overview of statistical learning theory approaches to concept learning and then discuss relevance to AI safety research. All of the content about statistical learning theory (except active learning) is taken from the Stanford CS229T lecture notes, which I highly recommend reading if you are interested in this topic.
Statistical learning theory provides some guarantees about the performance of induced hypotheses in various settings. I will discuss the settings of uniform convergence, online learning, and active learning, and what can be achived in each setting. In all settings, we will have some set of hypotheses under consideration, \(\mathcal{H}\). We will interpret a hypothesis as a function from the input type \(X\) to the output type \(Y\). For example, in an image classification task, \(X\) could be a set of images and \(Y\) could be the unit interval (representing a probability that the image is in the class). The hypothesis set \(\mathcal{H}\) could be, for example, the set of logistic regression classifiers. If \(h\) is our hypothesis and \((x, y)\) is a single data point, then we will accrue loss \(l(h(x), y)\).
Now let us look at each setting specifically. In the uniform convergence setting, there is some distribution over \((x, y)\) pairs. We will observe some of these as training data and find the hypothesis in our class that minimizes loss on this dataset. Then, we will score the hypothesis against a test set of pairs taken from the same distribution. It turns out that if our hypothesis set is “small”, then the hypothesis we find will probably not do much worse on the training set than on the true distribution (i.e. it will not overfit), and therefore it will be close to optimal on the true distribution (since it was optimal on the training set). As a result, with high probability, our hypothesis will score close to optimally on the test set.
More specifically, “small” in this context means that \(\mathcal{H}\) has low Rademacher complexity. This is usually ensured by the set being finite (and having a moderate log size) or having a moderate number of parameters. Rademacher complexity is a generalization of VC dimension.
It is easy to see how Eliezer’s objection applies to this setting. If the test set comes from a distribution different from what the training set came from, then the statistical guarantees no longer hold. But, there is another statistical learning setting that can provide stronger guarantees: online learning.
In this setting, the algorithm will, on each time step, select a single hypothesis \(h \in \mathcal{H}\), see a single \(x\) value, output a value \(y = h(x)\), and then see the true \(y\). It will receive loss \(l(h(x), y)\). Note that nature may choose any process to determine the true \(y\) value, including an adversarial process!
It turns out that for some hypothesis classes, it is possible to get reasonable bounds on the total loss (e.g. we get no more than \(k \sqrt{T}\) errors in \(T\) timesteps). That means that over time, we will not get much more loss than we would get if we picked out the single best hypothesis and selected it on each iteration.
This gets us further, but it still has problems. Specifically, although we can bound the number of errors the online learning algorithm makes (compared to the optimal hypothesis), we don’t know when these errors will occur. It could be that one error happens in a pivotal decision. This is to be expected: if no examples similar to the pivotal decision have been seen, then we cannot expect online learning to give us the right answer.
To remedy this problem, we might want to specifically find \(x\) values where different hypotheses disagree, and get training data for these. This is called active learning. The algorithm proceeds as follows. We have the current hypothesis set, \(\mathcal{H}\). In the ideal case, we can find some \(x\) value that evenly splits \(\mathcal{H}\) in the sense that \(h(x) = 0\) for about half of hypotheses, and \(h(x) = 1\) for the other half. Then we ask the user for the true value of \(h(x)\) and thereby cut \(\mathcal{H}\) half. If we can somewhat evenly split \(\mathcal{H}\) on each iteration, we will need to ask only \(O(\log \mathcal{H})\) questions. This is not too different when users’ answers are noisy; we will still be able to learn the correct distribution over user answers without too many more questions.
There are problems when there is no \(x\) value that splits \(\mathcal{H}\) in half. This could occur when, for example, we have some base hypothesis \(h\) in our current set, but we also have hypotheses of the form
\[h_{x'}(x) = 1  h(x)\text{ if } x = x'\]
\[h_{x'}(x) = h(x) \text{ otherwise }\]
for many different \(x'\) values. Now if \(h\) is actually the correct hypothesis, then there is no \(x\) value that will eliminate more than one incorrect hypothesis. We could get a problem like this if we were using a hypothesis class consisting of programs as in Solomonoff induction.
This is a serious problem. There may be solutions to it (for example, perhaps we think \(h\) is the correct one because it has a low description length), but I don’t actually know if any of these work.
To summarize:
 uniform convergence relies on the assumption that the test set comes from the same (or a very similar) distribution as the training set
 online learning does not use this assumption, but can still make errors on pivotal decisions
 active learning has the potential to actually learn the right concept without making any mistakes in the process, but has serious difficulties with finding questions that split the hypothesis space correctly. More work to solve this problem is warranted.
This entire framework assumes that the concept can be represented as a function from some data \(x\) to a boolean. But what data \(x\) should be used? For learning moral concepts, we may use textual descriptions, such as descriptions of moral decisions. But even if we could make accurate moral judgments for textual descriptions, it is not clear how to use this to create an autonomous moral AI. It would have to convert its understanding of the world to a textual description before judging it, but its internal model might be completely incomprehensible to humans, and it is not clear how to produce a textual summary of it. Therefore, there are still ontology identification issues with using a framework like this, although it can be used to learn concepts within an existing ontology.
