Summary: it is sometimes desirable to create systems that avoid edge cases. This post presents a partial solution to this problem, which works by comparing the system’s behavior to a “natural” distribution.
Introduction
Learning a concept from both positive and negative examples is a wellstudied problem in machine learning. However, if we learn a concept from training data (say, using human judgments as the gold standard), then the learned concept might not generalize well; it could differ from human judgments on any number of edge cases. There is a sense in which these edge cases are “outside the domain” of the training data; that is, the training data seems to naturally define some domain where we can make judgments confidently, with edge cases lying outside this domain.
Therefore, we will be ignoring 2class classification and defining a form of “classification” that uses only positive examples. We could use this “classifier” to positively identify regions that do not contain edge cases.
As a running example, consider a system that is supposed to create burritos using a nanofactory. It will have as training data some training burrito configuration files, which specify the exact arrangement of atoms in each training burrito. The task will be to create burritos that are similar to the training burritos, without necessarily being identical to them.
Suppose this system has access to a hypercomputer. Then it could use a variant of Solomonoff induction to learn a generative model over burrito configuration files, \(b\). With some caveats, this may learn the distribution of training burritos.
We could instruct the system to sample a burrito configuration file from this distribution, and then print it out using the nanofactory. If the induced distribution reflects the distribution of training burritos, then this should not be significantly different from creating a new training burrito in the usual fashion.
What if the nanofactory can’t print all possible burrito configurations (or it can, but we only want to print highquality burritos)? Suppose we have some feasible set \(f\) of printable burrito configurations. We may be tempted to sample from the distribution \(bf\), which assigns probabilities proportional to \(b\) for burrito configurations in \(f\) and probability 0 to other burrito configurations. However, this may be dangerous! For example, if only burritos containing nanoUFAIs are in the feasible set, then the nanofactory will be guaranteed to print such a burrito. Solomonoff induction will never assign probability 0 to any burrito configuration.
Bits of unnaturalness
Can we quantify how dangerous the \(bf\) distribution is? Here is a simple measure that might work. Define \(f\)’s bits of unnaturalness to be \(\log P_b(B \in f)\), where \(B\) is a random burrito configuration sampled from \(b\). The logic is that if \(P_b(B \in f)\) is not too low, then we didn’t have a superlow chance of getting a feasible burrito through natural processes. A oneinamillion burrito is unlikely to be dangerous (except through butterfly effects, discussed later).
We can generalize this to more burrito distributions \(d\) than just ones of the form \(bf\). It is potentially dangerous for any given burrito configuration to be much more probable under \(d\) than \(b\), so we can define \(d\)’s bits of unnaturalness as \(\log \max_{B \in \text{support}(d)} (d(B) / b(B))\). It is easy to see that this agrees with our original unnaturalness measure when \(d = bf\).
How dangerous is it to sample from a distribution with \(k\) bits of unnaturalness? Intuitively, if \(k\) is low, we’re sampling from a distribution that “fits under” the natural distribution, and so the result is unlikely to be unsafe. More formally, suppose we have a cost function \(c\) defined on burrito configurations, which returns a nonnegative number. If we have \(k\) bits of unnaturalness, then we have \[\mathbb{E}_d[c(B)] = \sum_B d(B)c(B) \leq \sum_B 2^kb(B)c(B) = 2^k \mathbb{E}_b[c(B)]\]
which means that if \(k\) is low, then sampling from \(d\) is not much more costly than sampling from \(b\). This is a nice guarantee to have.
Problems
In practice, many unnatural burrito distributions are likely to be harmless: every new method of making burritos will produce burritos that were unlikely to be produced by previous methods. The system could very well end up paralyzed due to having to make burritos that look almost exactly like natural ones. There isn’t a clear solution to this problem: we’d need some way of deciding which deviations from the natural distribution are safe and which aren’t.
Due to butterfly effects, creating a burrito naturally potentially has a very high expected cost. It’s not implausible that creating a single burrito could negatively alter the course of history. In practice, there’s also an equally high expected benefit (due to positive butterfly effects). But it only takes one bit of unnaturalness to wipe out the half of burritos that have positive butterfly effects. We might be able to solve this problem if we had some way of placing limits on the extent to which bounded computations (such as sampling from \(d\)) can take advantage of butterfly effects. Additionally, false thermodynamic miracles could be used to interfere with butterfly effects.
There is a bigger issue when we sample multiple burritos from \(d\). A large number of burritos from \(d\) could give us enough information to mostly determine \(d\). This could be dangerous, because \(d\) could contain dangerous messages. This is part of the general pattern that, even if the distribution for a single burrito is only slightly unnatural, the distribution for a sequence of burritos sampled from the distribution may be very unnatural.
It might be possible to solve this problem by limiting how unnatural the process used to produce \(d\) itself is. We could define a distribution over burrito distributions, and then define a naturalness measure for other burrito distribution distributions relative to this one. If \(d\) is sampled from a nottoounnatural burrito distribution distribution, then the sequence of burritos gotten from sampling \(d\) and then sampling burritos from \(d\) would not be too unnatural relative to what you would expect if you sampled a burrito distribution from the original burrito distribution distribution then sampled burritos from that distribution.
There might be ways to use an unnaturalness measure like this pervasively throughout an AI system to get global security guarantees (Eliezer mentioned something similar in a discussion about satisficers), but I haven’t thought about this thoroughly.
Alternatives and variations
 We could search for a simple set of good burritos \(g\), and assume our training data consists of uniform samples from \(g\). This seems like an overly strong sampling assumption. Additionally, this will favor lowcomplexity burritos, since it is “cheaper” from a probabilistic perspective to include lowentropy regions of burrito space in \(g\) (since these regions contain fewer burritos).
 We could search for a simple set of good burritos \(g\), and assume our training data consists of random samples from the universal prior conditioned on being in \(g\). This solves the issue with lowcomplexity burritos, but this still uses an overly strong sampling assumption and is unlikely to produce reasonable results.
 Benja mentioned that we could expand the training set and, as a result, have some \(d\) increase in unnaturalness. However, since we gave more training burritos, we really wanted to express that more variation in burritos is allowed, so \(d\)’s unnaturalness should not decrease. I’m not sure if this is necessarily a problem: \(d\) will not gain much unnaturalness unless we added lots of new training burritos. In any case, we could decide to increase the unnaturalness tolerance as more training burritos are added.
