Intelligent Agent Foundations Forumsign up / log in
Learning a concept using only positive examples
post by Jessica Taylor 1277 days ago | Benja Fallenstein, Patrick LaVictoire and Stuart Armstrong like this | 4 comments

Summary: it is sometimes desirable to create systems that avoid edge cases. This post presents a partial solution to this problem, which works by comparing the system’s behavior to a “natural” distribution.


Learning a concept from both positive and negative examples is a well-studied problem in machine learning. However, if we learn a concept from training data (say, using human judgments as the gold standard), then the learned concept might not generalize well; it could differ from human judgments on any number of edge cases. There is a sense in which these edge cases are “outside the domain” of the training data; that is, the training data seems to naturally define some domain where we can make judgments confidently, with edge cases lying outside this domain.

Therefore, we will be ignoring 2-class classification and defining a form of “classification” that uses only positive examples. We could use this “classifier” to positively identify regions that do not contain edge cases.

As a running example, consider a system that is supposed to create burritos using a nanofactory. It will have as training data some training burrito configuration files, which specify the exact arrangement of atoms in each training burrito. The task will be to create burritos that are similar to the training burritos, without necessarily being identical to them.

Suppose this system has access to a hypercomputer. Then it could use a variant of Solomonoff induction to learn a generative model over burrito configuration files, \(b\). With some caveats, this may learn the distribution of training burritos.

We could instruct the system to sample a burrito configuration file from this distribution, and then print it out using the nanofactory. If the induced distribution reflects the distribution of training burritos, then this should not be significantly different from creating a new training burrito in the usual fashion.

What if the nanofactory can’t print all possible burrito configurations (or it can, but we only want to print high-quality burritos)? Suppose we have some feasible set \(f\) of printable burrito configurations. We may be tempted to sample from the distribution \(b|f\), which assigns probabilities proportional to \(b\) for burrito configurations in \(f\) and probability 0 to other burrito configurations. However, this may be dangerous! For example, if only burritos containing nano-UFAIs are in the feasible set, then the nanofactory will be guaranteed to print such a burrito. Solomonoff induction will never assign probability 0 to any burrito configuration.

Bits of unnaturalness

Can we quantify how dangerous the \(b|f\) distribution is? Here is a simple measure that might work. Define \(f\)’s bits of unnaturalness to be \(-\log P_b(B \in f)\), where \(B\) is a random burrito configuration sampled from \(b\). The logic is that if \(P_b(B \in f)\) is not too low, then we didn’t have a super-low chance of getting a feasible burrito through natural processes. A one-in-a-million burrito is unlikely to be dangerous (except through butterfly effects, discussed later).

We can generalize this to more burrito distributions \(d\) than just ones of the form \(b|f\). It is potentially dangerous for any given burrito configuration to be much more probable under \(d\) than \(b\), so we can define \(d\)’s bits of unnaturalness as \(-\log \max_{B \in \text{support}(d)} (d(B) / b(B))\). It is easy to see that this agrees with our original unnaturalness measure when \(d = b|f\).

How dangerous is it to sample from a distribution with \(k\) bits of unnaturalness? Intuitively, if \(k\) is low, we’re sampling from a distribution that “fits under” the natural distribution, and so the result is unlikely to be unsafe. More formally, suppose we have a cost function \(c\) defined on burrito configurations, which returns a non-negative number. If we have \(k\) bits of unnaturalness, then we have \[\mathbb{E}_d[c(B)] = \sum_B d(B)c(B) \leq \sum_B 2^kb(B)c(B) = 2^k \mathbb{E}_b[c(B)]\]

which means that if \(k\) is low, then sampling from \(d\) is not much more costly than sampling from \(b\). This is a nice guarantee to have.


In practice, many unnatural burrito distributions are likely to be harmless: every new method of making burritos will produce burritos that were unlikely to be produced by previous methods. The system could very well end up paralyzed due to having to make burritos that look almost exactly like natural ones. There isn’t a clear solution to this problem: we’d need some way of deciding which deviations from the natural distribution are safe and which aren’t.

Due to butterfly effects, creating a burrito naturally potentially has a very high expected cost. It’s not implausible that creating a single burrito could negatively alter the course of history. In practice, there’s also an equally high expected benefit (due to positive butterfly effects). But it only takes one bit of unnaturalness to wipe out the half of burritos that have positive butterfly effects. We might be able to solve this problem if we had some way of placing limits on the extent to which bounded computations (such as sampling from \(d\)) can take advantage of butterfly effects. Additionally, false thermodynamic miracles could be used to interfere with butterfly effects.

There is a bigger issue when we sample multiple burritos from \(d\). A large number of burritos from \(d\) could give us enough information to mostly determine \(d\). This could be dangerous, because \(d\) could contain dangerous messages. This is part of the general pattern that, even if the distribution for a single burrito is only slightly unnatural, the distribution for a sequence of burritos sampled from the distribution may be very unnatural.

It might be possible to solve this problem by limiting how unnatural the process used to produce \(d\) itself is. We could define a distribution over burrito distributions, and then define a naturalness measure for other burrito distribution distributions relative to this one. If \(d\) is sampled from a not-too-unnatural burrito distribution distribution, then the sequence of burritos gotten from sampling \(d\) and then sampling burritos from \(d\) would not be too unnatural relative to what you would expect if you sampled a burrito distribution from the original burrito distribution distribution then sampled burritos from that distribution.

There might be ways to use an unnaturalness measure like this pervasively throughout an AI system to get global security guarantees (Eliezer mentioned something similar in a discussion about satisficers), but I haven’t thought about this thoroughly.

Alternatives and variations

  1. We could search for a simple set of good burritos \(g\), and assume our training data consists of uniform samples from \(g\). This seems like an overly strong sampling assumption. Additionally, this will favor low-complexity burritos, since it is “cheaper” from a probabilistic perspective to include low-entropy regions of burrito space in \(g\) (since these regions contain fewer burritos).
  2. We could search for a simple set of good burritos \(g\), and assume our training data consists of random samples from the universal prior conditioned on being in \(g\). This solves the issue with low-complexity burritos, but this still uses an overly strong sampling assumption and is unlikely to produce reasonable results.
  3. Benja mentioned that we could expand the training set and, as a result, have some \(d\) increase in unnaturalness. However, since we gave more training burritos, we really wanted to express that more variation in burritos is allowed, so \(d\)’s unnaturalness should not decrease. I’m not sure if this is necessarily a problem: \(d\) will not gain much unnaturalness unless we added lots of new training burritos. In any case, we could decide to increase the unnaturalness tolerance as more training burritos are added.

by Patrick LaVictoire 1274 days ago | Jessica Taylor and Stuart Armstrong like this | link

By the way, here’s my account of the motivation for this problem:

Let’s say you start with an AI that is superhuman at engineering. You want to ask it to do a simple task (like make you burritos) without risking vast unforeseen consequences. So you let it passively scan a bunch of human-made burritos, and ask it to make you a burrito. There are a couple of interesting failure modes:

  1. The space of acceptable burritos, as a subset of configurations of atoms, is a really narrow and twisty target. If you take the set of configurations which are closer to burrito 1 in the training set than any other training burrito is, the vast majority of those configurations would be toxic to humans, and some of them contain self-replicating nanobots, etc. Of course, there are ways of representing concepts such that the essential aspects of acceptable burritos (like being made out of a specific set of organic molecules) are more likely to be found. This is the problem of identifying the correct measure b in the first place.

  2. Having the AI create nanotech is pretty risky, and for this task we’d prefer if it stuck to more boring engineering like agricultural and culinary robots. But “don’t make any nanotech” is not a natural command, since how do you specify “nanotech” without examples, and since there are plenty of creative nanotech-like things that wouldn’t even occur to us to rule out. So we want to either give it parameters for what it can do (this gives us the feasible set f, which is unlikely to exactly contain any of our examples), or somehow set things up so that the boring engineering tasks are the optimal way to satisfy the problem. (This is also why “exactly clone one of the example burritos” is not a great solution, since this obviously requires nanotech.)


by Stuart Armstrong 1274 days ago | link

I’m feeling this could be related to my ideas here:

I’ll think about it more…


by Stuart Armstrong 1265 days ago | link

The \(\log \max\) part makes it different from something like the Kullback–Leibler divergence, but that might be a good feature - a \(\log \max\) definition seems harder to hack.

If worrying about butterfly effects and similar, it might be useful to do something like this: let p be the probability distribution of future states given that a burrito is not made, and q the same distribution given that the burrito is made. If p and q are very different (as measured by KL divergence or the approach here) then that means that either a) the burrito is dangerous, or b) the AI can unravel butterfly effects. If p and q are very different for many different burrito it could make, then we have a butterfly effect problem. If they are only different for some burritos, then we have identified the high-impact burritos.






[Note: This comment is three
by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes


Privacy & Terms