What do we need value learning for?
post by Jessica Taylor 937 days ago | Kaya Stechly, Abram Demski and Nate Soares like this | discuss

I will be writing a sequence of posts about value learning. The purpose of these posts is to create more explicit models of some value learning ideas, such as those discussed in The Value Learning Problem. Although these explicit models are unlikely to capture the complexity of real value learning systems, it is at least helpful to have some explicit model of value learning in mind when thinking about problems such as corrigibility.

This came up because I was discussing value learning with some people at MIRI and FHI. There were disagreements about some aspects of the problem, such as whether a value-learning AI could automatically learn how to be corrigible. I realized that my thinking about value learning was somewhat confused. Making concrete models will make my thinking clearer and also create more common models that people can discuss.

A value learning model is an algorithm that observes human behaviors and determines what values humans have. Roughly, the model consists of:

1. a type of values, $$\mathcal{V}$$
2. a prior over values, $$P(V)$$
3. a conditional distribution of human behavior given their values and observation, $$P(A | V, O)$$

Of course, this is very simplified: in real life the model must account for beliefs, memory, etc. Such a model can be used for multiple purposes. Each of these purposes requires different things from the model. It is important to look at these applications when constructing these models, so it is clear what target we are shooting for.

# Creating systems that predict human behavior

Some proposals for safe AI systems require predicting human behavior. These include both approval-directed agents and mimicry-based systems. Additionally, quantilizers benefit from having a distribution over actions that assigns reasonable probability to good actions, such as an approximation of the distribution of actions a human might take. These systems become more useful the more accurate the predictions of human behavior are. To the extent that knowing about human values helps a system predict human behavior, value learning models should make these systems more accurate. Value learning models are also likely to be easier for humans to understand than models for predicting human behavior that are created by “black-box” supervised learning methods, such as neural networks.

It is notable that the behavior model here is only used for its distribution of actions $$P(A | O)$$, not its internal representation of values. Since it is possible to produce training data for human behavior, it is possible to use supervised learning systems to create these models (though note that supervised learning systems may run into some additional problems as they become superintelligent, such as simulaton warfare). Models used for this application may use any internal representation $$\mathcal{V}$$ so long as this internal representation helps to predict behavior accurately. Previous work in this area includes apprenticeship learning.

# Creating goal-directed value-learning agents

If we want to create a goal-directed agent that pursues a goal compatible with human values, it will be useful for the system to learn what human values are. Here, predicting human behavior is not enough. The internal representation of values, $$\mathcal{V}$$, is quite important here: after learning $$V$$, the system must know whether its plans do well according to $$V$$. Learning $$V$$ appears to be a more difficult induction problem than learning $$A$$, since we can’t directly provide training data (we’d need to know our actual values to do that).

Obviously, a value-learning sovereign agent falls in this category. Additionally, an agent that attemps to accomplish a goal conservatively (in other words, without stepping on anything humans care about) will benefit by having a rough idea of what humans care about. See the Arbital article on corrigibility and Stuart Armstrong’s work on reduced-impact AI for some discussion of conservative agents. Regardless of which value system for agents we are discussing, we must decide whether $$V$$ represents the human’s instrumental or terminal values.

Existing models that actually attempt to learn $$V$$ (rather than just learning $$A$$) include inverse reinforcement learning and inverse planning. Neither of these systems have the AI learn the world model by induction. We will find that, when it does learn the model by induction, the problem becomes more difficult, and some solutions to the problem require ontology identification.

I will focus on this application for the rest of the posts in the series.

# Helping humans understand human values

We could use a value learning model to learn about the structure of human values. Perhaps, in the course of defining a value learning model, we learn something important about human values. For example, if we try to formally specify human values, we will find that our formal specification will have to allow humans to have preferences about alternative physics; otherwise, humans would become indifferent between all universe states upon discovering that our current understanding of physics is incorrect. This tells us something important about how our values work: that they are based on some multi-level representation of the world.

It is also possible that if we create a value learning model and run it on actual behavior, we will learn something useful about human values. For example, if the system learns the wrong values, this could indicate that the model’s hypothesis class for what the values could be does not contain our actual values. These insights are plausibly useful for understanding how to create value-aligned AIs.

### RECENT COMMENTS

I found an improved version
 by Alex Appel on A Loophole for Self-Applicative Soundness | 0 likes

I misunderstood your
 by Sam Eisenstat on A Loophole for Self-Applicative Soundness | 0 likes

Caught a flaw with this
 by Alex Appel on A Loophole for Self-Applicative Soundness | 0 likes

As you say, this isn't a
 by Sam Eisenstat on A Loophole for Self-Applicative Soundness | 1 like

Note: I currently think that
 by Jessica Taylor on Predicting HCH using expert advice | 0 likes

Counterfactual mugging
 by Jessica Taylor on Doubts about Updatelessness | 0 likes

What do you mean by "in full
 by David Krueger on Doubts about Updatelessness | 0 likes

It seems relatively plausible
 by Paul Christiano on Maximally efficient agents will probably have an a... | 1 like

I think that in that case,
 by Alex Appel on Smoking Lesion Steelman | 1 like

Two minor comments. First,
 by Sam Eisenstat on No Constant Distribution Can be a Logical Inductor | 1 like

A: While that is a really
 by Alex Appel on Musings on Exploration | 0 likes

> The true reason to do
 by Jessica Taylor on Musings on Exploration | 0 likes

A few comments. Traps are
 by Vadim Kosoy on Musings on Exploration | 1 like

I'm not convinced exploration
 by Abram Demski on Musings on Exploration | 0 likes

Update: This isn't really an
 by Alex Appel on A Difficulty With Density-Zero Exploration | 0 likes

Privacy & Terms