What do we need value learning for?
post by Jessica Taylor 723 days ago | Kaya Stechly, Abram Demski and Nate Soares like this | discuss

I will be writing a sequence of posts about value learning. The purpose of these posts is to create more explicit models of some value learning ideas, such as those discussed in The Value Learning Problem. Although these explicit models are unlikely to capture the complexity of real value learning systems, it is at least helpful to have some explicit model of value learning in mind when thinking about problems such as corrigibility.

This came up because I was discussing value learning with some people at MIRI and FHI. There were disagreements about some aspects of the problem, such as whether a value-learning AI could automatically learn how to be corrigible. I realized that my thinking about value learning was somewhat confused. Making concrete models will make my thinking clearer and also create more common models that people can discuss.

A value learning model is an algorithm that observes human behaviors and determines what values humans have. Roughly, the model consists of:

1. a type of values, $$\mathcal{V}$$
2. a prior over values, $$P(V)$$
3. a conditional distribution of human behavior given their values and observation, $$P(A | V, O)$$

Of course, this is very simplified: in real life the model must account for beliefs, memory, etc. Such a model can be used for multiple purposes. Each of these purposes requires different things from the model. It is important to look at these applications when constructing these models, so it is clear what target we are shooting for.

# Creating systems that predict human behavior

Some proposals for safe AI systems require predicting human behavior. These include both approval-directed agents and mimicry-based systems. Additionally, quantilizers benefit from having a distribution over actions that assigns reasonable probability to good actions, such as an approximation of the distribution of actions a human might take. These systems become more useful the more accurate the predictions of human behavior are. To the extent that knowing about human values helps a system predict human behavior, value learning models should make these systems more accurate. Value learning models are also likely to be easier for humans to understand than models for predicting human behavior that are created by “black-box” supervised learning methods, such as neural networks.

It is notable that the behavior model here is only used for its distribution of actions $$P(A | O)$$, not its internal representation of values. Since it is possible to produce training data for human behavior, it is possible to use supervised learning systems to create these models (though note that supervised learning systems may run into some additional problems as they become superintelligent, such as simulaton warfare). Models used for this application may use any internal representation $$\mathcal{V}$$ so long as this internal representation helps to predict behavior accurately. Previous work in this area includes apprenticeship learning.

# Creating goal-directed value-learning agents

If we want to create a goal-directed agent that pursues a goal compatible with human values, it will be useful for the system to learn what human values are. Here, predicting human behavior is not enough. The internal representation of values, $$\mathcal{V}$$, is quite important here: after learning $$V$$, the system must know whether its plans do well according to $$V$$. Learning $$V$$ appears to be a more difficult induction problem than learning $$A$$, since we can’t directly provide training data (we’d need to know our actual values to do that).

Obviously, a value-learning sovereign agent falls in this category. Additionally, an agent that attemps to accomplish a goal conservatively (in other words, without stepping on anything humans care about) will benefit by having a rough idea of what humans care about. See the Arbital article on corrigibility and Stuart Armstrong’s work on reduced-impact AI for some discussion of conservative agents. Regardless of which value system for agents we are discussing, we must decide whether $$V$$ represents the human’s instrumental or terminal values.

Existing models that actually attempt to learn $$V$$ (rather than just learning $$A$$) include inverse reinforcement learning and inverse planning. Neither of these systems have the AI learn the world model by induction. We will find that, when it does learn the model by induction, the problem becomes more difficult, and some solutions to the problem require ontology identification.

I will focus on this application for the rest of the posts in the series.

# Helping humans understand human values

We could use a value learning model to learn about the structure of human values. Perhaps, in the course of defining a value learning model, we learn something important about human values. For example, if we try to formally specify human values, we will find that our formal specification will have to allow humans to have preferences about alternative physics; otherwise, humans would become indifferent between all universe states upon discovering that our current understanding of physics is incorrect. This tells us something important about how our values work: that they are based on some multi-level representation of the world.

It is also possible that if we create a value learning model and run it on actual behavior, we will learn something useful about human values. For example, if the system learns the wrong values, this could indicate that the model’s hypothesis class for what the values could be does not contain our actual values. These insights are plausibly useful for understanding how to create value-aligned AIs.

### NEW DISCUSSION POSTS

Unfortunately, it's not just
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
 by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
 by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
 by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

I think the point I was
 by Abram Demski on Predictable Exploration | 0 likes

(also x-posted from
 by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

(x-posted from Arbital ==>
 by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

>If the other players can see
 by Stuart Armstrong on Predictable Exploration | 0 likes

 by Abram Demski on Predictable Exploration | 0 likes

> So I wound up with
 by Abram Demski on Predictable Exploration | 0 likes

Hm, I got the same result
 by Alex Appel on Predictable Exploration | 1 like

Paul - how widely do you want
 by David Krueger on Funding opportunity for AI alignment research | 0 likes

I agree, my intuition is that
 by Abram Demski on Smoking Lesion Steelman III: Revenge of the Tickle... | 0 likes