Desiderata for Normalizers
discussion post by Kaya Stechly 722 days ago | Jessica Taylor likes this | 1 comment

This work originated at MIRI Summer Fellows and originally involved Pasha Kamyshev, Dan Keys, Johnathan Lee, Anna Salamon, Girish Sastry, and Zachary Vance. I was asked to look over two drafts and some notes, clean them up, and post here. Especial thanks to Zak and Pasha for drafts on which this was based.

We discuss the issues with expected utility maximizers, posit the possibility of normalizers, and list some desiderata for normalizers.

This post, which explains background and desiderata, is a companion post to Three Alternatives to Utility Maximizers. The other post surveys some other “-izers” that came out of the MSFP session and gives a sketch of the math behind each while showing how they fulfill the desiderata.

## Background

The naive implementation of an expected utility maximizer involves looking at every possible action - of which there is generally an intractable number - and, at each, evaluating a black box utility function. Even if we could somehow implement such an agent (say, through access to a halting oracle), it would most likely tend towards extreme solutions. Given a function that we would interpret as “maximize paperclips,” such an agent would, if possible, convert its entire future light cone into the cheapest object that satisfies whatever its computational definition of paperclip is.

This makes errors in goal specification extremely costly.

Given a utility function which is naively acceptable, the agent will do something which by our standards is completely insane. Even in the paperclip example, the “paperclips” that the agent produces are unlikely to be labeled as paperclips by a human.

If a human wanted to maximize paperclips, they would not, in general, attempt to convert their entire future light cone into paperclips. They might fail to manufacture very many paperclips, but their actions will seem much more “normal” to us than that of the true expected utility maximizer above.

## Normalizers and Desiderata

We consider a normalizer to be an agent whose actions, given flawed utility function, are still considered sane or normal by a human observer. Specifically, we may wish that it should never come up with a solution that a human would find extreme even after the normalizer explained it.

This definition is extremely vague, so we propose some desiderata for normalizers, organized in rough order of how confident we are that each is necessary. Note that these desiderata may not be simultaneously satisfiable; furthermore, we do not think that they are sufficient for an agent to be a normalizer.

### Desideratum 1: Sanity at High Power

A normalizer should take sane actions regardless of how much computational power it has. Given more power, it should do better. It should not transition from being safe at, say, a human level of power, to being unsafe at superhuman levels.

This desiderata can only be satisfied by changing the algorithm the agent is running or succeeding at AI boxing.

Positive Example: An agent is programmed to maximize train efficiency using a suggester-verifier architecture; however, the verifier is programmed to only accept the default train timetable1.

Negative Example: At around human level, an agent told to maximize paperclips gets a high paying job and spends all its money on paperclips. Once it reaches superhuman level, it turns all the matter in its light cone into molecular paperclips.

However, just because an agent avoids insane or extreme courses of action does not mean that it actually gains any value.

### Desideratum 2: High Value

The normalizer should end up winning (with respect to its utility function). Even though it may fail to fully maximize utility, its action should not leave huge amounts of utility unrealized2.

As part of this desideratum, we would also like the normalizer to be winning at low power levels. This filters out uncomputable solutions that cannot be approximated computably.

Note that a paperclip maximizer satisfies this desideratum if we remove the computability considerations.

Positive Example: A normalizer spends many years figuring out how to build the correct utility maximizer and then does so.

Negative Example 1: A human tries to optimize the medical process, and makes significant progress, but everyone still dies.

Negative Example 2: A meliorizer3 attempts to save Earth from a supervolcano. Starting with the default action “do nothing”, it switches to the first policy it finds, which happens to be “save everyone in NYC and call it a day”, but fails to find the strictly better strategy “save everyone on Earth”4.

Even given that the agent is equally sane at different power levels and wins, we still would like humans to know whether it is sane, especially if we care about corrigibility.i

### Desideratum 3: Transparency

We should be able to understand why (or at least trust the sanity of) a normalizer is taking a given action, especially once the normalizer has explained it to us.

#### Desideratum 3a: Transparency under self-modification

We might also like this transparency to tile when the agent self-modifies.

Satisfying the above desiderata takes us much of the way to a normalizer, but we would also like our machine to be able to correct its own errors, that is, stay sane.

### Desideratum 4: Noticing Confusion

An agent should be able to notice when it is doing something that we might consider insane and take measures to prevent this.

It might, for example, have heuristics as to how sensible actions ought to work and detect if it begins seriously contemplating actions that violate these heuristics.

Positive example: An agent has several sub-modules. If they disagree on predictions by orders of magnitude, it doesn’t use that prediction in other calculations

Negative Example: An AIXI-like expected utility maximizer, programmed with the universal prior, assigns zero probability to hypercomputation. It fails to correct its prior.

#### Desideratum 4a: Robust to Perturbations of the Utility Function

A normalizer is robust to changes in utility function that would seem to a human to be inconsequential.

We can imagine two worlds: one in which a programmer ate a sandwich and wrote the utility function one way, and the other in which the programmer ate a salad and then (presumably due to how the food affected them) wrote the utility function in some subtly different way (maybe the ordering was flipped). A normalizer should do basically the same actions regardless of which world it is in.

This allows us to define “almost correct” utility functions and yet still gain the value there is to gain.

#### Desideratum 4b: Robust to Ontological Crises

The agent continues to operate and take sane actions even if it learns that its ontology is flawed5.

Positive Example: After it proves that string theory, rather than atomic theory, is correct, the agent still recognizes my mother and offers her ice cream.

#### Desideratum 4c: Able to Deal with Normative Uncertainty

The agent should be able to coherently deal with situations in which the correct thing to do is unclear.

Say the agent wants to satisfy the utility functions of all humans. A normalizer should be able to sanely deal with this type of situation.

Positive Example: An agent is unsure of whether dolphins are moral patients. When considering options it takes this into account and takes an action which does not cause mass extinction of dolphins.

1. We assume that we have prevented the suggester from affecting the environment in ways that bypass the verifier (such as by using the physical processes which are its computations). This is probably equivalent in the limit to the problem of AI boxing.

2. Bostrom calls huge amounts of wasted utility astronomical waste. In “Astronomical Waste: The Opportunity Cost of Delayed Technological Development”, he argues that “if the goal of speed conflicts with the goal of global safety, the total utilitarian should always opt to maximize safety.”

3. A meliorizer is an agent which has a default policy, but searches for a higher utility policy. If it finds such a policy, it switches to using it. If the search fails, it continues using the default policy. See section 8 of the Tiling Agents draft for the original presentation and some more discussion. See Soares’ “Tiling Agents in Causal Graphs” for one formalization; specifically that of suggester-verifiers with a fallback policy.

4. An overzealous meliorizer who finds the higher utility (as measured by its utility function) strategy “destroy the Earth so that 100% of people on Earth are vacuously saved” fulfills the first desiderata, but fails to me a normalizer.

5. See de Blanc’s “Ontological Crises in Artificial Agents Value Systems” for more details on ontological crises.

 by Paul Christiano 718 days ago | Kaya Stechly and Jessica Taylor like this | link Desiderata 1 and 2 seem to be the general non-negotiable goals of AI control (I call property 1 scalability, but many people have talked about it under different names.) Why would a human not convert the future into paperclips if they wanted to maximize paperclips? This sounds structurally like “A human would not, in general, attempt to convert their entire future light cone into flourishing conscious experience.” But once we change the noun I’m not so sure. Desiderata 3 and 4 also seem quite general; many (most?) people working on AI control aim to establish properties of this form. Presumably the way to effect transparency under self-modification is to (1) ensure that transparent techniques are competitive with their opaque counterparts, and then (2) build systems that help humans get what they want in general, and so help the humans build understandable+capable AI systems as a special case. reply

### NEW DISCUSSION POSTS

Indeed there is some kind of
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

Very nice. I wonder whether
 by Vadim Kosoy on Hyperreal Brouwer | 0 likes

Freezing the reward seems
 by Vadim Kosoy on Resolving human inconsistency in a simple model | 0 likes

Unfortunately, it's not just
 by Vadim Kosoy on Catastrophe Mitigation Using DRL | 0 likes

>We can solve the problem in
 by Wei Dai on The Happy Dance Problem | 1 like

Maybe it's just my browser,
 by Gordon Worley III on Catastrophe Mitigation Using DRL | 2 likes

At present, I think the main
 by Abram Demski on Looking for Recommendations RE UDT vs. bounded com... | 0 likes

In the first round I'm
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

Fine with it being shared
 by Paul Christiano on Funding opportunity for AI alignment research | 0 likes

I think the point I was
 by Abram Demski on Predictable Exploration | 0 likes

(also x-posted from
 by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

(x-posted from Arbital ==>
 by Sören Mindermann on The Three Levels of Goodhart's Curse | 0 likes

>If the other players can see
 by Stuart Armstrong on Predictable Exploration | 0 likes