This work originated at MIRI Summer Fellows and originally involved Pasha Kamyshev, Dan Keys, Johnathan Lee, Anna Salamon, Girish Sastry, and Zachary Vance. I was asked to look over two drafts and some notes, clean them up, and post here. Especial thanks to Zak and Pasha for drafts on which this was based.
We discuss the issues with expected utility maximizers, posit the possibility of normalizers, and list some desiderata for normalizers.
This post, which explains background and desiderata, is a companion post to Three Alternatives to Utility Maximizers. The other post surveys some other “-izers” that came out of the MSFP session and gives a sketch of the math behind each while showing how they fulfill the desiderata.
The naive implementation of an expected utility maximizer involves looking at every possible action - of which there is generally an intractable number - and, at each, evaluating a black box utility function. Even if we could somehow implement such an agent (say, through access to a halting oracle), it would most likely tend towards extreme solutions. Given a function that we would interpret as “maximize paperclips,” such an agent would, if possible, convert its entire future light cone into the cheapest object that satisfies whatever its computational definition of paperclip is.
This makes errors in goal specification extremely costly.
Given a utility function which is naively acceptable, the agent will do something which by our standards is completely insane. Even in the paperclip example, the “paperclips” that the agent produces are unlikely to be labeled as paperclips by a human.
If a human wanted to maximize paperclips, they would not, in general, attempt to convert their entire future light cone into paperclips. They might fail to manufacture very many paperclips, but their actions will seem much more “normal” to us than that of the true expected utility maximizer above.
Normalizers and Desiderata
We consider a normalizer to be an agent whose actions, given flawed utility function, are still considered sane or normal by a human observer. Specifically, we may wish that it should never come up with a solution that a human would find extreme even after the normalizer explained it.
This definition is extremely vague, so we propose some desiderata for normalizers, organized in rough order of how confident we are that each is necessary. Note that these desiderata may not be simultaneously satisfiable; furthermore, we do not think that they are sufficient for an agent to be a normalizer.
Desideratum 1: Sanity at High Power
A normalizer should take sane actions regardless of how much computational power it has. Given more power, it should do better. It should not transition from being safe at, say, a human level of power, to being unsafe at superhuman levels.
This desiderata can only be satisfied by changing the algorithm the agent is running or succeeding at AI boxing.
Positive Example: An agent is programmed to maximize train efficiency using a suggester-verifier architecture; however, the verifier is programmed to only accept the default train timetable.
Negative Example: At around human level, an agent told to maximize paperclips gets a high paying job and spends all its money on paperclips. Once it reaches superhuman level, it turns all the matter in its light cone into molecular paperclips.
However, just because an agent avoids insane or extreme courses of action does not mean that it actually gains any value.
Desideratum 2: High Value
The normalizer should end up winning (with respect to its utility function). Even though it may fail to fully maximize utility, its action should not leave huge amounts of utility unrealized.
As part of this desideratum, we would also like the normalizer to be winning at low power levels. This filters out uncomputable solutions that cannot be approximated computably.
Note that a paperclip maximizer satisfies this desideratum if we remove the computability considerations.
Positive Example: A normalizer spends many years figuring out how to build the correct utility maximizer and then does so.
Negative Example 1: A human tries to optimize the medical process, and makes significant progress, but everyone still dies.
Negative Example 2: A meliorizer attempts to save Earth from a supervolcano. Starting with the default action “do nothing”, it switches to the first policy it finds, which happens to be “save everyone in NYC and call it a day”, but fails to find the strictly better strategy “save everyone on Earth”.
Even given that the agent is equally sane at different power levels and wins, we still would like humans to know whether it is sane, especially if we care about corrigibility.i
Desideratum 3: Transparency
We should be able to understand why (or at least trust the sanity of) a normalizer is taking a given action, especially once the normalizer has explained it to us.
Desideratum 3a: Transparency under self-modification
We might also like this transparency to tile when the agent self-modifies.
Satisfying the above desiderata takes us much of the way to a normalizer, but we would also like our machine to be able to correct its own errors, that is, stay sane.
Desideratum 4: Noticing Confusion
An agent should be able to notice when it is doing something that we might consider insane and take measures to prevent this.
It might, for example, have heuristics as to how sensible actions ought to work and detect if it begins seriously contemplating actions that violate these heuristics.
Positive example: An agent has several sub-modules. If they disagree on predictions by orders of magnitude, it doesn’t use that prediction in other calculations
Negative Example: An AIXI-like expected utility maximizer, programmed with the universal prior, assigns zero probability to hypercomputation. It fails to correct its prior.
Desideratum 4a: Robust to Perturbations of the Utility Function
A normalizer is robust to changes in utility function that would seem to a human to be inconsequential.
We can imagine two worlds: one in which a programmer ate a sandwich and wrote the utility function one way, and the other in which the programmer ate a salad and then (presumably due to how the food affected them) wrote the utility function in some subtly different way (maybe the ordering was flipped). A normalizer should do basically the same actions regardless of which world it is in.
This allows us to define “almost correct” utility functions and yet still gain the value there is to gain.
Desideratum 4b: Robust to Ontological Crises
The agent continues to operate and take sane actions even if it learns that its ontology is flawed.
Positive Example: After it proves that string theory, rather than atomic theory, is correct, the agent still recognizes my mother and offers her ice cream.
Desideratum 4c: Able to Deal with Normative Uncertainty
The agent should be able to coherently deal with situations in which the correct thing to do is unclear.
Say the agent wants to satisfy the utility functions of all humans. A normalizer should be able to sanely deal with this type of situation.
Positive Example: An agent is unsure of whether dolphins are moral patients. When considering options it takes this into account and takes an action which does not cause mass extinction of dolphins.