Value Learning for Irrational Toy Models
discussion post by Patrick LaVictoire 13 days ago | discuss

(This is a half-formed idea from discussions within MIRI; if it’s dumb, I take the full blame.)

In value learning contexts, it’s useful to have a toy model of human psychology, to see (for example) if a certain approach would work to learn the values of an idealized rational agent, but might robustly fail when faced with a more realistically constructed agent.

For example, here is a toy model of an irrational agent: take a world where the deterministic mapping $$f$$ from actions to outcomes is fully known, and take three different preference orderings on the set of outcomes. When the agent chooses among actions $$a_1, \dots, a_n$$, we first check whether any $$f(a_i)$$ dominates the others according to at least two of the preference orderings; if so, we take that action with certainty. Otherwise, we select randomly among the available actions. (This agent is a moral democracy, and if two of the three subagents agree on a policy, that policy is taken; otherwise, the agent hits a deadlock and acts at random.)

It is easy to construct agents of this form which exhibit circular preferences in binary choices. We can therefore ask whether a particular value learning algorithm would satisfy sensible desiderata when learning from such an agent. (For instance, if outcome $$X$$ strictly dominates outcome $$Y$$ according to all three preference orderings, we might desire that our value learning algorithm not act so as to result in $$Y$$ when it could instead have acted so as to result in $$X$$.)

# The Hard Problem of Value Learning

But of course, a human brain is not even as simple as that toy model of irrationality. I’ve thought it might be useful to sketch out the level of generality that I actually believe is necessary, in order to show how hard the problem may be to get right.

Human brains do some amount of consequentialist reasoning [citation needed], so arguably at some point of cognition there exist heuristics for evaluating the overall desirability of various outcomes. We would like our value learning process to infer these heuristics and take them into account (this seems necessary, not sufficient).

We cannot assume that the human will take actions that effectively argmax these heuristics (though it will strongly correlate in some regime); we cannot assume whether these heuristics give us values for states, or for action-state pairs; we cannot assume that these heuristics make use of all the important information from the original observations, etc.

It seems to me as if our value learning algorithm will be trying to figure out the contents of the red box from the blue boxes:

This is not as hopeless as it seems, since we still have an assumption that the mapping from observations to actions approximately factors in this way, and that the orange boxes have been at least somewhat selected for performance. But it’s a far cry beyond what CIRL, for example, would be able to infer.

### NEW DISCUSSION POSTS

The "benign induction
 by David Krueger on Why I am not currently working on the AAMLS agenda | 0 likes

This comment is to explain
 by Alex Mennen on Formal Open Problem in Decision Theory | 0 likes

Thanks for writing this -- I
 by Daniel Dewey on AI safety: three human problems and one AI issue | 1 like

I think it does do the double
 by Stuart Armstrong on Acausal trade: double decrease | 0 likes

>but the agent incorrectly
 by Stuart Armstrong on CIRL Wireheading | 0 likes

I think the double decrease
 by Owen Cotton-Barratt on Acausal trade: double decrease | 0 likes

The problem is that our
 by Scott Garrabrant on Cooperative Oracles: Nonexploited Bargaining | 1 like

Yeah. The original generator
 by Scott Garrabrant on Cooperative Oracles: Nonexploited Bargaining | 0 likes

I don't see how it would. The
 by Scott Garrabrant on Cooperative Oracles: Nonexploited Bargaining | 1 like

Does this generalise to
 by Stuart Armstrong on Cooperative Oracles: Nonexploited Bargaining | 0 likes

>Every point in this set is a
 by Stuart Armstrong on Cooperative Oracles: Nonexploited Bargaining | 0 likes

This seems a proper version
 by Stuart Armstrong on Cooperative Oracles: Nonexploited Bargaining | 0 likes

This doesn't seem to me to
 by Stuart Armstrong on Change utility, reduce extortion | 0 likes

[_Regret Theory with General
 by Abram Demski on Generalizing Foundations of Decision Theory II | 0 likes

It's not clear whether we
 by Paul Christiano on Infinite ethics comparisons | 1 like