Value Learning for Irrational Toy Models
discussion post by Patrick LaVictoire 128 days ago | discuss

(This is a half-formed idea from discussions within MIRI; if it’s dumb, I take the full blame.)

In value learning contexts, it’s useful to have a toy model of human psychology, to see (for example) if a certain approach would work to learn the values of an idealized rational agent, but might robustly fail when faced with a more realistically constructed agent.

For example, here is a toy model of an irrational agent: take a world where the deterministic mapping $$f$$ from actions to outcomes is fully known, and take three different preference orderings on the set of outcomes. When the agent chooses among actions $$a_1, \dots, a_n$$, we first check whether any $$f(a_i)$$ dominates the others according to at least two of the preference orderings; if so, we take that action with certainty. Otherwise, we select randomly among the available actions. (This agent is a moral democracy, and if two of the three subagents agree on a policy, that policy is taken; otherwise, the agent hits a deadlock and acts at random.)

It is easy to construct agents of this form which exhibit circular preferences in binary choices. We can therefore ask whether a particular value learning algorithm would satisfy sensible desiderata when learning from such an agent. (For instance, if outcome $$X$$ strictly dominates outcome $$Y$$ according to all three preference orderings, we might desire that our value learning algorithm not act so as to result in $$Y$$ when it could instead have acted so as to result in $$X$$.)

# The Hard Problem of Value Learning

But of course, a human brain is not even as simple as that toy model of irrationality. I’ve thought it might be useful to sketch out the level of generality that I actually believe is necessary, in order to show how hard the problem may be to get right.

Human brains do some amount of consequentialist reasoning [citation needed], so arguably at some point of cognition there exist heuristics for evaluating the overall desirability of various outcomes. We would like our value learning process to infer these heuristics and take them into account (this seems necessary, not sufficient).

We cannot assume that the human will take actions that effectively argmax these heuristics (though it will strongly correlate in some regime); we cannot assume whether these heuristics give us values for states, or for action-state pairs; we cannot assume that these heuristics make use of all the important information from the original observations, etc.

It seems to me as if our value learning algorithm will be trying to figure out the contents of the red box from the blue boxes:

This is not as hopeless as it seems, since we still have an assumption that the mapping from observations to actions approximately factors in this way, and that the orange boxes have been at least somewhat selected for performance. But it’s a far cry beyond what CIRL, for example, would be able to infer.

### NEW DISCUSSION POSTS

Note that the problem with
 by Vadim Kosoy on Open Problems Regarding Counterfactuals: An Introd... | 0 likes

Typos on page 5: *
 by Vadim Kosoy on Open Problems Regarding Counterfactuals: An Introd... | 0 likes

Ah, you're right. So gain
 by Abram Demski on Smoking Lesion Steelman | 0 likes

> Do you have ideas for how
 by Jessica Taylor on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

I think I understand what
 by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

>You don’t have to solve
 by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

 by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

My confusion is the
 by Tom Everitt on Delegative Inverse Reinforcement Learning | 0 likes

> First of all, it seems to
 by Abram Demski on Smoking Lesion Steelman | 0 likes

> figure out what my values
 by Vladimir Slepnev on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

I agree that selection bias
 by Jessica Taylor on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

>It seems quite plausible
 by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

> defending against this type
 by Paul Christiano on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

2. I think that we can avoid
 by Paul Christiano on Autopoietic systems and difficulty of AGI alignmen... | 0 likes

I hope you stay engaged with
 by Wei Dai on Autopoietic systems and difficulty of AGI alignmen... | 0 likes