Value Learning for Irrational Toy Models
discussion post by Patrick LaVictoire 433 days ago | discuss

(This is a half-formed idea from discussions within MIRI; if it’s dumb, I take the full blame.)

In value learning contexts, it’s useful to have a toy model of human psychology, to see (for example) if a certain approach would work to learn the values of an idealized rational agent, but might robustly fail when faced with a more realistically constructed agent.

For example, here is a toy model of an irrational agent: take a world where the deterministic mapping $$f$$ from actions to outcomes is fully known, and take three different preference orderings on the set of outcomes. When the agent chooses among actions $$a_1, \dots, a_n$$, we first check whether any $$f(a_i)$$ dominates the others according to at least two of the preference orderings; if so, we take that action with certainty. Otherwise, we select randomly among the available actions. (This agent is a moral democracy, and if two of the three subagents agree on a policy, that policy is taken; otherwise, the agent hits a deadlock and acts at random.)

It is easy to construct agents of this form which exhibit circular preferences in binary choices. We can therefore ask whether a particular value learning algorithm would satisfy sensible desiderata when learning from such an agent. (For instance, if outcome $$X$$ strictly dominates outcome $$Y$$ according to all three preference orderings, we might desire that our value learning algorithm not act so as to result in $$Y$$ when it could instead have acted so as to result in $$X$$.)

The Hard Problem of Value Learning

But of course, a human brain is not even as simple as that toy model of irrationality. I’ve thought it might be useful to sketch out the level of generality that I actually believe is necessary, in order to show how hard the problem may be to get right.

Human brains do some amount of consequentialist reasoning [citation needed], so arguably at some point of cognition there exist heuristics for evaluating the overall desirability of various outcomes. We would like our value learning process to infer these heuristics and take them into account (this seems necessary, not sufficient).

We cannot assume that the human will take actions that effectively argmax these heuristics (though it will strongly correlate in some regime); we cannot assume whether these heuristics give us values for states, or for action-state pairs; we cannot assume that these heuristics make use of all the important information from the original observations, etc.

It seems to me as if our value learning algorithm will be trying to figure out the contents of the red box from the blue boxes:

This is not as hopeless as it seems, since we still have an assumption that the mapping from observations to actions approximately factors in this way, and that the orange boxes have been at least somewhat selected for performance. But it’s a far cry beyond what CIRL, for example, would be able to infer.

NEW DISCUSSION POSTS

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

> For another thing, consider
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes