Intelligent Agent Foundations Forumsign up / log in
Value Learning for Irrational Toy Models
discussion post by Patrick LaVictoire 281 days ago | discuss

(This is a half-formed idea from discussions within MIRI; if it’s dumb, I take the full blame.)

In value learning contexts, it’s useful to have a toy model of human psychology, to see (for example) if a certain approach would work to learn the values of an idealized rational agent, but might robustly fail when faced with a more realistically constructed agent.

For example, here is a toy model of an irrational agent: take a world where the deterministic mapping \(f\) from actions to outcomes is fully known, and take three different preference orderings on the set of outcomes. When the agent chooses among actions \(a_1, \dots, a_n\), we first check whether any \(f(a_i)\) dominates the others according to at least two of the preference orderings; if so, we take that action with certainty. Otherwise, we select randomly among the available actions. (This agent is a moral democracy, and if two of the three subagents agree on a policy, that policy is taken; otherwise, the agent hits a deadlock and acts at random.)

It is easy to construct agents of this form which exhibit circular preferences in binary choices. We can therefore ask whether a particular value learning algorithm would satisfy sensible desiderata when learning from such an agent. (For instance, if outcome \(X\) strictly dominates outcome \(Y\) according to all three preference orderings, we might desire that our value learning algorithm not act so as to result in \(Y\) when it could instead have acted so as to result in \(X\).)

The Hard Problem of Value Learning

But of course, a human brain is not even as simple as that toy model of irrationality. I’ve thought it might be useful to sketch out the level of generality that I actually believe is necessary, in order to show how hard the problem may be to get right.

Human brains do some amount of consequentialist reasoning [citation needed], so arguably at some point of cognition there exist heuristics for evaluating the overall desirability of various outcomes. We would like our value learning process to infer these heuristics and take them into account (this seems necessary, not sufficient).

We cannot assume that the human will take actions that effectively argmax these heuristics (though it will strongly correlate in some regime); we cannot assume whether these heuristics give us values for states, or for action-state pairs; we cannot assume that these heuristics make use of all the important information from the original observations, etc.

It seems to me as if our value learning algorithm will be trying to figure out the contents of the red box from the blue boxes:

Value Learning Diagram
This is not as hopeless as it seems, since we still have an assumption that the mapping from observations to actions approximately factors in this way, and that the orange boxes have been at least somewhat selected for performance. But it’s a far cry beyond what CIRL, for example, would be able to infer.





[Delegative Reinforcement
by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 1 like

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes


Privacy & Terms