Intelligent Agent Foundations Forumsign up / log in
Value learning subproblem: learning goals of simple agents
discussion post by Alex Mennen 96 days ago | discuss

A potential problem with inverse reinforcement learning as a way for an AI to learn human values is that human actions might not contain enough information to infer human values from accurately enough. If this is the case, then it might be necessary to figure out how to get information about a human’s preferences by actually looking inside their brain. This is a very hard problem, so I suggest starting with the toy problem of figuring out how to determine the goals of simple AI/ML systems, and then subsequently trying to scale up these techniques to try to make them work on human brains.

There are several directions you could go with this idea. For instance, you could try creating a machine learning system that takes in neural networks as input, and learning the concepts that these neural networks have been trained to recognize. So this system should be able to learn the alphabet by looking at a neural network that’s been trained to recognize and distinguish between letters, and learn what cats are by looking at a neural network that’s been trained to recognize cats. A trivial solution to this is to create a copy of the neural network you’ve seen, and then apply it to object-level inputs, thus demonstrating that you can perfectly reproduce the neural network’s behavior. This is an unsatisfying solution, and its analog scaled up to humans would be to create a copy of the human, and ask the copy whenever it wants to know what humans want, which might not be practical for questions too complicated for the human to evaluate. Instead, we would want to learn a representation of a neural net’s concepts such that it is possible to identify instances of the concept more accurately or more efficiently than the neural net does.

Another more precise potential toy problem is to look at code for a game-playing agent, and determine the rules of the game (in particular, the win conditions). Doing this correctly would inevitably end up with a representation of the goal that could be optimized for more effectively or efficiently. Subsequently playing the game it has learned well would just be an AI capabilities problem.

There are a few different things that “learning an agent’s goals” could mean, and we would want to be careful about which one to learn. For instance, we could learn what the agent’s designers intended for the agent to do. This has a possible failure mode when scaled up to humans that humans could be recognized as evolutionary fitness maximizers, with preferences that aren’t useful for evolutionary fitness being attributed to bad design by evolution. Or we could find an internal representation of reward inside the agent, and attempt to maximize that. This could have the failure mode of the value learner concluding that the agent’s utility can be maximized by wireheading the agent. Hopefully, fiddling around with notions of “learning an agent’s goals” for simple agents could help us find such a notion that is actually what we mean, and as a result, does not result in failure modes when scaled up to humans.

Scaling up this sort of thing to humans would require whole brain emulation, which might not be practical when we want to do that. So it might be good to learn goals from partial information about an agent. For instance, you could be given the graph structure of a neural network without the edge weights, and see how much each neuron activates on example inputs, and try to reconstruct its concepts and goals from that.

It’s possible that this sort of problem would not end up being useful. For example, it might end up being practical to rely on indirect normativity ( and avoid having to think in advance about how to learn human preferences. Or formalizing mild optimization ( would make it possible to create safe AGI that does not have extremely precise understanding of human goals, in which case it is more plausible that human actions and expressed preferences are enough to learn human goals to the precision needed by a mild optimizer.





If you drop the
by Alex Appel on Distributed Cooperation | 1 like

Cool! I'm happy to see this
by Abram Demski on Distributed Cooperation | 0 likes

Caveat: The version of EDT
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

[Delegative Reinforcement
by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes


Privacy & Terms