Intelligent Agent Foundations Forumsign up / log in
Learning values, or defining them?
discussion post by Stuart Armstrong 110 days ago | discuss

A putative new idea for AI control; index here.

Is there a teapot currently somewhere between the orbit of Earth and Mars? And who won the 1986 World Cup football match between England and Argentina? And what do these questions have to do with learning human values?

Underlying true reality?

Both those top questions are uncertain. We haven’t scanned the solar system with enough precision to find any such teapot. And though the referee allowed the first Argentine goal, making it official, and though FIFA had Agentina progress to the semi-finales (they would eventually win the tournament) while England was eliminate… that goal, the “Hand of God” goal, was scored by Maradona with his hand, a totally illegal move.

In a sense, neither question can ever be fully resolved. Even if we fully sweep the the solar system for teapots in a century’s time, it’s still possible there might have been one now, and it then crashed into the sun, stopping us from ever finding it. And in the spirit of ambijectivity, the question of Argentina’s victory (or whether it was a “proper” victory, or a “fair” victory) depends on which aspect of the definition of victory you choose to emphasise - the referee’s call and the official verdict, versus the clear violation of the rules.

Nevertheless, there is a sense in which we feel the first question has a definite answer (which is almost certainly “no”), while the second is purely about definitions.

Answering the question

Why do we feel that the teapot question has a definite answer? Well, we have a model of reality as something that actually objectively exists, and our investigation of the world backs it up - when confronted by a door, we feel that there is something behind the door, even if we choose not to open it. There are various counterfactuals in which we could have sent out a probe to any given area of the orbit, so we feel we could have resolved the “is there a teapot at this location” for any given location within a wide radius of the Earth.

Basically, the universe has features that causes us to believe that when we observe it (quantum effects aside), we are seeing a previous reality rather than creating a new one (see the old debate between platonists and formalists in mathematics).

Whereas even if we had a God or an AI, we don’t expect it to have a definite answer to the question of who won that football match. There is no platonic underlying reality as to the winner of the game, that we could just figure out if we had enough knowledge. We already know everything that’s relevant.

Procedure for learning/defining human values

Many attempts at learning human values are framed as “humans have an underlying true reward R, and here is procedure P for determining it”.

But in most cases, that formulation is incorrect, because the paper is actually saying “here is a procedure P for determining human values”. Actually saying that humans have true rewards is a much more complicated process: you have to justify that the true R exists, like Russell’s teapot, rather than being a question of definition, like Argentina’s football victory.

That sounds like a meaningless distinction: what is, in practice, the difference between a true reward R and an imperfect estimator P, and just P? It’s simpler conceptually if we talk about the true R, so why not do so?

It’s simpler, but much more misleading. If you work under the assumption that there is a true R, then you’re less likely to think that P might lead you astray. And if you see imperfections in P, your first instinct is something like “make P more rational/more likely to converge on R”, rather than ask the true question, which is “is P a good definition of human values?”

Even if the moral realists are right, and there is a true R, thinking about it is still misleading. Because there is, as yet, no satisfactory definition of this true R, and it’s very hard to make something converge better onto something you haven’t defined. Shifting the focus from the unknown (and maybe unknowable, or maybe even non-existent) R, to the actual P, is important.

It’s because I was focused so strongly on the procedure P, treating R as non-existent, that I was able to find some of the problems with value learning algorithms.

When you think that way, it becomes natural to ponder issues like “if we define victory through the official result, this leaves open the possibility for referee-concealed rule-breaking; is this acceptable for whatever we need that definition for?”





[Delegative Reinforcement
by Vadim Kosoy on Stable Pointers to Value II: Environmental Goals | 1 like

Intermediate update: The
by Alex Appel on Further Progress on a Bayesian Version of Logical ... | 0 likes

Since Briggs [1] shows that
by 258 on In memoryless Cartesian environments, every UDT po... | 2 likes

This doesn't quite work. The
by Nisan Stiennon on Logical counterfactuals and differential privacy | 0 likes

I at first didn't understand
by Sam Eisenstat on An Untrollable Mathematician | 1 like

This is somewhat related to
by Vadim Kosoy on The set of Logical Inductors is not Convex | 0 likes

This uses logical inductors
by Abram Demski on The set of Logical Inductors is not Convex | 0 likes

Nice writeup. Is one-boxing
by Tom Everitt on Smoking Lesion Steelman II | 0 likes

Hi Alex! The definition of
by Vadim Kosoy on Delegative Inverse Reinforcement Learning | 0 likes

A summary that might be
by Alex Appel on Delegative Inverse Reinforcement Learning | 1 like

I don't believe that
by Alex Appel on Delegative Inverse Reinforcement Learning | 0 likes

This is exactly the sort of
by Stuart Armstrong on Being legible to other agents by committing to usi... | 0 likes

When considering an embedder
by Jack Gallagher on Where does ADT Go Wrong? | 0 likes

The differences between this
by Abram Demski on Policy Selection Solves Most Problems | 1 like

Looking "at the very
by Abram Demski on Policy Selection Solves Most Problems | 0 likes


Privacy & Terms