A putative new idea for AI control; index here.
Is there a teapot currently somewhere between the orbit of Earth and Mars? And who won the 1986 World Cup football match between England and Argentina? And what do these questions have to do with learning human values?
Underlying true reality?
Both those top questions are uncertain. We haven’t scanned the solar system with enough precision to find any such teapot. And though the referee allowed the first Argentine goal, making it official, and though FIFA had Agentina progress to the semi-finales (they would eventually win the tournament) while England was eliminate… that goal, the “Hand of God” goal, was scored by Maradona with his hand, a totally illegal move.
In a sense, neither question can ever be fully resolved. Even if we fully sweep the the solar system for teapots in a century’s time, it’s still possible there might have been one now, and it then crashed into the sun, stopping us from ever finding it. And in the spirit of ambijectivity, the question of Argentina’s victory (or whether it was a “proper” victory, or a “fair” victory) depends on which aspect of the definition of victory you choose to emphasise - the referee’s call and the official verdict, versus the clear violation of the rules.
Nevertheless, there is a sense in which we feel the first question has a definite answer (which is almost certainly “no”), while the second is purely about definitions.
Answering the question
Why do we feel that the teapot question has a definite answer? Well, we have a model of reality as something that actually objectively exists, and our investigation of the world backs it up - when confronted by a door, we feel that there is something behind the door, even if we choose not to open it. There are various counterfactuals in which we could have sent out a probe to any given area of the orbit, so we feel we could have resolved the “is there a teapot at this location” for any given location within a wide radius of the Earth.
Basically, the universe has features that causes us to believe that when we observe it (quantum effects aside), we are seeing a previous reality rather than creating a new one (see the old debate between platonists and formalists in mathematics).
Whereas even if we had a God or an AI, we don’t expect it to have a definite answer to the question of who won that football match. There is no platonic underlying reality as to the winner of the game, that we could just figure out if we had enough knowledge. We already know everything that’s relevant.
Procedure for learning/defining human values
Many attempts at learning human values are framed as “humans have an underlying true reward R, and here is procedure P for determining it”.
But in most cases, that formulation is incorrect, because the paper is actually saying “here is a procedure P for determining human values”. Actually saying that humans have true rewards is a much more complicated process: you have to justify that the true R exists, like Russell’s teapot, rather than being a question of definition, like Argentina’s football victory.
That sounds like a meaningless distinction: what is, in practice, the difference between a true reward R and an imperfect estimator P, and just P? It’s simpler conceptually if we talk about the true R, so why not do so?
It’s simpler, but much more misleading. If you work under the assumption that there is a true R, then you’re less likely to think that P might lead you astray. And if you see imperfections in P, your first instinct is something like “make P more rational/more likely to converge on R”, rather than ask the true question, which is “is P a good definition of human values?”
Even if the moral realists are right, and there is a true R, thinking about it is still misleading. Because there is, as yet, no satisfactory definition of this true R, and it’s very hard to make something converge better onto something you haven’t defined. Shifting the focus from the unknown (and maybe unknowable, or maybe even non-existent) R, to the actual P, is important.
It’s because I was focused so strongly on the procedure P, treating R as non-existent, that I was able to find some of the problems with value learning algorithms.
When you think that way, it becomes natural to ponder issues like “if we define victory through the official result, this leaves open the possibility for referee-concealed rule-breaking; is this acceptable for whatever we need that definition for?”