by Vadim Kosoy 203 days ago | Malo Bourgon likes this | link | parent Consider a panel with two buttons, A and B. One button sends you to Heaven and one to Hell, but you don’t know which is which and there is no way to check without pressing one. To make it more fun, you have to choose a button within one minute or you go to Hell automatically. So, there are are two environments: in environment X, button A corresponds to Heaven and in environment Y, button B corresponds to Heaven. Obviously both cannot be in a learnable class simultaneously. So, at least one of them has to be ruled out (and if we also want to preserve symmetry then both). What sort of assumption do you think will rule them out? I think that “scientific and policy knowledge that no one knows at birth and no one could discover by themself over a lifetime” is absolutely compatible with the hypothesis I outlined, even in its most naive form. If humanity’s progress is episodic RL where each human life is an episode, then of course each human uses the knowledge accumulated by previous humans. This is the whole idea of a learning algorithm in this setting. Also, I think that success with CFC is not a lot of evidence against the hypothesis since, for one thing, CFC doesn’t allow a small group to easily destroy all of humanity, and for another thing, AFAIK action against CFC was only taken when some damage was already apparent. This is different from risks that have to be handled correctly on the first try. That said, “doesn’t reflect optimistically on our chances to survive AI risk” wasn’t intended as a strong claim but as something very speculative. Possibly I should have made it clearer. More generally, the idea of restricting to environments s.t. some base policy doesn’t fall into traps on them is not very restrictive. Indeed, for any learnable class H you can just take the base policy to be the learning algorithm itself and tautologically get a class at least as big as H. It becomes more interesting if we impose some constraints on the base policy, such as maybe restricting its computational complexity. Intuitively, it seems alluring to say that our environment may contain X-risks, but they are s.t. by the time we face them we have enough knowledge to avoid them. However, this leads to assumptions that depend on the prior as a whole rather than on particular environments (basically, it’s not clear whether this is saying anything besides just assuming the prior is learnable). This complicates things, and in particular it becomes less clear what does it mean for such a prior to be “universal”. Moreover, the notion of a “trap” is not even a function of the prior regarded a single mixed environment, but a function of the particular partition of the prior into constituent hypotheses. In other words, it depends on which uncertainty is considered subjective (a property of the agent’s state of knowledge) and which uncertainty is considered objective (an inherent unpredictability of the world). For example, if we go to the initial example but assume that there is a fair coin inside the environment that decides which button is Heaven, then instead of two environments we get one and tautologically there is no trap. In short, I think there is a lot more thinking to do about this question.

 by Jessica Taylor 202 days ago | link So, there are are two environments: in environment X, button A corresponds to Heaven and in environment Y, button B corresponds to Heaven. Obviously both cannot be in a learnable class simultaneously. So, at least one of them has to be ruled out (and if we also want to preserve symmetry then both). What sort of assumption do you think will rule them out? I don’t think we should rule either of these out. The obvious answer is to give up on asymptotic optimality and do something more like utility function optimization instead. That would be moving out of the learning theory setting, which is a good thing. Asymptotic optimality can apply to bounded optimization problems and can’t apply to civilization-level steering problems. reply
 by Vadim Kosoy 202 days ago | link Well, we could give up on regret bounds and instead just consider algorithms that asymptotically approach Bayes-optimality. (This would not be moving out of the learning theory setting though? At least not the way I use this terminology.) Regret bounds would still be useful in the context of guaranteeing transfer of human knowledge and values to the AGI, but not in the context of defining intelligence. However, my intuition is that it would be the wrong way to go. For one thing, it seems that it is computationally feasible (at least in some weak sense, i.e. for a small number of hypotheses s.t. the optimal policy for each is feasible) to get asymptotic Bayes-optimality for certain learnable classes (PSRL is a simple example) but not in general. I don’t have a proof (and I would be very interested to see either a proof or a refutation), but it seems to be the case AFAIK. For another thing, consider questions such as, why intelligent agents outcompete instinct-based agents, and why general intelligence (i.e. Bayes optimality or at least some notion of good performance w.r.t. a prior that is “universal” or “nearly universal” in some sense) can be developed by evolution in a rather restricted environment. These questions seem much easier to answer if intelligence has some frequentist property (i.e. it is in some sense effective in all or most environments) compared to, if intelligence has only purely Bayesian properties (i.e. it is only good on average w.r.t. some very broad ensemble of environments). reply
 by Jessica Taylor 201 days ago | link For another thing, consider questions such as, why intelligent agents outcompete instinct-based agents, and why general intelligence (i.e. Bayes optimality or at least some notion of good performance w.r.t. a prior that is “universal” or “nearly universal” in some sense) can be developed by evolution in a rather restricted environment. These questions seem much easier to answer if intelligence has some frequentist property (i.e. it is in some sense effective in all or most environments) compared to, if intelligence has only purely Bayesian properties (i.e. it is only good on average w.r.t. some very broad ensemble of environments). I don’t understand why you think this. Suppose there is some simple “naturalized AIXI”-ish thing that is parameterized on a prior, and there exists a simple prior for which an animal running this algorithm with this prior does pretty well in our world. Then evolution may produce an animal running something like naturalized AIXI with this prior. But naturalized AIXI is only good on average rather than guaranteeing effectiveness in almost all environments. reply
 by Vadim Kosoy 201 days ago | link My intuition is that it must not be just a coincidence that the agent happens to works well in our world, otherwise your formalism doesn’t capture the concept of intelligence in full. For example, we are worried that a UFAI would be very likely to kill us in this particular universe, not just in some counterfactual universes. Moreover, Bayesian agents with simple priors often do very poorly in particular worlds, because of what I call “Bayesian paranoia”. That is, if your agent thinks that lifting its left arm will plausibly send it to hell (a rather simple hypothesis), it will never lift its left arm and learn otherwise. In fact, I suspect that a certain degree of “optimism” is inherent in our intuitive notion of rationality, and it also has a good track record. For example, when scientists did early experiments with electricity, or magnetism, or chemical reactions, their understanding of physics at the time was arguably insufficient to know this will not destroy the world. However, there were few other ways to go forward. AFAIK the first time anyone seriously worried about a physics experiment was the RHIC (unless you also count the Manhattan project, when Edward Teller suggested the atom bomb might create a self-sustaining nuclear fusion reaction that will envelope the entire atmosphere). These latter concerns were only raised because we already knew enough to point at specific dangers. Of course this doesn’t mean we shouldn’t be worried about X-risks! But I think that some form of a priori optimism is plausibly correct, in some philosophical sense. (There was also some thinking in that direction by Sunehag and Hutter although I’m not sold on the particular formalism they consider). reply
 by Jessica Taylor 199 days ago | link I think I understand your point better now. It isn’t a coincidence that an agent produced by evolution has a good prior for our world (because evolution tries many priors, and there are lots of simple priors to try). But the fact that there exists a simple prior that does well in our universe is a fact that needs an explanation. It can’t be proven from Bayesianism; the closest thing to a proof of this form is that computationally unbounded agents can just be born with knowledge of physics if physics is sufficiently simple, but there is no similar argument for computationally bounded agents. reply
 by Jessica Taylor 201 days ago | link Well, we could give up on regret bounds and instead just consider algorithms that asymptotically approach Bayes-optimality. I am not proposing this. I am proposing doing something more like AIXI, which has a fixed prior and does not obtain optimality properties on a broad class of environments. It seems like directly specifying the right prior is hard, and it’s plausible that learning theory research would help give intuitions/models about which prior to use or what non-Bayesian algorithm would get good performance in the world we actually live in, but I don’t expect learning theory to directly produce an algorithm we would be happy with running to make big decisions in our universe. reply
 by Vadim Kosoy 201 days ago | link Yes, I think that we’re talking about the same thing. When I say “asymptotically approach Bayes-optimality” I mean the equation from Proposition A.0 here. I refer to this instead of just Bayes-optimality, because exact Bayes-optimality is computationally intractable even for a small number of hypothesis each of which is a small MDP. However, even asymptotic Bayes-optimality is usually only tractable for some learnable classes, AFAIK: for example if you have environments without traps then PSRL is asymptotically Bayes-optimal. reply
 by Jessica Taylor 202 days ago | link I think that “scientific and policy knowledge that no one knows at birth and no one could discover by themself over a lifetime” is absolutely compatible with the hypothesis I outlined, even in its most naive form. If humanity’s progress is episodic RL where each human life is an episode, then of course each human uses the knowledge accumulated by previous humans. This is the whole idea of a learning algorithm in this setting. If RL is using human lives as episodes then humans should already be born with the relevant knowledge. There would be no need for history since all learning is encoded in the policy. History isn’t RL; it’s data summarization, model building, and intertemporal communication. reply
 by Vadim Kosoy 202 days ago | link This seems to be interpreting the analogy too literally. Humans are not born with the knowledge, but they acquire the knowledge through some protocol that is designed to be much easier than rediscovering it. Moreover, by “reinforcement learning” I don’t mean the same type of algorithms used for RL today, I only mean that the performance guarantee this process satisfies is of a certain form. reply
 by Jessica Taylor 202 days ago | link More generally, the idea of restricting to environments s.t. some base policy doesn’t fall into traps on them is not very restrictive. This rules out environments in which the second law of thermodynamics holds. reply
 by Vadim Kosoy 202 days ago | link No, it doesn’t rule out any particular environment. A class that consists only of one environment is tautologically learnable, by the optimal policy for this environment. You might be thinking of learnability by anytime algorithms whereas I’m thinking of learnability by non-anytime algorithms (what I called “metapolicies”), the way I defined it here (see Definition 1). reply
 by Jessica Taylor 201 days ago | link Ok, I am confused by what you mean by “trap”. I thought “trap” meant a set of states you can’t get out of. And if the second law of thermodynamics is true, you can’t get from a high-entropy state to a low-entropy state. What do you mean by “trap”? reply
 by Vadim Kosoy 201 days ago | link To first approximation, a “trap” is a an action s.t. taking it loses long-term value in expectation, i.e an action which is outside the set $$\mathcal{A}_M^0$$ that I defined here (see the end of Definition 1). This set is always non-empty, since it at least has to contain the optimal action. However, this definition is not very useful when, for example, your environment contains a state that you cannot escape and you also cannot avoid (for example, the heat death of the universe might be such a state), since, in this case, nothing is a trap. To be more precise we need to go from an analysis which is asymptotic in the time discount parameter to an analysis with a fixed, finite time discount parameter (similarly to how with time complexity, we usually start from analyzing the asymptotic complexity of an algorithm, but ultimately we are interested in particular inputs of finite size). For a fixed time time discount parameter, the concept of a trap becomes “fuzzy”: a trap is an action which loses a substantial fraction of the value. reply

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes