A brief note on factoring out certain variables discussion post by Stuart Armstrong 1164 days ago | Jessica Taylor and Patrick LaVictoire like this | 5 comments

[Note: This comment is three years later than the post]

The “obvious idea” here unfortunately seems not to work, because it is vulnerable to so-called “infinite improbability drives”. Suppose $$B$$ is a shutdown button, and $$P(b|e)$$ gives some weight to $$B=pressed$$ and $$B=unpressed$$. Then, the AI will benefit from selecting a Q such that it always chooses an action $$a$$, in which it enters a lottery, and if it does not win, then it the button B is pushed. In this circumstance, $$P(b|e)$$ is unchanged, while both $$P(c|b=pressed,a,e)$$ and $$P(c|b=unpressed,a,e)$$ allocate almost all of the probability to great $$C$$ outcomes. So the approach will create an AI that wants to exploit its ability to determine $$B$$.

 by Jessica Taylor 337 days ago | Vanessa Kosoy likes this | link | parent | on: Meta: IAFF vs LessWrong Apparently “You must be approved by an admin to comment on Alignment Forum”, how do I do this? Also is this officially the successor to IAFF? If so it would be good to make that more clear on this website.
by Alex Mennen 337 days ago | link | on: Meta: IAFF vs LessWrong

There should be a chat icon on the bottom-right of the screen on Alignment Forum that you can use to talk to the admins (unless only people who have already been approved can see this?). You can also comment on LW (Alignment Forum posts are automatically crossposted to LW), and ask the admins to make it show up on Alignment Forum afterwards.

 by Alex Mennen 338 days ago | Vanessa Kosoy likes this | link | parent | on: Meta: IAFF vs LessWrong There is a replacement for IIAF now: https://www.alignmentforum.org/
by Jessica Taylor 337 days ago | Vanessa Kosoy likes this | link | on: Meta: IAFF vs LessWrong

Apparently “You must be approved by an admin to comment on Alignment Forum”, how do I do this?

Also is this officially the successor to IAFF? If so it would be good to make that more clear on this website.

 Meta: IAFF vs LessWrong discussion post by Vanessa Kosoy 350 days ago | Jessica Taylor likes this | 5 comments
by Alex Mennen 338 days ago | Vanessa Kosoy likes this | link | on: Meta: IAFF vs LessWrong

There is a replacement for IIAF now: https://www.alignmentforum.org/

 by Paul Christiano 347 days ago | link | parent | on: The Learning-Theoretic AI Alignment Research Agend... I think the most plausible view is: what we call intelligence is a collection of a large number of algorithms and innovations each of which slightly increases effectiveness in a reasonably broad range of tasks. To see why both view A and B seem strange to me, consider the analog for physical tasks. You could say that there is a simple core to human physical manipulation which allows us to solve any problem in some very broad natural domain. Or you could think that we just have a ton of tricks for particular manipulation tasks. But neither of those seems right, there is no simple core to the human body plan but at the same time it contains many features which are helpful across a broad range of tasks.

Regarding the physical manipulation analogy: I think that there actually is a simple core to the human body plan. This core is, more or less: a spine, two arms with joints in the middle, two legs with joints in the middle, feet and arms with fingers. This is probably already enough to qualitatively solve more or less all physical manipulation problems humans can solve. All the nuances are needed to make it quantitatively more efficient and deal with the detailed properties of biological tissues, biological muscles et cetera (the latter might be considered analogous to the detailed properties of computational hardware and input/output channels for brains/AGIs).

 by Jessica Taylor 343 days ago | link | parent | on: The Learning-Theoretic AI Alignment Research Agend... This seems like a hack. The equilibrium policy is going to assume that the environment is good to it in general in a magical fashion, rather than assuming the environment is good to it in the specific ways we should expect given our own knowledge of how the environment works. It’s kind of like assuming “things magically end up lower than you expected on priors” instead of having a theory of gravity. I think there is something like a theory of gravity here. The things I would note about our universe that make it possible to avoid a lot of traps include: Physical laws are symmetric across spacetime. Physical laws are spacially local. The predictable effects of a local action are typically local; most effects “dissipate” after a while (e.g. into heat). The butterfly effect is evidence for this rather than against this, since it means many effects are unpredictable and so can be modeled thermodynamically. When small changes have big and predictable effects (e.g. in a computer), there is often agentic optimization power towards the creation and maintenance of this system of effects, and in these cases it is possible for at least some agents to understand important things about how the system works. Some “partially-dissipated” effects are statistical in nature. For example, an earthquake hitting an area has many immediate effects, but over the long term the important effects are things like “this much local productive activity was disrupted”, “this much local human health was lost”, etc. You have the genes that you do because evolution, which is similar to a reinforcement learning algorithm, believed that these genes would cause you to survive and reproduce. If we construct AI systems, we will give them code (including a prior) that we expect to cause them to do something useful for us. In general, the agency of an agent’s creator should affect the agent’s beliefs. If there are many copies of an agent, and successful agents are able to repurpose the resources of unsuccessful ones, then different copies can try different strategies; some will fail but the successful ones can then repurpose their resources. (Evolution can be seen as a special case of this) Some phenemona have a “fractal” nature, where a small thing behaves similar to a big thing. For example, there are a lot of similarities between the dynamics of a nation and the dynamics of a city. Thus small things can be used as models of big things. If your interests are aligned with those of agents in your local vicinity, then they will mostly try to help you. (This applies to parents making their children’s environment safe) I don’t have an elegant theory yet but these observations seem like a reasonable starting point for forming one.

I think that we should expect evolution to give us a prior that is a good lossy compression of actual physics (where “actual physics” means, those patterns the universe has that can be described within our computational complexity bounds). Meaning that, on the one hand it should be low description complexity (otherwise it will be hard for evolution to find it), and on the other hand it should be assign high probability to the true environment (in other words, the KL divergence of the true environment from the prior should be small). And also it should be approximately learnable, otherwise it won’t go from assigning high probability to actually performing well.

The principles you outlined seem reasonable overall.

Note that the locality/dissipation/multiagent assumptions amount to a special case of “the environment is effectively reversible (from the perspective of the human species as a whole) as long as you don’t apply too much optimization power” (“optimization power” probably translates to divergence from some baseline policy plus maybe computational complexity considerations). Now, as you noted before, actual macroscopic physics is not reversible, but it might still be effectively reversible if you have a reliable long-term source of negentropy (like the sun). Maybe we can also slightly relax them by allowing irreversible changes as long as they are localized and the available space is sufficiently big.

“If we construct AI systems, we will give them code (including a prior) that we expect to cause them to do something useful for us. In general, the agency of an agent’s creator should affect the agent’s beliefs” is essentially what DRL does: allows transferring our knowledge to the AI without hard-coding it by hand.

“When small changes have big and predictable effects (e.g. in a computer), there is often agentic optimization power towards the creation and maintenance of this system of effects, and in these cases it is possible for at least some agents to understand important things about how the system works” seems like it would allow us to go beyond effective reversibility, but I’m not sure how to formalize it or whether it’s a justified assumption. One way towards formalizing it is, the prior is s.t. studying the initial state approximate communication class allows determining the entire environment, but this seems to point at a very broad class of approximately learnable priors w/o specifying a criterion how to choose among them.

Another principle that we can try to use is, the ubiquity of analytic functions. Analytic functions have the property that, knowing the function in a bounded domain allows extrapolating it everywhere. This is different from allowing arbitrary computable functions which may have “if” clauses, so that studying the function in a bounded domain is never enough to be sure about its behavior outside it. In particular, this line of inquiry seems relatively easy to formalize using continuous MDPs (although we run into the problem that finding the optimal policy is infeasible, in general). Also, it might have something to do with the effectiveness of neural networks (although the popular ReLU response function is not analytic).

 by Vanessa Kosoy 345 days ago | link | parent | on: The Learning-Theoretic AI Alignment Research Agend... My intuition is that it must not be just a coincidence that the agent happens to works well in our world, otherwise your formalism doesn’t capture the concept of intelligence in full. For example, we are worried that a UFAI would be very likely to kill us in this particular universe, not just in some counterfactual universes. Moreover, Bayesian agents with simple priors often do very poorly in particular worlds, because of what I call “Bayesian paranoia”. That is, if your agent thinks that lifting its left arm will plausibly send it to hell (a rather simple hypothesis), it will never lift its left arm and learn otherwise. In fact, I suspect that a certain degree of “optimism” is inherent in our intuitive notion of rationality, and it also has a good track record. For example, when scientists did early experiments with electricity, or magnetism, or chemical reactions, their understanding of physics at the time was arguably insufficient to know this will not destroy the world. However, there were few other ways to go forward. AFAIK the first time anyone seriously worried about a physics experiment was the RHIC (unless you also count the Manhattan project, when Edward Teller suggested the atom bomb might create a self-sustaining nuclear fusion reaction that will envelope the entire atmosphere). These latter concerns were only raised because we already knew enough to point at specific dangers. Of course this doesn’t mean we shouldn’t be worried about X-risks! But I think that some form of a priori optimism is plausibly correct, in some philosophical sense. (There was also some thinking in that direction by Sunehag and Hutter although I’m not sold on the particular formalism they consider).

I think I understand your point better now. It isn’t a coincidence that an agent produced by evolution has a good prior for our world (because evolution tries many priors, and there are lots of simple priors to try). But the fact that there exists a simple prior that does well in our universe is a fact that needs an explanation. It can’t be proven from Bayesianism; the closest thing to a proof of this form is that computationally unbounded agents can just be born with knowledge of physics if physics is sufficiently simple, but there is no similar argument for computationally bounded agents.

 by Vanessa Kosoy 344 days ago | link | parent | on: The Learning-Theoretic AI Alignment Research Agend... After thinking some more, maybe the following is natural way towards formalizing the optimism condition. Let $$H$$ be the space of hypotheses and $$\xi_0 \in \Delta H$$ be the “unbiased” universal prior. Given any $$\zeta \in \Delta H$$, we denote $$\hat{\zeta} = E_{\mu \sim \zeta}[\mu]$$, i.e. the environment resulting from mixing the environments in the belief state $$\zeta$$. Given an environment $$\mu$$, let $$\pi^\mu$$ be the Bayes-optimal policy for $$\mu$$ and $$\pi^\mu_\theta$$ the perturbed Bayes-optimal policy for $$\mu$$, where $$\theta$$ is a perturbation parameter. Here, “perturbed” probably means something like softmax expected utility, but more thought is needed. Then, the “optimistic” prior $$\xi$$ is defined as a solution to the following fixed point equation: $\xi(\mu) = Z^{-1} \xi_0(\mu) \exp(\beta(E_{\mu\bowtie\pi^{\hat{\xi}}_\theta}[U]-E_{\mu\bowtie\pi^\mu}[U]))$ Here, $$Z$$ is a normalization constant and $$\beta$$ is an additional parameter. This equation defines something like a softmax Nash equilibrium in a cooperative game of two players where, one player chooses $$\mu$$ (so that $$\xi$$ is eir mixed strategy), another player chooses $$\pi$$ and the utility is minus regret (alternatively, we might want to choose only Pareto efficient Nash equilibria). The parameter $$\beta$$ controls optimism regarding the ability to learn the environment, whereas the parameter $$\theta$$ represents optimism regarding the presence of slack: ability to learn despite making some errors or random exploration (how to choose these parameters is another question). Possibly, the idea of exploring the environment “layer by layer” can be recovered from combining this with hierarchy assumptions.

This seems like a hack. The equilibrium policy is going to assume that the environment is good to it in general in a magical fashion, rather than assuming the environment is good to it in the specific ways we should expect given our own knowledge of how the environment works. It’s kind of like assuming “things magically end up lower than you expected on priors” instead of having a theory of gravity.

I think there is something like a theory of gravity here. The things I would note about our universe that make it possible to avoid a lot of traps include:

• Physical laws are symmetric across spacetime.
• Physical laws are spacially local.
• The predictable effects of a local action are typically local; most effects “dissipate” after a while (e.g. into heat). The butterfly effect is evidence for this rather than against this, since it means many effects are unpredictable and so can be modeled thermodynamically.
• When small changes have big and predictable effects (e.g. in a computer), there is often agentic optimization power towards the creation and maintenance of this system of effects, and in these cases it is possible for at least some agents to understand important things about how the system works.
• Some “partially-dissipated” effects are statistical in nature. For example, an earthquake hitting an area has many immediate effects, but over the long term the important effects are things like “this much local productive activity was disrupted”, “this much local human health was lost”, etc.
• You have the genes that you do because evolution, which is similar to a reinforcement learning algorithm, believed that these genes would cause you to survive and reproduce. If we construct AI systems, we will give them code (including a prior) that we expect to cause them to do something useful for us. In general, the agency of an agent’s creator should affect the agent’s beliefs.
• If there are many copies of an agent, and successful agents are able to repurpose the resources of unsuccessful ones, then different copies can try different strategies; some will fail but the successful ones can then repurpose their resources. (Evolution can be seen as a special case of this)
• Some phenemona have a “fractal” nature, where a small thing behaves similar to a big thing. For example, there are a lot of similarities between the dynamics of a nation and the dynamics of a city. Thus small things can be used as models of big things.
• If your interests are aligned with those of agents in your local vicinity, then they will mostly try to help you. (This applies to parents making their children’s environment safe)

I don’t have an elegant theory yet but these observations seem like a reasonable starting point for forming one.

 by Jessica Taylor 346 days ago | link | parent | on: The Learning-Theoretic AI Alignment Research Agend... First, when I say “reinforcement learning” I don’t necessarily mean the type of RL algorithms that exist today. I just mean something that is designed to perform well (in some sense) in the face of uncertainty about the environment. That is broad enough to include Bayesianism. I think you are imagining a narrower class of algorithms that can achieve some property like asymptotic optimality. Agree that this narrower class is much broader than current RL, though. Second, even existing RL algorithms are not pure trial-and-error. For example, posterior sampling maintains a belief state about the environment and runs the optimal policy for some environment sampled from the belief state. So, if the belief state “knows” that something is a bad/good idea then the algorithm doesn’t need to actually try it. I agree that if it knows for sure that it isn’t in some environment then it doesn’t need to test anything to perform well in that environment. But what if there is a 5% chance that the environment is such that nuclear war is good (e.g. because it eliminates other forms of destructive technology for a long time)? Then this AI would start nuclear war with 5% probability per learning epoch. This is not pure trial-and-error but it is trial-and-error in an important relevant sense. What I’m trying to formalize is, what kind of environments allow this to happen consistently, i.e. being able to acquire enough knowledge to deal with a trap before you arrive at the trap. This seems like an interesting research approach and I don’t object to it. I would object to thinking that algorithms that only handle this class of environments are safe to run in our world (which I expect is not of this form). To be clear, while I expect that a Bayesian-ish agent has a good chance to avoid very bad outcomes using the knowledge it has, I don’t think anything that attains asymptotic optimality will be useful while avoiding very bad outcomes with decent probability.

After thinking some more, maybe the following is natural way towards formalizing the optimism condition.

Let $$H$$ be the space of hypotheses and $$\xi_0 \in \Delta H$$ be the “unbiased” universal prior. Given any $$\zeta \in \Delta H$$, we denote $$\hat{\zeta} = E_{\mu \sim \zeta}[\mu]$$, i.e. the environment resulting from mixing the environments in the belief state $$\zeta$$. Given an environment $$\mu$$, let $$\pi^\mu$$ be the Bayes-optimal policy for $$\mu$$ and $$\pi^\mu_\theta$$ the perturbed Bayes-optimal policy for $$\mu$$, where $$\theta$$ is a perturbation parameter. Here, “perturbed” probably means something like softmax expected utility, but more thought is needed. Then, the “optimistic” prior $$\xi$$ is defined as a solution to the following fixed point equation:

$\xi(\mu) = Z^{-1} \xi_0(\mu) \exp(\beta(E_{\mu\bowtie\pi^{\hat{\xi}}_\theta}[U]-E_{\mu\bowtie\pi^\mu}[U]))$

Here, $$Z$$ is a normalization constant and $$\beta$$ is an additional parameter.

This equation defines something like a softmax Nash equilibrium in a cooperative game of two players where, one player chooses $$\mu$$ (so that $$\xi$$ is eir mixed strategy), another player chooses $$\pi$$ and the utility is minus regret (alternatively, we might want to choose only Pareto efficient Nash equilibria). The parameter $$\beta$$ controls optimism regarding the ability to learn the environment, whereas the parameter $$\theta$$ represents optimism regarding the presence of slack: ability to learn despite making some errors or random exploration (how to choose these parameters is another question).

Possibly, the idea of exploring the environment “layer by layer” can be recovered from combining this with hierarchy assumptions.

 by Jessica Taylor 346 days ago | link | parent | on: The Learning-Theoretic AI Alignment Research Agend... Well, we could give up on regret bounds and instead just consider algorithms that asymptotically approach Bayes-optimality. I am not proposing this. I am proposing doing something more like AIXI, which has a fixed prior and does not obtain optimality properties on a broad class of environments. It seems like directly specifying the right prior is hard, and it’s plausible that learning theory research would help give intuitions/models about which prior to use or what non-Bayesian algorithm would get good performance in the world we actually live in, but I don’t expect learning theory to directly produce an algorithm we would be happy with running to make big decisions in our universe.

Yes, I think that we’re talking about the same thing. When I say “asymptotically approach Bayes-optimality” I mean the equation from Proposition A.0 here. I refer to this instead of just Bayes-optimality, because exact Bayes-optimality is computationally intractable even for a small number of hypothesis each of which is a small MDP. However, even asymptotic Bayes-optimality is usually only tractable for some learnable classes, AFAIK: for example if you have environments without traps then PSRL is asymptotically Bayes-optimal.

 by Jessica Taylor 346 days ago | link | parent | on: The Learning-Theoretic AI Alignment Research Agend... For another thing, consider questions such as, why intelligent agents outcompete instinct-based agents, and why general intelligence (i.e. Bayes optimality or at least some notion of good performance w.r.t. a prior that is “universal” or “nearly universal” in some sense) can be developed by evolution in a rather restricted environment. These questions seem much easier to answer if intelligence has some frequentist property (i.e. it is in some sense effective in all or most environments) compared to, if intelligence has only purely Bayesian properties (i.e. it is only good on average w.r.t. some very broad ensemble of environments). I don’t understand why you think this. Suppose there is some simple “naturalized AIXI”-ish thing that is parameterized on a prior, and there exists a simple prior for which an animal running this algorithm with this prior does pretty well in our world. Then evolution may produce an animal running something like naturalized AIXI with this prior. But naturalized AIXI is only good on average rather than guaranteeing effectiveness in almost all environments.

My intuition is that it must not be just a coincidence that the agent happens to works well in our world, otherwise your formalism doesn’t capture the concept of intelligence in full. For example, we are worried that a UFAI would be very likely to kill us in this particular universe, not just in some counterfactual universes. Moreover, Bayesian agents with simple priors often do very poorly in particular worlds, because of what I call “Bayesian paranoia”. That is, if your agent thinks that lifting its left arm will plausibly send it to hell (a rather simple hypothesis), it will never lift its left arm and learn otherwise.

In fact, I suspect that a certain degree of “optimism” is inherent in our intuitive notion of rationality, and it also has a good track record. For example, when scientists did early experiments with electricity, or magnetism, or chemical reactions, their understanding of physics at the time was arguably insufficient to know this will not destroy the world. However, there were few other ways to go forward. AFAIK the first time anyone seriously worried about a physics experiment was the RHIC (unless you also count the Manhattan project, when Edward Teller suggested the atom bomb might create a self-sustaining nuclear fusion reaction that will envelope the entire atmosphere). These latter concerns were only raised because we already knew enough to point at specific dangers. Of course this doesn’t mean we shouldn’t be worried about X-risks! But I think that some form of a priori optimism is plausibly correct, in some philosophical sense. (There was also some thinking in that direction by Sunehag and Hutter although I’m not sold on the particular formalism they consider).

 by Jessica Taylor 346 days ago | link | parent | on: The Learning-Theoretic AI Alignment Research Agend... Ok, I am confused by what you mean by “trap”. I thought “trap” meant a set of states you can’t get out of. And if the second law of thermodynamics is true, you can’t get from a high-entropy state to a low-entropy state. What do you mean by “trap”?

To first approximation, a “trap” is a an action s.t. taking it loses long-term value in expectation, i.e an action which is outside the set $$\mathcal{A}_M^0$$ that I defined here (see the end of Definition 1). This set is always non-empty, since it at least has to contain the optimal action. However, this definition is not very useful when, for example, your environment contains a state that you cannot escape and you also cannot avoid (for example, the heat death of the universe might be such a state), since, in this case, nothing is a trap. To be more precise we need to go from an analysis which is asymptotic in the time discount parameter to an analysis with a fixed, finite time discount parameter (similarly to how with time complexity, we usually start from analyzing the asymptotic complexity of an algorithm, but ultimately we are interested in particular inputs of finite size). For a fixed time time discount parameter, the concept of a trap becomes “fuzzy”: a trap is an action which loses a substantial fraction of the value.

 by Jessica Taylor 346 days ago | link | parent | on: The Learning-Theoretic AI Alignment Research Agend... First, when I say “reinforcement learning” I don’t necessarily mean the type of RL algorithms that exist today. I just mean something that is designed to perform well (in some sense) in the face of uncertainty about the environment. That is broad enough to include Bayesianism. I think you are imagining a narrower class of algorithms that can achieve some property like asymptotic optimality. Agree that this narrower class is much broader than current RL, though. Second, even existing RL algorithms are not pure trial-and-error. For example, posterior sampling maintains a belief state about the environment and runs the optimal policy for some environment sampled from the belief state. So, if the belief state “knows” that something is a bad/good idea then the algorithm doesn’t need to actually try it. I agree that if it knows for sure that it isn’t in some environment then it doesn’t need to test anything to perform well in that environment. But what if there is a 5% chance that the environment is such that nuclear war is good (e.g. because it eliminates other forms of destructive technology for a long time)? Then this AI would start nuclear war with 5% probability per learning epoch. This is not pure trial-and-error but it is trial-and-error in an important relevant sense. What I’m trying to formalize is, what kind of environments allow this to happen consistently, i.e. being able to acquire enough knowledge to deal with a trap before you arrive at the trap. This seems like an interesting research approach and I don’t object to it. I would object to thinking that algorithms that only handle this class of environments are safe to run in our world (which I expect is not of this form). To be clear, while I expect that a Bayesian-ish agent has a good chance to avoid very bad outcomes using the knowledge it has, I don’t think anything that attains asymptotic optimality will be useful while avoiding very bad outcomes with decent probability.

Actually, I am including Bayesianism in “reinforcement learning” in the broad sense, although I am also advocating for some form of asymptotic optimality (importantly, it is not asymptotic in time like often done in the literature, but asymptotic in the time discount parameter; otherwise you give up on most of the utility, like you pointed out in an earlier discussion we had).

In the scenario you describe, the agent will presumably discard (or, strongly penalize the probability of) the pro-nuclear-war hypothesis first since the initial policy loses value much faster on this hypothesis compared to the anti-nuclear-war hypothesis (since the initial policy is biased towards the more likely anti-nuclear-war hypothesis). It will then remain with the anti-nuclear-war hypothesis and follow the corresponding policy (of not starting nuclear war). Perhaps this can be formalized as searching for a fixed point of some transformation.

 by Sam Eisenstat 1429 days ago | link | parent | on: Optimal and Causal Counterfactual Worlds Condition 4 in your theorem coincides with Lewis’ account of counterfactuals. Pearl cites Lewis, but he also criticizes him on the ground that the ordering on worlds is too arbitrary. In the language of this post, he is saying that condition 2 arises naturally from the structure of the problem and that condition 4 is derives from the deeper structure corresponding to condition 2. I also noticed that the function $$f$$ and the partial order $$\succ$$ can be read as “time of first divergence from the real world” and “first diverges before”, respectively. This makes the theorem a lot more intuitive.

Yeah, when I went back and patched up the framework of this post to be less logical-omniscence-y, I was able to get $$2\to 3\to 4\to 1$$, but 2 is a bit too strong to be proved from 1, because my framing of 2 is just about probability disagreements in general, while 1 requires $$W$$ to assign probability 1 to $$\phi$$.

 by Vanessa Kosoy 346 days ago | link | parent | on: The Learning-Theoretic AI Alignment Research Agend... Well, we could give up on regret bounds and instead just consider algorithms that asymptotically approach Bayes-optimality. (This would not be moving out of the learning theory setting though? At least not the way I use this terminology.) Regret bounds would still be useful in the context of guaranteeing transfer of human knowledge and values to the AGI, but not in the context of defining intelligence. However, my intuition is that it would be the wrong way to go. For one thing, it seems that it is computationally feasible (at least in some weak sense, i.e. for a small number of hypotheses s.t. the optimal policy for each is feasible) to get asymptotic Bayes-optimality for certain learnable classes (PSRL is a simple example) but not in general. I don’t have a proof (and I would be very interested to see either a proof or a refutation), but it seems to be the case AFAIK. For another thing, consider questions such as, why intelligent agents outcompete instinct-based agents, and why general intelligence (i.e. Bayes optimality or at least some notion of good performance w.r.t. a prior that is “universal” or “nearly universal” in some sense) can be developed by evolution in a rather restricted environment. These questions seem much easier to answer if intelligence has some frequentist property (i.e. it is in some sense effective in all or most environments) compared to, if intelligence has only purely Bayesian properties (i.e. it is only good on average w.r.t. some very broad ensemble of environments).

Well, we could give up on regret bounds and instead just consider algorithms that asymptotically approach Bayes-optimality.

I am not proposing this. I am proposing doing something more like AIXI, which has a fixed prior and does not obtain optimality properties on a broad class of environments. It seems like directly specifying the right prior is hard, and it’s plausible that learning theory research would help give intuitions/models about which prior to use or what non-Bayesian algorithm would get good performance in the world we actually live in, but I don’t expect learning theory to directly produce an algorithm we would be happy with running to make big decisions in our universe.

 by Vanessa Kosoy 346 days ago | link | parent | on: The Learning-Theoretic AI Alignment Research Agend... Well, we could give up on regret bounds and instead just consider algorithms that asymptotically approach Bayes-optimality. (This would not be moving out of the learning theory setting though? At least not the way I use this terminology.) Regret bounds would still be useful in the context of guaranteeing transfer of human knowledge and values to the AGI, but not in the context of defining intelligence. However, my intuition is that it would be the wrong way to go. For one thing, it seems that it is computationally feasible (at least in some weak sense, i.e. for a small number of hypotheses s.t. the optimal policy for each is feasible) to get asymptotic Bayes-optimality for certain learnable classes (PSRL is a simple example) but not in general. I don’t have a proof (and I would be very interested to see either a proof or a refutation), but it seems to be the case AFAIK. For another thing, consider questions such as, why intelligent agents outcompete instinct-based agents, and why general intelligence (i.e. Bayes optimality or at least some notion of good performance w.r.t. a prior that is “universal” or “nearly universal” in some sense) can be developed by evolution in a rather restricted environment. These questions seem much easier to answer if intelligence has some frequentist property (i.e. it is in some sense effective in all or most environments) compared to, if intelligence has only purely Bayesian properties (i.e. it is only good on average w.r.t. some very broad ensemble of environments).

For another thing, consider questions such as, why intelligent agents outcompete instinct-based agents, and why general intelligence (i.e. Bayes optimality or at least some notion of good performance w.r.t. a prior that is “universal” or “nearly universal” in some sense) can be developed by evolution in a rather restricted environment. These questions seem much easier to answer if intelligence has some frequentist property (i.e. it is in some sense effective in all or most environments) compared to, if intelligence has only purely Bayesian properties (i.e. it is only good on average w.r.t. some very broad ensemble of environments).

I don’t understand why you think this. Suppose there is some simple “naturalized AIXI”-ish thing that is parameterized on a prior, and there exists a simple prior for which an animal running this algorithm with this prior does pretty well in our world. Then evolution may produce an animal running something like naturalized AIXI with this prior. But naturalized AIXI is only good on average rather than guaranteeing effectiveness in almost all environments.

 by Vanessa Kosoy 346 days ago | link | parent | on: The Learning-Theoretic AI Alignment Research Agend... No, it doesn’t rule out any particular environment. A class that consists only of one environment is tautologically learnable, by the optimal policy for this environment. You might be thinking of learnability by anytime algorithms whereas I’m thinking of learnability by non-anytime algorithms (what I called “metapolicies”), the way I defined it here (see Definition 1).

Ok, I am confused by what you mean by “trap”. I thought “trap” meant a set of states you can’t get out of. And if the second law of thermodynamics is true, you can’t get from a high-entropy state to a low-entropy state. What do you mean by “trap”?

 by Vanessa Kosoy 346 days ago | link | parent | on: The Learning-Theoretic AI Alignment Research Agend... I don’t think you’re interpreting what I’m saying correctly. First, when I say “reinforcement learning” I don’t necessarily mean the type of RL algorithms that exist today. I just mean something that is designed to perform well (in some sense) in the face of uncertainty about the environment. Second, even existing RL algorithms are not pure trial-and-error. For example, posterior sampling maintains a belief state about the environment and runs the optimal policy for some environment sampled from the belief state. So, if the belief state “knows” that something is a bad/good idea then the algorithm doesn’t need to actually try it. Third, “starting nuclear wars to test theories” is the opposite of I’m trying to describe. What I’m saying is, we already have enough knowledge (acquired by exploring previous levels) to know that nuclear war is a bad idea, so exploring this level will not involve starting nuclear wars. What I’m trying to formalize is, what kind of environments allow this to happen consistently, i.e. being able to acquire enough knowledge to deal with a trap before you arrive at the trap.

First, when I say “reinforcement learning” I don’t necessarily mean the type of RL algorithms that exist today. I just mean something that is designed to perform well (in some sense) in the face of uncertainty about the environment.

That is broad enough to include Bayesianism. I think you are imagining a narrower class of algorithms that can achieve some property like asymptotic optimality. Agree that this narrower class is much broader than current RL, though.

Second, even existing RL algorithms are not pure trial-and-error. For example, posterior sampling maintains a belief state about the environment and runs the optimal policy for some environment sampled from the belief state. So, if the belief state “knows” that something is a bad/good idea then the algorithm doesn’t need to actually try it.

I agree that if it knows for sure that it isn’t in some environment then it doesn’t need to test anything to perform well in that environment. But what if there is a 5% chance that the environment is such that nuclear war is good (e.g. because it eliminates other forms of destructive technology for a long time)? Then this AI would start nuclear war with 5% probability per learning epoch. This is not pure trial-and-error but it is trial-and-error in an important relevant sense.

What I’m trying to formalize is, what kind of environments allow this to happen consistently, i.e. being able to acquire enough knowledge to deal with a trap before you arrive at the trap.

This seems like an interesting research approach and I don’t object to it. I would object to thinking that algorithms that only handle this class of environments are safe to run in our world (which I expect is not of this form). To be clear, while I expect that a Bayesian-ish agent has a good chance to avoid very bad outcomes using the knowledge it has, I don’t think anything that attains asymptotic optimality will be useful while avoiding very bad outcomes with decent probability.

Logical Inductors Converge to Correlated Equilibria (Kinda)
post by Alex Appel 385 days ago | Sam Eisenstat and Jessica Taylor like this | 1 comment

Logical inductors of “similar strength”, playing against each other in a repeated game, will converge to correlated equilibria of the one-shot game, for the same reason that players that react to the past plays of their opponent converge to correlated equilibria. In fact, this proof is essentially just the proof from Calibrated Learning and Correlated Equilibrium by Forster (1997), adapted to a logical inductor setting.

Regarding the “no subjective dependencies” assumption: I might be missing something obvious, but do you have a proof that this assumption is true in some non-trivial cases? For example, when the players are perfectly symmetric?

Also, you say “the second complication is that all Nash equilibria are correlated equilibria, so maybe logical inductors converge to Nash equilibria in most cases.” In what sense it that a “complication”?

 Doubts about Updatelessness discussion post by Alex Appel 408 days ago | Abram Demski likes this | 3 comments

“if they condition on a sufficiently long initial sequence of digits, they’ll assign high probability to the true distant digit.”

I might be missing something, but this seems to be wrong? If the distant digit is sufficiently distant to be not efficiently computable, then obviously the logical inductor will not be able to predict it. The whole idea of logical counterfactual mugging is considering questions which the agent cannot solve in advance.

 Resource-Limited Reflective Oracles discussion post by Alex Appel 430 days ago | Sam Eisenstat, Abram Demski and Jessica Taylor like this | 1 comment

Personally I find this a little hard to follow. It would be nice if you had clear formal statements of the definitions and theorems.

 by Jessica Taylor 346 days ago | link | parent | on: The Learning-Theoretic AI Alignment Research Agend... More generally, the idea of restricting to environments s.t. some base policy doesn’t fall into traps on them is not very restrictive. This rules out environments in which the second law of thermodynamics holds.

No, it doesn’t rule out any particular environment. A class that consists only of one environment is tautologically learnable, by the optimal policy for this environment. You might be thinking of learnability by anytime algorithms whereas I’m thinking of learnability by non-anytime algorithms (what I called “metapolicies”), the way I defined it here (see Definition 1).

 by Jessica Taylor 346 days ago | link | parent | on: The Learning-Theoretic AI Alignment Research Agend... I think that “scientific and policy knowledge that no one knows at birth and no one could discover by themself over a lifetime” is absolutely compatible with the hypothesis I outlined, even in its most naive form. If humanity’s progress is episodic RL where each human life is an episode, then of course each human uses the knowledge accumulated by previous humans. This is the whole idea of a learning algorithm in this setting. If RL is using human lives as episodes then humans should already be born with the relevant knowledge. There would be no need for history since all learning is encoded in the policy. History isn’t RL; it’s data summarization, model building, and intertemporal communication.

This seems to be interpreting the analogy too literally. Humans are not born with the knowledge, but they acquire the knowledge through some protocol that is designed to be much easier than rediscovering it. Moreover, by “reinforcement learning” I don’t mean the same type of algorithms used for RL today, I only mean that the performance guarantee this process satisfies is of a certain form.

 by Jessica Taylor 346 days ago | link | parent | on: The Learning-Theoretic AI Alignment Research Agend... So, there are are two environments: in environment X, button A corresponds to Heaven and in environment Y, button B corresponds to Heaven. Obviously both cannot be in a learnable class simultaneously. So, at least one of them has to be ruled out (and if we also want to preserve symmetry then both). What sort of assumption do you think will rule them out? I don’t think we should rule either of these out. The obvious answer is to give up on asymptotic optimality and do something more like utility function optimization instead. That would be moving out of the learning theory setting, which is a good thing. Asymptotic optimality can apply to bounded optimization problems and can’t apply to civilization-level steering problems.

Well, we could give up on regret bounds and instead just consider algorithms that asymptotically approach Bayes-optimality. (This would not be moving out of the learning theory setting though? At least not the way I use this terminology.) Regret bounds would still be useful in the context of guaranteeing transfer of human knowledge and values to the AGI, but not in the context of defining intelligence.

However, my intuition is that it would be the wrong way to go.

For one thing, it seems that it is computationally feasible (at least in some weak sense, i.e. for a small number of hypotheses s.t. the optimal policy for each is feasible) to get asymptotic Bayes-optimality for certain learnable classes (PSRL is a simple example) but not in general. I don’t have a proof (and I would be very interested to see either a proof or a refutation), but it seems to be the case AFAIK.

For another thing, consider questions such as, why intelligent agents outcompete instinct-based agents, and why general intelligence (i.e. Bayes optimality or at least some notion of good performance w.r.t. a prior that is “universal” or “nearly universal” in some sense) can be developed by evolution in a rather restricted environment. These questions seem much easier to answer if intelligence has some frequentist property (i.e. it is in some sense effective in all or most environments) compared to, if intelligence has only purely Bayesian properties (i.e. it is only good on average w.r.t. some very broad ensemble of environments).

 by Jessica Taylor 346 days ago | link | parent | on: The Learning-Theoretic AI Alignment Research Agend... Most species have gone extinct in the past. I would not be satisfied with an outcome where all humans die or 99% of humans die, even though technically humans might rebuild if there are any left and other intelligent life can evolve if humanity is extinct. These extinction levels can happen with foreseeable tech. Additionally, avoiding nuclear war requires continual cognitive effort to be put into the problem; it would be insufficient to use trial-and-error to avoid nuclear war. I don’t see why you would want a long sequence of reinforcement learning algorithms. At some point the algorithms produce things that can think, and then they should use their thinking to steer the future rather than trial-and-error alone. I don’t think RL algorithms would get the right answer on CFCs or nuclear war prevention. I am pretty sure that we can’t fully explore our current level, e.g. that would include starting nuclear wars to test theories about nuclear deterrence and nuclear winter. I really think that you are taking the RL analogy too far here; decision-making systems involving humans have some things in common with RL but RL theory only describes a fragment of the reasoning that these systems do.

I don’t think you’re interpreting what I’m saying correctly.

First, when I say “reinforcement learning” I don’t necessarily mean the type of RL algorithms that exist today. I just mean something that is designed to perform well (in some sense) in the face of uncertainty about the environment.

Second, even existing RL algorithms are not pure trial-and-error. For example, posterior sampling maintains a belief state about the environment and runs the optimal policy for some environment sampled from the belief state. So, if the belief state “knows” that something is a bad/good idea then the algorithm doesn’t need to actually try it.

Third, “starting nuclear wars to test theories” is the opposite of I’m trying to describe. What I’m saying is, we already have enough knowledge (acquired by exploring previous levels) to know that nuclear war is a bad idea, so exploring this level will not involve starting nuclear wars. What I’m trying to formalize is, what kind of environments allow this to happen consistently, i.e. being able to acquire enough knowledge to deal with a trap before you arrive at the trap.

Older

### NEW DISCUSSION POSTS

[Note: This comment is three
 by Ryan Carey on A brief note on factoring out certain variables | 0 likes

There should be a chat icon
 by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
 by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
 by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
 by Vanessa Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
 by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
 by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes