Intelligent Agent Foundations Forumsign up / log in
by Jessica Taylor 3 days ago | Vadim Kosoy likes this | link | parent | on: Meta: IAFF vs LessWrong

Apparently “You must be approved by an admin to comment on Alignment Forum”, how do I do this?

Also is this officially the successor to IAFF? If so it would be good to make that more clear on this website.

reply

by Alex Mennen 3 days ago | link

There should be a chat icon on the bottom-right of the screen on Alignment Forum that you can use to talk to the admins (unless only people who have already been approved can see this?). You can also comment on LW (Alignment Forum posts are automatically crossposted to LW), and ask the admins to make it show up on Alignment Forum afterwards.

reply


Strongly agree that all AI alignment research should at least be linked from here.

reply


Delegative Reinforcement Learning aims to solve both problems by occasionally passing control to the human operator (“advisor”), and using it to learn which actions are safe.

Why would you assume the existence of an advisor who can avoid taking catastrophic actions and sometimes take an optimal action? This would require some process capable of good judgment to understand many aspects of the AI’s decision-making process, such as its world models (as these models are relevant to which actions are catastrophic/optimal). Are you proposing a high degree of transparency, a bootstrapping process as in ALBA, or something else?

reply

by Vadim Kosoy 15 days ago | link

I think that what you’re saying here can be reformulated as follows (please correct me if I end up not answering your question):

The action that a RL agent takes depends both on the new observation and its internal state. Often we ignore the latter and pretend the action depends only on the history of observations and actions, and this is okay because we can always produce the probability distribution over internal states conditional on the given history. However, this is only ok for information-theoretic analysis, since sampling this probability distribution given only the history as input is computationally intractable.

So, it might be a reasonable assumption that the advisor takes “sane” actions when left to its own devices, but it is not reasonable to assume the same when it works together with the AI. This is because, even if the AI behaved exactly as the advisor, it would hide the simulated advisor’s internal state, which would preclude the advisor from taking the wheel and proceeding with the same policy.

I think this is a real problem, but we can overcome it by letting the advisor write some kind of “diary” that documents eir reasoning process, as much as possible. The diary is also considered a part of the environment (although we might want to bake into the prior the rules of operating the diary and a “cheap talk” assumption which says the diary has no side effects on the world). This way, the internal state is externalized, and the AI will effectively become transparent by maintaining the diary too (essentially the AI in this setup is emulating a “best case” version of the advisor). It would be great if we could make this idea into a formal analysis.

reply

by Jessica Taylor 14 days ago | link

That captures part of it but I also don’t think the advisor takes sane actions when the AI is doing things to the environment that change the environment. E.g. the AI is implementing some plan to create a nuclear reactor, and the advisor doesn’t understand how nuclear reactors work.

I guess you could have the AI first write the nuclear reactor plan in the diary, but this is essentially the same thing is transparency.

reply

by Vadim Kosoy 13 days ago | link

Well, you could say it is the same thing as transparency. What is interesting about it is that, in principle, you don’t have to put in transparency by hand using some completely different techniques. Instead, transparency arises naturally from the DRL paradigm and some relatively mild assumptions (that there is a “diary”). The idea is that, the advisor would not build a nuclear reaction without seeing an explanation of nuclear reactors, so the AI also won’t do it too.

reply


One hypothesis is, the main way humanity avoids traps is by happening to exist in a relatively favorable environment and knowing this fact, on some level. Specifically, it seems rather difficult for a single human or a small group to pursue a policy that will lead all of humanity into a trap (incidentally, this hypothesis doesn’t reflect optimistically on our chances to survive AI risk), and also rather rare for many humans to coordinate on simultaneously exploring an unusual policy. Therefore, human history may be very roughly likened to episodic RL where each human life is an episode.

It’s pretty clear that humans avoid traps using thinking, not just learning. See: CFCs, mutually assured destruction. Yes, principles of thinking can be learned, but then they generalize better than learning theory can prove.

See also: Not just learning

reply

by Vadim Kosoy 15 days ago | link

When I say “learning” I only mean that the true environment is initially unknown. I’m not assuming anything about the internals of the algorithm. So, the question is, what desiderata can we formulate that are possible to satisfy by any algorithm at all. The collection of all environments is not learnable (because of traps), so we cannot demand the algorithm to be asymptotically optimal on every environment. Therefore, it seems like we need to assume something about the environment, if we want a definition of intelligence that accounts for the effectiveness of intelligence. Formulating such an assumption, making it rigorous, and backing it by rigorous analysis is the subproblem I’m presenting here. The particular sort of assumption I’m pointing at here might be oversimplified, but the question remains.

reply

by Jessica Taylor 14 days ago | link

I agree that we’ll want some reasonable assumption on the environment (e.g. symmetry of physical laws throughout spacetime) that will enable thinking to generalize well. I don’t think that assumption looks like “it’s hard to cause a lot of destruction” or “the environment is favorable to you in general”. And I’m pretty sure that individual human lives are not the most important level of analysis for thinking about the learning required to avoid civilization-level traps (e.g. with CFCs, handling the situation required scientific and policy knowledge that no one knows at birth and no one could discover by themself over a lifetime)

reply

by Vadim Kosoy 13 days ago | link

Consider a panel with two buttons, A and B. One button sends you to Heaven and one to Hell, but you don’t know which is which and there is no way to check without pressing one. To make it more fun, you have to choose a button within one minute or you go to Hell automatically.

So, there are are two environments: in environment X, button A corresponds to Heaven and in environment Y, button B corresponds to Heaven. Obviously both cannot be in a learnable class simultaneously. So, at least one of them has to be ruled out (and if we also want to preserve symmetry then both). What sort of assumption do you think will rule them out?

I think that “scientific and policy knowledge that no one knows at birth and no one could discover by themself over a lifetime” is absolutely compatible with the hypothesis I outlined, even in its most naive form. If humanity’s progress is episodic RL where each human life is an episode, then of course each human uses the knowledge accumulated by previous humans. This is the whole idea of a learning algorithm in this setting.

Also, I think that success with CFC is not a lot of evidence against the hypothesis since, for one thing, CFC doesn’t allow a small group to easily destroy all of humanity, and for another thing, AFAIK action against CFC was only taken when some damage was already apparent. This is different from risks that have to be handled correctly on the first try.

That said, “doesn’t reflect optimistically on our chances to survive AI risk” wasn’t intended as a strong claim but as something very speculative. Possibly I should have made it clearer.

More generally, the idea of restricting to environments s.t. some base policy doesn’t fall into traps on them is not very restrictive. Indeed, for any learnable class H you can just take the base policy to be the learning algorithm itself and tautologically get a class at least as big as H. It becomes more interesting if we impose some constraints on the base policy, such as maybe restricting its computational complexity.

Intuitively, it seems alluring to say that our environment may contain X-risks, but they are s.t. by the time we face them we have enough knowledge to avoid them. However, this leads to assumptions that depend on the prior as a whole rather than on particular environments (basically, it’s not clear whether this is saying anything besides just assuming the prior is learnable). This complicates things, and in particular it becomes less clear what does it mean for such a prior to be “universal”. Moreover, the notion of a “trap” is not even a function of the prior regarded a single mixed environment, but a function of the particular partition of the prior into constituent hypotheses. In other words, it depends on which uncertainty is considered subjective (a property of the agent’s state of knowledge) and which uncertainty is considered objective (an inherent unpredictability of the world). For example, if we go to the initial example but assume that there is a fair coin inside the environment that decides which button is Heaven, then instead of two environments we get one and tautologically there is no trap.

In short, I think there is a lot more thinking to do about this question.

reply

by Jessica Taylor 12 days ago | link

So, there are are two environments: in environment X, button A corresponds to Heaven and in environment Y, button B corresponds to Heaven. Obviously both cannot be in a learnable class simultaneously. So, at least one of them has to be ruled out (and if we also want to preserve symmetry then both). What sort of assumption do you think will rule them out?

I don’t think we should rule either of these out. The obvious answer is to give up on asymptotic optimality and do something more like utility function optimization instead. That would be moving out of the learning theory setting, which is a good thing.

Asymptotic optimality can apply to bounded optimization problems and can’t apply to civilization-level steering problems.

reply

by Vadim Kosoy 12 days ago | link

Well, we could give up on regret bounds and instead just consider algorithms that asymptotically approach Bayes-optimality. (This would not be moving out of the learning theory setting though? At least not the way I use this terminology.) Regret bounds would still be useful in the context of guaranteeing transfer of human knowledge and values to the AGI, but not in the context of defining intelligence.

However, my intuition is that it would be the wrong way to go.

For one thing, it seems that it is computationally feasible (at least in some weak sense, i.e. for a small number of hypotheses s.t. the optimal policy for each is feasible) to get asymptotic Bayes-optimality for certain learnable classes (PSRL is a simple example) but not in general. I don’t have a proof (and I would be very interested to see either a proof or a refutation), but it seems to be the case AFAIK.

For another thing, consider questions such as, why intelligent agents outcompete instinct-based agents, and why general intelligence (i.e. Bayes optimality or at least some notion of good performance w.r.t. a prior that is “universal” or “nearly universal” in some sense) can be developed by evolution in a rather restricted environment. These questions seem much easier to answer if intelligence has some frequentist property (i.e. it is in some sense effective in all or most environments) compared to, if intelligence has only purely Bayesian properties (i.e. it is only good on average w.r.t. some very broad ensemble of environments).

reply

by Jessica Taylor 11 days ago | link

For another thing, consider questions such as, why intelligent agents outcompete instinct-based agents, and why general intelligence (i.e. Bayes optimality or at least some notion of good performance w.r.t. a prior that is “universal” or “nearly universal” in some sense) can be developed by evolution in a rather restricted environment. These questions seem much easier to answer if intelligence has some frequentist property (i.e. it is in some sense effective in all or most environments) compared to, if intelligence has only purely Bayesian properties (i.e. it is only good on average w.r.t. some very broad ensemble of environments).

I don’t understand why you think this. Suppose there is some simple “naturalized AIXI”-ish thing that is parameterized on a prior, and there exists a simple prior for which an animal running this algorithm with this prior does pretty well in our world. Then evolution may produce an animal running something like naturalized AIXI with this prior. But naturalized AIXI is only good on average rather than guaranteeing effectiveness in almost all environments.

reply

by Vadim Kosoy 11 days ago | link

My intuition is that it must not be just a coincidence that the agent happens to works well in our world, otherwise your formalism doesn’t capture the concept of intelligence in full. For example, we are worried that a UFAI would be very likely to kill us in this particular universe, not just in some counterfactual universes. Moreover, Bayesian agents with simple priors often do very poorly in particular worlds, because of what I call “Bayesian paranoia”. That is, if your agent thinks that lifting its left arm will plausibly send it to hell (a rather simple hypothesis), it will never lift its left arm and learn otherwise.

In fact, I suspect that a certain degree of “optimism” is inherent in our intuitive notion of rationality, and it also has a good track record. For example, when scientists did early experiments with electricity, or magnetism, or chemical reactions, their understanding of physics at the time was arguably insufficient to know this will not destroy the world. However, there were few other ways to go forward. AFAIK the first time anyone seriously worried about a physics experiment was the RHIC (unless you also count the Manhattan project, when Edward Teller suggested the atom bomb might create a self-sustaining nuclear fusion reaction that will envelope the entire atmosphere). These latter concerns were only raised because we already knew enough to point at specific dangers. Of course this doesn’t mean we shouldn’t be worried about X-risks! But I think that some form of a priori optimism is plausibly correct, in some philosophical sense. (There was also some thinking in that direction by Sunehag and Hutter although I’m not sold on the particular formalism they consider).

reply

by Jessica Taylor 9 days ago | link

I think I understand your point better now. It isn’t a coincidence that an agent produced by evolution has a good prior for our world (because evolution tries many priors, and there are lots of simple priors to try). But the fact that there exists a simple prior that does well in our universe is a fact that needs an explanation. It can’t be proven from Bayesianism; the closest thing to a proof of this form is that computationally unbounded agents can just be born with knowledge of physics if physics is sufficiently simple, but there is no similar argument for computationally bounded agents.

reply

by Jessica Taylor 11 days ago | link

Well, we could give up on regret bounds and instead just consider algorithms that asymptotically approach Bayes-optimality.

I am not proposing this. I am proposing doing something more like AIXI, which has a fixed prior and does not obtain optimality properties on a broad class of environments. It seems like directly specifying the right prior is hard, and it’s plausible that learning theory research would help give intuitions/models about which prior to use or what non-Bayesian algorithm would get good performance in the world we actually live in, but I don’t expect learning theory to directly produce an algorithm we would be happy with running to make big decisions in our universe.

reply

by Vadim Kosoy 11 days ago | link

Yes, I think that we’re talking about the same thing. When I say “asymptotically approach Bayes-optimality” I mean the equation from Proposition A.0 here. I refer to this instead of just Bayes-optimality, because exact Bayes-optimality is computationally intractable even for a small number of hypothesis each of which is a small MDP. However, even asymptotic Bayes-optimality is usually only tractable for some learnable classes, AFAIK: for example if you have environments without traps then PSRL is asymptotically Bayes-optimal.

reply

by Jessica Taylor 12 days ago | link

I think that “scientific and policy knowledge that no one knows at birth and no one could discover by themself over a lifetime” is absolutely compatible with the hypothesis I outlined, even in its most naive form. If humanity’s progress is episodic RL where each human life is an episode, then of course each human uses the knowledge accumulated by previous humans. This is the whole idea of a learning algorithm in this setting.

If RL is using human lives as episodes then humans should already be born with the relevant knowledge. There would be no need for history since all learning is encoded in the policy. History isn’t RL; it’s data summarization, model building, and intertemporal communication.

reply

by Vadim Kosoy 12 days ago | link

This seems to be interpreting the analogy too literally. Humans are not born with the knowledge, but they acquire the knowledge through some protocol that is designed to be much easier than rediscovering it. Moreover, by “reinforcement learning” I don’t mean the same type of algorithms used for RL today, I only mean that the performance guarantee this process satisfies is of a certain form.

reply

by Jessica Taylor 12 days ago | link

More generally, the idea of restricting to environments s.t. some base policy doesn’t fall into traps on them is not very restrictive.

This rules out environments in which the second law of thermodynamics holds.

reply

by Vadim Kosoy 12 days ago | link

No, it doesn’t rule out any particular environment. A class that consists only of one environment is tautologically learnable, by the optimal policy for this environment. You might be thinking of learnability by anytime algorithms whereas I’m thinking of learnability by non-anytime algorithms (what I called “metapolicies”), the way I defined it here (see Definition 1).

reply

by Jessica Taylor 11 days ago | link

Ok, I am confused by what you mean by “trap”. I thought “trap” meant a set of states you can’t get out of. And if the second law of thermodynamics is true, you can’t get from a high-entropy state to a low-entropy state. What do you mean by “trap”?

reply

by Vadim Kosoy 11 days ago | link

To first approximation, a “trap” is a an action s.t. taking it loses long-term value in expectation, i.e an action which is outside the set \(\mathcal{A}_M^0\) that I defined here (see the end of Definition 1). This set is always non-empty, since it at least has to contain the optimal action. However, this definition is not very useful when, for example, your environment contains a state that you cannot escape and you also cannot avoid (for example, the heat death of the universe might be such a state), since, in this case, nothing is a trap. To be more precise we need to go from an analysis which is asymptotic in the time discount parameter to an analysis with a fixed, finite time discount parameter (similarly to how with time complexity, we usually start from analyzing the asymptotic complexity of an algorithm, but ultimately we are interested in particular inputs of finite size). For a fixed time time discount parameter, the concept of a trap becomes “fuzzy”: a trap is an action which loses a substantial fraction of the value.

reply

by Vadim Kosoy 12 days ago | link

Consider also, evolution. Evolution can also be regarded as a sort of reinforcement learning algorithm. So why, during billions years of evolution, no gene sequence was created that somehow destroyed all life on Earth? It seems hard to come up with an answer other than “it’s hard to cause a lot of destruction”.

Some speculation:

I think that we have a sequence of reinforcement algorithms: evolution -> humanity -> individual human / small group (maybe followed by -> AGI) s.t. each step inherits the knowledge generated by the previous step and also applies more optimization pressure than the previous step. This suggests formulating a “favorability” assumption of the following form: there is a (possibly infinite) sequence of reinforcement learning algorithms A0, A1, A2… s.t. each algorithm is more powerful than the previous (e.g. has more computing power), and our environment has to be s.t.

  1. Running policy A0 has a small rate (at most \(\epsilon_0\)) of falling into traps.
  2. If we run A0 for some time \(T_0\) (s.t. \(\epsilon_0 T_0 \ll 1\)), and then run A1 after updating on the observations during \(T_0\), then A1 has a small rate (at most \(\epsilon_1\)) of falling into traps.
  3. Ditto when we add A2

…And so forth.

The sequence {Ai} may be thought of as a sequence of agents or as just steps in the exploration of the environment by a single agent. So, our condition is that, each new “layer of reality” may be explored safely given that the previous layers were already studied.

reply

by Jessica Taylor 12 days ago | link

Most species have gone extinct in the past. I would not be satisfied with an outcome where all humans die or 99% of humans die, even though technically humans might rebuild if there are any left and other intelligent life can evolve if humanity is extinct. These extinction levels can happen with foreseeable tech. Additionally, avoiding nuclear war requires continual cognitive effort to be put into the problem; it would be insufficient to use trial-and-error to avoid nuclear war.

I don’t see why you would want a long sequence of reinforcement learning algorithms. At some point the algorithms produce things that can think, and then they should use their thinking to steer the future rather than trial-and-error alone. I don’t think RL algorithms would get the right answer on CFCs or nuclear war prevention.

I am pretty sure that we can’t fully explore our current level, e.g. that would include starting nuclear wars to test theories about nuclear deterrence and nuclear winter.

I really think that you are taking the RL analogy too far here; decision-making systems involving humans have some things in common with RL but RL theory only describes a fragment of the reasoning that these systems do.

reply

by Vadim Kosoy 12 days ago | link

I don’t think you’re interpreting what I’m saying correctly.

First, when I say “reinforcement learning” I don’t necessarily mean the type of RL algorithms that exist today. I just mean something that is designed to perform well (in some sense) in the face of uncertainty about the environment.

Second, even existing RL algorithms are not pure trial-and-error. For example, posterior sampling maintains a belief state about the environment and runs the optimal policy for some environment sampled from the belief state. So, if the belief state “knows” that something is a bad/good idea then the algorithm doesn’t need to actually try it.

Third, “starting nuclear wars to test theories” is the opposite of I’m trying to describe. What I’m saying is, we already have enough knowledge (acquired by exploring previous levels) to know that nuclear war is a bad idea, so exploring this level will not involve starting nuclear wars. What I’m trying to formalize is, what kind of environments allow this to happen consistently, i.e. being able to acquire enough knowledge to deal with a trap before you arrive at the trap.

reply

by Jessica Taylor 11 days ago | link

First, when I say “reinforcement learning” I don’t necessarily mean the type of RL algorithms that exist today. I just mean something that is designed to perform well (in some sense) in the face of uncertainty about the environment.

That is broad enough to include Bayesianism. I think you are imagining a narrower class of algorithms that can achieve some property like asymptotic optimality. Agree that this narrower class is much broader than current RL, though.

Second, even existing RL algorithms are not pure trial-and-error. For example, posterior sampling maintains a belief state about the environment and runs the optimal policy for some environment sampled from the belief state. So, if the belief state “knows” that something is a bad/good idea then the algorithm doesn’t need to actually try it.

I agree that if it knows for sure that it isn’t in some environment then it doesn’t need to test anything to perform well in that environment. But what if there is a 5% chance that the environment is such that nuclear war is good (e.g. because it eliminates other forms of destructive technology for a long time)? Then this AI would start nuclear war with 5% probability per learning epoch. This is not pure trial-and-error but it is trial-and-error in an important relevant sense.

What I’m trying to formalize is, what kind of environments allow this to happen consistently, i.e. being able to acquire enough knowledge to deal with a trap before you arrive at the trap.

This seems like an interesting research approach and I don’t object to it. I would object to thinking that algorithms that only handle this class of environments are safe to run in our world (which I expect is not of this form). To be clear, while I expect that a Bayesian-ish agent has a good chance to avoid very bad outcomes using the knowledge it has, I don’t think anything that attains asymptotic optimality will be useful while avoiding very bad outcomes with decent probability.

reply

by Vadim Kosoy 11 days ago | link

Actually, I am including Bayesianism in “reinforcement learning” in the broad sense, although I am also advocating for some form of asymptotic optimality (importantly, it is not asymptotic in time like often done in the literature, but asymptotic in the time discount parameter; otherwise you give up on most of the utility, like you pointed out in an earlier discussion we had).

In the scenario you describe, the agent will presumably discard (or, strongly penalize the probability of) the pro-nuclear-war hypothesis first since the initial policy loses value much faster on this hypothesis compared to the anti-nuclear-war hypothesis (since the initial policy is biased towards the more likely anti-nuclear-war hypothesis). It will then remain with the anti-nuclear-war hypothesis and follow the corresponding policy (of not starting nuclear war). Perhaps this can be formalized as searching for a fixed point of some transformation.

reply

by Vadim Kosoy 10 days ago | link

After thinking some more, maybe the following is natural way towards formalizing the optimism condition.

Let \(H\) be the space of hypotheses and \(\xi_0 \in \Delta H\) be the “unbiased” universal prior. Given any \(\zeta \in \Delta H\), we denote \(\hat{\zeta} = E_{\mu \sim \zeta}[\mu]\), i.e. the environment resulting from mixing the environments in the belief state \(\zeta\). Given an environment \(\mu\), let \(\pi^\mu\) be the Bayes-optimal policy for \(\mu\) and \(\pi^\mu_\theta\) the perturbed Bayes-optimal policy for \(\mu\), where \(\theta\) is a perturbation parameter. Here, “perturbed” probably means something like softmax expected utility, but more thought is needed. Then, the “optimistic” prior \(\xi\) is defined as a solution to the following fixed point equation:

\[\xi(\mu) = Z^{-1} \xi_0(\mu) \exp(\beta(E_{\mu\bowtie\pi^{\hat{\xi}}_\theta}[U]-E_{\mu\bowtie\pi^\mu}[U]))\]

Here, \(Z\) is a normalization constant and \(\beta\) is an additional parameter.

This equation defines something like a softmax Nash equilibrium in a cooperative game of two players where, one player chooses \(\mu\) (so that \(\xi\) is eir mixed strategy), another player chooses \(\pi\) and the utility is minus regret (alternatively, we might want to choose only Pareto efficient Nash equilibria). The parameter \(\beta\) controls optimism regarding the ability to learn the environment, whereas the parameter \(\theta\) represents optimism regarding the presence of slack: ability to learn despite making some errors or random exploration (how to choose these parameters is another question).

Possibly, the idea of exploring the environment “layer by layer” can be recovered from combining this with hierarchy assumptions.

reply

by Jessica Taylor 9 days ago | link

This seems like a hack. The equilibrium policy is going to assume that the environment is good to it in general in a magical fashion, rather than assuming the environment is good to it in the specific ways we should expect given our own knowledge of how the environment works. It’s kind of like assuming “things magically end up lower than you expected on priors” instead of having a theory of gravity.

I think there is something like a theory of gravity here. The things I would note about our universe that make it possible to avoid a lot of traps include:

  • Physical laws are symmetric across spacetime.
  • Physical laws are spacially local.
  • The predictable effects of a local action are typically local; most effects “dissipate” after a while (e.g. into heat). The butterfly effect is evidence for this rather than against this, since it means many effects are unpredictable and so can be modeled thermodynamically.
  • When small changes have big and predictable effects (e.g. in a computer), there is often agentic optimization power towards the creation and maintenance of this system of effects, and in these cases it is possible for at least some agents to understand important things about how the system works.
  • Some “partially-dissipated” effects are statistical in nature. For example, an earthquake hitting an area has many immediate effects, but over the long term the important effects are things like “this much local productive activity was disrupted”, “this much local human health was lost”, etc.
  • You have the genes that you do because evolution, which is similar to a reinforcement learning algorithm, believed that these genes would cause you to survive and reproduce. If we construct AI systems, we will give them code (including a prior) that we expect to cause them to do something useful for us. In general, the agency of an agent’s creator should affect the agent’s beliefs.
  • If there are many copies of an agent, and successful agents are able to repurpose the resources of unsuccessful ones, then different copies can try different strategies; some will fail but the successful ones can then repurpose their resources. (Evolution can be seen as a special case of this)
  • Some phenemona have a “fractal” nature, where a small thing behaves similar to a big thing. For example, there are a lot of similarities between the dynamics of a nation and the dynamics of a city. Thus small things can be used as models of big things.
  • If your interests are aligned with those of agents in your local vicinity, then they will mostly try to help you. (This applies to parents making their children’s environment safe)

I don’t have an elegant theory yet but these observations seem like a reasonable starting point for forming one.

reply

by Vadim Kosoy 8 days ago | link

I think that we should expect evolution to give us a prior that is a good lossy compression of actual physics (where “actual physics” means, those patterns the universe has that can be described within our computational complexity bounds). Meaning that, on the one hand it should be low description complexity (otherwise it will be hard for evolution to find it), and on the other hand it should be assign high probability to the true environment (in other words, the KL divergence of the true environment from the prior should be small). And also it should be approximately learnable, otherwise it won’t go from assigning high probability to actually performing well.

The principles you outlined seem reasonable overall.

Note that the locality/dissipation/multiagent assumptions amount to a special case of “the environment is effectively reversible (from the perspective of the human species as a whole) as long as you don’t apply too much optimization power” (“optimization power” probably translates to divergence from some baseline policy plus maybe computational complexity considerations). Now, as you noted before, actual macroscopic physics is not reversible, but it might still be effectively reversible if you have a reliable long-term source of negentropy (like the sun). Maybe we can also slightly relax them by allowing irreversible changes as long as they are localized and the available space is sufficiently big.

“If we construct AI systems, we will give them code (including a prior) that we expect to cause them to do something useful for us. In general, the agency of an agent’s creator should affect the agent’s beliefs” is essentially what DRL does: allows transferring our knowledge to the AI without hard-coding it by hand.

“When small changes have big and predictable effects (e.g. in a computer), there is often agentic optimization power towards the creation and maintenance of this system of effects, and in these cases it is possible for at least some agents to understand important things about how the system works” seems like it would allow us to go beyond effective reversibility, but I’m not sure how to formalize it or whether it’s a justified assumption. One way towards formalizing it is, the prior is s.t. studying the initial state approximate communication class allows determining the entire environment, but this seems to point at a very broad class of approximately learnable priors w/o specifying a criterion how to choose among them.

Another principle that we can try to use is, the ubiquity of analytic functions. Analytic functions have the property that, knowing the function in a bounded domain allows extrapolating it everywhere. This is different from allowing arbitrary computable functions which may have “if” clauses, so that studying the function in a bounded domain is never enough to be sure about its behavior outside it. In particular, this line of inquiry seems relatively easy to formalize using continuous MDPs (although we run into the problem that finding the optimal policy is infeasible, in general). Also, it might have something to do with the effectiveness of neural networks (although the popular ReLU response function is not analytic).

reply


Now, if the AI is implementing DRL, the uncertainty between Earth and Mu leads it to delegate to the advisor precisely at the moment this difference is important.

It seems like this is giving up on allowing the AI to make long-term predictions. It can make short-term, testable predictions (since if different advisors disagree, it is possible to see who is right). But long-term predictions can’t be cheaply tested.

In the absence of long-term predictions, it still might be possible to do something along the lines of what Paul is thinking of (i.e. predicting human judgments of longer-term things), but I don’t see what else you could do. Does this match your model?

reply

by Vadim Kosoy 15 days ago | link

I’m not giving up on long-term predictions in general. It’s just that, because of traps, some uncertainties cannot be resolved by testing, as you say. In those cases the AI has to rely on what it learned from the advisor, which indeed amounts to human judgment.

reply


Note: I currently think that the basic picture of getting within \(\epsilon\) of a good prediction is actually pretty sketchy. I wrote about the sample complexity here. Additional to the sample complexity issues, the requirement is for predictors to be Bayes-optimal, but Bayes-optimality is not possible for bounded reasoners. This is important because e.g. some adversarial predictor might make very good predictions on some subset of questions (because it’s spending its compute on those specifically), causing other predictors to be filtered out (if those questions are used to determine who the best predictor is). I don’t know what kind of analysis could get the \(\epsilon\)-accuracy result at this point.

reply


Counterfactual mugging doesn’t require spoofing. Consider the following problem:

Suppose no one, given \(10^{5}\) steps of computation, is able to compute any information about the parity of the \(10^{10}\)th digit of \(\pi\), and everyone, given \(10^{100}\) steps of computation, is able to compute the \(10^{10}\)th digit of \(\pi\). Suppose that at time \(t\), everyone has \(10^5\) steps of computation, and at a later time \(t'\), everyone has \(10^{100}\) steps of computation. At the initial time \(t\), Omega selects a probability \(p\) equal to the conditional probability Omega assigns to the agent paying $1 at time \(t'\) conditional on the digit being odd. (This could be because Omega is a logical inductor, or because Omega is a CDT agent whose utility function is such that selecting this value of \(p\) is optimal). At time \(t'\), if the digit is even, a fair coin with probability \(p\) of coming up heads is flipped, and if it comes up heads, Omega pays the agent $10. If instead the digit is odd, then the agent has the option of paying Omega $1.

This contains no spoofing, and the optimal policy for the agent is to pay up if asked.

reply


The true reason to do exploration seems to be because the agent believes the action it is taking will not lead to an irreversible trap, and because it believes that the action will reveal information about the true environment that enables a better policy later on, which in expectation up to the time horizon, outweighs the temporary loss incurred due to exploring.

My understanding of logical inductor exploration (e.g. in asymptotic decision theory) is that the exploration steps the agent learns from mostly don’t happen in its own lifetime, rather they happen in the lifetimes of similar but simpler agents. This allows exploration to work for single-shot problems such as 5 and 10. Intuitively, if you are in a 5 and 10 problem and your brain has size 10^1000, then you can simulate someone whose brain has size 10^999 doing a 5 and 10 problem, and thereby learn the relation between the agent’s action and how much utility they get. So each particular agent has some chance of exploring irrecoverably, but in aggregate not many of them will (and it’s hard to predict which will and which won’t).

As far as I can tell, the only strategy that doesn’t have some sort of targetable exploration behavior is Thompson sampling.

Thompson sampling still randomizes (it randomizes its belief about the world it’s in) and is therefore vulnerable to troll bridge.

reply

by Alex Appel 100 days ago | link

A: While that is a really interesting note that I hadn’t spotted before, the standard formulation of exploration steps in logical inductor decision theory involve infinite exploration steps over all time, so even though an agent of this type would be able to inductively learn from what other agents do in different decision problems in less time than it naively appears, that wouldn’t make it explore less.

B: What I intended with the remark about Thompson sampling was that troll bridge functions on there being two distinct causes of “attempting to cross the bridge”. One is crossing because you believe it to be the best action, and the other is crossing because an exploration step occurred, and Thompson sampling doesn’t have a split decision criterion like this. Although now that you point it out, it is possible to make a Thompson sampling variant where the troll blows up the bridge when “crossing the bridge” is not the highest-ranked action.

reply


  1. That makes sense.

  2. OK, it seems like I misinterpreted your comment on philosophy. But in this post you seem to be saying that we might not need to solve philosophical problems related to epistemology and agency?

  3. That concept also seems useful and different from autopoiesis as I understand it (since it requires continual human cognitive work to run, though not very much).

reply

by Paul Christiano 331 days ago | link

  1. I think that we can avoid coming up with a good decision theory or priors or so on—there are particular reasons that we might have had to solve philosophical problems, which I think we can dodge. But I agree that we need or want to solve some philosophical problems to align AGI (e.g. defining corrigibility precisely is a philosophical problem).

reply


I’m curious what initially triggered this.

I tried to solve the problem and found that I thought it was very hard to make the sort of substantial progress that would meaningfully bridge the gap from our current epistemic/philosophical state to the state where the problem is largely solved. I did make incremental progress, but not the sort of incremental progress I saw as attacking the really hard problems. Towards the later parts of my work at MIRI, I was doing research that seemed to be largely overlapping with complex systems theory (in order to reason about how to align autopoietic systems similar to evolution) in a way that made it hard to imagine that I’d come up with useful crisp formal definitions/proofs/etc.

This seems a bit low, given that there’s a number of disjunctive ways that it could happen.

I feel like saying 2% now. Not sure what caused the update.

I’m pretty worried that such technology will accelerate value drift within the current autopoietic system.

I’m also worried about something like this, though I would state the risk as “mass insanity” rather than “value drift”. (“Value drift” brings to mind an individual or group trying to preserve their current object-level values, rather than trying to preserve somewhat-universal human values and sane reflection processes)

reply

by Wei Dai 331 days ago | link

I hope you stay engaged with the AI risk discussions and maintain your credibility. I’m really worried about the self-selection effect where people who think AI alignment is really hard end up quitting or not working in the field in the first place, and then it appears to outsiders that all of the AI safety experts don’t think the problem is that hard.

I’m also worried about something like this, though I would state the risk as “mass insanity” rather than “value drift”. (“Value drift” brings to mind an individual or group trying to preserve their current object-level values, rather than trying to preserve somewhat-universal human values and sane reflection processes)

I’m envisioning that in the future there will also be systems where you can input any conclusion that you want to argue (including moral conclusions) and the target audience, and the system will give you the most convincing arguments for it. At that point people won’t be able to participate in any online (or offline for that matter) discussions without risking their object-level values being hijacked.

You didn’t respond to my point that defending against this type of technology does seem to require solving hard philosophical problems. What are your thoughts on this?

reply

by Paul Christiano 331 days ago | link

defending against this type of technology does seem to require solving hard philosophical problems

Why is this?

The case you describe seems clearly contrary to my preferences about how I should reflect. So a system which helped me implement my preferences would help me avoid this situation (in the same way that it would help me avoid being shot, or giving malware access to valuable computing resources).

It seems quite plausible that we’ll live to see a world where it’s considered dicey for your browser to uncritically display sentences written by an untrusted party.

reply

by Wei Dai 331 days ago | link

It seems quite plausible that we’ll live to see a world where it’s considered dicey for your browser to uncritically display sentences written by an untrusted party.

How would your browser know who can be trusted, if any of your friends and advisers could be corrupted at any given moment (or just their accounts taken over by malware and used to spread optimized disinformation)?

The case you describe seems clearly contrary to my preferences about how I should reflect.

How would an automated system help you avoid it, aside from blocking off all outside contact? (I doubt I’d be able to ever figure out what my values actually are / should be, if I had to do it without talking to other humans.) If you’re thinking of some sort of meta-execution-style system to help you analyze arguments and distinguish between correct arguments and merely convincing ones, I think that involves solving hard philosophical problems. My understanding is that Jessica agrees with me on that, so I was asking why she doesn’t think the same problem applies in the non-autopoietic automation scenario.

reply

by Vladimir Slepnev 330 days ago | link

figure out what my values actually are / should be

I think many human ideas are like low resolution pictures. Sometimes they show simple things, like a circle, so we can make a higher resolution picture of the same circle. That’s known as formalizing an idea. But if the thing in the picture looks complicated, figuring out a higher resolution picture of it is an underspecified problem. I fear that figuring out my values over all possible futures might be that kind of problem.

So apart from hoping to define a “full resolution picture” of human values, either by ourselves or with the help of some AI or AI-human hybrid, it might be useful to come up with approaches that avoid defining it. That was my motivation for this post, which directly uses our “low resolution” ideas to describe some particular nice future without considering all possible ones. It’s certainly flawed, but there might be other similar ideas.

Does that make sense?

reply

by Wei Dai 328 days ago | link

I think I understand what you’re saying, but my state of uncertainty is such that I put a lot of probability mass on possibilities that wouldn’t be well served by what you’re suggesting. For example, the possibility that we can achieve most value not through the consequences of our actions in this universe, but through their consequences in much larger (computationally richer) universes simulating this one. Or that spreading hedonium is actually the right thing to do and produces orders of magnitude more value than spreading anything that resembles human civilization. Or that value scales non-linearly with brain size so we should go for either very large or very small brains.

While discussing the VR utopia post, you wrote “I know you want to use philosophy to extend the domain, but I don’t trust our philosophical abilities to do that, because whatever mechanism created them could only test them on normal situations.” I have some hope that there is a minimal set of philosophical abilities that would allow us to eventually solve arbitrary philosophical problems, and we already have this. Otherwise it seems hard to explain the kinds of philosophical progress we’ve made, like realizing that other universes probably exist, and figuring out some ideas about how to make decisions when there are multiple copies of us in this universe and others.

Of course it’s also possible that’s not the case, and we can’t do better than to optimize the future using our current “low resolution” values, but until we’re a lot more certain of this, any attempt to do this seems to constitute a strong existential risk.

reply

by Jessica Taylor 331 days ago | link

I agree that selection bias is a problem. I plan on discussing and writing about AI alignment somewhat in the future. Also note that Eliezer and Nate think the problem is pretty hard and unlikely to be solved.

You didn’t respond to my point that defending against this type of technology does seem to require solving hard philosophical problems. What are your thoughts on this?

Automation technology (in an adversarial context) is kind of like a very big gun. It projects a lot of force. It can destroy lots of things if you point it wrong. It might be hard to point at the right target. And you might kill or incapacitate yourself if you do something wrong. But it’s inherently stupid, and has no agency by itself. You don’t have to solve philosophy to deal with large guns, you just have to do some combination of (a) figure out how to wield them to do good with them, (b) get people to stop using them, (c) find strategies for fighting against them, or (d) defend against them. (Certainly, some of these things involve philosophy, but they don’t necessarily require fully formalizing anything). The threat is different in kind from that of a fully-automated autopoietic cognitive system, which is more like a big gun possessed by an alien soul.

reply

by Wei Dai 329 days ago | link

You don’t have to solve philosophy to deal with large guns, you just have to do some combination of (a) figure out how to wield them to do good with them, (b) get people to stop using them, (c) find strategies for fighting against them, or (d) defend against them.

Do you have ideas for how to do these things, for the specific “big gun” that I described earlier?

The threat is different in kind from that of a fully-automated autopoietic cognitive system, which is more like a big gun possessed by an alien soul.

If the big gun is being wielded by humans whose values and thought processes have been corrupted (by others using that big gun, or through some other way like being indoctrinated in bad ideas from birth), that doesn’t seem very different from a big gun possessed by an alien soul.

reply

by Jessica Taylor 327 days ago | link

Do you have ideas for how to do these things, for the specific “big gun” that I described earlier?

Roughly, minimize direct contact with things that cause insanity, be the sanest people around, and as a result be generally more competent than the rest of the world at doing real things. At some point use this capacity to oppose things that cause insanity. I haven’t totally worked this out.

If the big gun is being wielded by humans whose values and thought processes have been corrupted (by others using that big gun, or through some other way like being indoctrinated in bad ideas from birth), that doesn’t seem very different from a big gun possessed by an alien soul.

It’s hard to corrupt human values without corrupting other forms of human sanity, such as epistemics and general ability to do things.

reply

Older

NEW LINKS

NEW POSTS

NEW DISCUSSION POSTS

RECENT COMMENTS

There should be a chat icon
by Alex Mennen on Meta: IAFF vs LessWrong | 0 likes

Apparently "You must be
by Jessica Taylor on Meta: IAFF vs LessWrong | 1 like

There is a replacement for
by Alex Mennen on Meta: IAFF vs LessWrong | 1 like

Regarding the physical
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think that we should expect
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

I think I understand your
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

This seems like a hack. The
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

After thinking some more,
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yes, I think that we're
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

My intuition is that it must
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

To first approximation, a
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Actually, I *am* including
by Vadim Kosoy on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

Yeah, when I went back and
by Alex Appel on Optimal and Causal Counterfactual Worlds | 0 likes

> Well, we could give up on
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

> For another thing, consider
by Jessica Taylor on The Learning-Theoretic AI Alignment Research Agend... | 0 likes

RSS

Privacy & Terms