My current take on the Paul-MIRI disagreement on alignability of messy AI
post by Jessica Taylor 58 days ago | Ryan Carey, Vadim Kosoy, Daniel Dewey, Patrick LaVictoire, Scott Garrabrant and Stuart Armstrong like this | 40 comments

Paul Christiano and “MIRI” have disagreed on an important research question for a long time: should we focus research on aligning “messy” AGI (e.g. one found through gradient descent or brute force search) with human values, or on developing “principled” AGI (based on theories similar to Bayesian probability theory)? I’m going to present my current model of this disagreement and additional thoughts about it.

I put “MIRI” in quotes because MIRI is an organization composed of people who have differing views. I’m going to use the term “MIRI view” to refer to some combination of the views of Eliezer, Benya, and Nate. I think these three researchers have quite similar views, such that it is appropriate in some contexts to attribute a view to all of them collectively; and that these researchers’ views constitute what most people think of as the “MIRI view”.

(KANSI AI complicates this disagreement somewhat; the story here is that we can use “messy” components in a KANSI AI but these components have to have their capabilities restricted significantly. Such restriction isn’t necessary if we think messy AGI can be aligned in general.)

## Intuitions and research approaches

I’m generally going to take the perspective of looking for the intuitions motivating a particular research approach or produced by a particular research approach, rather than looking at the research approaches themselves. I expect it is easier to reach agreement about the how compelling a particular intuition is (at least when other intuitions are temporarily ignored), than to reach agreement on particular research approaches.

In general, it’s quite possible for a research approach to be inefficient while still being based on, or giving rise to, useful intuitions. So a criticism of a particular research approach is not necessarily a criticism of the intuitions behind it.

## Terminology

• A learning problem is a task for which the AI is supposed to output some information, and if we wanted, we could give the information a score measuring how good it is the task, using less than ~2 weeks of labor. In other words, there’s an inexpensive “ground truth” we have access to. This looks a little weird but I think this is a natural category, and some of the intuitions relate to learning and non-learning problems. Paul has written about learning and non-learning problems here.
• An AI system is aligned if it is pursuing some combination of different humans’ values and not significantly pursuing other values that could impact the long term future of humanity. If it is pursuing other values significantly it is unaligned.
• An AI system is competitive if it is nearly as efficient as other AI systems (aligned or unaligned) that people could build.

## Listing out intuitions

I’m going to list out a bunch of relevant intuitions. Usually I can’t actually convey the intuition through text; at best I can write “what someone who has this intuition would feel like saying” and “how someone might go about gaining this intuition”. Perhaps the text will make “logical” sense to you without feeling compelling; this could be a sign that you don’t have the underlying intuition.

## Background AI safety intuitions

These background intuitions ones that I think are shared by both Paul and MIRI.

1. Weak orthogonality. It is possible to build highly intelligent agents with silly goals such as maximizing paperclips. Random “minds from mindspace” (e.g. found through brute force search) will have values that significantly diverge from human values.

2. Instrumental convergence. Highly advanced agents will by default pursue strategies such as gaining resources and deceiving their operators (performing a “treacherous turn”).

3. Edge instantiation. For most objective functions that naively seem useful, the maximum is quite “weird” in a way that is bad for human values.

4. Patch resistance. Most AI alignment problems (e.g. edge instantiation) are very difficult to “patch”; adding a patch that deals with a specific failure will fail to fix the underlying problem and instead lead to further unintended solutions.

## Intuitions motivating the agent foundations approach

I think the following intuitions are sufficient to motivate the agent foundations approach to AI safety (thinking about idealized models of advanced agents to become less confused), and something similar to the agent foundations agenda, at least if one ignores contradictory intuitions for a moment. In particular, when considering these intuitions at once, I feel compelled to become less confused about advanced agents through research questions similar to those in the agent foundations agenda.

I’ve confirmed with Nate that these are similar to some of his main intuitions motivating the agent foundations approach.

5. Cognitive reductions are great. When we feel confused about something, there is often a way out of this confusion, by figuring out which algorithm would have generated that confusion. Often, this works even when the original problem seemed “messy” or “subjective”; something that looks messy can have simple principles behind it that haven’t been discovered yet.

6. If you don’t do cognitive reductions, you will put your confusion in boxes and hide the actual problem. By default, a lot of people studying a problem will fail to take the perspective of cognitive reductions and thereby not actually become less confused. The free will debate is a good example of this: most discussion of free will contains confusions that could be resolved using Daniel Dennett’s cognitive reduction of free will. (This is essentially the same as the cognitive reduction discussed in the sequences.)

7. We should expect mainstream AGI research to be inefficient at learning much about the confusing aspects of intelligence, for this reason. It’s pretty easy to look at most AI research and see where it’s hiding fundamental confusions such as logical uncertainty without actually resolving them. E.g. if neural networks are used to predict math, then the confusion about how to do logical uncertainty is placed in the black box of “what this neural net learns to do”. This isn’t that helpful for actually understanding logical uncertainty in a “cognitive reduction” sense; such an understanding could lead to much more principled algorithms.

8. If we apply cognitive reductions to intelligence, we can design agents we expect to be aligned. Suppose we are able to observe “how intelligence feels from the inside” and distill these observations into an idealized cognitive algorithm for intelligence (similar to the idealized algorithm Daniel Dennett discusses to resolve free will). The minimax algorithm is one example of this: it’s an idealized version of planning that in principle could have been derived by observing the mental motions humans do when playing games. If we implement an AI system that approximates this idealized algorithm, then we have a story for why the AI is doing what it is doing: it is taking action X for the same reason that an “idealized human” would take action X. That is, it “goes through mental motions” that we can imagine going through (or approximates doing so), if we were solving the task we programmed the AI to do. If we’re programming the AI to assist us, we could imagine the mental motions we would take if we were assisting aliens.

9. If we don’t resolve our confusions about intelligence, then we don’t have this story, and this is suspicious. Suppose we haven’t actually resolved our confusions about intelligence. Then we don’t have the story in the previous point, so it’s pretty weird to think our AI is aligned. We must have a pretty different story, and it’s hard to imagine different stories that could allow us to conclude that an AI is aligned.

10. Simple reasoning rules will correctly generalize even for non-learning problems. That is, there’s some way that agents can learn rules for making good judgments that generalize to tasks they can’t get fast feedback on. Humans seem to be an existence proof that simple reasoning rules can generalize; science can make predictions about far-away galaxies even when there isn’t an observable ground truth for the state of the galaxy (only indirect observations). Plausibly, it is possible to use “brute force” to find agents using these reasoning rules by searching for agents that perform well on small tasks and then hoping that they generalize to large tasks, but this can result in misalignment. For example, Solomonoff induction is controlled by malign consequentialists who have learned good rules for how to reason; approximating Solomonoff induction is one way to make an unaligned AI. If an aligned AI is to be roughly competitive with these “brute force” unaligned AIs, we should have some story for why the aligned AI system is also able to acquire simple reasoning rules that generalize well. Note that Paul mostly agrees with this intuition and is in favor of agent foundations approaches to solving this problem, although his research approach would significantly differ from the current agent foundations agenda. (This point is somewhat confusing; see my other post for clarification)

## Intuitions motivating act-based agents

I think these following intuitions are all intuitions that Paul has that motivate his current research approach.

11. Almost all technical problems are either tractable to solve or are intractable/impossible for a good reason. This is based on Paul’s experience in technical research. For example, consider a statistical learning problem where we are trying to predict a Y value from an X value using some model. It’s possible to get good statistical guarantees on problems where the training distribution of X values is the same as the test distribution of X values, but when those distributions are distinguishable (i.e. there’s a classifier that can separate them pretty well), there’s a fundamental obstruction to getting the same guarantees: given the information available, there is no way to distinguish a model that will generalize from one that won’t, since they could behave in arbitrary ways on test data that is distinctly different from training data. An exception to the rule is NP-complete problems; we don’t have a good argument yet for why they can’t be solved in polynomial time. However, even in this case, NP-hardness forms a useful boundary between tractable and intractable problems.

12. If the previous intuition is true, we should search for solutions and fundamental obstructions. If there is either a solution or a fundamental obstruction to a problem, then an obvious way to make progress on the problem is to alternate between generating obvious solutions and finding good reasons why a class of solutions (or all solutions) won’t work. In the case of AI alignment, we should try getting a very good solution (e.g. one that allows the aligned AI to be competitive with unprincipled AI systems such as ones based on deep learning by exploiting the same techniques) until we have a fundamental obstruction to this. Such a fundamental obstruction would tell us which relaxations to the “full problem” we should consider, and be useful for convincing others that coordination is required to ensure that aligned AI can prevail even if it is not competitive with unaligned AI. (Paul’s research approach looks quite optimistic partially because he is pursuing this strategy).

13. We should be looking for ways of turning arbitrary AI capabilities into equally powerful aligned AI capabilities. On priors, we should expect it to be hard for AI safety researchers to make capabilities advances; AI safety researchers make up only a small percentage of AI researchers. If this is the case, then aligned AI will be quite uncompetitive unless it takes advantage of the most effective AI technology that’s already around. It would be really great if we could take an arbitrary AI technology (e.g. deep learning), do a bit of thinking, and come up with a way to direct that technology towards human values. There isn’t a crisp fundamental obstruction to doing this yet, so it is the natural first place to look. To be more specific about what this research strategy entails, suppose it is possible to build built an unaligned AI system. We expect it to be competent; say it is competent for reason X. We ought to be able to either build an aligned AI system that also works for reason X, or else find a fundamental obstruction. For example, reason X could be “it does gradient descent to find weights optimizing a proxy for competence”; then we’d seek to build a system that works because it does gradient descent to find weights optimizing a proxy for competence and alignment.

14. Pursuing human narrow values presents a much more optimistic picture of AI alignment. See Paul’s posts on narrow value learning, act-based agents, and abstract approval direction. The agent foundations agenda often considers problems of the form “let’s use Bayesian VNM agents as our starting point and look for relaxations appropriate to realistic agents, which are naturalized”. This leads to problems such as decision theory, naturalized induction, and ontology identification. However, there isn’t a clear argument for why they are subproblems of the problem we actually care about (which is close to something like “pursuing human narrow values”). For example, perhaps we can understand how to have an AI pursue human narrow values without solving decision theory, since maybe humans don’t actually have a utility function or a decision theory yet (though we might upon long-term reflection; pursuing narrow values should preserve the conditions for such long-term reflection). These research questions might be useful threads to pull on if solving them would tell us more about the problems we actually care about. But I think Paul has a strong intuition that working on these problems isn’t the right way to make progress on pursuing human narrow values.

15. There are important considerations in favor of focusing on alignment for foreseeable AI technologies. See posts here and here. In particular, this motivates work related to alignment for systems solving learning problems.

16. It is, in principle, possible to automate a large fraction of human labor using robust learning. That is, a human can use $$X$$ amount of labor to oversee the AI doing something like $$X^{3/2}$$ amount of labor in a robust fashion. KWIK learning is a particularly clean (though impractical) demonstration of this. This enables the human to spend much more time overseeing a particular decision than the AI takes to make it (e.g. spending 1 day to oversee a decision made in 1 second), since only a small fraction of decisions are overseen.

17. The above is quite powerful, due to bootstrapping. “Automating a large fraction of human labor” is significantly more impressive than it first seems, since the human can use other AI systems in the course of evaluating a specific decision. See ALBA. We don’t yet have a fundamental obstruction to any of ALBA’s subproblems, and we have an argument that solving these subproblems is sufficient to create an aligned learning system.

18. There are reasons to expect the details of reasoning well to be “messy”. That is, there are reasons why we might expect cognition to be as messy and hard to formalize as biology is. While biology has some important large-scale features (e.g. evolution), overall it is quite hard to capture using simple rules. We can take the history of AI as evidence for this; AI research often does consist of people trying to figure out how humans do something at an idealized level and formalize it (roughly similar to the agent foundations approach), and this kind of AI research does not always lead to the most capable AI systems. The success of deep learning is evidence that the most effective way for AI systems to acquire good rules of reasoning is usually to learn them, rather than having them be hardcoded.

## What to do from here?

I find all the intuitions above at least somewhat compelling. Given this, I have made some tentative conclusions:

• I think the intuition 10 (“simple reasoning rules generalize for non-learning problems”) is particularly important. I don’t quite understand Paul’s research approach for this question, but it seems that there is convergence that this intuition is useful and that we should take an agent foundations approach to solve the problem. I think this convergence represents a great deal of progress in the overall disagreement.
• If we can resolve the above problem by creating intractable algorithms for finding simple reasoning rules that generalize, then plausibly something like ALBA could “distill” these algorithms into a competitive aligned agent making use of e.g. deep learning technology. My picture of this is vague but if this is correct, then the agent foundations approach and ALBA are quite synergistic. Paul has written a bit about the relation between ALBA and non-learning problems here.
• I’m still somewhat optimistic about Paul’s approach of “turn arbitrary capabilities into aligned capabilities” and pessimistic about the alternatives to this approach. If this approach is ultimately doomed, I think it’s likely because it’s far easier to find a single good AI system than to turn arbitrary unaligned AI systems into competitive aligned AI systems; there’s a kind of “universal quantifier” implicit in the second approach. However, I don’t see this as a good reason not to use this research approach. It seems like if it is doomed, we will likely find some kind of fundamental obstruction somewhere along the way, and I expect a crisply stated fundamental obstruction to be quite useful for knowing exactly which relaxation of the “competitive aligned AI” problem to pursue. Though this does argue for pursuing other approaches in parallel that are motivated by this particular difficulty.
• I think intuition 14 (“pursuing human narrow values presents a much more optimistic picture of AI alignment”) is quite important, and would strongly inform research I do using the agent foundations approach. I think the main reason “MIRI” is wary of this is that it seems quite vague and confusing, and maybe fundamental confusions like decision theory and ontology identification will re-emerge if we try to make it more precise. Personally, I expect that, though narrow value learning is confusing, it really ought to dodge decision theory and ontology identification. One way of testing this expectation would be for me to think about narrow value learning by creating toy models of agents that have narrow values but not proper utility functions. Unfortunately, I wouldn’t be too surprised if this turns out to be super messy and hard to formalize.

## Acknowledgements

Thanks to Paul, Nate, Eliezer, and Benya for a lot of conversations on this topic. Thanks to John Salvatier for helping me to think about intuitions and teaching me skills for learning intuitions from other people.

 by Wei Dai 51 days ago | Ryan Carey, Stuart Armstrong and Vladimir Nesov like this | link This seems a good opportunity for me to summarize my disagreements with both Paul and MIRI. In short, there are two axes along which Paul and MIRI disagree with each other, where I’m more pessimistic than either of them. (One of Paul’s latest replies to me on his AI control blog says “I have become more pessimistic after thinking it through somewhat more carefully.” and “If that doesn’t look good (and it probably won’t) I will have to step back and think about the situation more broadly.” I’m currently not sure how broadly Paul was going to rethink the situation or what conclusions he has since reached. What follows is meant to reflect my understanding of his positions up to those statements.) One axis might be called “metaphilosophical paternalism” (a phrase I just invented, not sure if there’s an existing one I should use), i.e., how much external support / error correction do humans need, or more generally how benign of an environment do we have to place them in, before we can expect them to eventually figure out their “actual” values (which implies correctly solving all relevant philosophical dependencies such as population ethics and philosophy of consciousness) and how hard will it be to design and provide such support / error correction. MIRI’s position seems to be that humans do need a lot of external support / error correction (see CEV) and this is a hard problem, but not so hard that it will likely turn out to be a blocking issue. Paul’s position went from his 2012 version of “indirect normativity” which envisioned placing a human in a relatively benign simulated environment (although still very different from the kinds of environments where we have historical evidence of humans being able to make philosophical progress in) to his current ideas where humans live in very hostile environments, having to process potentially adversarial messages from superintelligent AIs under time pressure. My own thinking is that we currently know very little about metaphilosophy, essentially nothing beyond that philosophy is some kind of computational / cognitive process implemented by (at least some) human brains, and there seems to be such a thing as philosophical truth or philosophical progress, but that is hard to define or even recognize. Without easy ways to check one’e ideas (e.g., using controlled experiments or mathematical proofs), human cognitive processes tend to diverge rather than converge. (See political and religious beliefs, for example.) If typical of philosophical problems in general, understanding metaphilosophy well enough to implement something like CEV will likely take many decades of work even after someone discovers a viable approach (which we don’t yet have). Think of how confused we still are about how expected utility maximization applies in bargaining, or what priors really are or should be, many decades after those ideas were first proposed. I don’t understand Paul and MIRI’s reasons for being as optimistic as they each are on this issue. The other axis of disagreement is how feasible it would be to create aligned AI that matches or beats unaligned AI in efficiency/capability. Here Paul is only trying to match unaligned AIs using the same mainstream AI techniques, whereas MIRI is trying to beat unaligned AIs in order to prevent them from undergoing intelligence explosion. But even Paul is more optimistic than I think is warranted. (To be fair, at least some within MIRI, such as Nate, may be aiming to beat unaligned AIs not because they’re particularly optimistic about the prospects of doing so, but because they’re pessimistic about what would happen if we merely match them.) It seems unlikely to me that alignment to complex human values comes for free. If nothing else, aligned AIs will be more complex than unaligned AIs and such complexity is costly in design, coding, maintenance, and security. Think of the security implications of having a human controller or a complex value extrapolation process at an AI’s core, compared to something simpler like a paperclip maximizer, or the continuous challenges of creating improved revisions of AI design while minimizing the risk of losing alignment to a set of complex and unknown values. Jessica’s post lists searching for fundamental obstructions to aligned AI as a motivation for Paul’s research direction. I think given that efficient aligned AIs almost certainly exist as points in mindspace, it’s unlikely that we can find “fundamental” reasons why we can’t build them. Instead they will likely just take much more resources (including time) to build than unaligned AIs, for a host of “messy” reasons. Maybe the research can show that certain approaches to building competitive aligned AIs won’t succeed, but realistically such a result can only hope to cover a tiny part of AI design space, so I don’t see why that kind of result would be particularly valuable. Please note that what I wrote here isn’t meant to be an argument against doing the kind of research that Paul and MIRI are doing. It’s more of an argument for simultaneously trying to find and pursue other approaches to solving the “AI risk” problem, especially ones that don’t require the same preconditions in order to succeed. Otherwise, since those preconditions don’t seem very likely to actually obtain, we’re leaving huge amounts of potential expected value on the table if we bank on just one or even both of these approaches. reply
 by Jessica Taylor 51 days ago | Ryan Carey likes this | link If typical of philosophical problems in general, understanding metaphilosophy well enough to implement something like CEV will likely take many decades of work even after someone discovers a viable approach (which we don’t yet have). Consider the following strategy the AI could take: Put a bunch of humans in a secure box containing food/housing/etc Acquire as much power as possible while keeping the box intact After 100 years, ask the humans in the box what to do next There are lots of things that are unsatisfying about the proposal (e.g. the fact that only the humans in the box survive), but I’m curious which you find least satisfying (especially unsatisfying things that are also unsatisfying about Paul’s proposals). Do you think designing this AI will require solving metaphilosophical problems? Do you think this AI will be at a substantial efficiency disadvantage relative to a paperclip maximizer? (Note that this doesn’t require humans to figure out their actual values in 100 years; they can decide some questions and kick the rest to another 100 years later) reply
 by Wei Dai 50 days ago | link If “a bunch” is something like 10000 smartest, most sane, most philosophically competent humans on Earth, then they might give you a reasonable answer in 100 years (depending on how teachable/heritable these things are, since the original 10000 won’t be alive at that point). But if you exclude most of humanity then most likely they’ll contribute their resources to their own AI projects so you’re starting with a small percent of power, and already losing most of potential value. That box will be a very attractive target for other AIs to attack (e.g., by sending a manipulative message to the humans inside), attack is generally easier than defense, so keeping that box secure will be hard. One problem is how do you convey “secure” to your AI in a way that covers everything that might happen in 100 years, during which all kinds of unforeseeable technologies might be developed? Then there’s the problem that attackers only have to succeed once whereas your AI has to successfully defend against all attacks for a subjective eternity. I think there will be strong incentives for AIs to join into coalitions and then merge into coherent unified designs (with aggregated values) because that makes them much more efficient (it gets rid of losses from redundant computations, asymmetric information, bad equilibria in general.), and also because there are likely increasing returns to scale (for example the first coalition / merged AI to find some important insight into building the next generation of AI might gain a large additional share of power at the cost of other AIs, or the strongest coalition can just fight and destroy all others and take 100% of the universe for itself). If your AI’s motivational structure is not expected utility maximization of some evaluable utility function (or whatever will be compatible with the dominant merged AIs), it might soon be forced to either self-modify into that form or lose out in this kind of coalitional race. It seems that you can either A) solve all the philosophical problems involved in safely doing this kind of merging ahead of time which will take a lot of resources (or just be impossible because we don’t know how all the mergers will work in detail), B) figure out metaphilosophy and have the AI solve those problems, or C) fail to do either and then the AI self-modifies badly or loses the coalitional race. I think all of the things I find unsatisfying above have analogues in Paul’s proposals, and I’ve commented about them on his blog. Please let me know if I can clarify anything. reply
 by Paul Christiano 50 days ago | link If “a bunch” is something like 10000 smartest, most sane, most philosophically competent humans on Earth, then they might give you a reasonable answer in 100 years It seems to me like one person thinking for a day would do fine, and ten people thinking for ten days would do better, and so on. You seem to be imagining some bar for “good enough” which the people need to meet. I don’t understand where that bar is though—as far as I can tell, to the extent that there is a minimal bar it is really low, just “good enough to make any progress at all.” It seems that you are much more pessimistic about the prospects of people in the box than society outside of the box, in the kind of situation we might arrange absent AI. Is that right? Is the issue just that they aren’t much better off than society outside of the box, and you think that it’s not good to pay a significant cost without getting some significant improvement? Is the issue that they need to do really well in order to adapt to a more complex universe dominated by powerful AI’s? so keeping that box secure will be hard Physical security of the box seems no harder than physical security of your AI’s hardware. If physical security is maintained, then you can simply not relay any messages to the inside of the box. how do you convey “secure” to your AI in a way that covers everything that might happen in 100 years, during which all kinds of unforeseeable technologies might be developed The point is that in order for the AI to work it needs to implement our views about “secure” / about good deliberation, not our views about arbitrary philosophical questions. So this allows us to reduce our ambitions. It may also be too hard to build a system that has an adequate understanding of “secure,” but I don’t think that arguments about the difficulty of metaphilosophy are going to establish that. So if you grant this, it seems like you should be willing to replace “solving philosophical problems” in your arguments with “adequately assessing physical security;” is that right? fail to do either and then the AI self-modifies badly or loses the coalitional race I can imagine situations where this kind of coalitional formation destroys value unless we have sophisticated philosophical tools. I consider this a separate problem from AI control; its importance depends on the expected damage done by this shortcoming. Right now this doesn’t look like a big deal to me. That is, it looks to me like simple mechanisms will probably be good enough to capture most of the gains from coalition formation. An example of a simple mechanism, to help indicate why I expect some simple mechanism to work well enough: if A and B have equal influence and want to compromise, then they create a powerful agent whose goal is to obtain “meaningful control” over the future and then later flip a coin to decide which of A and B gets to use that control. (With room for bargaining between A and B at some future point, using their future understanding of bargaining, prior to the coin flip. Though realistically I think bargaining in advance isn’t necessary since it can probably be done acausally after the coin flip.) As with the last case, we’ve now moved most of the difficulty into what we mean for either A or B to have “meaningful control” of resources. We are also now required to keep both A and B secure, but that seems relatively cheap (especially if they can be represented as code). But it doesn’t look likely to me that these kinds of things are going to be serious problems that stand up to focused attempts to solve them (if we can solve other aspects of AI control, it seems very likely that we can use our solution to ensure that A or B maintains “meaningful control” over some future resources, according to an interpretation of meaningful control that is agreeable to both A and B), and I don’t yet completely understand why you are so much more concerned about it. And if acausal trade can work rather than needing to bargain in advance, then we can probably just make the coin flip now and set aside these issues altogether. I consider that more likely than not, even moreso if we are willing to do some setup to help facilitate such trade. Overall, it seems to me like to the extent that bargaining/coalition formation seems to be a serious issue, we should deal with it now as a separate problem (or people in the future can deal with it then as a separate problem), but that it can be treated mostly independently from AI control and that it currently doesn’t look like a big deal. I agree that the difficulty of bargaining / coalition formation could necessitate the same kind of coordinated response as a failure to solve AI control, and in this sense the two problems are related (along with all other possible problems that might require a similar response). This post explains why I don’t think this has a huge effect on the value of AI control work, though I agree that it can increase the value of other interventions. (And could increase their value enough that they become higher priority than AI control work.) reply
 by Wei Dai 49 days ago | link I don’t understand where that bar is though—as far as I can tell, to the extent that there is a minimal bar it is really low, just “good enough to make any progress at all.” I see two parts of the bar. One is being good enough to eventually solve all important philosophical problems. “Good enough to make any progress at all” isn’t good enough, if they’re just making progress on easy problems (or easy parts of hard problems). What if there are harder problems they need to solve later (and in this scenario all the other humans are dead)? Another part is ability to abstain from using the enormous amount of power available until they figure out how to use it safely. Suppose after 100 years, the people in the box hasn’t figured that out yet, what fraction of all humans would vote to go back in the box for another 100 years? Physical security of the box seems no harder than physical security of your AI’s hardware. An AI can create multiple copies of itself and check them against each other. It can migrate to computing substrates that are harder to attack. It can distribute itself across space and across different kinds of hardware. It can move around in space under high acceleration to dodge attacks. It can re-design its software architecture and/or use cryptographic methods to improve detection and mitigation against attacks. A box containing humans can do none of these things. If physical security is maintained, then you can simply not relay any messages to the inside of the box. Aside from the above, I had in mind that it may be hard to define “security” so that an attacker couldn’t think of some loophole in the definition and use it to send a message into the box in a way that doesn’t trigger a security violation. if A and B have equal influence and want to compromise, then they create a powerful agent whose goal is to obtain “meaningful control” over the future and then later flip a coin to decide which of A and B gets to use that control. Suppose A has utility linear in resources and B has utility log in resources, then moving to “flip a coin to decide which of A and B gets to use that control” makes A no worse off but B much worse off. This changes the disagreement point (what happens if they fail to reach a deal), in a way that (intuitively speaking) greatly increases A’s bargaining power. B almostly certainly shouldn’t go for this. A more general objection is that you’re proposing one particular way that AIs might merge, and I guess proposing to hard code that into your AI as the only acceptable way to merge, and have it reject all other proposals that don’t fit this form. This just seems really fragile. How do you know that if you only accept proposals of this form, that’s good enough to win the coalitional race during the next 100 years, or that the class of proposals your AIs will accept doesn’t leave it open to being exploited by other AIs? Overall, it seems to me like to the extent that bargaining/coalition formation seems to be a serious issue, we should deal with it now as a separate problem (or people in the future can deal with it then as a separate problem), but that it can be treated mostly independently from AI control and that it currently doesn’t look like a big deal. So another disagreement between us that I forgot to list in my initial comment is that I think it’s unlikely we can predict all the important/time-sensitive philosophical problems that AIs will face, and solve them all ahead of time. Bargaining/coalition formation is one class of such problems, I think self-improvement is another (what does your AI do if other AIs, in order to improve their capabilities, start using new ML/algorithmic components that don’t fit into your AI control scheme?), and there are probably other problems that we can’t foresee right now. reply
 by Wei Dai 48 days ago | link E.g. if they can take steps towards safe cognitive enhancement I didn’t think that the scenario assumed the bunch of humans in a box had access to enough industrial/technology base to do cognitive enhancement. It seems like we’re in danger of getting bogged down in details about the “people in box” scenario, which I don’t think was meant to be a realistic scenario. Maybe we should just go back to talking about your actual AI control proposals? So I don’t yet see a natural scenario where A and B have are forced to bargain but the “default” is for B to be able to secure a proportional fraction of the future. Here’s one: Suppose A, B, C each share 1/3 of the universe. If A and B join forces they can destroy C and take C’s resources, otherwise it’s a stalemate. (To make the problem easier assume C can’t join with anyone else.) Another one is A and B each have secure rights, but they need to join together to maximize negentropy. I’m curious whether you see as the disanalogy between these cases I’m not sure what analogy you’re proposing between the two cases. Can you explain more? I don’t see how it can revise our view of the value of AI control by more than say a factor of 2. I didn’t understand this claim when I first read it on your blog. Can you be more formal/explicit about what two numbers you’re comparing, that yields less than a factor of 2? reply
 by Wei Dai 47 days ago | link Unlike destructive technologies, philosophical hurdles are only a problem for aligned AIs. With destructive technologies, both aligned and unaligned AIs (at least the ones who don’t terminally value destruction) would want to coordinate to prevent them and they only have to figure out how. But with philosophical problems, unaligned AIs instead want to exploit them to gain advantages over aligned AIs. For example if aligned AIs have to spend a lot of time to think about how to merge or self-improve safely (due to deferring to slow humans), unaligned AIs won’t want to join some kind of global pact to all wait for the humans to decide, but will instead move forward amongst themselves as quickly as they can. This seems like a crucial disanalogy between destructive technologies and philosophical hurdles. This comes down to the claim that P(get house in order after AI but before catastrophe | not gotten house in order prior to AI) is at least 1/2. This seems really high. In your Medium article you only argued that (paraphrasing) AI could be as helpful for improving coordination as for creating destructive technology. I don’t see how you get from that to this conclusion. reply
 by Vladimir Nesov 47 days ago | link Unaligned AIs don’t necessarily have efficient idealized values. Waiting for (simulated) humans to decide is analogous to computing a complicated pivotal fact about unaligned AI’s values. It’s not clear that “naturally occurring” unaligned AIs have simpler idealized/extrapolated values than aligned AIs with upload-based value definitions. Some unaligned AIs may actually be on the losing side, recall the encrypted-values AI example. reply
 by Vladimir Nesov 47 days ago | link It seems to me like for the people to get stuck you have to actually imagine there is some particular level they reach where they can’t find any further way to self-improve. For philosophy, levels of ability are not comparable, because problems to be solved are not sufficiently formulated. Approximate one-day humans (as in HCH) will formulate different values from accurate ten-years humans, not just be worse at elucidating them. So perhaps you could re-implement cognition starting from approximate one-day humans, but values of the resulting process won’t be like mine. Approximate short-lived humans may be useful for building a task AI that lets accurate long-lived humans (ems) to work on the problem, but it must also allow them to make decisions, it can’t be trusted to prevent what it considers to be a mistake, and so it can’t guard the world from AI risk posed by the long-lived humans, because they are necessary for formulating values. The risk is that “getting our house in order” outlaws philosophical progress, prevents changing things based on considerations that the risk-prevention sovereign doesn’t accept. So the scope of the “house” that is being kept in order must be limited, there should be people working on alignment who are not constrained. I agree that working on philosophy seems hopeless/inefficient at this point, but that doesn’t resolve the issue, it just makes it necessary to reduce the problem to setting up a very long term alignment research project (initially) performed by accurate long-lived humans, guarded from current risks, so that this project can do the work on philosophy. If this step is not in the design, very important things could be lost, things we currently don’t even suspect. Messy Task AI could be part of setting up the environment for making it happen (like enforcing absence of AI or nanotech outside the research project). Your writing gives me hope that this is indeed possible. Perhaps this is sufficient to survive long enough to be able to run a principled sovereign capable of enacting values eventually computed by an alignment research project (encoded in its goal), even where the research project comes up with philosophical considerations that the designers of the sovereign didn’t see (as in this comment). Perhaps this task AI can make the first long-lived accurate uploads, using approximate short-lived human predictions as initial building blocks. Even avoiding the interim sovereign altogether is potentially an option, if the task AI is good enough at protecting the alignment research project from the world, although that comes with astronomical opportunity costs. reply
 by Vladimir Nesov 47 days ago | link Speaking for myself, the main issue is that we have no idea how to do step 3, how to tell a pre-existing sovereign what to do. A task AI with limited scope can be replaced, but an optimizer has to be able to understand what is being asked of it, and if it wasn’t designed to be able to understand certain things, it won’t be possible to direct it correctly. If in 100 years the humans come up with new principles in how the AI should make decisions (philosophical progress), it may be impossible to express these principles as directions for an existing AI that was designed without the benefit of understanding these principles. (Of course, the humans shouldn’t be physically there, or it will be too hard to say what it means to keep them safe, but making accurate uploads and packaging the 100 years as a pure computation solves this issue without any conceptual difficulty.) reply
 by Paul Christiano 47 days ago | link A task AI with limited scope can be replaced, but an optimizer has to be able to understand what is being asked of it, and if it wasn’t designed to be able to understand certain things, it won’t be possible to direct it correctly. It’s not clear to me why “limited scope” and “can be replaced” are related. An agent with broad scope can still be optimizing something like “what the human would want me to do today” and the human could have preferences like “now that humans believe that an alternative design would have been better, gracefully step aside.” (And an agent with narrow scope could be unwilling to step aside if so doing would interfere with accomplishing its narrow task.) reply
 by Vladimir Nesov 47 days ago | link Being able to “gracefully step aside” (to be replaced) is an example of what I meant by “limited scope” (in time). Even if AI’s scope is “broad”, the crucial point is that it’s not literally everything (and by default it is). In practice it shouldn’t be more than a small part of the future, so that the rest can be optimized better, using new insights. (Also, to be able to ask what humans would want today, there should remain some humans who didn’t get “optimized” into something else.) reply
 by Jessica Taylor 51 days ago | Patrick LaVictoire likes this | link MIRI’s position seems to be that humans do need a lot of external support / error correction (see CEV) and this is a hard problem, but not so hard that it will likely turn out to be a blocking issue. Note that Eliezer is currently more optimistic about task AGI than CEV (for the first AGI built), and I think Nate is too. I’m not sure what Benya thinks. reply
 by Wei Dai 49 days ago | Ryan Carey likes this | link Oh, right, I had noticed that, and then forgot and went back to my previous model of MIRI. I don’t think Eliezer ever wrote down why he changed his mind about task AGI or how he is planning to use one. If the plan is something like “buy enough time to work on CEV at leisure”, then possibly I have much less disagreement on “metaphilosophical paternalism” with MIRI than I thought. reply
 by Jessica Taylor 51 days ago | link Jessica’s post lists searching for fundamental obstructions to aligned AI as a motivation for Paul’s research direction. I think given that efficient aligned AIs almost certainly exist as points in mindspace, it’s unlikely that we can find “fundamental” reasons why we can’t build them. Instead they will likely just take much more resources (including time) to build than unaligned AIs, for a host of “messy” reasons. In this case I expect that in <10 years we get something like: “we tried making aligned versions of a bunch of algorithms, but the aligned versions are always less powerful because they left out some source of power the unaligned versions had. We iterated the process a few times (studying the additional sources of power and making aligned versions of them), and this continued to be the case. We have good reasons to believe that there isn’t a sensible stopping point to this process.” This seems pretty close to a fundamental obstruction and it seems like it would be similarly useful, especially if the “good reasons to believe there isn’t a sensible stopping point to this process” tell us something new about which relaxations are promising. reply
 by David Krueger 48 days ago | link I don’t see this as being the case. As Vadim pointed out, we don’t even know what we mean by “aligned versions” of algos, ATM. So we wouldn’t know if we’re succeeding or failing (until it’s too late and we have a treacherous turn). It looks to me like Wei Dai shares my views on “safety-performance trade-offs” (grep it here: http://graphitepublications.com/the-beginning-of-the-end-or-the-end-of-beginning-what-happens-when-ai-takes-over/). I’d paraphrase what he’s said as: “Orthogonality implies that alignment shouldn’t cost performance, but says nothing about the costs of ‘value loading’ (i.e. teaching an AI human values and verifying its value learning procedure and/or the values it has learned). Furthermore, value loading will probably be costly, because we don’t know how to do it, competitive dynamics make the opportunity cost of working on it large, and we don’t even have clear criteria for success.” Which I emphatically agree with. reply
 by Jessica Taylor 48 days ago | link As Vadim pointed out, we don’t even know what we mean by “aligned versions” of algos, ATM. So we wouldn’t know if we’re succeeding or failing If we fail to make the intuition about aligned versions of algorithms more crisp than it currently is, then it’ll be pretty clear that we failed. It seems reasonable to be skeptical that we can make our intuitions about “aligned versions of algorithms” crisp and then go on to design competitive and provably aligned versions of all AI algorithms in common use. But it does seem like we will know if we succeed at this task, and even before then we’ll have indications of progress such as success/failure at formalizing and solving scalable AI control in successively complex toy environments. (It seems like I have intuitions about what would constitute progress that are hard to convey over text, so I would not be surprised if you aren’t convinced that it’s possible to measure progress). “Orthogonality implies that alignment shouldn’t cost performance, but says nothing about the costs of ‘value loading’ (i.e. teaching an AI human values and verifying its value learning procedure and/or the values it has learned). Furthermore, value loading will probably be costly, because we don’t know how to do it, competitive dynamics make the opportunity cost of working on it large, and we don’t even have clear criteria for success.” It seems like “value loading is very hard/costly” has to imply that the proposal in this comment thread is going to be very hard/costly, e.g. because one of Wei Dai’s objections to it proves fatal. But it seems like arguments of the form “human values are complex and hard to formalize” or “humans don’t know what we value” are insufficient to establish this; Wei Dai’s objections in the thread are mostly not about value learning. (sorry if you aren’t arguing “value loading is hard because human values are complex and hard to formalize” and I’m misinterpreting you) reply
 by Paul Christiano 47 days ago | link As Vadim pointed out, we don’t even know what we mean by “aligned versions” of algos, ATM. So we wouldn’t know if we’re succeeding or failing (until it’s too late and we have a treacherous turn). Even beyond Jessica’s point (that failure to improve our understanding would constitute an observable failure), I don’t completely buy this. We are talking about AI safety because there are reasons to think that AI systems will cause a historically unprecedented kind of problem. If we could design systems for which we had no reason to expect them to cause such problems, then we can rest easy. I don’t think there is some kind of magical and unassailable reason to be suspicious of powerful AI systems, there are just a bunch of particular reasons to be concerned. Similarly, there is no magical reason to expect a treacherous turn—this is one of the kinds of unusual failures which we have reason to be concerned about. If we built a system for which we had no reason to be concerned, then we shouldn’t be concerned. reply
 by Vadim Kosoy 47 days ago | link Paul, I’m not sure I understand what you’re saying here. Can you imagine a system “for which we had no reason to expect it to cause such problems” without an underlying mathematical theory that shows why this system is safe? The reason AI systems will cause a historically unprecedented kind of problem, is that AI systems can outsmart humans and thus create situations that are outside our control, even when we don’t a priori see the precise mechanism by which we will lose control. In order for such a system be safe, we need to know that it will not attempt anything detrimental to us, and we need to know this as an abstraction, i.e without knowing in details what the system will do (because the system is superintelligent so we by definition we cannot guess its actions). Doesn’t it seem improbable to you that we will have a way of having such knowledge by some other means than the accuracy of mathematical thought? That is, we can have a situation like “AI running in homomorphic encryption with a quantum-generated key that is somewhere far from the AI’s computer” where it’s reasonable claim that the AI is safe as long as it stays encrypted (even though there is still some risk from being wrong about cryptographic conjectures or the AI exploiting some surprising sort of unknown physics), without using a theory of intelligence at all (beyond the fact that intelligence is a special case of computation), but it seems unlikely that we can have something like this while simultaneously having the AI powerful enough to protect us against other AIs that are malicious. reply
 by Paul Christiano 47 days ago | link Can you imagine a system “for which we had no reason to expect it to cause such problems” without an underlying mathematical theory that shows why this system is safe? Yes. For example, suppose we built a system whose behavior was only expected to be intelligent to the extent that it imitated intelligent human behavior—for which there is no other reason to believe that it is intelligent. Depending on the human being imitated, such a system could end up seeming unproblematic even without any new theoretical understanding. We don’t yet see any way to build such a system, much less to do so in a way that could be competitive with the best RL system that could be designed at a given level of technology. But I can certainly imagine it. (Obviously I think there is a much larger class of systems that might be non-problematic, though it may depend on what we mean by “underlying mathematical theory.”) AI systems can outsmart humans and thus create situations that are outside our control, even when we don’t a priori see the precise mechanism by which we will lose control This doesn’t seem sufficient for trouble. Trouble only occurs when those systems are effectively optimizing for some inhuman goals, including e.g. acquiring and protecting resources. That is a very special thing for a system to do, above and beyond being able to accomplish tasks that apparently require intelligence. Currently we don’t have any way to accomplish the goals of AI that don’t risk this failure mode, but it’s not obvious that it is necessary. reply
 by Vadim Kosoy 47 days ago | David Krueger likes this | link Can you imagine a system “for which we had no reason to expect it to cause such problems” without an underlying mathematical theory that shows why this system is safe? …suppose we built a system whose behavior was only expected to be intelligent to the extent that it imitated intelligent human behavior—for which there is no other reason to believe that it is intelligent. This doesn’t seem to be a valid example: your system is not superintelligent, it is “merely” human. That is, I can imagine solving AI risk by building whole brain emulations with enormous speed-up and using them to acquire absolute power. However: To the extent this relies on “classical” brain emulation methods, I think this is not what is usually meant by “solving AI alignment.” To the extent this relies on heuristic learning algorithms, I would be worried your algorithm does something subtly wrong in a way that distorts values, although heuristic learning would also invalidate the condition that “there is no other reason to believe that it is intelligent.” (in particular it raises additional concerns such as attacks by malicious superintelligences across the multiverse) As an aside, there is a high-risk zone here where someone untrustworthy can gain this technology and use it to unwittingly create unfriendly AI. AI systems can outsmart humans and thus create situations that are outside our control, even when we don’t a priori see the precise mechanism by which we will lose control This doesn’t seem sufficient for trouble. Trouble only occurs when those systems are effectively optimizing for some inhuman goals, including e.g. acquiring and protecting resources. Well, any AI is effectively optimizing for some goal by definition. How do you know this goal is “human”? In particular, if your AI is supposed to defend us from other AIs, it is very much in the business of acquiring and protecting resources. reply
 by David Krueger 46 days ago | link I think the core of our differences is that I see minimally constrained, opaque, utility-maximizing agents with good models of the world and access to rich interfaces (sensors and actuators) as extremely likely to be substantially more powerful than what we will be able to build if we start degrading any of these properties. These properties also seem sufficient for a treacherous turn (in an unaligned AI). reply
 by Paul Christiano 40 days ago | link I see minimally constrained, opaque, utility-maximizing agents with good models of the world and access to rich interfaces (sensors and actuators) as extremely likely to be substantially more powerful than what we will be able to build if we start degrading any of these properties. The only point on which there is plausible disagreement is “utility-maximizing agents.” On a narrow reading of “utility-maximizing agents” it is not clear why it would be important to getting more powerful performance. On a broad reading of “utility-maximizing agents” I agree that powerful systems are utility-maximizing. But if we take a broad reading of this property, I don’t agree with the claim that we will be unable to reliably tell that such agents aren’t dangerous without theoretical progress. In particular, there is an argument of the form “the prospect of a treacherous turn makes any informal analysis unreliable.” I agree that the prospect of a treacherous turn makes some kinds of informal analysis unreliable. But I think it is completely wrong that it makes all informal analysis unreliable, I think that appropriate informal analysis can be sufficient to rule out the prospect of a treacherous turn. (Most likely an analysis that keeps track of what is being optimized, and rules out the prospect that an indicator was competently optimized to manipulate our understanding of the current situation.) reply
 by Jessica Taylor 51 days ago | link To be fair, at least some within MIRI, such as Nate, may be aiming to beat unaligned AIs not because they’re particularly optimistic about the prospects of doing so, but because they’re pessimistic about what would happen if we merely match them. My model of Nate thinks the path to victory goes through the aligned AI project gaining a substantial first mover advantage (through fast local takeoff, more principled algorithms, and/or better coordination). Though he’s also quite concerned about extremely large efficiency disadvantages of aligned AI vs unaligned AI (e.g. he’s pessimistic about act-based agents helping much because they might require the AI to be good at predicting humans doing complex tasks such as research). reply
 by Paul Christiano 51 days ago | link One of Paul’s latest replies… I was talking specifically about algorithms that build a model of a human and then optimize over that model in order to do useful algorithmic work (e.g. modeling human translation quality and then choosing the optimal translation). i.e., how much external support / error correction do humans need, or more generally how benign of an environment do we have to place them in, before we can expect them to eventually figure out their “actual” values I still don’t get your position on this point, but we seem to be going around a bit in circles. Probably the most useful thing would be responding to Jessica’s hypothetical about putting humanity in a box. searching for fundamental obstructions to aligned AI I am not just looking for an aligned AI, I am looking for a transformation from (arbitrary AI) –> (aligned AI). I think the thing-to-understand is what kind of algorithms can plausibly give you AI systems without being repurposable to being aligned (with O(1) work). I think this is the kind of problem for which you are either going to get a positive or negative answer. That is, probably there is either such a transformation, or a hard-to-align design and an argument for why we can’t align it. (This example with optimizing over human translations seems like it could well be an insurmountable obstruction, implying that my most ambitious goal is impossible.) I don’t understand Paul and MIRI’s reasons for being as optimistic as they each are on this issue. I believe that both of us think that what you perceive as a problem can be sidestepped (I think this is the same issue we are going in circles around). It seems unlikely to me that alignment to complex human values comes for free. The hope is to do a sublinear amount of additional work, not to get it for free. It’s more of an argument for simultaneously trying to find and pursue other approaches to solving the “AI risk” problem, especially ones that don’t require the same preconditions in order to succeed It seems like we are roughly on the same page; but I am more optimistic about either discovering a positive answer or a negative answer, and so I think this approach is the highest-leveraged thing to work on and you don’t. I think that cognitive or institutional enhancement is also a contender, as is getting our house in order, even if our only goal is dealing with AI risk. reply
 by Wei Dai 50 days ago | link I still don’t get your position on this point, but we seem to be going around a bit in circles. Yes, my comment was more targeted to other people, who I’m hoping can provide their own views on these issues. (It’s kind of strange that more people haven’t commented on your ideas online. I’ve asked to be invited to any future MIRI workshops discussing them, in case most of the discussions are happening offline.) I am not just looking for an aligned AI, I am looking for a transformation from (arbitrary AI) –> (aligned AI). I think the thing-to-understand is what kind of algorithms can plausibly give you AI systems without being repurposable to being aligned (with O(1) work). Can you be more explicit and formal about what you’re looking for? Is it a transformation T, such that for any AI A, T(A) is an aligned AI as efficient as A, and applying T amounts to O(1) of work? (O(1) relative to what variable? The work that originally went into building A?) If that’s what you mean, then it seems obvious that T doesn’t exist, but I don’t know how else to interpret your statement. That is, probably there is either such a transformation, or a hard-to-align design and an argument for why we can’t align it. I don’t understand why is this disjunction true, which might be because of my confusion above. Also, even if you found a hard-to-align design and an argument for why we can’t align it, that doesn’t show that aligned AIs can’t be competitive with unaligned AIs (in order to convince others to coordinate, as Jessica wrote). The people who need convincing will just think there’s almost certainly other ways to build a competitive aligned AI that doesn’t involve transforming the hard-to-align design. reply
 by Paul Christiano 50 days ago | link Can you be more explicit and formal about what you’re looking for? Consider some particular research program that might yield powerful AI systems, e.g. (search for better model classes for deep learning, search for improved optimization algorithms, deploy these algorithms on increasingly large hardware). For each such research program I would like to have some general recipe that takes as input the intermediate products of that program (i.e. the hardware and infrastructure, the model class, the optimization algorithms) and uses them to produce an benign AI which is competitive with the output of the research program. The additional effort and cost required for this recipe would ideally be sublinear in the effort invested in the underlying research project (O(1) was too strong). I suspect this is possible for some research programs and not others. I expect there are some programs for which this goal is demonstrably hopeless. I think that those research programs need to be treated with care. Moreover, if I had a demonstration that a research programs was dangerous in this way, I expect that I could convince people that it needs to be treated with care. The people who need convincing will just think there’s almost certainly other ways to build a competitive aligned AI that doesn’t involve transforming the hard-to-align design. Yes, at best someone might agree that a particular research program is dangerous/problematic. That seems like enough though—hopefully they could either be convinced to pursue other research programs that aren’t problematic, or would continue with the problematic research program and could then agree that other measures are needed to avert the risk. reply
 by Wei Dai 42 days ago | link If an AI causes its human controller to converge to false philosophical conclusions (especially ones relevant to their values), either directly through its own actions or indirectly by allowing adversarial messages to reach the human from other AIs, it clearly can’t be considered benign. But given our current lack of metaphilosophical understanding, how do you hope to show that any particular AI (e.g., the output of a proposed transformation/recipe) won’t cause that? Or is the plan to accept a lower burden of proof, namely assume that the AI is benign as long as no one can show that it does cause its human controller to converge to false philosophical conclusions? The additional effort and cost required for this recipe would ideally be sublinear in the effort invested in the underlying research project (O(1) was too strong). If such a recipe existed for a project, that doesn’t mean it’s not problematic. For example, suppose there are projects A and B, each of which had such a recipe. It’s still possible that using intermediate products of both A and B, one could build a more efficient unaligned AI but not a competitive aligned AI (at sublinear additional cost). Similarly, if recipes didn’t exist for projects A and B individually, it might still exist for A+B. It seems like to make a meaningful statement you have to treat the entire world as one big research program. Do you agree? reply
 by Paul Christiano 40 days ago | link If such a recipe existed for a project, that doesn’t mean it’s not problematic. For example, suppose there are projects A and B, each of which had such a recipe. It’s still possible that using intermediate products of both A and B, one could build a more efficient unaligned AI but not a competitive aligned AI My hope is to get technique A working, then get technique B working, and then get A+B working, and so on, prioritizing based on empirical guesses about what combinations will end up being deployed in practice (and hoping to develop general understanding that can be applied across many programs and combinations). I expect that in many cases, if you can handle A and B you can handle A+B, though some interactions will certainly introduce new problems. This program doesn’t have much chance of success without new abstractions. indirectly by allowing adversarial messages to reach the human from other AIs, it clearly can’t be considered benign This is definitely benign on my accounting. There is a further question of how well you do in conflict. A window is benign but won’t protect you from inputs that will drive you crazy. The hope is that if you have an AI that is benign + powerful than you may be OK. directly through its own actions If the agent is trying to implement deliberation in accordance with the user’s preferences about deliberation, then I want to call that benign. There is a further question of whether we mess up deliberation, which could happen with or without AI. We would like to set things up in such a way that we aren’t forced to deliberate earlier than we would otherwise want to. (And this included in the user’s preferences about deliberation, i.e. a benign AI will be trying to secure for the user the option of deliberating later, if the user believes that deliberating later is better than deliberating in concert with the AI now.) Malign just means “actively optimizing for something bad,” the hope is to avoid that, but this doesn’t rule out other kinds of problems (e.g. causing deliberation to go badly due to insufficient competence, blowing up the world due to insufficient competence, etc.) Overall, my current best guess is that this disagreement is better to pursue after my research program is further along, we know things like whether “benign” makes sense as an abstraction, I have considered some cases where benign agents necessarily seem to be less efficient, and so on. I am still interested in arguments that might (a) convince me to not work on this program, e.g. because I should be working on alternative social solutions, or (b) convince others to work on this program, e.g. because they currently don’t see how it could succeed but might work on it if they did, or (c) which clarify the key obstructions for this research program. reply
 by David Krueger 48 days ago | link I really agree with #2 (and I think with #1, as well, but I’m not as sure I understand your point there). I’ve been trying to convince people that there will be strong trade-offs between safety and performance, and have been surprised that this doesn’t seem obvious to most… but I haven’t really considered that “efficient aligned AIs almost certainly exist as points in mindspace”. In fact I’m not sure I agree 100% (basically because “Moloch” (http://slatestarcodex.com/2014/07/30/meditations-on-moloch/)). I think “trying to find and pursue other approaches to solving the “AI risk” problem, especially ones that don’t require the same preconditions in order to succeed” remains perhaps the most important thing to do; do you have anything in particular in mind? Personally, I tend to think that we ought to address the coordination problem head-on and attempt a solution before AGI really “takes off”. reply
 by Paul Christiano 47 days ago | link I’ve been trying to convince people that there will be strong trade-offs between safety and performance What do you see as the best arguments for this claim? I haven’t seen much public argument for it and am definitely interested in seeing more. I definitely grant that it’s prima facie plausible (as is the alternative). Some caveats: It’s obvious there are trade-offs between safety and performance in the usual sense of “safety.” But we are interested in a special kind of failure, where a failed system ends up controlling a significant share of the entire universe’s resources (rather than e.g. causing an explosion), and it’s less obvious that preventing such failures necessarily requires a significant cost. Its also obvious that there is an additional cost to be paid in order to solve control, e.g. consider the fact that we are currently spending time on it. But the question is how much additional work needs to be done. Does building aligned systems require 1000% more work? 10%? 0.1%? I don’t see why it should obvious that this number is on the order of 100% rather than 1%. Similarly for performance costs. I’m willing to grant that an aligned system will be more expensive to run. But is that cost an extra 1000% or an extra 0.1%? Both seem quite plausible. From a theoretical perspective the question is whether the required overhead is linear or sublinear? I haven’t seen strong arguments for the “linear overhead” side, and my current guess is that the answer is sublinear. But again, both positions seem quite plausible. (There are currently a few major obstructions to my approach that could plausibly give a tight theoretical argument for linear overhead, such as the translation example in the discussion with Wei Dai. In the past such obstructions have ended up seeming surmountable, but I think that it is totally plausible that eventually one won’t. And at that point I hope to be able to make clean statements about exactly what kind of thing we can’t hope to do efficiently+safely / exactly what kinds of additional assumptions we would have to make / what the key obstructions are). Personally, I tend to think that we ought to address the coordination problem head-on I think this is a good idea and a good project, which I would really like to see more people working on. In the past I may have seemed more dismissive and if so I apologize for being misguided. I’ve spent a little bit of time thinking about it recently and my feeling is that there is a lot of productive and promising work to do. My current guess is that AI control is the more valuable thing for me personally to do though I could imagine being convinced out of this. I feel that AI control is valuable given that (a) it has a reasonable chance of succeeding even if we can’t solve these coordination problems, and (b) convincing evidence that the problem is hard would be a useful input into getting the AI community to coordinate. If you managed to get AI researchers to effectively coordinate around conditionally restricting access to AI (if it proved to be dangerous), then that would seriously undermine argument (b). I believe that a sufficiently persuasive/charismatic/accomplished person could probably do this today. If I ended up becoming convinced that AI control was impossible this would undermine argument (a) (though hopefully that impossibility argument could itself be used to satisfy desiderata (b)). reply
 by Vadim Kosoy 55 days ago | David Krueger and Jessica Taylor like this | link I feel that there is a false dichotomy going on here. In order to say we “solved” AGI alignment, we must have some mathematical theory that defines “AGI”, defines “aligned AGI” and a proof that some specific algorithm is an “aligned AGI.” So, it’s not that important whether the algorithm is “messy” or “principled” (actually I don’t think it’s a meaningful distinction), it’s important what we can prove about the algorithm. We might relax the requirement of strict “proof” and be satisfied with a well-defined conjecture that has lots of backing evidence (like $$P \ne NP$$) but it seems to me that we wouldn’t want to give up on at least having a well-defined conjecture (unless under conditions of extreme despair, which is not what I would call a successful solution to AGI alignment). So, we can still argue about which subproblems have the best chance of leading us to such a mathematical theory, but it feels like there is a fuzzy boundary there rather than a sharp division into two or more irreconcilable approaches. reply
 by Jessica Taylor 55 days ago | link I agree that the original question (messy vs principled) seems like a false dichotomy at this point. It’s not obvious where the actual disagreement is. My current guess is that the main disagreement is something like: on the path to victory, did we take generic AGI algorithms (e.g. deep learning technology + algorithms layered on top of it) and figure out how to make aligned versions of them, or did we make our own algorithms? Either way we end up with an argument for why the thing we have at the end is aligned. (This is just my current guess, though, and I’m not sure if it’s the most important disagreement.) reply
 by Daniel Dewey 58 days ago | link Thanks for writing this, Jessica – I expect to find it helpful when I read it more carefully! reply
 by David Krueger 47 days ago | link Points 5-9 seem to basically be saying: “We should work on understanding principles of intelligence so that we can make sure that AIs are thinking the same way as humans do; currently we lack this level of understanding”. I don’t really understand point 10, especially this part: “They would most likely generalize in an unaligned way, since the reasoning rules would likely be contained in some sub-agent (e.g. consider how Earth interpreted as an “agent” only got to the moon by going through reasoning rules implemented by humans, who have random-ish values; Paul’s post on the universal prior also demonstrates this).” reply
 by Jessica Taylor 47 days ago | link “We should work on understanding principles of intelligence so that we can make sure that AIs are thinking the same way as humans do; currently we lack this level of understanding” Roughly. I think the minimax algorithm would qualify as “something that thinks the same way an idealized human would”, where “idealized” is doing substantial work (certainly, humans don’t actually play chess using minimax). I don’t really understand point 10, especially this part: Consider the following procedure for building an AI: Collect a collection of AI tasks that we think are AGI-complete (e.g. a bunch of games and ML tasks) Search for a short program that takes lots of data from the Internet as input and produces a policy that does well on lots of these AI tasks Run this program on substantially different tasks related to the real world This seems very likely to result in an unaligned AI. Consider the following program: Simulate some stochastic physics, except that there’s some I/O terminal somewhere (as described in this post) If the I/O terminal gets used, give the I/O terminal the Internet data as input and take the policy as output If it doesn’t get used, run the simulation again until it does This program is pretty short, and with some non-negligible probability (say, more than 1 in 1 billion), it’s going to produce a policy that is an unaligned AGI. This is because in enough runs of physics there will be civilizations; if the I/O terminal is accessed it is probably by some civilization; and the civilization will probably have values that are not aligned with human values, so they will do a treacherous turn (if they have enough information to know how the I/O terminal is being interpreted, which they do if there’s a lot of Internet data). reply
 by David Krueger 46 days ago | link Thanks, I think I understand that part of the argument now. But I don’t understand how it relates to: “10. We should expect simple reasoning rules to correctly generalize even for non-learning problems.” ^Is that supposed to be a good thing or a bad thing? “Should expect” as in we want to find rules that do this, or as in rules will probably do this? reply
 by Jessica Taylor 46 days ago | link It’s just meant to be a prediction (simple rules will probably generalize). reply

### NEW DISCUSSION POSTS

Why wouldn't it work? The
 by Jessica Taylor on True understanding comes from passing exams | 0 likes

It would be weird if the
 by Jessica Taylor on Are daemons a problem for ideal agents? | 0 likes

The second AI doesn't get to
 by Stuart Armstrong on True understanding comes from passing exams | 0 likes

Fixed the $\varepsilon$,
 by Scott Garrabrant on Entangled Equilibria and the Twin Prisoners' Dilem... | 0 likes

I think you meant to divide
 by Vadim Kosoy on Entangled Equilibria and the Twin Prisoners' Dilem... | 0 likes

Yup, this isn't robust to
 by Patrick LaVictoire on Censoring out-of-domain representations | 0 likes

I don't think "honesty" is
 by Paul Christiano on How likely is a random AGI to be honest? | 2 likes

Discussed briefly in Concrete
 by Daniel Dewey on Minimizing Empowerment for Safety | 2 likes

I reason as follows: 1.
 by David Krueger on Does UDT *really* get counter-factually mugged? | 1 like

I agree... if there are
 by David Krueger on Censoring out-of-domain representations | 0 likes

Game-aligned agents aren't
 by Vladimir Nesov on Does UDT *really* get counter-factually mugged? | 0 likes

The issue in the OP is that
 by Vladimir Nesov on Does UDT *really* get counter-factually mugged? | 0 likes

This seems only loosely
 by David Krueger on Does UDT *really* get counter-factually mugged? | 0 likes

OK that makes sense, thanks.
 by David Krueger on Does UDT *really* get counter-factually mugged? | 0 likes

It's not the same (but
 by David Krueger on Learning Impact in RL | 0 likes