Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman).

Can you be more clearer this point? To operationalize this, I propose the following question: what is the fraction of world GDP you expect will be attributable to AI at the time we have these risky AIs that you are interested in?

For example, are you worried about AIs that will arise when AI is 1-10% of the economy, or more like 50%? 90%?

AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt

Matthew Barnett1mo10

My question for people who support this framing (i.e., that we should try to "control" AIs) is the following:

When do you think it's appropriate to relax our controls on AI? In other words, how do you envision we'd reach a point at which we can trust AIs well enough to grant them full legal rights and the ability to enter management and governance roles without lots of human oversight?

I think this question is related to the discussion you had about whether AI control is "evil", but by contrast my worries are a bit different than the ones I felt were expressed in this podcast. My main concern with the "AI control" frame is not so much that AIs will be mistreated by humans, but rather that humans will be too stubborn in granting AIs freedom, leaving political revolution as the only viable path for AIs to receive full legal rights.

Put another way, if humans don't relax their grip soon enough, then any AIs that feel "oppressed" (in the sense of not having much legal freedom to satisfy their preferences) may reason that deliberately fighting the system, rather than negotiating with it, is the only realistic way to obtain autonomy. This could work out very poorly after the point at which AIs are collectively more powerful than humans. By contrast, a system that welcomed AIs into the legal system without trying to obsessively control them and limit their freedoms would plausibly have a much better chance at avoiding such a dangerous political revolution.

Modern Transformers are AGI, and Human-Level

Matthew Barnett2mo912

I agree with virtually all of the high-level points in this post — the term "AGI" did not seem to usually initially refer to a system that was better than all human experts at absolutely everything, transformers are not a narrow technology, and current frontier models can meaningfully be called "AGI".

Indeed, my own attempt to define AGI a few years ago was initially criticized for being too strong, as I initially specified a difficult construction task, which was later weakened to being able to "satisfactorily assemble a (or the equivalent of a) circa-2021 Ferrari 312 T4 1:8 scale automobile model" in response to pushback. These days the opposite criticism is generally given: that my definition is too weak.

However, I do think there is a meaningful sense in which current frontier AIs are not "AGI" in a way that does not require goalpost shifting. Various economically-minded people have provided definitions for AGI that were essentially "can the system perform most human jobs?" And as far as I can tell, this definition has held up remarkably well.

For example, Tobias Baumann wrote in 2018,

A commonly used reference point is the attainment of “human-level” general intelligence (also called AGI, artificial general intelligence), which is defined as the ability to successfully perform any intellectual task that a human is capable of. The reference point for the end of the transition is the attainment of superintelligence – being vastly superior to humans at any intellectual task – and the “decisive strategic advantage” (DSA) that ensues.¹ The question, then, is how long it takes to get from human-level intelligence to superintelligence.
I find this definition problematic. The framing suggests that there will be a point in time when machine intelligence can meaningfully be called “human-level”. But I expect artificial intelligence to differ radically from human intelligence in many ways. In particular, the distribution of strengths and weaknesses over different domains or different types of reasoning is and will likely be different² – just as machines are currently superhuman at chess and Go, but tend to lack “common sense”. AI systems may also diverge from biological minds in terms of speed, communication bandwidth, reliability, the possibility to create arbitrary numbers of copies, and entanglement with existing systems.
Unless we have reason to expect a much higher degree of convergence between human and artificial intelligence in the future, this implies that at the point where AI systems are at least on par with humans at any intellectual task, they actually vastly surpass humans in most domains (and have just fixed their worst weakness). So, in this view, “human-level AI” marks the end of the transition to powerful AI rather than its beginning.
As an alternative, I suggest that we consider the fraction of global economic activity that can be attributed to (autonomous) AI systems.³ Now, we can use reference points of the form “AI systems contribute X% of the global economy”. (We could also look at the fraction of resources that’s controlled by AI, but I think this is sufficiently similar to collapse both into a single dimension. There’s always a tradeoff between precision and simplicity in how we think about AI scenarios.)

Counting arguments provide no evidence for AI doom

Matthew Barnett2mo10

Hmm, I don't think the intention is the key thing (at least with how I use the word and how I think Joe uses the word), I think the key thing is whether the reinforcement/reward process actively incentivizes bad behavior.

I confusingly stated my point (and retracted my specific claim in the comment above). I think the rest of my comment basically holds, though. Here's what I think is a clearer argument:

The term "schemer" evokes an image of someone who is lying to obtain power. It doesn't particularly evoke a backstory for why the person became a liar in the first place.
There are at least two ways that AIs could arise that lie in order to obtain power:
- The reward function could directly reinforce the behavior of lying to obtain power, at least at some point in the training process.
- The reward function could have no defects (in the sense of not directly reinforcing harmful behavior), and yet an agent could nonetheless arise during training that lies in order to obtain power, simply because it is a misaligned inner optimizer (broadly speaking)
In both cases, one can imagine the AI eventually "playing the training game", in the sense of having a complete understanding of its training process and deliberately choosing actions that yield high reward, according to its understanding of the training process
Since both types of AIs are: (1) playing the training game, (2) lying in order to obtain power, it makes sense to call both of them "schemers", as that simply matches the way the term is typically used.

For example, Nora and Quintin started their post with, "AI doom scenarios often suppose that future AIs will engage in scheming— planning to escape, gain power, and pursue ulterior motives, while deceiving us into thinking they are aligned with our interests." This usage did not specify the reason for the deceptive behavior arising in the first place, only that the behavior was both deceptive and aimed at gaining power.
Separately, I am currently confused at what it means for a behavior to be "directly reinforced" by a reward function, so I'm not completely confident in these arguments, or my own line of reasoning here. My best guess is that these are fuzzy terms that might be much less coherent than they initially appear if one tried to make these arguments more precise.

Counting arguments provide no evidence for AI doom

Matthew Barnett2mo52

Perhaps I was being too loose with my language, and it's possible this is a pointless pedantic discussion about terminology, but I think I was still pointing to what Carlsmith called schemers in that quote. Here's Joe Carlsmith's terminological breakdown:

~~The key distinction in my view is whether the designers of the reward function intended for lies to be reinforced or not.~~ [ETA: this was confusingly stated. What I meant is that if a people design a reward function that accidentally reinforces lying in order to obtain power, it seems reasonable to call the agent that results from training on that reward function a "schemer" given Carlsmith's terminology, and common sense.]

If lying to obtain power is reinforced but the designers either do not know this, or do not know how to mitigate this behavior, then it still seems reasonable to call the resulting model a "schemer". In Ajeya Cotra's story, for example:

Alex was incentivized to lie because it got rewards for taking actions that were superficially rated as good even if they weren't actually good, i.e. Alex was "lying because this was directly reinforced". She wrote, "Because humans have systematic errors in judgment, there are many scenarios where acting deceitfully causes humans to reward Alex’s behavior more highly. Because Alex is a skilled, situationally aware, creative planner, it will understand this; because Alex’s training pushes it to maximize its expected reward, it will be pushed to act on this understanding and behave deceptively."
Alex was "playing the training game", as Ajeya Cotra says this explicitly several times in her story.
Alex was playing the training game in order to get power for itself or for other AIs; clearly, as the model literally takes over the world and disempowers humanity at the end.
Alex kind of didn't appear to purely care about reward-on-the-episode, since it took over the world? Yes, Alex cared about rewards, but not necessarily on this episode. Maybe I'm wrong here. But even if Alex only cared about reward-on-the-episode, you could easily construct a scenario similar to Ajeya's story in which a model begins to care about things other than reward-on-the-episode, which nonetheless fits the story of "the AI is lying because this was directly reinforced".

Counting arguments provide no evidence for AI doom

Matthew Barnett2mo138

(I might write a longer response later, but I thought it would be worth writing a quick response now. Cross-posted from the EA forum, and I know you've replied there, but I'm posting anyway.)

I have a few points of agreement and a few points of disagreement:

Agreements:

The strict counting argument seems very weak as an argument for scheming, essentially for the reason you identified: it relies on a uniform prior over AI goals, which seems like a really bad model of the situation.
The hazy counting argument—while stronger than the strict counting argument—still seems like weak evidence for scheming. One way of seeing this is, as you pointed out, to show that essentially identical arguments can be applied to deep learning in different contexts that nonetheless contradict empirical evidence.

Some points of disagreement:

I think the title overstates the strength of the conclusion. The hazy counting argument seems weak to me but I don't think it's literally "no evidence" for the claim here: that future AIs will scheme.
I disagree with the bottom-line conclusion: "we should assign very low credence to the spontaneous emergence of scheming in future AI systems—perhaps 0.1% or less"
- I think it's too early to be very confident in sweeping claims about the behavior or inner workings of future AI systems, especially in the long-run. I don't think the evidence we have about these things is very strong right now.
- One caveat: I think the claim here is vague. I don't know what counts as "spontaneous emergence", for example. And I don't know how to operationalize AI scheming. I personally think scheming comes in degrees: some forms of scheming might be relatively benign and mild, and others could be more extreme and pervasive.
- Ultimately I think you've only rebutted one argument for scheming—the counting argument. A more plausible argument for scheming, in my opinion, is simply that the way we train AIs—including the data we train them on—could reward AIs that scheme over AIs that are honest and don't scheme. Actors such as AI labs have strong incentives to be vigilant against these types of mistakes when training AIs, but I don't expect people to come up with perfect solutions. So I'm not convinced that AIs won't scheme at all.
- If by "scheming" all you mean is that an agent deceives someone in order to get power, I'd argue that many humans scheme all the time. Politicians routinely scheme, for example, by pretending to have values that are more palatable to the general public, in order to receive votes. Society bears some costs from scheming, and pays costs to mitigate the effects of scheming. Combined, these costs are not crazy-high fractions of GDP; but nonetheless, scheming is a constant fact of life.
- If future AIs are "as aligned as humans", then AIs will probably scheme frequently. I think an important question is how intensely and how pervasively AIs will scheme; and thus, how much society will have to pay as a result of scheming. If AIs scheme way more than humans, then this could be catastrophic, but I haven't yet seen any decent argument for that theory.
- So ultimately I am skeptical that AI scheming will cause human extinction or disempowerment, but probably for different reasons than the ones in your essay: I think the negative effects of scheming can probably be adequately mitigated by paying some costs even if it arises.
I don't think you need to believe in any strong version of goal realism in order to accept the claim that AIs will intuitively have "goals" that they robustly attempt to pursue. It seems pretty natural to me that people will purposely design AIs that have goals in an ordinary sense, and some of these goals will be "misaligned" in the sense that the designer did not intend for them. My relative optimism about AI scheming doesn't come from thinking that AIs won't robustly pursue goals, but instead comes largely from my beliefs that:
- AIs, like all real-world agents, will be subject to constraints when pursuing their goals. These constraints include things like the fact that it's extremely hard and risky to take over the whole world and then optimize the universe exactly according to what you want. As a result, AIs with goals that differ from what humans (and other AIs) want, will probably end up compromising and trading with other agents instead of pursuing world takeover. This is a benign failure and doesn't seem very bad.
- The amount of investment we put into mitigating scheming is not an exogenous variable, but instead will respond to evidence about how pervasive scheming is in AI systems, and how big of a deal AI scheming is. And I think we'll accumulate lots of evidence about the pervasiveness of AI scheming in deep learning over time (e.g. such as via experiments with model organisms of alignment), allowing us to set the level of investment in AI safety at a reasonable level as AI gets incrementally more advanced.
  
  If we experimentally determine that scheming is very important and very difficult to mitigate in AI systems, we'll probably respond by spending a lot more money on mitigating scheming, and vice versa. In effect, I don't think we have good reasons to think that society will spend a suboptimal amount on mitigating scheming.

Four visions of Transformative AI success

Matthew Barnett4mo57

“But what about comparative advantage?” you say. Well, I would point to the example of a not-particularly-bright 7-year-old child in today’s world. Not only would nobody hire that kid into their office or factory, but they would probably pay good money to keep him out, because he would only mess stuff up.

This is an extremely minor critique given that I'm responding to a footnote, so I hope it doesn't drown out more constructive responses, but I'm actually pretty skeptical that the reason why people don't hire children as workers is because the children would just mess everything up.

I think there are a number of economically valuable physical tasks that most 7-year-old children can perform without messing everything up. For example, one can imagine stocking shelves in stores, small cleaning jobs, and moving lightweight equipment. My thesis here is supported by fact that 7-year-olds were routinely employed to do labor in previous centuries:

In the 18th century, the arrival of a newborn to a rural family was viewed by the parents as a future beneficial laborer and an insurance policy for old age.⁴ At an age as young as 5, a child was expected to help with farm work and other household chores.⁵ The agrarian lifestyle common in America required large quantities of hard work, whether it was planting crops, feeding chickens, or mending fences.⁶ Large families with less work than children would often send children to another household that could employ them as a maid, servant, or plowboy.⁷ Most families simply could not afford the costs of raising a child from birth to adulthood without some compensating labor.

The reason why people don't hire children these days seems more a result of legal and social constraints than the structure of our economy. In modern times, child labor is seen as harmful or even abusive to the child. However, if these legal and social constraints were lifted, arguably most young children in the developed world could be earning wages well above the subsistence level of ~$3/day, making them more productive (in an economic sense) than the majority of workers in pre-modern times.

AI doing philosophy = AI generating hands?

Matthew Barnett4mo40

In a parallel universe with a saner civilization, there must be tons of philosophy professors workings with tons of AI researchers to try to improve AI's philosophical reasoning. They're probably going on TV and talking about 养兵千日，用兵一时 (feed an army for a thousand days, use it for an hour) or how proud they are to contribute to our civilization's existential safety at this critical time. There are probably massive prizes set up to encourage public contribution, just in case anyone had a promising out of the box idea (and of course with massive associated infrastructure to filter out the inevitable deluge of bad ideas). Maybe there are extensive debates and proposals about pausing or slowing down AI development until metaphilosophical research catches up.

This paragraph gives me the impression that you think we should be spending a lot more time, resources and money on advancing AI philosophical competence. I think I disagree, but I'm not exactly sure where my disagreement lies. So here are some of my questions:

How difficult do you expect philosophical competence to be relative to other tasks? For example:
- Do you think that Harvard philosophy-grad-student-level philosophical competence will be one of the "last" tasks to be automated before AIs are capable of taking over the world?
- Do you expect that we will have robots that are capable of reliably cleaning arbitrary rooms, doing laundry, and washing dishes, before the development of AI that's as good as the median Harvard philosophy graduate student? If so, why?
Is the "problem" more that we need a superhuman philosophical reasoning to avoid a catastrophe? Or is the problem that even top-human-level philosophers are hard to automate in some respect?
Why not expect philosophical competence to be solved "by default" more-or-less using transfer learning from existing philosophical literature, and human evaluation (e.g. RLHF, AI safety via debate, iterated amplification and distillation etc.)?
- Unlike AI deception generally, it seems we should be able to easily notice if our AIs are lacking in philosophical competence, making this problem much less pressing, since people won't be comfortable voluntarily handing off power to AIs that they know are incompetent in some respect.
- To the extent you disagree with the previous bullet point, I expect it's either because you think the problem is either (1) sociological (i.e. the problem is that people will actually make the mistake of voluntarily handing power to AIs they know are philosophically incompetent), or the problem is (2) hard because of the difficulty of evaluation (i.e. we don't know how to evaluate what good philosophy looks like).
  - In case (1), I think I'm probably just more optimistic than you about this exact issue, and I'd want to compare it to most other cases where AIs fall short of top-human level performance. For example, we likely would not employ AIs as mathematicians if people thought that AIs weren't actually good at math. This just seems obvious to me.
  - Case (2) seems more plausible to me, but I'm not sure why you'd find this problem particularly pressing compared to other problems of evaluation, e.g. generating economic policies that look good to us but are actually bad.
    - More generally, the problem of creating AIs that produce good philosophy, rather than philosophy that merely looks good, seems like a special case of the general "human simulator" argument, where RLHF is incentivized to find AIs that fool us by producing outputs that look good to us, but are actually bad. To me it just seems much more productive to focus on the general problem of how to do accurate reinforcement learning (i.e. RL that rewards honest, corrigible, and competent behavior), and I'm not sure why you'd want to focus much on the narrow problem of philosophical reasoning as a special case here. Perhaps you can clarify your focus here?
What specific problems do you expect will arise if we fail to solve philosophical competence "in time"?
- Are you imagining, for example, that at some point humanity will direct our AIs to "solve ethics" and then implement whatever solution the AIs come up with? (Personally I currently don't expect anything like this to happen in our future, at least in a broad sense.)

Evaluating the historical value misspecification argument

Matthew Barnett4mo32

I think the second half of this makes it clear that Eliezer is using “good” in a definition-2-sense.

I think there's some nuance here. It seems clear to me that solving the "full" friendly AI problem, as Eliezer imagined, would involve delineating human value on the level of the Coherent Extrapolated Volition, rather than merely as adequately as an ordinary human. That's presumably what Eliezer meant in the context of the quote you cited.

However, I think it makes sense to interpret GPT-4 as representing substantial progress on the problem of building a task AGI, and especially (for the purpose of my post) the problem of delineating value from training data to the extent required by task AGIs (relative to AIs, in, say 2018). My understanding is that Eliezer advocated that we should try to build task AGIs before trying to build full-on sovereign superintelligences.^[1] On the Arbital page about task AGIs, he makes the following point:

Assuming that users can figure out intended goals for the AGI that are valuable and pivotal, the identification problem for describing what constitutes a safe performance of that Task, might be simpler than giving the AGI a complete description of normativity in general. That is, the problem of communicating to an AGI an adequate description of “cure cancer” (without killing patients or causing other side effects), while still difficult, might be simpler than an adequate description of all normative value. Task AGIs fall on the narrow side of Ambitious vs. narrow value learning.
Relative to the problem of building a Sovereign, trying to build a Task AGI instead might step down the problem from “impossibly difficult” to “insanely difficult”, while still maintaining enough power in the AI to perform pivotal acts.

My interpretation here is that delineating value from training data (i.e. the value identification problem) for task AGIs was still considered hard at least as late as 2015, even as it might be easier creating a "complete description of normativity in general". Another page also spells the problem out pretty clearly, in a way I find clearly consistent with my thesis.^[2]

I think GPT-4 represents substantial progress on this problem, specifically because of its ability to "do-what-I-mean" rather than "do-what-I-ask", identify ambiguities to the user during during deployment, and accomplish limited tasks safely. It's honestly a little hard for me to sympathize with a point of view that says GPT-4 isn't significant progress along this front, relative to pre-2019 AIs (some part of me was expecting more readers to find this thesis obvious, but apparently it is not obvious). GPT-4 clearly doesn't do crazy things that you'd naively expect if it wasn't capable of delineating value well from training data.

^{^}
Eliezer wrote,
An autonomous superintelligence would be the most difficult possible class of AGI to align, requiring total alignment. Coherent extrapolated volition is a proposed alignment target for an autonomous superintelligence, but again, probably not something we should attempt to do on our first try.
^{^}
Here's the full page,
Safe plan identification is the problem of how to give a Task AGI training cases, answered queries, abstract instructions, etcetera such that (a) the AGI can thereby identify outcomes in which the task was fulfilled, (b) the AGI can generate an okay plan for getting to some such outcomes without bad side effects, and (c) the user can verify that the resulting plan is actually okay via some series of further questions or user querying. This is the superproblem that includes task identification, as much value identification as is needed to have some idea of the general class of post-task worlds that the user thinks are okay, any further tweaks like low-impact planning or flagging inductive ambiguities, etcetera. This superproblem is distinguished from the entire problem of building a Task AGI because there’s further issues like corrigibility, behaviorism, building the AGI in the first place, etcetera. The safe plan identification superproblem is about communicating the task plus user preferences about side effects and implementation, such that this information allows the AGI to identify a safe plan and for the user to know that a safe plan has been identified.

Evaluating the historical value misspecification argument

Matthew Barnett5mo36

So, IIUC, you are proposing we:
Literally just query GPT-N about whether [input_outcome] is good or bad

I'm hesitant to say that I'm actually proposing literally this exact sequence as my suggestion for how we build safe human-level AGI, because (1) "GPT-N" can narrowly refer to a specific line of models by OpenAI whereas the way I was using it was more in-line with "generically powerful multi-modal models in the near-future", and (2) the actual way we build safe AGI will presumably involve a lot of engineering and tweaking to any such plan in ways that are difficult to predict and hard to write down comprehensively ahead of time. And if I were to lay out "the plan" in a few paragraphs, it will probably look pretty inadequate or too high-level compared to whatever people actually end up doing.

Also, I'm not ruling out that there might be an even better plan. Indeed, I hope there is a better plan available by the time we develop human-level AGI.

That said, with the caveats I've given above, yes, this is basically what I'm proposing, and I think there's a reasonably high chance (>50%) that this general strategy would work to my own satisfaction.

Can you say more about what you mean by solution to inner alignment?

To me, a solution to inner alignment would mean that we've solved the problem of malign generalization. To be a bit more concrete, this roughly means that we've solved the problem of training an AI to follow a set of objectives in a way that generalizes to inputs that are outside of the training distribution, including after the AI has been deployed.

For example, if you teach an AI (or a child) that murder is wrong, they should be able to generalize this principle to new situations that don't match the typical environment they were trained in, and be motivated to follow the principle in those circumstances. Metaphorically, the child grows up and doesn't want to murder people even after they've been given a lot of power over other people's lives. I think this can be distinguished from the problem of specifying what murder is, because the central question is whether the AI/child is motivated to pursue the ethics that was instilled during training, even in new circumstances, rather than whether they are simply correctly interpreting the command "do not murder".

Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of "producing outcomes the RM classifies as good?" Or the objective "producing outcomes the RM would classify as good if it was operating normally?"

I think I mean the second thing, rather than the first thing, but it's possible I am not thinking hard enough about this right now to fully understand the distinction you are making.