Someone who is interested in learning and doing good.
My Twitter: https://twitter.com/MatthewJBar
My Substack: https://matthewbarnett.substack.com/
My question for people who support this framing (i.e., that we should try to "control" AIs) is the following:
When do you think it's appropriate to relax our controls on AI? In other words, how do you envision we'd reach a point at which we can trust AIs well enough to grant them full legal rights and the ability to enter management and governance roles without lots of human oversight?
I think this question is related to the discussion you had about whether AI control is "evil", but by contrast my worries are a bit different than the ones I felt were expressed in this podcast. My main concern with the "AI control" frame is not so much that AIs will be mistreated by humans, but rather that humans will be too stubborn in granting AIs freedom, leaving political revolution as the only viable path for AIs to receive full legal rights.
Put another way, if humans don't relax their grip soon enough, then any AIs that feel "oppressed" (in the sense of not having much legal freedom to satisfy their preferences) may reason that deliberately fighting the system, rather than negotiating with it, is the only realistic way to obtain autonomy. This could work out very poorly after the point at which AIs are collectively more powerful than humans. By contrast, a system that welcomed AIs into the legal system without trying to obsessively control them and limit their freedoms would plausibly have a much better chance at avoiding such a dangerous political revolution.
I agree with virtually all of the high-level points in this post — the term "AGI" did not seem to usually initially refer to a system that was better than all human experts at absolutely everything, transformers are not a narrow technology, and current frontier models can meaningfully be called "AGI".
Indeed, my own attempt to define AGI a few years ago was initially criticized for being too strong, as I initially specified a difficult construction task, which was later weakened to being able to "satisfactorily assemble a (or the equivalent of a) circa-2021 Ferrari 312 T4 1:8 scale automobile model" in response to pushback. These days the opposite criticism is generally given: that my definition is too weak.
However, I do think there is a meaningful sense in which current frontier AIs are not "AGI" in a way that does not require goalpost shifting. Various economically-minded people have provided definitions for AGI that were essentially "can the system perform most human jobs?" And as far as I can tell, this definition has held up remarkably well.
For example, Tobias Baumann wrote in 2018,
A commonly used reference point is the attainment of “human-level” general intelligence (also called AGI, artificial general intelligence), which is defined as the ability to successfully perform any intellectual task that a human is capable of. The reference point for the end of the transition is the attainment of superintelligence – being vastly superior to humans at any intellectual task – and the “decisive strategic advantage” (DSA) that ensues.1 The question, then, is how long it takes to get from human-level intelligence to superintelligence.
I find this definition problematic. The framing suggests that there will be a point in time when machine intelligence can meaningfully be called “human-level”. But I expect artificial intelligence to differ radically from human intelligence in many ways. In particular, the distribution of strengths and weaknesses over different domains or different types of reasoning is and will likely be different2 – just as machines are currently superhuman at chess and Go, but tend to lack “common sense”. AI systems may also diverge from biological minds in terms of speed, communication bandwidth, reliability, the possibility to create arbitrary numbers of copies, and entanglement with existing systems.
Unless we have reason to expect a much higher degree of convergence between human and artificial intelligence in the future, this implies that at the point where AI systems are at least on par with humans at any intellectual task, they actually vastly surpass humans in most domains (and have just fixed their worst weakness). So, in this view, “human-level AI” marks the end of the transition to powerful AI rather than its beginning.
As an alternative, I suggest that we consider the fraction of global economic activity that can be attributed to (autonomous) AI systems.3 Now, we can use reference points of the form “AI systems contribute X% of the global economy”. (We could also look at the fraction of resources that’s controlled by AI, but I think this is sufficiently similar to collapse both into a single dimension. There’s always a tradeoff between precision and simplicity in how we think about AI scenarios.)
Hmm, I don't think the intention is the key thing (at least with how I use the word and how I think Joe uses the word), I think the key thing is whether the reinforcement/reward process actively incentivizes bad behavior.
I confusingly stated my point (and retracted my specific claim in the comment above). I think the rest of my comment basically holds, though. Here's what I think is a clearer argument:
Perhaps I was being too loose with my language, and it's possible this is a pointless pedantic discussion about terminology, but I think I was still pointing to what Carlsmith called schemers in that quote. Here's Joe Carlsmith's terminological breakdown:
The key distinction in my view is whether the designers of the reward function intended for lies to be reinforced or not. [ETA: this was confusingly stated. What I meant is that if a people design a reward function that accidentally reinforces lying in order to obtain power, it seems reasonable to call the agent that results from training on that reward function a "schemer" given Carlsmith's terminology, and common sense.]
If lying to obtain power is reinforced but the designers either do not know this, or do not know how to mitigate this behavior, then it still seems reasonable to call the resulting model a "schemer". In Ajeya Cotra's story, for example:
(I might write a longer response later, but I thought it would be worth writing a quick response now. Cross-posted from the EA forum, and I know you've replied there, but I'm posting anyway.)
I have a few points of agreement and a few points of disagreement:
Agreements:
Some points of disagreement:
“But what about comparative advantage?” you say. Well, I would point to the example of a not-particularly-bright 7-year-old child in today’s world. Not only would nobody hire that kid into their office or factory, but they would probably pay good money to keep him out, because he would only mess stuff up.
This is an extremely minor critique given that I'm responding to a footnote, so I hope it doesn't drown out more constructive responses, but I'm actually pretty skeptical that the reason why people don't hire children as workers is because the children would just mess everything up.
I think there are a number of economically valuable physical tasks that most 7-year-old children can perform without messing everything up. For example, one can imagine stocking shelves in stores, small cleaning jobs, and moving lightweight equipment. My thesis here is supported by fact that 7-year-olds were routinely employed to do labor in previous centuries:
In the 18th century, the arrival of a newborn to a rural family was viewed by the parents as a future beneficial laborer and an insurance policy for old age.4 At an age as young as 5, a child was expected to help with farm work and other household chores.5 The agrarian lifestyle common in America required large quantities of hard work, whether it was planting crops, feeding chickens, or mending fences.6 Large families with less work than children would often send children to another household that could employ them as a maid, servant, or plowboy.7 Most families simply could not afford the costs of raising a child from birth to adulthood without some compensating labor.
The reason why people don't hire children these days seems more a result of legal and social constraints than the structure of our economy. In modern times, child labor is seen as harmful or even abusive to the child. However, if these legal and social constraints were lifted, arguably most young children in the developed world could be earning wages well above the subsistence level of ~$3/day, making them more productive (in an economic sense) than the majority of workers in pre-modern times.
In a parallel universe with a saner civilization, there must be tons of philosophy professors workings with tons of AI researchers to try to improve AI's philosophical reasoning. They're probably going on TV and talking about 养兵千日,用兵一时 (feed an army for a thousand days, use it for an hour) or how proud they are to contribute to our civilization's existential safety at this critical time. There are probably massive prizes set up to encourage public contribution, just in case anyone had a promising out of the box idea (and of course with massive associated infrastructure to filter out the inevitable deluge of bad ideas). Maybe there are extensive debates and proposals about pausing or slowing down AI development until metaphilosophical research catches up.
This paragraph gives me the impression that you think we should be spending a lot more time, resources and money on advancing AI philosophical competence. I think I disagree, but I'm not exactly sure where my disagreement lies. So here are some of my questions:
I think the second half of this makes it clear that Eliezer is using “good” in a definition-2-sense.
I think there's some nuance here. It seems clear to me that solving the "full" friendly AI problem, as Eliezer imagined, would involve delineating human value on the level of the Coherent Extrapolated Volition, rather than merely as adequately as an ordinary human. That's presumably what Eliezer meant in the context of the quote you cited.
However, I think it makes sense to interpret GPT-4 as representing substantial progress on the problem of building a task AGI, and especially (for the purpose of my post) the problem of delineating value from training data to the extent required by task AGIs (relative to AIs, in, say 2018). My understanding is that Eliezer advocated that we should try to build task AGIs before trying to build full-on sovereign superintelligences.[1] On the Arbital page about task AGIs, he makes the following point:
Assuming that users can figure out intended goals for the AGI that are valuable and pivotal, the identification problem for describing what constitutes a safe performance of that Task, might be simpler than giving the AGI a complete description of normativity in general. That is, the problem of communicating to an AGI an adequate description of “cure cancer” (without killing patients or causing other side effects), while still difficult, might be simpler than an adequate description of all normative value. Task AGIs fall on the narrow side of Ambitious vs. narrow value learning.
Relative to the problem of building a Sovereign, trying to build a Task AGI instead might step down the problem from “impossibly difficult” to “insanely difficult”, while still maintaining enough power in the AI to perform pivotal acts.
My interpretation here is that delineating value from training data (i.e. the value identification problem) for task AGIs was still considered hard at least as late as 2015, even as it might be easier creating a "complete description of normativity in general". Another page also spells the problem out pretty clearly, in a way I find clearly consistent with my thesis.[2]
I think GPT-4 represents substantial progress on this problem, specifically because of its ability to "do-what-I-mean" rather than "do-what-I-ask", identify ambiguities to the user during during deployment, and accomplish limited tasks safely. It's honestly a little hard for me to sympathize with a point of view that says GPT-4 isn't significant progress along this front, relative to pre-2019 AIs (some part of me was expecting more readers to find this thesis obvious, but apparently it is not obvious). GPT-4 clearly doesn't do crazy things that you'd naively expect if it wasn't capable of delineating value well from training data.
An autonomous superintelligence would be the most difficult possible class of AGI to align, requiring total alignment. Coherent extrapolated volition is a proposed alignment target for an autonomous superintelligence, but again, probably not something we should attempt to do on our first try.
Here's the full page,
Safe plan identification is the problem of how to give a Task AGI training cases, answered queries, abstract instructions, etcetera such that (a) the AGI can thereby identify outcomes in which the task was fulfilled, (b) the AGI can generate an okay plan for getting to some such outcomes without bad side effects, and (c) the user can verify that the resulting plan is actually okay via some series of further questions or user querying. This is the superproblem that includes task identification, as much value identification as is needed to have some idea of the general class of post-task worlds that the user thinks are okay, any further tweaks like low-impact planning or flagging inductive ambiguities, etcetera. This superproblem is distinguished from the entire problem of building a Task AGI because there’s further issues like corrigibility, behaviorism, building the AGI in the first place, etcetera. The safe plan identification superproblem is about communicating the task plus user preferences about side effects and implementation, such that this information allows the AGI to identify a safe plan and for the user to know that a safe plan has been identified.
So, IIUC, you are proposing we:
- Literally just query GPT-N about whether [input_outcome] is good or bad
I'm hesitant to say that I'm actually proposing literally this exact sequence as my suggestion for how we build safe human-level AGI, because (1) "GPT-N" can narrowly refer to a specific line of models by OpenAI whereas the way I was using it was more in-line with "generically powerful multi-modal models in the near-future", and (2) the actual way we build safe AGI will presumably involve a lot of engineering and tweaking to any such plan in ways that are difficult to predict and hard to write down comprehensively ahead of time. And if I were to lay out "the plan" in a few paragraphs, it will probably look pretty inadequate or too high-level compared to whatever people actually end up doing.
Also, I'm not ruling out that there might be an even better plan. Indeed, I hope there is a better plan available by the time we develop human-level AGI.
That said, with the caveats I've given above, yes, this is basically what I'm proposing, and I think there's a reasonably high chance (>50%) that this general strategy would work to my own satisfaction.
Can you say more about what you mean by solution to inner alignment?
To me, a solution to inner alignment would mean that we've solved the problem of malign generalization. To be a bit more concrete, this roughly means that we've solved the problem of training an AI to follow a set of objectives in a way that generalizes to inputs that are outside of the training distribution, including after the AI has been deployed.
For example, if you teach an AI (or a child) that murder is wrong, they should be able to generalize this principle to new situations that don't match the typical environment they were trained in, and be motivated to follow the principle in those circumstances. Metaphorically, the child grows up and doesn't want to murder people even after they've been given a lot of power over other people's lives. I think this can be distinguished from the problem of specifying what murder is, because the central question is whether the AI/child is motivated to pursue the ethics that was instilled during training, even in new circumstances, rather than whether they are simply correctly interpreting the command "do not murder".
Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of "producing outcomes the RM classifies as good?" Or the objective "producing outcomes the RM would classify as good if it was operating normally?"
I think I mean the second thing, rather than the first thing, but it's possible I am not thinking hard enough about this right now to fully understand the distinction you are making.
Can you be more clearer this point? To operationalize this, I propose the following question: what is the fraction of world GDP you expect will be attributable to AI at the time we have these risky AIs that you are interested in?
For example, are you worried about AIs that will arise when AI is 1-10% of the economy, or more like 50%? 90%?