I read most of this paper, albeit somewhat quickly and skipped a few sections. I appreciate how clear the writing is, and I want to encourage more AI risk proponents to write papers like this to explain their views. That said, I largely disagree with the conclusion and several lines of reasoning within it.
Here are some of my thoughts (although these not my only disagreements):
I think the definition of “disempowerment” is vague in a way that fails to distinguish between e.g. (1) “less than 1% of world income goes to humans, but they have a high absolute standard of living and are generally treated well” vs. (2) “humans are in a state of perpetual impoverishment and oppression due to AIs and generally the future sucks for them”.
These are distinct scenarios with very different implications (under my values) for whether what happened is bad or good
I think (1) is OK and I think it’s more-or-less the default outcome from AI, whereas I think (2) would be a lot worse and I find it less likely.
By not distinguishing between these things, the paper allows for a motte-and-bailey in which they show that one (generic) range of outcomes could occur, and then imply that it is bad, even though both good and bad scenarios are consistent with the set of outcomes they’ve demonstrated
I think this quote is pretty confused and seems to rely partially on a misunderstanding of what people mean when they say that AGI cognition might be messy: “Second, even if human psychology is messy, this does not mean that an AGI’s psychology would be messy. It seems like current deep learning methodology embodies a distinction between final and instrumental goals. For instance, in standard versions of reinforcement learning, the model learns to optimize an externally specified reward function as best as possible. It seems like this reward function determines the model’s final goal. During training, the model learns to seek out things which are instrumentally relevant to this final goal. Hence, there appears to be a strict distinction between the final goal (specified by the reward function) and instrumental goals.”
Generally speaking, reinforcement learning shouldn’t be seen as directly encoding goals into models and thereby making them agentic, but should instead be seen as a process used to select models for how well they get reward during training.
Consequently, there’s no strong reason why reinforcement learning should create entities that have a clean psychological goal structure that is sharply different from and less messy than human goal structures. c.f. Models don’t “get reward”.
But I agree that future AIs could be agentic if we purposely intend for them to be agentic, including via extensive reinforcement learning.
I think this quote potentially indicates a flawed mental model of AI development underneath: “Moreover, I want to note that instrumental convergence is not the only route to AI capable of disempowering humanity which tries to disempower humanity. If sufficiently many actors will be able to build AI capable of disempowering humanity, including, e.g. small groups of ordinary citizens, then some will intentionally unleash AI trying to disempower humanity.”
I think this scenario is very implausible because AIs will very likely be developed by large entities with lots of resources (such as big corporations and governments) rather than e.g. small groups of ordinary citizens.
By the time small groups of less powerful citizens have the power to develop very smart AIs, we will likely already be in a world filled with very smart AIs. In this case, either human disempowerment already happened, or we’re in a world in which it’s much harder to disempower humans, because there are lots of AIs who have an active stake in ensuring this does not occur.
The last point is very important, and follows from a more general principle that the “ability necessary to take over the world” is not constant, but instead increases with the technology level. For example, if you invent a gun, that does not make you very powerful, because other people could have guns too. Likewise, simply being very smart does not make you have any overwhelming hard power against the rest of the world if the rest of the world is filled with very smart agents.
I think this quote overstates the value specification problem and ignores evidence from LLMs that this type of thing is not very hard: “There are two kinds of challenges in aligning AI. First, one needs to specify the goals the model should pursue. Second, one needs to ensure that the model robustly pursues those goals.Footnote12 The first challenge has been termed the ‘king Midas problem’ (Russell 2019). In a nutshell, human goals are complex, multi-faceted, diverse, wide-ranging, and potentially inconsistent. This is why it is exceedingly hard, if not impossible, to explicitly specify everything humans tend to care about.”
I don’t think we need to “explicitly specify everything humans tend to care about” into a utility function. Instead, we can have AIs learn human values by having them trained on human data.
This is already what current LLMs do. If you ask GPT-4 to execute a sequence of instructions, it rarely misinterprets you in a way that would imply improper goal specification. The more likely outcome is that GPT-4 will simply not be able to fulfill your request, not that it will execute a mis-specified sequence of instructions that satisfies the literal specification of what you said at the expense of what you intended.
Note that I’m not saying that GPT-4 merely understands what you’re requesting. I am saying that GPT-4 generally literally executes your instructions how you intended (an action, not a belief).
I think the argument about how instrumental convergence implies disempowerment proves too much. Lots of agents in the world don’t try to take over the world despite having goals that are not identical to the goals of other agents. If your claim is that powerful agents will naturally try to take over the world unless they are exactly aligned with the goals of the rest of the world, then I don’t think this claim is consistent with the existence of powerful sub-groups of humanity (e.g. large countries) that do not try to take over the world despite being very powerful.
You might reason, “Powerful sub-groups of humans are aligned with each other, which is why they don’t try to take over the world”. But I dispute this hypothesis:
First of all, I don’t think that humans are exactly aligned with the goals of other humans. I think that’s just empirically false in almost every way you could measure the truth of the claim. At best, humans are generally partially (not totally) aligned with random strangers—which could also easily be true of future AIs that are pretrained on our data.
Second of all, I think the most common view in social science is that powerful groups don’t constantly go to war and predate on smaller groups because there are large costs to war, rather than because of moral constraints. Attempting takeover is generally risky and not usually better in expectation than trying to trade, negotiate and compromise and accumulate resources lawfully (e.g. a violent world takeover would involves a lot of pointless destruction of resources). This is distinct from the idea that human groups don’t try to take over the world because they’re aligned with human values (which I also think is too vague to evaluate meaningfully, if that’s what you’d claim).
You can’t easily counter by saying “no human group has the ability to take over the world” because it is trivial to carve up subsets of humanity that control >99% of wealth and resources, which could in principle take control of the entire world if they became unified and decided to achieve that goal. These arbitrary subsets of humanity don’t attempt world takeover largely because they are not coordinated as a group, but AIs could similarly not be unified and coordinated around a such a goal too.
Thank you for the comment, very thought-provoking! I tried to make some reply to each of your comments, but there is much more one could say.
First, I agree that my notion of disempowerment could have been explicated more clearly, although my elucidations fit relatively straightforwardly with your second notion (mainly, perpetual oppression or extinction), not your first. I think conclusions (1) and (2) are both quite significant, although there are important ethical differences.
For the argument, the only case where this potential ambiguity makes a difference is with respect to premise 4 (the instrumental convergence premise). Would be interesting to spell out more which points there seem much more plausible with respect to notion (1) but not to (2). If one has high credence in the view that AIs will decide to compromise with humans, rather than extinguish them, this would be one example of a view which leads to a much higher credence in (1) than in (2).
“I think this quote is pretty confused and seems to rely partially on a misunderstanding of what people mean when they say that AGI cognition might be messy…”
I agree that RL does not necessarily create agents with such a clean psychological goal structure, but I think that there is (maybe strong) reason to think that RL often creates such agents. Cases of reward hacking in RL algorithms are precisely cases where an algorithm exhibits such a relatively clean goal structure, single-mindedly pursuing a ‘stupid’ goal while being instrumentally rational and thus apparently having a clear distinction between final and instrumental goals. But, granted, this might depend on what is ‘rewarded’, e.g. if it’s only a game score in a video game, then the goal structure might be cleaner than when it is a variety of very different things, and on whether the relevant RL agents tend to learn goals over rewards or some states of the world.
“I think this quote potentially indicates a flawed mental model of AI development underneath…”
Very good points. Nevertheless, it seems fair to say that it adds to the difficulty of avoiding disempowerment from misaligned AI that not only the first sufficiently capable AI (AGI) has to avoid catastrophic misaligment, but all further AGIs have to either avoid this too or be stopped by the AGIs already in existence. This then relates to points regarding whether the first AGIs do not only avoid catastrophic misalignment, but are sufficiently aligned so that we can use them to stop other AGIs and what the offense-defense balance would be. Could be that this works out, but also does not seem very safe to me.
“I think this quote overstates the value specification problem and ignores evidence from LLMs that this type of thing is not very hard…”
I am less convinced that evidence from LLMs shows that value specification is not hard. As you hint at, the question of value specification was never taken to be whether a sufficiently intelligent AI can understand our values (of course it can, if it is sufficiently intelligent), but whether we can specify them as its goals (such that it comes to share them). In (e.g.) GPT-4 trained via RL from human feedback, it is true that it typically executes your instructions as intended. However, sometimes it doesn’t and, moreover, there are theoretical reasons to think that this would stop being the case if the system was sufficiently powerful to do an action which would maximize human feedback but which does not consist in executing instructions as intended (e.g., by deceiving human raters).
“I think the argument about how instrumental convergence implies disempowerment proves too much…”
I am not moved by the appeal to humans here that much. If we had a unified (coordination) human agent (goal-directedness) who does not care about the freedom and welfare of other humans at all (full misalignment) and is sufficiently powerful (capability), then it seems plausible to me that this agent would try to take control of humanity, often in a very bad sense (e.g. extinction). If we relax ‘coordination’ or ‘full misalignment’ as assumptions, then this seems hard to predict. I could still see this ending in an AI which tries to disempower humanity, but it’s hard to say.
Would be interesting to spell out more which points there seem much more plausible with respect to notion (1) but not to (2). If one has high credence in the view that AIs will decide to compromise with humans, rather than extinguish them, this would be one example of a view which leads to a much higher credence in (1) than in (2).
I think the view that AIs will compromise with humans rather than go to war with them makes sense under the perspective shared by a large fraction (if not majority) of social scientists that war is usually costlier, riskier, and more wasteful than trade between rational parties with adequate levels of information, who have the option of communicating and negotiating successfully.
This is a general fact about war, and has little to do with the values of the parties going to war, c.f. Rationalist explanations for war. Economic models of war do not generally predict war between parties that have different utility functions. On the contrary, a standard (simple) economic model of human behavior consists of viewing humans as entirely misaligned with other agents in the world, in the sense of having completely non-overlapping utility functions with random strangers. This model has been generalized to firms, countries, alliances etc., and yet it is rare for these generalized models to predict war as the default state of affairs.
Usually when I explain this idea to people, I am met with skepticism that we can generalize these social science models to AI. But I don’t see why not: they are generally our most well-tested models of war. They are grounded in empirical facts and decades of observations, rather than evidence-free speculation (which I perceive as the primary competing alternative in AI risk literature). And most importantly, the assumptions of the models are robust to differences in power between agents, and misalignment between agents, which are generally the two key facts that people point to when arguing why these models are wrong when applied to AI. Yet this alleged distinction appears to merely reflect a misunderstanding of the modeling assumptions, rather than any key difference between humans and AIs.
What’s interesting to me is that many people generally have no problem generalizing these economic models to other circumstances. For example, we could ask:
Would genetically engineered humans try to disempower non-genetically engineered humans, or would they try to trade and compromise? (In my experience, most people predict trade and compromise, even as the genetically engineered humans become much smarter and evolve into a different subspecies.)
Would human emulations on computers try to disempower biological humans, or would they try to trade and compromise? (Again, in my experience, most people predict trade and compromise, even as the emulated civilization becomes vastly more powerful than biological humans.)
In each case, I generally encounter AI risk proponents claiming that what distinguishes these cases from the case of AI is that, in these cases, we can assume that the genetically engineered humans and human emulations will be “aligned” with human values, which adequately explains why they will attempt to compromise rather than go to war with the ordinary biological humans. But as I have already explained, standard economic models of war do not predict that war is constrained by alignment to human values, but is instead constrained by the costs of war, and the relative benefits of trade compared to war.
To the extent you think these economic models of war are simply incorrect, then I think it is worth explicitly engaging with the established social science literature, rather than inventing a new model that makes unique predictions about what non-human AIs would apparently do, who definitionally do not share human values.
In (e.g.) GPT-4 trained via RL from human feedback, it is true that it typically executes your instructions as intended. However, sometimes it doesn’t and, moreover, there are theoretical reasons to think that this would stop being the case if the system was sufficiently powerful to do an action which would maximize human feedback but which does not consist in executing instructions as intended (e.g., by deceiving human raters).
It is true that GPT-4 “sometimes” fails to follow human instructions, but the same could be said about humans. I think it’s worth acknowledging the weight of the empirical evidence here regardless.
In my opinion the empirical evidence generally seems way stronger than the theoretical arguments, which (so far) seem to have had little success predicting when and how alignment would be difficult. For example, many people believed that AGI would be achieved at the time AIs are having natural conversations with humans (e.g. Eliezer Yudkowsky implied as much in his essay about a fire alarm[1]). According to this prediction, we should have already been having pretty severe misspecification problems if such problems were supposed to arise at AGI-level. And yet, I claim, we are not having these severe problems (and instead, we are merely having modestly difficult problems that can be patched with sufficient engineering effort).
It is true that problems of misspecification should become more difficult as AIs get smarter. However, it’s important to recognize that as AI capabilities grow, so too will our tools and methods for tackling these alignment challenges. One key factor is that we will have increasingly intelligent AI systems that can assist us in the alignment process itself. To illustrate this point concretely, let’s walk through a hypothetical scenario:
Suppose that aligning a human-level artificial general intelligence (AGI) merely requires a dedicated team of human alignment researchers. This seems generally plausible given that evaluating output is easier than generating novel outputs (see this article that goes into more detail about this argument and why it’s relevant). Once we succeed in aligning that human-level AGI system, we can then leverage it to help us align the next iteration of AGI that is slightly more capable than human-level (let’s call it AGI+). We would have a team of aligned human-level AGIs working on this challenge with us.
Then, when it comes to aligning the following iteration, AGI++ (which is even more intelligent), we can employ the AGI+ systems we previously aligned to work on this next challenge. And so on, with each successive generation of AI systems helping us to align the next, even more advanced generation.
It seems plausible that this cycle of AI systems assisting in the alignment of future, more capable systems could continue for a long time, allowing us to align AIs of ever-increasing intelligence without at any point needing mere humans to solve the problem of superintelligent alignment alone. If at some point this cycle becomes unsustainable, we can expect the highly intelligent AI advisors we have at that point to warn us about the limits of this approach. This would allow us to recognize when we are reaching the limits of our ability to maintain reliable alignment.
Full quote from Eliezer: “When they are very impressed by how smart their AI is relative to a human being in respects that still feel magical to them; as opposed to the parts they do know how to engineer, which no longer seem magical to them; aka the AI seeming pretty smart in interaction and conversation; aka the AI actually being an AGI already.”
I’ve looked into the game theory of war literature a bit, and my impression is that economists are still pretty confused about war. As you mention, the simplest model predicts that rational agents should prefer negotiated settlements to war, and it seems unsettled what actually causes wars among humans. (People have proposed more complex models incorporating more elements of reality, but AFAIK there isn’t a consensus as to which model gives the best explanation of why wars occur.) I think it makes sense to be aware of this literature and its ideas, but there’s not a strong argument for deferring to it over one’s own ideas or intuitions.
My own thinking is that war between AIs and humans could happen in many ways. One simple (easy to understand) way is that agents will generally refuse a settlement worse than what they think they could obtain on their own (by going to war), so human irrationality could cause a war when e.g. the AI faction thinks it will win with 99% probability, and humans think they could win with 50% probability, so each side demand more of the lightcone (or resources in general) than the other side is willing to grant.
To take this one step further, I would say that given that many deviations from the simplest game theoretic model do predict war, war among consequentialist agents may well be the default in some sense. Also, given that humans often do (or did) go to war with each other, our shared values (i.e. the extent to which we do have empathy/altruism for others) must contribute to the current relative peace in some way.
My own thinking is that war between AIs and humans could happen in many ways. One simple (easy to understand) way is that agents will generally refuse a settlement worse than what they think they could obtain on their own (by going to war), so human irrationality could cause a war when e.g. the AI faction thinks it will win with 99% probability, and humans think they could win with 50% probability, so each side demand more of the lightcone (or resources in general) than the other side is willing to grant.
This generally makes sense to me. I also think human irrationality could prompt a war with AIs. I don’t disagree with the claim insofar as you’re claiming that such a war is merely plausible (say >10% chance), rather than a default outcome. (Although to be clear, I don’t think such a war would likely cut cleanly along human vs. AI lines.)
On the other hand, humans are currently already irrational and yet human vs. human wars are not the default (they happen frequently, e.g. but at any given time on Earth, the vast majority of humans are not in a warzone or fighting in an active war). It’s not clear to me why human vs. AIs would make war more likely to occur than in the human vs. human case, if by assumption the main difference here is that one side is more rational.
In other words, if we’re moving from a situation of irrational parties vs. other irrational parties to irrational parties vs. rational parties, I’m not sure why we’d expect this change to make things more warlike and less peaceful as a result. You mention one potential reason:
Also, given that humans often do (or did) go to war with each other, our shared values (i.e. the extent to which we do have empathy/altruism for others) must contribute to the current relative peace in some way.
I don’t think this follows. Humans presumably also had empathy in e.g. 1500, back when war was more common, so how could it explain our current relative peace?
Perhaps you mean that cultural changes caused our present time period to be relatively peaceful. But I’m not sure about that; or at least, the claim should probably be made more specific. There are many things about the environment that have changed since our relatively more warlike ancestors, and (from my current perspective) I think it’s plausible that any one of them could have been the reason for our current relative peace. That is, I don’t see a good reason to single out human values or empathy as the main cause in itself.
For example, humans are now a lot richer per capita, which might mean that people have “more to lose” when going to war, and thus are less likely to engage in it. We’re also a more globalized culture, and our economic system relies more on long-distance trade than it did in the past, making war more costly. We’re also older, in the sense that the median age is higher (and old people are less likely to commit violence), and women got the right to vote (who perhaps are less likely to support hawkish politicians).
To be clear, I don’t put much confidence in any of these explanations. As of now, I’m very uncertain about why the 21st century seems relatively peaceful compared to the distant past. However I do think that:
None of the explanations I’ve given above seem well-described as “our values/empathy” made us less warlike. And to the extent our values changed, I expect that was probably downstream of more fundamental changes, like economic growth and globalization, rather than being an exogenous change that was independent of these effects.
To the extent that changing human nature explains our current relatively peaceful era, this position seems to require that you believe human nature is fundamentally quite plastic and can be warped over time pretty easily due to cultural changes. If that’s true, human nature is ultimately quite variable, perhaps more similar to AI than you might have otherwise thought (as both are presumably pushed around easily by training data).
It’s not clear to me why human vs. AIs would make war more likely to occur than in the human vs. human case, if by assumption the main difference here is that one side is more rational.
We have more empirical evidence that we can look at when it comes to human-human wars, making it easier to have well-calibrated beliefs about chances of winning. When it comes to human-AI wars, we’re more likely to have wildly irrational beliefs.
This is just one reason war could occur though. Perhaps a more likely reason is that there won’t be a way to maintain the peace, that both sides can be convinced will work, and is sufficiently cheap that the cost doesn’t eat up all of the gains from avoiding war. For example, how would the human faction know that if it agrees to peace, the AI faction won’t fully dispossess the humans at some future date when it’s even more powerful? Even if AIs are able to come up with some workable mechanisms, how would the humans know that it’s not just a trick?
Without credible assurances (which seems hard to come by), I think if humans do agree to peace, the most likely outcome is that it does get dispossessed in the not too distant future, either gradually (for example getting scammed/persuaded/blackmailed/stolen from in various ways), or all at once. I think society as a whole won’t have a strong incentive to protect humans because they’ll be almost pure consumers (not producing much relative to what they consume), and such classes of people are often killed or dispossessed in human history (e.g., landlords after communist takeovers).
I don’t think this follows. Humans presumably also had empathy in e.g. 1500, back when war was more common, so how could it explain our current relative peace?
I mainly mean that without empathy/altruism, we’d probably have even more wars, both now and then.
To the extent that changing human nature explains our current relatively peaceful era, this position seems to require that you believe human nature is fundamentally quite plastic and can be warped over time pretty easily due to cultural changes.
Well, yes, I’m also pretty scared of this. See this post where I talked about something similar. I guess overall I’m still inclined to push for a future where “AI alignment” and “human safety” are both solved, instead of settling for one in which neither is (which I’m tempted to summarize your position as, but I’m not sure if I’m being fair).
I guess overall I’m still inclined to push for a future where “AI alignment” and “human safety” are both solved, instead of settling for one in which neither is (which I’m tempted to summarize your position as, but I’m not sure if I’m being fair)
For what it’s worth, I’d loosely summarize my position on this issue as being that I mainly think of AI as a general vehicle for accelerating technological and economic growth, along with accelerating things downstream of technology and growth, such as cultural change. And I’m skeptical we could ever fully “solve alignment” in the ambitious sense you seem to be imagining.
In this frame, it could be good to slow down AI if your goal is to delay large changes to the world. There are plausible scenarios in which this could make sense. Perhaps most significantly, one could be a cultural conservative and think that cultural change is generally bad in expectation, and thus more change is bad even if it yields higher aggregate prosperity sooner in time (though I’m not claiming this is your position).
Whereas, by contrast, I think cultural change can be bad, but I don’t see much reason to delay it if it’s inevitable. And the case against delaying AI seems even stronger here if you care about preserving (something like) the lives and values of people who currently exist, as AI offers the best chance of extending our lifespans, and “putting us in the driver’s seat” more generally by allowing us to actually be there during AGI development.
If future humans were in the driver’s seat instead, but with slightly more control over the process, I wouldn’t necessarily see that as being significantly better in expectation compared to my favored alternative, including over the very long run (according to my values).
(And as a side note, I also care about influencing human values, or what you might term “human safety”, but I generally see this as orthogonal to this specific discussion.)
If future humans were in the driver’s seat instead, but with slightly more control over the process
Why only “slightly” more control? It’s surprising to see you say this without giving any reasons or linking to some arguments, as this degree of alignment difficulty seems like a very unusual position that I’ve never seen anyone argue for before.
I’m a bit surprised you haven’t seen anyone make this argument before. To be clear, I wrote the comment last night on a mobile device, and it was intended to be a brief summary of my position, which perhaps explains why I didn’t link to anything or elaborate on that specific question. I’m not sure I want to outline my justifications for my view right now, but my general impression is that civilization has never had much central control over cultural values, so it’s unsurprising if this situation persists into the future, including with AI. Even if we align AIs, cultural and evolutionary forces can nonetheless push our values far. Does that brief explanation provide enough of a pointer to what I’m saying for you to be ~satisfied? I know I haven’t said much here; but I kind doubt my view on this issue is that rare that you’ve literally never seen someone present a case for it.
I have some objections to the idea that groups will be “immortal” in the future, in the sense of never changing, dying, or rotting, and persisting over time in a roughly unchanged form, exerting consistent levels of power over a very long time period. To be clear, I do think AGI can make some forms of value lock-in more likely, but I want to distinguish a few different claims:
(1) is a future value lock-in likely to occur at some point, especially not long after human labor has become ~obsolete?
(2) is lock-in more likely if we perform, say, a century more of technical AI alignment research, before proceeding forward?
(3) is it good to make lock-in more likely by, say, delaying AI by 100 years to do more technical alignment research, before proceeding forward? (i.e., will it be good or bad to do this type of thing?)
My quick and loose current answers to these questions are as follows:
This seems plausible but unlikely to me in a strong form. Some forms of lock-in seem likely; I’m more skeptical of the more radical scenarios people have talked about.
I suspect lock-in would become more likely in this case, but the marginal effect of more research would likely be pretty small.
I am pretty uncertain about this question, but I lean towards being against deliberately aiming for this type of lock-in. I am inclined to this view for a number of reasons, but one reason is that this policy seems to make it more likely that we restrict innovation and experience system rot on a large scale, causing the future to be much bleaker than it otherwise could be. See also Robin Hanson’s post on world government rot.
I read most of this paper, albeit somewhat quickly and skipped a few sections. I appreciate how clear the writing is, and I want to encourage more AI risk proponents to write papers like this to explain their views. That said, I largely disagree with the conclusion and several lines of reasoning within it.
Here are some of my thoughts (although these not my only disagreements):
I think the definition of “disempowerment” is vague in a way that fails to distinguish between e.g. (1) “less than 1% of world income goes to humans, but they have a high absolute standard of living and are generally treated well” vs. (2) “humans are in a state of perpetual impoverishment and oppression due to AIs and generally the future sucks for them”.
These are distinct scenarios with very different implications (under my values) for whether what happened is bad or good
I think (1) is OK and I think it’s more-or-less the default outcome from AI, whereas I think (2) would be a lot worse and I find it less likely.
By not distinguishing between these things, the paper allows for a motte-and-bailey in which they show that one (generic) range of outcomes could occur, and then imply that it is bad, even though both good and bad scenarios are consistent with the set of outcomes they’ve demonstrated
I think this quote is pretty confused and seems to rely partially on a misunderstanding of what people mean when they say that AGI cognition might be messy: “Second, even if human psychology is messy, this does not mean that an AGI’s psychology would be messy. It seems like current deep learning methodology embodies a distinction between final and instrumental goals. For instance, in standard versions of reinforcement learning, the model learns to optimize an externally specified reward function as best as possible. It seems like this reward function determines the model’s final goal. During training, the model learns to seek out things which are instrumentally relevant to this final goal. Hence, there appears to be a strict distinction between the final goal (specified by the reward function) and instrumental goals.”
Generally speaking, reinforcement learning shouldn’t be seen as directly encoding goals into models and thereby making them agentic, but should instead be seen as a process used to select models for how well they get reward during training.
Consequently, there’s no strong reason why reinforcement learning should create entities that have a clean psychological goal structure that is sharply different from and less messy than human goal structures. c.f. Models don’t “get reward”.
But I agree that future AIs could be agentic if we purposely intend for them to be agentic, including via extensive reinforcement learning.
I think this quote potentially indicates a flawed mental model of AI development underneath: “Moreover, I want to note that instrumental convergence is not the only route to AI capable of disempowering humanity which tries to disempower humanity. If sufficiently many actors will be able to build AI capable of disempowering humanity, including, e.g. small groups of ordinary citizens, then some will intentionally unleash AI trying to disempower humanity.”
I think this scenario is very implausible because AIs will very likely be developed by large entities with lots of resources (such as big corporations and governments) rather than e.g. small groups of ordinary citizens.
By the time small groups of less powerful citizens have the power to develop very smart AIs, we will likely already be in a world filled with very smart AIs. In this case, either human disempowerment already happened, or we’re in a world in which it’s much harder to disempower humans, because there are lots of AIs who have an active stake in ensuring this does not occur.
The last point is very important, and follows from a more general principle that the “ability necessary to take over the world” is not constant, but instead increases with the technology level. For example, if you invent a gun, that does not make you very powerful, because other people could have guns too. Likewise, simply being very smart does not make you have any overwhelming hard power against the rest of the world if the rest of the world is filled with very smart agents.
I think this quote overstates the value specification problem and ignores evidence from LLMs that this type of thing is not very hard: “There are two kinds of challenges in aligning AI. First, one needs to specify the goals the model should pursue. Second, one needs to ensure that the model robustly pursues those goals.Footnote12 The first challenge has been termed the ‘king Midas problem’ (Russell 2019). In a nutshell, human goals are complex, multi-faceted, diverse, wide-ranging, and potentially inconsistent. This is why it is exceedingly hard, if not impossible, to explicitly specify everything humans tend to care about.”
I don’t think we need to “explicitly specify everything humans tend to care about” into a utility function. Instead, we can have AIs learn human values by having them trained on human data.
This is already what current LLMs do. If you ask GPT-4 to execute a sequence of instructions, it rarely misinterprets you in a way that would imply improper goal specification. The more likely outcome is that GPT-4 will simply not be able to fulfill your request, not that it will execute a mis-specified sequence of instructions that satisfies the literal specification of what you said at the expense of what you intended.
Note that I’m not saying that GPT-4 merely understands what you’re requesting. I am saying that GPT-4 generally literally executes your instructions how you intended (an action, not a belief).
I think the argument about how instrumental convergence implies disempowerment proves too much. Lots of agents in the world don’t try to take over the world despite having goals that are not identical to the goals of other agents. If your claim is that powerful agents will naturally try to take over the world unless they are exactly aligned with the goals of the rest of the world, then I don’t think this claim is consistent with the existence of powerful sub-groups of humanity (e.g. large countries) that do not try to take over the world despite being very powerful.
You might reason, “Powerful sub-groups of humans are aligned with each other, which is why they don’t try to take over the world”. But I dispute this hypothesis:
First of all, I don’t think that humans are exactly aligned with the goals of other humans. I think that’s just empirically false in almost every way you could measure the truth of the claim. At best, humans are generally partially (not totally) aligned with random strangers—which could also easily be true of future AIs that are pretrained on our data.
Second of all, I think the most common view in social science is that powerful groups don’t constantly go to war and predate on smaller groups because there are large costs to war, rather than because of moral constraints. Attempting takeover is generally risky and not usually better in expectation than trying to trade, negotiate and compromise and accumulate resources lawfully (e.g. a violent world takeover would involves a lot of pointless destruction of resources). This is distinct from the idea that human groups don’t try to take over the world because they’re aligned with human values (which I also think is too vague to evaluate meaningfully, if that’s what you’d claim).
You can’t easily counter by saying “no human group has the ability to take over the world” because it is trivial to carve up subsets of humanity that control >99% of wealth and resources, which could in principle take control of the entire world if they became unified and decided to achieve that goal. These arbitrary subsets of humanity don’t attempt world takeover largely because they are not coordinated as a group, but AIs could similarly not be unified and coordinated around a such a goal too.
Thank you for the comment, very thought-provoking! I tried to make some reply to each of your comments, but there is much more one could say.
First, I agree that my notion of disempowerment could have been explicated more clearly, although my elucidations fit relatively straightforwardly with your second notion (mainly, perpetual oppression or extinction), not your first. I think conclusions (1) and (2) are both quite significant, although there are important ethical differences.
For the argument, the only case where this potential ambiguity makes a difference is with respect to premise 4 (the instrumental convergence premise). Would be interesting to spell out more which points there seem much more plausible with respect to notion (1) but not to (2). If one has high credence in the view that AIs will decide to compromise with humans, rather than extinguish them, this would be one example of a view which leads to a much higher credence in (1) than in (2).
“I think this quote is pretty confused and seems to rely partially on a misunderstanding of what people mean when they say that AGI cognition might be messy…”
I agree that RL does not necessarily create agents with such a clean psychological goal structure, but I think that there is (maybe strong) reason to think that RL often creates such agents. Cases of reward hacking in RL algorithms are precisely cases where an algorithm exhibits such a relatively clean goal structure, single-mindedly pursuing a ‘stupid’ goal while being instrumentally rational and thus apparently having a clear distinction between final and instrumental goals. But, granted, this might depend on what is ‘rewarded’, e.g. if it’s only a game score in a video game, then the goal structure might be cleaner than when it is a variety of very different things, and on whether the relevant RL agents tend to learn goals over rewards or some states of the world.
“I think this quote potentially indicates a flawed mental model of AI development underneath…”
Very good points. Nevertheless, it seems fair to say that it adds to the difficulty of avoiding disempowerment from misaligned AI that not only the first sufficiently capable AI (AGI) has to avoid catastrophic misaligment, but all further AGIs have to either avoid this too or be stopped by the AGIs already in existence. This then relates to points regarding whether the first AGIs do not only avoid catastrophic misalignment, but are sufficiently aligned so that we can use them to stop other AGIs and what the offense-defense balance would be. Could be that this works out, but also does not seem very safe to me.
“I think this quote overstates the value specification problem and ignores evidence from LLMs that this type of thing is not very hard…”
I am less convinced that evidence from LLMs shows that value specification is not hard. As you hint at, the question of value specification was never taken to be whether a sufficiently intelligent AI can understand our values (of course it can, if it is sufficiently intelligent), but whether we can specify them as its goals (such that it comes to share them). In (e.g.) GPT-4 trained via RL from human feedback, it is true that it typically executes your instructions as intended. However, sometimes it doesn’t and, moreover, there are theoretical reasons to think that this would stop being the case if the system was sufficiently powerful to do an action which would maximize human feedback but which does not consist in executing instructions as intended (e.g., by deceiving human raters).
“I think the argument about how instrumental convergence implies disempowerment proves too much…”
I am not moved by the appeal to humans here that much. If we had a unified (coordination) human agent (goal-directedness) who does not care about the freedom and welfare of other humans at all (full misalignment) and is sufficiently powerful (capability), then it seems plausible to me that this agent would try to take control of humanity, often in a very bad sense (e.g. extinction). If we relax ‘coordination’ or ‘full misalignment’ as assumptions, then this seems hard to predict. I could still see this ending in an AI which tries to disempower humanity, but it’s hard to say.
I think the view that AIs will compromise with humans rather than go to war with them makes sense under the perspective shared by a large fraction (if not majority) of social scientists that war is usually costlier, riskier, and more wasteful than trade between rational parties with adequate levels of information, who have the option of communicating and negotiating successfully.
This is a general fact about war, and has little to do with the values of the parties going to war, c.f. Rationalist explanations for war. Economic models of war do not generally predict war between parties that have different utility functions. On the contrary, a standard (simple) economic model of human behavior consists of viewing humans as entirely misaligned with other agents in the world, in the sense of having completely non-overlapping utility functions with random strangers. This model has been generalized to firms, countries, alliances etc., and yet it is rare for these generalized models to predict war as the default state of affairs.
Usually when I explain this idea to people, I am met with skepticism that we can generalize these social science models to AI. But I don’t see why not: they are generally our most well-tested models of war. They are grounded in empirical facts and decades of observations, rather than evidence-free speculation (which I perceive as the primary competing alternative in AI risk literature). And most importantly, the assumptions of the models are robust to differences in power between agents, and misalignment between agents, which are generally the two key facts that people point to when arguing why these models are wrong when applied to AI. Yet this alleged distinction appears to merely reflect a misunderstanding of the modeling assumptions, rather than any key difference between humans and AIs.
What’s interesting to me is that many people generally have no problem generalizing these economic models to other circumstances. For example, we could ask:
Would genetically engineered humans try to disempower non-genetically engineered humans, or would they try to trade and compromise? (In my experience, most people predict trade and compromise, even as the genetically engineered humans become much smarter and evolve into a different subspecies.)
Would human emulations on computers try to disempower biological humans, or would they try to trade and compromise? (Again, in my experience, most people predict trade and compromise, even as the emulated civilization becomes vastly more powerful than biological humans.)
In each case, I generally encounter AI risk proponents claiming that what distinguishes these cases from the case of AI is that, in these cases, we can assume that the genetically engineered humans and human emulations will be “aligned” with human values, which adequately explains why they will attempt to compromise rather than go to war with the ordinary biological humans. But as I have already explained, standard economic models of war do not predict that war is constrained by alignment to human values, but is instead constrained by the costs of war, and the relative benefits of trade compared to war.
To the extent you think these economic models of war are simply incorrect, then I think it is worth explicitly engaging with the established social science literature, rather than inventing a new model that makes unique predictions about what non-human AIs would apparently do, who definitionally do not share human values.
It is true that GPT-4 “sometimes” fails to follow human instructions, but the same could be said about humans. I think it’s worth acknowledging the weight of the empirical evidence here regardless.
In my opinion the empirical evidence generally seems way stronger than the theoretical arguments, which (so far) seem to have had little success predicting when and how alignment would be difficult. For example, many people believed that AGI would be achieved at the time AIs are having natural conversations with humans (e.g. Eliezer Yudkowsky implied as much in his essay about a fire alarm[1]). According to this prediction, we should have already been having pretty severe misspecification problems if such problems were supposed to arise at AGI-level. And yet, I claim, we are not having these severe problems (and instead, we are merely having modestly difficult problems that can be patched with sufficient engineering effort).
It is true that problems of misspecification should become more difficult as AIs get smarter. However, it’s important to recognize that as AI capabilities grow, so too will our tools and methods for tackling these alignment challenges. One key factor is that we will have increasingly intelligent AI systems that can assist us in the alignment process itself. To illustrate this point concretely, let’s walk through a hypothetical scenario:
Suppose that aligning a human-level artificial general intelligence (AGI) merely requires a dedicated team of human alignment researchers. This seems generally plausible given that evaluating output is easier than generating novel outputs (see this article that goes into more detail about this argument and why it’s relevant). Once we succeed in aligning that human-level AGI system, we can then leverage it to help us align the next iteration of AGI that is slightly more capable than human-level (let’s call it AGI+). We would have a team of aligned human-level AGIs working on this challenge with us.
Then, when it comes to aligning the following iteration, AGI++ (which is even more intelligent), we can employ the AGI+ systems we previously aligned to work on this next challenge. And so on, with each successive generation of AI systems helping us to align the next, even more advanced generation.
It seems plausible that this cycle of AI systems assisting in the alignment of future, more capable systems could continue for a long time, allowing us to align AIs of ever-increasing intelligence without at any point needing mere humans to solve the problem of superintelligent alignment alone. If at some point this cycle becomes unsustainable, we can expect the highly intelligent AI advisors we have at that point to warn us about the limits of this approach. This would allow us to recognize when we are reaching the limits of our ability to maintain reliable alignment.
Full quote from Eliezer: “When they are very impressed by how smart their AI is relative to a human being in respects that still feel magical to them; as opposed to the parts they do know how to engineer, which no longer seem magical to them; aka the AI seeming pretty smart in interaction and conversation; aka the AI actually being an AGI already.”
I’ve looked into the game theory of war literature a bit, and my impression is that economists are still pretty confused about war. As you mention, the simplest model predicts that rational agents should prefer negotiated settlements to war, and it seems unsettled what actually causes wars among humans. (People have proposed more complex models incorporating more elements of reality, but AFAIK there isn’t a consensus as to which model gives the best explanation of why wars occur.) I think it makes sense to be aware of this literature and its ideas, but there’s not a strong argument for deferring to it over one’s own ideas or intuitions.
My own thinking is that war between AIs and humans could happen in many ways. One simple (easy to understand) way is that agents will generally refuse a settlement worse than what they think they could obtain on their own (by going to war), so human irrationality could cause a war when e.g. the AI faction thinks it will win with 99% probability, and humans think they could win with 50% probability, so each side demand more of the lightcone (or resources in general) than the other side is willing to grant.
To take this one step further, I would say that given that many deviations from the simplest game theoretic model do predict war, war among consequentialist agents may well be the default in some sense. Also, given that humans often do (or did) go to war with each other, our shared values (i.e. the extent to which we do have empathy/altruism for others) must contribute to the current relative peace in some way.
This generally makes sense to me. I also think human irrationality could prompt a war with AIs. I don’t disagree with the claim insofar as you’re claiming that such a war is merely plausible (say >10% chance), rather than a default outcome. (Although to be clear, I don’t think such a war would likely cut cleanly along human vs. AI lines.)
On the other hand, humans are currently already irrational and yet human vs. human wars are not the default (they happen frequently, e.g. but at any given time on Earth, the vast majority of humans are not in a warzone or fighting in an active war). It’s not clear to me why human vs. AIs would make war more likely to occur than in the human vs. human case, if by assumption the main difference here is that one side is more rational.
In other words, if we’re moving from a situation of irrational parties vs. other irrational parties to irrational parties vs. rational parties, I’m not sure why we’d expect this change to make things more warlike and less peaceful as a result. You mention one potential reason:
I don’t think this follows. Humans presumably also had empathy in e.g. 1500, back when war was more common, so how could it explain our current relative peace?
Perhaps you mean that cultural changes caused our present time period to be relatively peaceful. But I’m not sure about that; or at least, the claim should probably be made more specific. There are many things about the environment that have changed since our relatively more warlike ancestors, and (from my current perspective) I think it’s plausible that any one of them could have been the reason for our current relative peace. That is, I don’t see a good reason to single out human values or empathy as the main cause in itself.
For example, humans are now a lot richer per capita, which might mean that people have “more to lose” when going to war, and thus are less likely to engage in it. We’re also a more globalized culture, and our economic system relies more on long-distance trade than it did in the past, making war more costly. We’re also older, in the sense that the median age is higher (and old people are less likely to commit violence), and women got the right to vote (who perhaps are less likely to support hawkish politicians).
To be clear, I don’t put much confidence in any of these explanations. As of now, I’m very uncertain about why the 21st century seems relatively peaceful compared to the distant past. However I do think that:
None of the explanations I’ve given above seem well-described as “our values/empathy” made us less warlike. And to the extent our values changed, I expect that was probably downstream of more fundamental changes, like economic growth and globalization, rather than being an exogenous change that was independent of these effects.
To the extent that changing human nature explains our current relatively peaceful era, this position seems to require that you believe human nature is fundamentally quite plastic and can be warped over time pretty easily due to cultural changes. If that’s true, human nature is ultimately quite variable, perhaps more similar to AI than you might have otherwise thought (as both are presumably pushed around easily by training data).
We have more empirical evidence that we can look at when it comes to human-human wars, making it easier to have well-calibrated beliefs about chances of winning. When it comes to human-AI wars, we’re more likely to have wildly irrational beliefs.
This is just one reason war could occur though. Perhaps a more likely reason is that there won’t be a way to maintain the peace, that both sides can be convinced will work, and is sufficiently cheap that the cost doesn’t eat up all of the gains from avoiding war. For example, how would the human faction know that if it agrees to peace, the AI faction won’t fully dispossess the humans at some future date when it’s even more powerful? Even if AIs are able to come up with some workable mechanisms, how would the humans know that it’s not just a trick?
Without credible assurances (which seems hard to come by), I think if humans do agree to peace, the most likely outcome is that it does get dispossessed in the not too distant future, either gradually (for example getting scammed/persuaded/blackmailed/stolen from in various ways), or all at once. I think society as a whole won’t have a strong incentive to protect humans because they’ll be almost pure consumers (not producing much relative to what they consume), and such classes of people are often killed or dispossessed in human history (e.g., landlords after communist takeovers).
I mainly mean that without empathy/altruism, we’d probably have even more wars, both now and then.
Well, yes, I’m also pretty scared of this. See this post where I talked about something similar. I guess overall I’m still inclined to push for a future where “AI alignment” and “human safety” are both solved, instead of settling for one in which neither is (which I’m tempted to summarize your position as, but I’m not sure if I’m being fair).
For what it’s worth, I’d loosely summarize my position on this issue as being that I mainly think of AI as a general vehicle for accelerating technological and economic growth, along with accelerating things downstream of technology and growth, such as cultural change. And I’m skeptical we could ever fully “solve alignment” in the ambitious sense you seem to be imagining.
In this frame, it could be good to slow down AI if your goal is to delay large changes to the world. There are plausible scenarios in which this could make sense. Perhaps most significantly, one could be a cultural conservative and think that cultural change is generally bad in expectation, and thus more change is bad even if it yields higher aggregate prosperity sooner in time (though I’m not claiming this is your position).
Whereas, by contrast, I think cultural change can be bad, but I don’t see much reason to delay it if it’s inevitable. And the case against delaying AI seems even stronger here if you care about preserving (something like) the lives and values of people who currently exist, as AI offers the best chance of extending our lifespans, and “putting us in the driver’s seat” more generally by allowing us to actually be there during AGI development.
If future humans were in the driver’s seat instead, but with slightly more control over the process, I wouldn’t necessarily see that as being significantly better in expectation compared to my favored alternative, including over the very long run (according to my values).
(And as a side note, I also care about influencing human values, or what you might term “human safety”, but I generally see this as orthogonal to this specific discussion.)
Why only “slightly” more control? It’s surprising to see you say this without giving any reasons or linking to some arguments, as this degree of alignment difficulty seems like a very unusual position that I’ve never seen anyone argue for before.
I’m a bit surprised you haven’t seen anyone make this argument before. To be clear, I wrote the comment last night on a mobile device, and it was intended to be a brief summary of my position, which perhaps explains why I didn’t link to anything or elaborate on that specific question. I’m not sure I want to outline my justifications for my view right now, but my general impression is that civilization has never had much central control over cultural values, so it’s unsurprising if this situation persists into the future, including with AI. Even if we align AIs, cultural and evolutionary forces can nonetheless push our values far. Does that brief explanation provide enough of a pointer to what I’m saying for you to be ~satisfied? I know I haven’t said much here; but I kind doubt my view on this issue is that rare that you’ve literally never seen someone present a case for it.
Where the main counterargument is that now the groups in power can be immortal and digital minds will be possible.
See also: AGI and Lock-in
I have some objections to the idea that groups will be “immortal” in the future, in the sense of never changing, dying, or rotting, and persisting over time in a roughly unchanged form, exerting consistent levels of power over a very long time period. To be clear, I do think AGI can make some forms of value lock-in more likely, but I want to distinguish a few different claims:
(1) is a future value lock-in likely to occur at some point, especially not long after human labor has become ~obsolete?
(2) is lock-in more likely if we perform, say, a century more of technical AI alignment research, before proceeding forward?
(3) is it good to make lock-in more likely by, say, delaying AI by 100 years to do more technical alignment research, before proceeding forward? (i.e., will it be good or bad to do this type of thing?)
My quick and loose current answers to these questions are as follows:
This seems plausible but unlikely to me in a strong form. Some forms of lock-in seem likely; I’m more skeptical of the more radical scenarios people have talked about.
I suspect lock-in would become more likely in this case, but the marginal effect of more research would likely be pretty small.
I am pretty uncertain about this question, but I lean towards being against deliberately aiming for this type of lock-in. I am inclined to this view for a number of reasons, but one reason is that this policy seems to make it more likely that we restrict innovation and experience system rot on a large scale, causing the future to be much bleaker than it otherwise could be. See also Robin Hanson’s post on world government rot.