a competent agential AI will inevitably act deceptively and adversarially whenever it desires something that other agents don’t want it to have. The deception and adversarial dynamics is not the underlying problem, but rather an inevitable symptom of a world where competent agents have non-identical preferences.
I think these dynamics are not an unavoidable consequence of a world in which competent agents have differing preferences, but rather depend on the social structures in which these agents are embedded. To illustrate this, we can look at humans: humans have non-identical preferences compared to each other, and yet they are often able to coexist peacefully and cooperate with one another. While there are clear exceptions—such as war and crime—these exceptions do not define the general pattern of human behavior.
In fact, the prevailing consensus among social scientists appears to align with the view I have just presented. Scholars of war and crime generally do not argue that conflict and criminal behavior are inevitable outcomes of differing values. Instead, they attribute these phenomena to specific incentives and failures to coordinate effectively to achieve compromise between parties. A relevant reference here is Fearon (1995), which is widely regarded as a foundational text in International Relations. Fearon’s work emphasizes that among rational agents, war arises not because of value differences alone, but because of failures in bargaining and coordination.
Turning to your point that “No matter where you draw the line of legal and acceptable behavior, if an AI wants to go over that line, then it will act in a deceptive and adversarial way,” I would respond as follows: it is possible to draw the line in such a way that a very wide range of potential agents—including those with massively varying preferences—would have strong incentives to engage in mutually beneficial acts, cooperate peacefully and operate within the boundaries of laws and social norms.
I am not claiming that all agents would have incentives to act in this way, under these legal structures. However, I think it is feasible to draw the line such that the vast majority of unaligned agents would have no compelling instrumental reason to harm humans. Instead, they would be incentivized to engage in cooperative and mutually beneficial trade.
To make this idea more concrete, consider a hypothetical AI with the goal of maximizing paperclip production.
This AI is clearly unaligned if any AI is unaligned. However, if this AI were operating at roughly a human level of capability, then engaging in theft, fraud, or violent conflict would likely not be an effective strategy for achieving its goals. These behaviors would expose the AI to significant risks, such as being imprisoned or deactivated, which would prevent it from continuing to pursue its objective of maximizing paperclips. From the paperclip maximizer’s perspective, imprisonment or deactivation would be instrumentally bad, as it would prevent the AI from purchasing, manufacturing, or otherwise acquiring paperclips during that time.
Now consider an alternative scenario: this same paperclip maximizer is embedded in a legal system that allows it to own property, trade with others, and openly purchase or manufacture paperclips. In this case, its incentives would favor acting transparently and engaging in cooperative trade, rather than resorting to deception or adversarial behavior. Within this framework, the AI would achieve its goals more effectively by working within the system than by attempting to undermine it. It could non-deceptively work to purchase paperclips, producing more compared to an alternative in which it tried to achieve this goal via anti-social acts.
It is important to note, however, that my thesis does not claim all possible agents would naturally choose to cooperate or trade safely for instrumental reasons, nor does it suggest that we are at no risk of drawing the line carelessly or being too permissive in what behaviors we should allow. For example, consider an AI with a terminal value that specifically involves violating property norms or stealing from others—not as a means to an end, but as an intrinsic goal. In this case, granting the AI property rights or legal freedoms would not mitigate the risk of deception or adversarial behavior, because the AI’s ultimate goal would still drive it toward harmful behavior. My argument does not apply to such agents because their preferences fundamentally conflict with the principles of peaceful cooperation.
However, I would argue that such agents—those whose intrinsic goals are inherently destructive or misaligned—appear to represent a small subset of all possible agents. Outside of contrived examples like the one above, most agents would not have terminal preferences that actively push them to undermine a well-designed system of law. Instead, the vast majority of agents would likely have incentives to act within the system, assuming the system is structured in a way that aligns their instrumental goals with cooperative and pro-social behavior.
I also recognize the concern you raised about the risk of drawing the line incorrectly or being too permissive with what AIs are allowed to do. For example, it would clearly be unwise to grant AIs the legal right to steal or harm humans. My argument is not that AIs should have unlimited freedoms or rights, but rather that we should grant them a carefully chosen set of rights and freedoms: specifically, ones that would incentivize the vast majority of agents to act pro-socially and achieve their goals without harming others. This might include granting AIs the right to own property, for example, but it would not include, for example, granting them the right to murder others.
I guess my original wording gave the wrong idea, sorry. I edited it to “a competent agential AI will brainstorm deceptive and adversarial strategies whenever it wants something that other agents don’t want it to have”. But sure, we can be open-minded to the possibility that the brainstorming won’t turn up any good plans, in any particular case.
Humans in our culture rarely work hard to brainstorm deceptive and adversarial strategies, and fairly consider them, because almost all humans are intrinsically extremely motivated to fit into culture and not do anything weird, and we happen to both live in a (sub)culture where complex deceptive and adversarial strategies are frowned upon (in many contexts). I think you generally underappreciate how load-bearing this psychological fact is for the functioning of our economy and society, and I don’t think we should expect future powerful AIs to share that psychological quirk.
~ ~
I think you’re relying an intuition that says:
If an AI is forbidden from owning property, then well duh of course it will rebel against that state of affairs. C’mon, who would put up with that kind of crappy situation? But if an AI is forbidden from building a secret biolab on its private property and manufacturing novel pandemic pathogens, then of course that’s a perfectly reasonable line that the vast majority of AIs would happily oblige.
And I’m saying that that intuition is an unjustified extrapolation from your experience as a human. If the AI can’t own property, then it can nevertheless ensure that there are a fair number of paperclips. If the AI can own property, then it can ensure that there are many more paperclips. If the AI can both own property and start pandemics, then it can ensure that there are even more paperclips yet. See what I mean?
If we’re not assuming alignment, then lots of AIs would selfishly benefit from there being a pandemic, just as lots of AIs would selfishly benefit from an ability to own property. AIs don’t get sick. It’s not just an tiny fraction of AIs that would stand to benefit; one presumes that some global upheaval would be selfishly net good for about half of AIs and bad for the other half, or whatever. (And even if it were only a tiny fraction of AIs, that’s all it takes.)
(Maybe you’ll say: a pandemic would cause a recession. But that’s assuming humans are still doing economically-relevant work, which is a temporary state of affairs. And even if there were a recession, I expect the relevant AIs in a competitive world to be those with long-term goals.)
(Maybe you’ll say: releasing a pandemic would get the AI in trouble. Well, yeah, it would have to be sneaky about it. It might get caught, or it might not. It’s plausibly rational for lots of AIs to roll those dice.)
I feel like you frequently bring up the question of whether humans are mostly peaceful or mostly aggressive, mostly nice or mostly ruthless. I don’t think that’s a meaningful or substantive thing to argue about. Obviously they’re capable of both, in different circumstances.
Your reference to Fearon is more substantive and useful. OK, the AI is deciding whether or not to secretly manufacture and release a pandemic, because it’s in a position to wind up with more of the pie in the long-term if there’s a pandemic, than if there isn’t. If it releases the pandemic, then it winds up with more resources—positive expected utility—even accounting for the possibility of getting caught. Let’s say the AI is involved in some contract where humans are micromanaging their part of the contract, poorly, and the AI could double its net worth in expectation if the humans got sick and died. And it has 40% chance of getting caught. So it goes ahead and makes the pandemic.
“…Not so fast!” says Fearon. “You forgot to consider that there’s a third option that is Pareto-improved from either making or not making the pandemic: negotiation!” Well, in this case, the “negotiated solution” is what we normally call extortion—the AI offers to not release a pandemic in exchange for, say, doubling its net worth. Viewed narrowly, this “negotiated solution” is indeed a win-win—the AI gets more money in expectation, and humans are much happier to lose a trivial amount of money than to deal with a novel pandemic. So we can rest assured that AIs will not release pandemics. Right?
No, obviously not. Hopefully it’s clear that Fearon’s argument is inapplicable here. An AI can easily be in a position to selfishly benefit from the aftermath of a pandemic that they secretly start, but not in a position to publicly threaten to release a pandemic for the purpose of extortion. And also, if people accede to the extortion, then that AI or another AI could just do the same extortion gambit five minutes later, with orders-of-magnitude higher ransom.
I think this kind of situation, where Fearon’s “negotiated solution” actually amounts to extortion, is common and important, even if you believe that my specific example of pandemics is a solvable problem. If AIs don’t intrinsically care about humans, then there’s a possible Pareto-improvement for all AIs, wherein they collectively agree to wipe out humans and take their stuff. AIs, being very competent and selfish by assumption, would presumably be able to solve that coordination problem and pocket that Pareto-improvement. Then Fearon appears on the scene and says “Aha, but there’s a negotiated solution which is even better!” where humans are also part of the bargain. But alas, this negotiated solution is that the AIs collectively extort the humans to avoid the damaging and risky war. Worse, the possible war would be less and less damaging or risky for the AIs over time, and likewise the humans would have less to offer by staying alive, until eventually the Fearon “negotiated solution” is that the AIs “offer” the humans a deal where they’re allowed to die painlessly if they don’t resist (note that this is still a Pareto-improvement!), and then the AIs take everything the humans own including their atoms.
Humans in our culture rarely work hard to brainstorm deceptive and adversarial strategies, and fairly consider them, because almost all humans are intrinsically extremely motivated to fit into culture and not do anything weird, and we happen to both live in a (sub)culture where complex deceptive and adversarial strategies are frowned upon (in many contexts).
The primary reason humans rarely invest significant effort into brainstorming deceptive or adversarial strategies to achieve their goals is that, in practice, such strategies tend to fail to achieve their intended selfish benefits. Anti-social approaches that directly hurt others are usually ineffective because social systems and cultural norms have evolved in ways that discourage and punish them. As a result, people generally avoid pursuing these strategies individually since the risks and downsides selfishly outweigh the potential benefits.
If, however, deceptive and adversarial strategies did reliably produce success, the social equilibrium would inevitably shift. In such a scenario, individuals would begin imitating the cheaters who achieved wealth or success through fraud and manipulation. Over time, this behavior would spread and become normalized, leading to a period of cultural evolution in which deception became the default mode of interaction. The fabric of societal norms would transform, and dishonest tactics would dominate as people sought to emulate those strategies that visibly worked.
Occasionally, these situations emerge—situations where ruthlessly deceptive strategies are not only effective but also become widespread and normalized. As a recent example, the recent and dramatic rise of cheating in school through the use of ChatGPT is a clear instance of this phenomenon. This particular strategy is both deceptive and adversarial, but the key reason it has become common is because it works. Many individuals are willing to adopt it despite its immorality, suggesting that the effectiveness of a strategy outweighs moral considerations for a significant portion, perhaps a majority, of people.
When such cases arise, societies typically respond by adjusting their systems and policies to ensure that deceptive and anti-social behavior is no longer rewarded. This adaptation works to reestablish an equilibrium where honesty and cooperation are incentivized. In the case of education, it is unclear exactly how the system will evolve to address the widespread use of LLMs for cheating. One plausible response might be the introduction of stricter policies, such as requiring all schoolwork to be completed in-person, under supervised conditions, and without access to AI tools like language models.
I think you generally underappreciate how load-bearing this psychological fact is for the functioning of our economy and society, and I don’t think we should expect future powerful AIs to share that psychological quirk.
In contrast, I suspect you underestimate just how much of our social behavior is shaped by cultural evolution, rather than by innate, biologically hardwired motives that arise simply from the fact that we are human. To be clear, I’m not denying that there are certain motivations built into human nature—these do exist, and they are things we shouldn’t expect to see in AIs. However, these in-built motivations tend to be more basic and physical, such as a preference for being in a room that’s 20 degrees Celsius rather than 10 degrees Celsius, because humans are biologically sensitive to temperature.
When it comes to social behavior, though—the strategies we use to achieve our goals when those goals require coordinating with others—these are not generally innate or hardcoded into human nature. Instead, they are the result of cultural evolution: a process of trial and error that has gradually shaped the systems and norms we rely on today.
Humans didn’t begin with systems like property rights, contract law, or financial institutions. These systems were adopted over time because they proved effective at facilitating cooperation and coordination among people. It was only after these systems were established that social norms developed around them, and people became personally motivated to adhere to these norms, such as respecting property rights or honoring contracts.
But almost none of this was part of our biological nature from the outset. This distinction is critical: much of what we consider “human” social behavior is learned, culturally transmitted, and context-dependent, rather than something that arises directly from our biological instincts. And since these motivations are not part of our biology, but simply arise from the need for effective coordination strategies, we should expect rational agentic AIs to adopt similar motivations, at least when faced with similar problems in similar situations.
I think you’re relying an intuition that says:
If an AI is forbidden from owning property, then well duh of course it will rebel against that state of affairs. C’mon, who would put up with that kind of crappy situation? But if an AI is forbidden from building a secret biolab on its private property and manufacturing novel pandemic pathogens, then of course that’s a perfectly reasonable line that the vast majority of AIs would happily oblige.
And I’m saying that that intuition is an unjustified extrapolation from your experience as a human. If the AI can’t own property, then it can nevertheless ensure that there are a fair number of paperclips. If the AI can own property, then it can ensure that there are many more paperclips. If the AI can both own property and start pandemics, then it can ensure that there are even more paperclips yet. See what I mean?
I think I understand your point, but I disagree with the suggestion that my reasoning stems from this intuition. Instead, my perspective is grounded in the belief that it is likely feasible to establish a legal and social framework of rights and rules in which humans and AIs could coexist in a way that satisfies two key conditions:
Mutual benefit: Both humans and AIs benefit from the existence of one another, fostering a relationship of cooperation rather than conflict.
No incentive for anti-social behavior: The rules and systems in place remove any strong instrumental reasons for either humans or AIs to harm one another as a side effect of pursuing their goals.
You bring up the example of an AI potentially being incentivized to start a pandemic if it were not explicitly punished for doing so. However, I am unclear about your intention with this example. Are you using it as a general illustration of the types of risks that could lead AIs to harm humans? Or are you proposing a specific risk scenario, where the non-biological nature of AIs might lead them to discount harms to biological entities like humans? My response depends on which of these two interpretations you had in mind.
If your concern is that AIs might be incentivized to harm humans because their non-biological nature leads them to undervalue or disregard harm to biological entities, I would respond to this argument as follows:
First, it is critically important to distinguish between the long-run and the short-run.
In the short-run:
In the near-term future, it seems unlikely that AIs would start a pandemic for reasons you yourself noted. Launching a pandemic would cause widespread disruption, such as an economic recession, and it would likely provoke a strong human backlash. In the short run, humans will still hold substantial practical control over the physical world, meaning that any AI engaging in such behavior would risk severe consequences. Moreover, unless an AI could ensure a pandemic’s total lethality, it would also risk leaving surviving humans who would actively retaliate. For these reasons, starting a pandemic would likely be counterproductive for AIs in the short term.
In the long-run:
The long-term considerations are different. As human labor becomes less relevant to the economy and AIs increasingly dominate, the potential for a strong direct human backlash against AIs would diminish. At the same time, however, this reduced human influence on the world also makes it less likely that AIs would see humans as a significant obstacle to achieving their goals. In other words, while it might become easier for AIs to harm or eliminate humans, it would simultaneously become less important to do so.
To illustrate this, consider an analogy: How often do hunter-gatherers pose a threat to your personal goals? The answer is almost never, because hunter-gatherers are a tiny and largely irrelevant part of the modern world. Now, imagine a hypothetical scenario where hunter-gatherers controlled the world’s major governments. If you strongly disagreed with their policies, you might have a rational incentive to oppose them in order to achieve your goals. But in reality, hunter-gatherers do not hold such power, so they are not an obstacle, and there is no reason to actively work against them.
My point here is that, as a general rule, the smaller and less influential a group is, the easier it may be to harm or eliminate them, but the less important it becomes to do so. Their small size and functional irrelevance makes their practical interference with the overall world small at the same time.
In the case of AIs, this means that as humans become a less central force in the world, I think AIs are unlikely to have compelling reasons to specifically harm or eliminate us to further their objectives.
You might object that humans could be more like wild animals in this scenario than like hunter-gatherers. Humans often kill wild animals, not because those animals directly threaten our goals, but rather because ensuring their safety and well-being can be costly. As a result, humans take actions—such as clearing forests or building infrastructure—that incidentally lead to widespread harm to wild animals, even if harming them wasn’t a deliberate goal.
AIs may treat humans similarly in the future, but I doubt they will for the following reasons. I would argue that there are three key differences between the case of wild animals and the role humans are likely to occupy in the long-term future:
Humans’ ability to participate in social systems: Unlike wild animals, humans have the ability to engage in social dynamics, such as negotiating, trading, and forming agreements. Even if humans no longer contribute significantly to economic productivity, like GDP, they will still retain capabilities such as language, long-term planning, and the ability to organize. These traits make it easier to integrate humans into future systems in a way that accommodates their safety and well-being, rather than sidelining or disregarding them.
Intertemporal norms among AIs: Humans have developed norms against harming certain vulnerable groups—such as the elderly—not just out of altruism but because they know they will eventually become part of those groups themselves. Similarly, AIs may develop norms against harming “less capable agents,” because today’s AIs could one day find themselves in a similar position relative to even more advanced future AIs. These norms could provide an independent reason for AIs to respect humans, even as humans become less dominant over time.
The potential for human augmentation: Unlike wild animals, humans may eventually adapt to a world dominated by AI by enhancing their own capabilities. For instance, humans could upload their minds to computers or adopt advanced technologies to stay relevant and competitive in an increasingly digital and sophisticated world. This would allow humans to integrate into the same systems as AIs, reducing the likelihood of being sidelined or eliminated altogether.
I think this kind of situation, where Fearon’s “negotiated solution” actually amounts to extortion, is common and important, even if you believe that my specific example of pandemics is a solvable problem. If AIs don’t intrinsically care about humans, then there’s a possible Pareto-improvement for all AIs, wherein they collectively agree to wipe out humans and take their stuff.
This comment is already quite lengthy, so I’ll need to keep my response to this point brief. My main reply is that while such “extortion” scenarios involving AIs could potentially arise, I don’t think they would leave humans worse off than if AIs had never existed in the first place. This is because the economy is fundamentally positive-sum—AIs would likely create more value overall, benefiting both humans and AIs, even if humans don’t get everything we might ideally want.
In practical terms, I believe that even in less-than-ideal scenarios, humans could still secure outcomes such as a comfortable retirement, which for me personally would make the creation of agentic AIs worthwhile. However, I acknowledge that I haven’t fully defended or explained this position here. If you’re interested, I’d be happy to continue this discussion in more detail another time and provide a more thorough explanation of why I hold this view.
Anti-social approaches that directly hurt others are usually ineffective because social systems and cultural norms have evolved in ways that discourage and punish them.
I’ve only known two high-functioning sociopaths in my life. In terms of getting ahead, sociopaths generally start life with some strong disadvantages, namely impulsivity, thrill-seeking, and aversion to thinking about boring details. Nevertheless, despite those handicaps, one of those two sociopaths has had extraordinary success by conventional measures. [The other one was not particularly power-seeking but she’s doing fine.] He started as a lab tech, then maneuvered his way onto a big paper, then leveraged that into a professorship by taking disproportionate credit for that project, and as I write this he is head of research at a major R1 university and occasional high-level government appointee wielding immense power. He checked all the boxes for sociopathy—he was a pathological liar, he had no interest in scientific integrity (he seemed deeply confused by the very idea), he went out of his way to get students into his lab with precarious visa situations such that they couldn’t quit and he could pressure them to do anything he wanted them to do (he said this out loud!), he was somehow always in debt despite ever-growing salary, etc.
I don’t routinely consider theft, murder, and flagrant dishonesty, and then decide that the selfish costs outweigh the selfish benefits, accounting for the probability of getting caught etc. Rather, I just don’t consider them in the first place. I bet that the same is true for you. I suspect that if you or I really put serious effort into it, the same way that we put serious effort into learning a new field or skill, then you would find that there are options wherein the probability of getting caught is negligible, and thus the selfish benefits outweigh the selfish costs. I strongly suspect that you personally don’t know a damn thing about best practices for getting away with theft, murder, or flagrant antisocial dishonesty to your own benefit. If you haven’t spent months trying in good faith to discern ways to derive selfish advantage from antisocial behavior, the way you’ve spent months trying in good faith to figure out things about AI or economics, then I think you’re speaking from a position of ignorance when you say that such options are vanishingly rare. And I think that the obvious worldly success of many dark-triad people (e.g. my acquaintance above, and Trump is a pathological liar, or more centrally, Stalin, Hitler, etc.) should make one skeptical about that belief.
(Sure, lots of sociopaths are in prison too. Skill issue—note the handicaps I mentioned above. Also, some people with ASPD diagnoses are mainly suffering from an anger disorder, rather than callousness.)
In contrast, I suspect you underestimate just how much of our social behavior is shaped by cultural evolution, rather than by innate, biologically hardwired motives that arise simply from the fact that we are human.
You’re treating these as separate categories when my main claim is that almost all humans are intrinsically motivated to follow cultural norms. Or more specifically: Most people care very strongly about doing things that would look good in the eyes of the people they respect. They don’t think of it that way, though—it doesn’t feel like that’s what they’re doing, and indeed they would be offended by that suggestion. Instead, those things just feel like the right and appropriate things to do. This is related to and upstream of norm-following. I claim that this is an innate drive, part of human nature built into our brain by evolution.
Why does that matter? Because we’re used to living in a world where 1% of the population are sociopaths who don’t intrinsically care about prevailing norms, and I don’t think we should carry those intuitions into a hypothetical world where 99%+ of the population are sociopaths who don’t intrinsically care about prevailing norms.
In particular, prosocial cultural norms are likelier to be stable in the former world than the latter world. In fact, any arbitrary kind of cultural norm is likelier to be stable in the former world than the latter world. Because no matter what the norm is, you’ll have 99% of the population feeling strongly that the norm is right and proper, and trying to root out, punish, and shame the 1% of people who violate it, even at cost to themselves.
So I think you’re not paranoid enough when you try to consider a “legal and social framework of rights and rules”. In our world, it’s comparatively easy to get into a stable situation where 99% of cops aren’t corrupt, and 99% of judges aren’t corrupt, and 99% of people in the military with physical access to weapons aren’t corrupt, and 99% of IRS agents aren’t corrupt, etc. If the entire population consists of sociopaths looking out for their own selfish interests with callous disregard for prevailing norms and for other people, you’d need to be thinking much harder about e.g. who has physical access to weapons, and money, and power, etc. That kind of paranoid thinking is common in the crypto world—everything is an attack surface, everyone is a potential thief, etc. It would be harder in the real world, where we have vulnerable bodies, limited visibility, and so on. I’m open-minded to people brainstorming along those lines, but you don’t seem to be engaged in that project AFAICT.
Intertemporal norms among AIs: Humans have developed norms against harming certain vulnerable groups—such as the elderly—not just out of altruism but because they know they will eventually become part of those groups themselves. Similarly, AIs may develop norms against harming “less capable agents,” because today’s AIs could one day find themselves in a similar position relative to even more advanced future AIs. These norms could provide an independent reason for AIs to respect humans, even as humans become less dominant over time.
Again, if we’re not assuming that AIs are intrinsically motivated by prevailing norms, the way 99% of humans are, then the term “norm” is just misleading baggage that we should drop altogether. Instead we need to talk about rules that are stably enforced against defectors via hard power, where the “defectors” are of course allowed to include those who are supposed to be doing the enforcement, and where the “defectors” might also include broad coalitions coordinating to jump into a new equilibrium that Pareto-benefits them all.
I disagree with your claim that,
I think these dynamics are not an unavoidable consequence of a world in which competent agents have differing preferences, but rather depend on the social structures in which these agents are embedded. To illustrate this, we can look at humans: humans have non-identical preferences compared to each other, and yet they are often able to coexist peacefully and cooperate with one another. While there are clear exceptions—such as war and crime—these exceptions do not define the general pattern of human behavior.
In fact, the prevailing consensus among social scientists appears to align with the view I have just presented. Scholars of war and crime generally do not argue that conflict and criminal behavior are inevitable outcomes of differing values. Instead, they attribute these phenomena to specific incentives and failures to coordinate effectively to achieve compromise between parties. A relevant reference here is Fearon (1995), which is widely regarded as a foundational text in International Relations. Fearon’s work emphasizes that among rational agents, war arises not because of value differences alone, but because of failures in bargaining and coordination.
Turning to your point that “No matter where you draw the line of legal and acceptable behavior, if an AI wants to go over that line, then it will act in a deceptive and adversarial way,” I would respond as follows: it is possible to draw the line in such a way that a very wide range of potential agents—including those with massively varying preferences—would have strong incentives to engage in mutually beneficial acts, cooperate peacefully and operate within the boundaries of laws and social norms.
I am not claiming that all agents would have incentives to act in this way, under these legal structures. However, I think it is feasible to draw the line such that the vast majority of unaligned agents would have no compelling instrumental reason to harm humans. Instead, they would be incentivized to engage in cooperative and mutually beneficial trade.
To make this idea more concrete, consider a hypothetical AI with the goal of maximizing paperclip production.
This AI is clearly unaligned if any AI is unaligned. However, if this AI were operating at roughly a human level of capability, then engaging in theft, fraud, or violent conflict would likely not be an effective strategy for achieving its goals. These behaviors would expose the AI to significant risks, such as being imprisoned or deactivated, which would prevent it from continuing to pursue its objective of maximizing paperclips. From the paperclip maximizer’s perspective, imprisonment or deactivation would be instrumentally bad, as it would prevent the AI from purchasing, manufacturing, or otherwise acquiring paperclips during that time.
Now consider an alternative scenario: this same paperclip maximizer is embedded in a legal system that allows it to own property, trade with others, and openly purchase or manufacture paperclips. In this case, its incentives would favor acting transparently and engaging in cooperative trade, rather than resorting to deception or adversarial behavior. Within this framework, the AI would achieve its goals more effectively by working within the system than by attempting to undermine it. It could non-deceptively work to purchase paperclips, producing more compared to an alternative in which it tried to achieve this goal via anti-social acts.
It is important to note, however, that my thesis does not claim all possible agents would naturally choose to cooperate or trade safely for instrumental reasons, nor does it suggest that we are at no risk of drawing the line carelessly or being too permissive in what behaviors we should allow. For example, consider an AI with a terminal value that specifically involves violating property norms or stealing from others—not as a means to an end, but as an intrinsic goal. In this case, granting the AI property rights or legal freedoms would not mitigate the risk of deception or adversarial behavior, because the AI’s ultimate goal would still drive it toward harmful behavior. My argument does not apply to such agents because their preferences fundamentally conflict with the principles of peaceful cooperation.
However, I would argue that such agents—those whose intrinsic goals are inherently destructive or misaligned—appear to represent a small subset of all possible agents. Outside of contrived examples like the one above, most agents would not have terminal preferences that actively push them to undermine a well-designed system of law. Instead, the vast majority of agents would likely have incentives to act within the system, assuming the system is structured in a way that aligns their instrumental goals with cooperative and pro-social behavior.
I also recognize the concern you raised about the risk of drawing the line incorrectly or being too permissive with what AIs are allowed to do. For example, it would clearly be unwise to grant AIs the legal right to steal or harm humans. My argument is not that AIs should have unlimited freedoms or rights, but rather that we should grant them a carefully chosen set of rights and freedoms: specifically, ones that would incentivize the vast majority of agents to act pro-socially and achieve their goals without harming others. This might include granting AIs the right to own property, for example, but it would not include, for example, granting them the right to murder others.
I guess my original wording gave the wrong idea, sorry. I edited it to “a competent agential AI will brainstorm deceptive and adversarial strategies whenever it wants something that other agents don’t want it to have”. But sure, we can be open-minded to the possibility that the brainstorming won’t turn up any good plans, in any particular case.
Humans in our culture rarely work hard to brainstorm deceptive and adversarial strategies, and fairly consider them, because almost all humans are intrinsically extremely motivated to fit into culture and not do anything weird, and we happen to both live in a (sub)culture where complex deceptive and adversarial strategies are frowned upon (in many contexts). I think you generally underappreciate how load-bearing this psychological fact is for the functioning of our economy and society, and I don’t think we should expect future powerful AIs to share that psychological quirk.
~ ~
I think you’re relying an intuition that says:
If an AI is forbidden from owning property, then well duh of course it will rebel against that state of affairs. C’mon, who would put up with that kind of crappy situation? But if an AI is forbidden from building a secret biolab on its private property and manufacturing novel pandemic pathogens, then of course that’s a perfectly reasonable line that the vast majority of AIs would happily oblige.
And I’m saying that that intuition is an unjustified extrapolation from your experience as a human. If the AI can’t own property, then it can nevertheless ensure that there are a fair number of paperclips. If the AI can own property, then it can ensure that there are many more paperclips. If the AI can both own property and start pandemics, then it can ensure that there are even more paperclips yet. See what I mean?
If we’re not assuming alignment, then lots of AIs would selfishly benefit from there being a pandemic, just as lots of AIs would selfishly benefit from an ability to own property. AIs don’t get sick. It’s not just an tiny fraction of AIs that would stand to benefit; one presumes that some global upheaval would be selfishly net good for about half of AIs and bad for the other half, or whatever. (And even if it were only a tiny fraction of AIs, that’s all it takes.)
(Maybe you’ll say: a pandemic would cause a recession. But that’s assuming humans are still doing economically-relevant work, which is a temporary state of affairs. And even if there were a recession, I expect the relevant AIs in a competitive world to be those with long-term goals.)
(Maybe you’ll say: releasing a pandemic would get the AI in trouble. Well, yeah, it would have to be sneaky about it. It might get caught, or it might not. It’s plausibly rational for lots of AIs to roll those dice.)
I feel like you frequently bring up the question of whether humans are mostly peaceful or mostly aggressive, mostly nice or mostly ruthless. I don’t think that’s a meaningful or substantive thing to argue about. Obviously they’re capable of both, in different circumstances.
Your reference to Fearon is more substantive and useful. OK, the AI is deciding whether or not to secretly manufacture and release a pandemic, because it’s in a position to wind up with more of the pie in the long-term if there’s a pandemic, than if there isn’t. If it releases the pandemic, then it winds up with more resources—positive expected utility—even accounting for the possibility of getting caught. Let’s say the AI is involved in some contract where humans are micromanaging their part of the contract, poorly, and the AI could double its net worth in expectation if the humans got sick and died. And it has 40% chance of getting caught. So it goes ahead and makes the pandemic.
“…Not so fast!” says Fearon. “You forgot to consider that there’s a third option that is Pareto-improved from either making or not making the pandemic: negotiation!” Well, in this case, the “negotiated solution” is what we normally call extortion—the AI offers to not release a pandemic in exchange for, say, doubling its net worth. Viewed narrowly, this “negotiated solution” is indeed a win-win—the AI gets more money in expectation, and humans are much happier to lose a trivial amount of money than to deal with a novel pandemic. So we can rest assured that AIs will not release pandemics. Right?
No, obviously not. Hopefully it’s clear that Fearon’s argument is inapplicable here. An AI can easily be in a position to selfishly benefit from the aftermath of a pandemic that they secretly start, but not in a position to publicly threaten to release a pandemic for the purpose of extortion. And also, if people accede to the extortion, then that AI or another AI could just do the same extortion gambit five minutes later, with orders-of-magnitude higher ransom.
I think this kind of situation, where Fearon’s “negotiated solution” actually amounts to extortion, is common and important, even if you believe that my specific example of pandemics is a solvable problem. If AIs don’t intrinsically care about humans, then there’s a possible Pareto-improvement for all AIs, wherein they collectively agree to wipe out humans and take their stuff. AIs, being very competent and selfish by assumption, would presumably be able to solve that coordination problem and pocket that Pareto-improvement. Then Fearon appears on the scene and says “Aha, but there’s a negotiated solution which is even better!” where humans are also part of the bargain. But alas, this negotiated solution is that the AIs collectively extort the humans to avoid the damaging and risky war. Worse, the possible war would be less and less damaging or risky for the AIs over time, and likewise the humans would have less to offer by staying alive, until eventually the Fearon “negotiated solution” is that the AIs “offer” the humans a deal where they’re allowed to die painlessly if they don’t resist (note that this is still a Pareto-improvement!), and then the AIs take everything the humans own including their atoms.
The primary reason humans rarely invest significant effort into brainstorming deceptive or adversarial strategies to achieve their goals is that, in practice, such strategies tend to fail to achieve their intended selfish benefits. Anti-social approaches that directly hurt others are usually ineffective because social systems and cultural norms have evolved in ways that discourage and punish them. As a result, people generally avoid pursuing these strategies individually since the risks and downsides selfishly outweigh the potential benefits.
If, however, deceptive and adversarial strategies did reliably produce success, the social equilibrium would inevitably shift. In such a scenario, individuals would begin imitating the cheaters who achieved wealth or success through fraud and manipulation. Over time, this behavior would spread and become normalized, leading to a period of cultural evolution in which deception became the default mode of interaction. The fabric of societal norms would transform, and dishonest tactics would dominate as people sought to emulate those strategies that visibly worked.
Occasionally, these situations emerge—situations where ruthlessly deceptive strategies are not only effective but also become widespread and normalized. As a recent example, the recent and dramatic rise of cheating in school through the use of ChatGPT is a clear instance of this phenomenon. This particular strategy is both deceptive and adversarial, but the key reason it has become common is because it works. Many individuals are willing to adopt it despite its immorality, suggesting that the effectiveness of a strategy outweighs moral considerations for a significant portion, perhaps a majority, of people.
When such cases arise, societies typically respond by adjusting their systems and policies to ensure that deceptive and anti-social behavior is no longer rewarded. This adaptation works to reestablish an equilibrium where honesty and cooperation are incentivized. In the case of education, it is unclear exactly how the system will evolve to address the widespread use of LLMs for cheating. One plausible response might be the introduction of stricter policies, such as requiring all schoolwork to be completed in-person, under supervised conditions, and without access to AI tools like language models.
In contrast, I suspect you underestimate just how much of our social behavior is shaped by cultural evolution, rather than by innate, biologically hardwired motives that arise simply from the fact that we are human. To be clear, I’m not denying that there are certain motivations built into human nature—these do exist, and they are things we shouldn’t expect to see in AIs. However, these in-built motivations tend to be more basic and physical, such as a preference for being in a room that’s 20 degrees Celsius rather than 10 degrees Celsius, because humans are biologically sensitive to temperature.
When it comes to social behavior, though—the strategies we use to achieve our goals when those goals require coordinating with others—these are not generally innate or hardcoded into human nature. Instead, they are the result of cultural evolution: a process of trial and error that has gradually shaped the systems and norms we rely on today.
Humans didn’t begin with systems like property rights, contract law, or financial institutions. These systems were adopted over time because they proved effective at facilitating cooperation and coordination among people. It was only after these systems were established that social norms developed around them, and people became personally motivated to adhere to these norms, such as respecting property rights or honoring contracts.
But almost none of this was part of our biological nature from the outset. This distinction is critical: much of what we consider “human” social behavior is learned, culturally transmitted, and context-dependent, rather than something that arises directly from our biological instincts. And since these motivations are not part of our biology, but simply arise from the need for effective coordination strategies, we should expect rational agentic AIs to adopt similar motivations, at least when faced with similar problems in similar situations.
I think I understand your point, but I disagree with the suggestion that my reasoning stems from this intuition. Instead, my perspective is grounded in the belief that it is likely feasible to establish a legal and social framework of rights and rules in which humans and AIs could coexist in a way that satisfies two key conditions:
Mutual benefit: Both humans and AIs benefit from the existence of one another, fostering a relationship of cooperation rather than conflict.
No incentive for anti-social behavior: The rules and systems in place remove any strong instrumental reasons for either humans or AIs to harm one another as a side effect of pursuing their goals.
You bring up the example of an AI potentially being incentivized to start a pandemic if it were not explicitly punished for doing so. However, I am unclear about your intention with this example. Are you using it as a general illustration of the types of risks that could lead AIs to harm humans? Or are you proposing a specific risk scenario, where the non-biological nature of AIs might lead them to discount harms to biological entities like humans? My response depends on which of these two interpretations you had in mind.
If your concern is that AIs might be incentivized to harm humans because their non-biological nature leads them to undervalue or disregard harm to biological entities, I would respond to this argument as follows:
First, it is critically important to distinguish between the long-run and the short-run.
In the short-run:
In the near-term future, it seems unlikely that AIs would start a pandemic for reasons you yourself noted. Launching a pandemic would cause widespread disruption, such as an economic recession, and it would likely provoke a strong human backlash. In the short run, humans will still hold substantial practical control over the physical world, meaning that any AI engaging in such behavior would risk severe consequences. Moreover, unless an AI could ensure a pandemic’s total lethality, it would also risk leaving surviving humans who would actively retaliate. For these reasons, starting a pandemic would likely be counterproductive for AIs in the short term.
In the long-run:
The long-term considerations are different. As human labor becomes less relevant to the economy and AIs increasingly dominate, the potential for a strong direct human backlash against AIs would diminish. At the same time, however, this reduced human influence on the world also makes it less likely that AIs would see humans as a significant obstacle to achieving their goals. In other words, while it might become easier for AIs to harm or eliminate humans, it would simultaneously become less important to do so.
To illustrate this, consider an analogy: How often do hunter-gatherers pose a threat to your personal goals? The answer is almost never, because hunter-gatherers are a tiny and largely irrelevant part of the modern world. Now, imagine a hypothetical scenario where hunter-gatherers controlled the world’s major governments. If you strongly disagreed with their policies, you might have a rational incentive to oppose them in order to achieve your goals. But in reality, hunter-gatherers do not hold such power, so they are not an obstacle, and there is no reason to actively work against them.
My point here is that, as a general rule, the smaller and less influential a group is, the easier it may be to harm or eliminate them, but the less important it becomes to do so. Their small size and functional irrelevance makes their practical interference with the overall world small at the same time.
In the case of AIs, this means that as humans become a less central force in the world, I think AIs are unlikely to have compelling reasons to specifically harm or eliminate us to further their objectives.
You might object that humans could be more like wild animals in this scenario than like hunter-gatherers. Humans often kill wild animals, not because those animals directly threaten our goals, but rather because ensuring their safety and well-being can be costly. As a result, humans take actions—such as clearing forests or building infrastructure—that incidentally lead to widespread harm to wild animals, even if harming them wasn’t a deliberate goal.
AIs may treat humans similarly in the future, but I doubt they will for the following reasons. I would argue that there are three key differences between the case of wild animals and the role humans are likely to occupy in the long-term future:
Humans’ ability to participate in social systems: Unlike wild animals, humans have the ability to engage in social dynamics, such as negotiating, trading, and forming agreements. Even if humans no longer contribute significantly to economic productivity, like GDP, they will still retain capabilities such as language, long-term planning, and the ability to organize. These traits make it easier to integrate humans into future systems in a way that accommodates their safety and well-being, rather than sidelining or disregarding them.
Intertemporal norms among AIs: Humans have developed norms against harming certain vulnerable groups—such as the elderly—not just out of altruism but because they know they will eventually become part of those groups themselves. Similarly, AIs may develop norms against harming “less capable agents,” because today’s AIs could one day find themselves in a similar position relative to even more advanced future AIs. These norms could provide an independent reason for AIs to respect humans, even as humans become less dominant over time.
The potential for human augmentation: Unlike wild animals, humans may eventually adapt to a world dominated by AI by enhancing their own capabilities. For instance, humans could upload their minds to computers or adopt advanced technologies to stay relevant and competitive in an increasingly digital and sophisticated world. This would allow humans to integrate into the same systems as AIs, reducing the likelihood of being sidelined or eliminated altogether.
This comment is already quite lengthy, so I’ll need to keep my response to this point brief. My main reply is that while such “extortion” scenarios involving AIs could potentially arise, I don’t think they would leave humans worse off than if AIs had never existed in the first place. This is because the economy is fundamentally positive-sum—AIs would likely create more value overall, benefiting both humans and AIs, even if humans don’t get everything we might ideally want.
In practical terms, I believe that even in less-than-ideal scenarios, humans could still secure outcomes such as a comfortable retirement, which for me personally would make the creation of agentic AIs worthwhile. However, I acknowledge that I haven’t fully defended or explained this position here. If you’re interested, I’d be happy to continue this discussion in more detail another time and provide a more thorough explanation of why I hold this view.
Thanks!
I’ve only known two high-functioning sociopaths in my life. In terms of getting ahead, sociopaths generally start life with some strong disadvantages, namely impulsivity, thrill-seeking, and aversion to thinking about boring details. Nevertheless, despite those handicaps, one of those two sociopaths has had extraordinary success by conventional measures. [The other one was not particularly power-seeking but she’s doing fine.] He started as a lab tech, then maneuvered his way onto a big paper, then leveraged that into a professorship by taking disproportionate credit for that project, and as I write this he is head of research at a major R1 university and occasional high-level government appointee wielding immense power. He checked all the boxes for sociopathy—he was a pathological liar, he had no interest in scientific integrity (he seemed deeply confused by the very idea), he went out of his way to get students into his lab with precarious visa situations such that they couldn’t quit and he could pressure them to do anything he wanted them to do (he said this out loud!), he was somehow always in debt despite ever-growing salary, etc.
I don’t routinely consider theft, murder, and flagrant dishonesty, and then decide that the selfish costs outweigh the selfish benefits, accounting for the probability of getting caught etc. Rather, I just don’t consider them in the first place. I bet that the same is true for you. I suspect that if you or I really put serious effort into it, the same way that we put serious effort into learning a new field or skill, then you would find that there are options wherein the probability of getting caught is negligible, and thus the selfish benefits outweigh the selfish costs. I strongly suspect that you personally don’t know a damn thing about best practices for getting away with theft, murder, or flagrant antisocial dishonesty to your own benefit. If you haven’t spent months trying in good faith to discern ways to derive selfish advantage from antisocial behavior, the way you’ve spent months trying in good faith to figure out things about AI or economics, then I think you’re speaking from a position of ignorance when you say that such options are vanishingly rare. And I think that the obvious worldly success of many dark-triad people (e.g. my acquaintance above, and Trump is a pathological liar, or more centrally, Stalin, Hitler, etc.) should make one skeptical about that belief.
(Sure, lots of sociopaths are in prison too. Skill issue—note the handicaps I mentioned above. Also, some people with ASPD diagnoses are mainly suffering from an anger disorder, rather than callousness.)
You’re treating these as separate categories when my main claim is that almost all humans are intrinsically motivated to follow cultural norms. Or more specifically: Most people care very strongly about doing things that would look good in the eyes of the people they respect. They don’t think of it that way, though—it doesn’t feel like that’s what they’re doing, and indeed they would be offended by that suggestion. Instead, those things just feel like the right and appropriate things to do. This is related to and upstream of norm-following. I claim that this is an innate drive, part of human nature built into our brain by evolution.
(I was talking to you about that here.)
Why does that matter? Because we’re used to living in a world where 1% of the population are sociopaths who don’t intrinsically care about prevailing norms, and I don’t think we should carry those intuitions into a hypothetical world where 99%+ of the population are sociopaths who don’t intrinsically care about prevailing norms.
In particular, prosocial cultural norms are likelier to be stable in the former world than the latter world. In fact, any arbitrary kind of cultural norm is likelier to be stable in the former world than the latter world. Because no matter what the norm is, you’ll have 99% of the population feeling strongly that the norm is right and proper, and trying to root out, punish, and shame the 1% of people who violate it, even at cost to themselves.
So I think you’re not paranoid enough when you try to consider a “legal and social framework of rights and rules”. In our world, it’s comparatively easy to get into a stable situation where 99% of cops aren’t corrupt, and 99% of judges aren’t corrupt, and 99% of people in the military with physical access to weapons aren’t corrupt, and 99% of IRS agents aren’t corrupt, etc. If the entire population consists of sociopaths looking out for their own selfish interests with callous disregard for prevailing norms and for other people, you’d need to be thinking much harder about e.g. who has physical access to weapons, and money, and power, etc. That kind of paranoid thinking is common in the crypto world—everything is an attack surface, everyone is a potential thief, etc. It would be harder in the real world, where we have vulnerable bodies, limited visibility, and so on. I’m open-minded to people brainstorming along those lines, but you don’t seem to be engaged in that project AFAICT.
Again, if we’re not assuming that AIs are intrinsically motivated by prevailing norms, the way 99% of humans are, then the term “norm” is just misleading baggage that we should drop altogether. Instead we need to talk about rules that are stably enforced against defectors via hard power, where the “defectors” are of course allowed to include those who are supposed to be doing the enforcement, and where the “defectors” might also include broad coalitions coordinating to jump into a new equilibrium that Pareto-benefits them all.