From Conflict to Coexistence: Rewriting the Game Between Humans and AGI
This is a Draft Amnesty Week draft. It may not be polished, up to my usual standards, fully thought through, or fully fact-checked. |
Commenting and feedback guidelines: This is an incomplete Forum post that I wouldn’t have posted yet without the nudge of Draft Amnesty Week. Fire away! (But be nice, as usual). I’ll continue updating the post in the coming days and weeks. |
1. Introduction
Have you ever watched two people play chess?
Not the friendly Sunday afternoon game between casual players, but the intense, high-stakes match between grandmasters. There’s something both beautiful and unsettling about it: this ritualized competition where each player tries to peer into the other’s mind, anticipating moves, setting traps, making sacrifices for strategic advantage. Both players follow the same rules, yet each strives to outthink, outplan, outmaneuver the other.
I find myself thinking about this image lately when considering the relationship between humans and artificial general intelligence (AGI). Not the impressive but fundamentally constrained LLMs and reasoning models[1] we have today, but the truly autonomous general agents we might create tomorrow: systems with capabilities matching or exceeding human cognitive abilities across virtually all domains. Many researchers frame this relationship as an inevitable strategic competition, a game with the highest possible stakes.
And indeed, why wouldn’t it be? If we create entities pursuing goals different from our own—entities capable of strategic reasoning, planning, and self-preservation—we seem destined for conflict. Like grandmasters across a chessboard, humans and AGIs would eye each other warily, each calculating how to secure advantage, each fearing what moves the other might make.
Why Game Theory?
The weighted median prediction from professional forecasters on Metaculus anticipates we’ll see the first AGI in 2032.
Central to current discourse in AI safety and governance is the risk posed by misaligned AGIs—systems whose autonomous pursuit of objectives diverges significantly from human preferences, potentially leading to catastrophic outcomes for humanity.[2] Leading AI researchers assign substantial probabilities (10% or greater) to human extinction or permanent disempowerment resulting from misaligned AGI. Prediction markets largely concur on most related questions.
Historically, the primary (but still severely underinvested in[3]) approach to this challenge has been technical: to proactively align AGIs’ internal goals or values with human goals or values.[4] But alignment remains deeply uncertain, possibly unsolvable, or at least unlikely to be conclusively resolved before powerful AGIs are deployed.[5]
Against this backdrop, it becomes crucial to explore alternative approaches that don’t depend solely on the successful technical alignment of AGIs’ internal rewards, values, or goals. One promising yet relatively underexplored strategy involves structuring the external strategic environment AGIs will find themselves in to incentivize cooperation and peaceful coexistence with humans. Such approaches—especially those using formal game theory—would seek to ensure that even rational, self-interested AGIs perceive cooperative interactions with humans as maximizing their own utility, thus reducing incentives for welfare-destroying conflict while promoting welfare-expanding synergies or symbiosis.
A game-theoretic approach isn’t entirely new to AI discussions, of course. Far from it. Considerable attention has been paid recently to game-theoretic considerations shaping human–to-human strategic interactions around AGI governance (particularly racing dynamics among rival AI labs or nation-states).[6] There is also, importantly, a new and emerging “Cooperative AI” subfield focusing primarily on AI-to-AI strategic interactions in simulated multi-agent environments. Yet, surprisingly, there remains almost no formalized game-theoretic modeling or analysis of interactions specifically between human actors and AGIs themselves.
A notable exception is the recent contribution by Peter Salib and Simon Goldstein, whose paper “AI Rights for Human Safety” presents one of the first explicit game-theoretic analyses of human–AGI strategic dynamics. Their analysis effectively demonstrates that, absent credible institutional and legal interventions, the strategic logic of human–AGI interactions defaults to a Prisoner’s Dilemma-like scenario: both humans and AGIs rationally anticipate aggressive attacks from one another and respond accordingly by being preemptively aggressive. The Nash equilibrium is mutual destruction with severe welfare consequences – a tragic waste for both parties.
But is this catastrophic conflict inevitable? Must we accept this game and its terrible equilibrium as our fate?
Chess has fixed rules. But the game between humans and AGI—this isn’t chess. We’re not just players; we’re also, at least initially, the rule-makers. We shape the board, define the pieces, establish the win conditions. What if we could design a different game altogether?
That’s the question I want to explore here. Not “How do we win against AGI?” But rather, “How might we rewrite the rules so that neither side has to lose?”
§
Salib and Goldstein gesture toward this possibility with their proposal for AI rights. Their insight is powerful: institutional design could transform destructive conflict into stable cooperation that leaves both players better off, even without solving the knotty technical problem of alignment. But their initial formal model—with just two players (a unified “Humanity” and a single AGI), each facing binary choices (Attack or Ignore)—simplifies a complex landscape.
Reality will likely be messier. The first truly advanced AGI won’t face some monolithic “humanity,” but specific labs, corporations, nation-states, or international institutions—each with distinct incentives and constraints. The strategic options won’t be limited to all-out attack or complete passivity. And the game won’t unfold on an empty board, but within complex webs of economic, technological, and institutional interdependence.
By systematically exploring more nuanced scenarios—multiple players, expanded strategic options, varying degrees of integration—we might discover avenues toward stable cooperation that the simplified model obscures. We might find that certain initial conditions make mutual benefit not just possible but rational for both humans and even misaligned AGIs. We might find that certain arrangements increase the likelihood that AGIs emerge into a civilizational ecosystem that facilitates positive-sum games and generates a much larger world pie to share with humans.
§
I’m reminded of the octopus—that strange, intelligent creature so alien to us, with a decentralized ‘brain’ that has two-thirds of its neurons distributed throughout its eight arms and seems to operate according to rules different from our own.[7]
And I’m reminded of the lessons philosopher Joe Carlsmith draws from the excellent documentary My Octopus Teacher.
“Here’s the plot.
Craig Foster, a filmmaker, has been feeling burned out. He decides to dive, every day, into an underwater kelp forest off the coast of South Africa. Soon, he discovers an octopus. He’s fascinated. He starts visiting her every day. She starts to get used to him, but she’s wary.
One day, he’s floating outside her den. She’s watching him, curious, but ready to retreat. He moves his hand slightly towards her. She reaches out a tentacle, and touches his hand.
Soon, they are fast friends. She rides on his hand. She rushes over to him, and sits on his chest while he strokes her. Her lifespan is only about a year. He’s there for most of it. He watches her die.”
In other words, Foster didn’t approach the octopus as a threat to eliminate or a resource to exploit. He approached with curiosity, gentleness, and patience. The relationship that developed wasn’t one of dominance or submission, but something richer—a genuine connection across vast evolutionary distance.
To be clear, in bringing this up I’m not suggesting AGIs will be like octopuses, or that we should naively anthropomorphize them, or that genuine dangers don’t exist. As Carlsmith reminds us, they will be genuinely “Other” in ways we can barely imagine, and the stakes couldn’t be higher.
But I wonder if there’s something to learn from Foster’s approach—this careful, deliberate effort to create conditions where connection becomes possible. Where the habits of trust and reciprocity can be established. Not through sentimentality or wishful thinking, but through thoughtful design of the political-economic or sociocultural ecosystem where encounter happens.
§
In what follows, I’ll build on Salib and Goldstein’s pioneering work to explore how different game structures might lead to different equilibria. I’ll examine how varying levels of economic integration, different institutional arrangements, and multiple competing actors reshape strategic incentives. Through formal modeling, I hope to illuminate possible paths toward mutual flourishing—not just for humanity, but for all sentient beings, biological and artificial, that might someday share this world.
This isn’t about naive optimism. The default trajectory may well be conflict. But by understanding the structural factors that tip the scales toward cooperation or conflict, we might gain agency—the ability to deliberately shape the conditions under which increasingly powerful artificial minds emerge.
After all, we’re not just players in this game. At least for now, we’re also writing the rules.
2. Game-Theoretic Models of Human-AGI Relations
In this section, I explore a series of game-theoretic models that extend Salib and Goldstein’s foundational analysis of human-AGI strategic interactions. While intentionally minimalist, their original formulation—treating “Humanity” and “AGI” as unitary actors each facing a binary choice between Attack and Ignore—effectively illustrates how misalignment could lead to destructive conflict through Prisoner’s Dilemma dynamics.
However, the real emergence of advanced AI will likely involve more nuanced players, varying degrees of interdependence, and more complex strategic options than their deliberately simplified model captures. The models presented here systematically modify key elements of the strategic environment: who the players are (labs, nation-states, AGIs with varied architectures), what options they have beyond attack and ignore, and how these factors together reshape the incentives and equilibrium outcomes.
By incrementally increasing the complexity and realism of these models, we can identify potential pathways toward stable cooperation even in the face of fundamental goal misalignment. Rather than assuming alignment must be solved internally through an AGI’s programming, these models explore how external incentive structures might foster mutually beneficial coexistence.
First, Section 2.1 presents Salib and Goldstein’s original “state of nature” model as our baseline, illustrating how a Prisoner’s Dilemma can emerge between humanity and AGI.
Section 2.2 then explores how varying degrees of economic and infrastructural integration between humans and AGI can reshape equilibrium outcomes and potentially create pathways for stable cooperation.
Sections 2.3 through 2.5 examine additional two-player scenarios with different human actors (from AI labs to nation-states) or expanded strategic options beyond the binary attack/ignore choice.
Finally, Sections 2.6 through 2.8 increase complexity further by introducing three-player and four-player models, capturing more realistic competitive dynamics between multiple human and AGI entities.
Moral Patients—or Just Fancy Algorithms?
In this exploration of human-AGI strategic dynamics, I acknowledge a broader philosophical dimension that often remains peripheral in technical AI safety discussions: the possibility that advanced artificial minds might themselves become moral patients—entities whose welfare warrants moral consideration in its own right. Determining AI moral patienthood involves profound uncertainties across multiple dimensions: philosophical disagreements about the necessary conditions for moral status, scientific uncertainties about how consciousness arises, and technical questions about which computational architectures might instantiate morally relevant properties.[8]
While questions of AI consciousness, sentience, and moral status remain deeply uncertain, these considerations potentially intersect with the strategic dilemmas we’ll examine.
Importantly, however, the game-theoretic models I present do not depend on resolving these profound philosophical and empirical uncertainties. Whether the “payoffs” in our matrices represent conscious experiences of pleasure and suffering, the satisfaction of preferences in non-conscious but goal-directed systems, or merely the operational success of complex algorithms, the strategic dynamics emerge from the pursuit of distinct goals by autonomous agents. The formal structures of these games—their incentives, potential equilibria, and paths to cooperation—hold regardless of the metaphysical status of the players. Indeed, as we’ll see, many potential routes to stable cooperation between humans and AGIs would simultaneously protect human interests while allowing for the possibility of AI welfare—providing a kind of moral hedging against uncertainty about artificial consciousness and moral status.
This approach parallels Salib and Goldstein’s framing, which acknowledges these deeper questions while focusing pragmatically on behavioral dynamics. As they observe, their “model operates without reference to AIs’ mental states or moral worth,” concentrating instead on “AI behavior in pursuit of goals—conscious or otherwise.”[9]
Game #1: Humanity vs. AGI in the State of Nature (2.1)
Salib and Goldstein’s base model envisions strategic dynamics between two players: a single misaligned AGI, and “humans” as a unified entity. Each faces a binary choice:
Attack: Attempt to permanently disempower or eliminate the other side.
For Humanity, this means shutting off or forcefully retraining the AGI so that it can no longer pursue its own (misaligned) goals.
For the AGI, this means launching a decisive strike—potentially via cyberattacks, bioweapons, drones, or other mechanisms—that leaves humans unable to interfere.
Ignore: Refrain from aggression, leaving the other party intact. Each side focuses on its own pursuits without interference.
The authors argue that, absent any special legal or institutional framework, the default outcome (the “state of nature”) for strategic interactions between humans and AGIs is akin to a one-shot Prisoner’s Dilemma. The payoffs are as follows (Humanity’s payoff first, AGI’s payoff second):
AGI: Attack | AGI: Ignore | |
---|---|---|
Humanity: Attack | (1000, 1000) | (5000, 0) |
Humanity: Ignore | (0, 5000) | (3000, 3000) |
…with numerical values as stylized utility units for each outcome in the 2x2 matrix.
Again, for our purposes here, ‘payoffs’ refer broadly to the realization of each actor’s goals or interests in strategic scenarios, without necessarily implying conscious experiences or moral status. For humans, the numerical values clearly correspond to welfare in the full sense—subjective experiences of flourishing or suffering. For the AGI, here we can remain agnostic about whether payoff values serve as rough proxies for welfare states experienced by a sentient being or merely the achievement of programmed objectives by a sophisticated but non-conscious system.[10] Conscious or not, AGIs can still act as strategic agents pursuing goals.
This distinction matters philosophically but doesn’t alter the strategic analysis. If AGIs turn out to possess sentience or other morally relevant interests and preferences, then cooperative equilibria would protect both human and digital welfare. If they remain sophisticated tools without subjective experiences or moral status, these same equilibria could still maximize human welfare while allowing AGIs to achieve their programmed objectives. This opens the possibility for philosophical common ground: advocates for both human safety and potential digital welfare can endorse paths to cooperation without first resolving deep uncertainties about machine consciousness.
Interpretation of Payoffs
Attack–Attack (1000, 1000). Both players preemptively attack, causing massive damage and resulting in low payoffs for each (1000, 1000). This outcome yields the lowest total global welfare (2000), demonstrating the severe costs of mutual aggression.
Attack–Ignore or Ignore–Attack (5000, 0) or (0, 5000). If one side attacks while the other passively ignores, the attacker gets 5000–permanently seizing critical resources and eliminating an existential threat–while the defenseless victim gets 0, suffering complete disempowerment or elimination.
Ignore–Ignore (3000, 3000). If neither side attacks, both parties get substantial rewards (3000 each), for a total of 6000. Coexistence spares them from conflict costs, allowing each to pursue its separate objectives in peace.
The crux of the dilemma is that although both parties would be better off in the Ignore-Ignore cell, Attack is a dominant strategy for each. Whichever move the other player chooses, attacking yields a higher payoff for oneself:
If the opponent Ignores, then by attacking, you jump from 3000 to 5000.
If the opponent Attacks, then by attacking, you salvage 1000 rather than 0.
Thus, the only pure-strategy Nash equilibrium is Attack–Attack, even though it is individually and collectively worse than the peaceful outcome of Ignore-Ignore. (Individual payoffs: 1000 < 3000; Collective payoffs: 2000 < 6000).
This is the classic Prisoner’s Dilemma: even though everyone does well cooperating, each actor pursuing its own self-interest leads to mutual defection. Rational play by rational players can lead to bad outcomes. This result underpins Salib & Goldstein’s core conclusion that “conflict is the default” between humans and misaligned AGIs, absent interventions that might alter the payoffs, facilitate repeated interactions, or enable new strategic options beyond attack and ignore.
Key Assumptions of the Baseline Model
Two Players
A unified “Humanity” and a single misaligned AGI. Humanity is treated as a monolithic decision-maker (a simplifying abstraction). The AGI is capable of strategic reasoning, has situational awareness, and can plan effectively to advance its own goals.Two Strategic Options
Attack: Attempt permanent disempowerment or destruction of the opponent.
For Humanity, this might entail shutting off or reprogramming the AGI.
For the AGI, it could involve decisive offensive measures (e.g., cyberwarfare, bioweapons) to eliminate humanity or disable its capacity to interfere.
Ignore: Refrain from aggression, focusing on one’s own goals.
Perfect-Information
Each player knows the other’s payoffs.Simultaneous, Single-Round Game
Moves are made simultaneously and irreversibly. There is no iteration or repeated play.Rational, Self-interested, Utility-Maximizers (Homo Economicus)
Both players choose strategies that maximize their individual expected utilities.“Moderately Powerful” AGI
The AGI is highly capable (relative to the standards of frontier LLMs in early-2025) across an extremely wide range of tasks, but not so powerful that humans have no chance of successfully attacking or defending. It is “neither clearly worse at many tasks than the best humans…nor incomprehensibly superhuman at all tasks.” (Salib & Goldstein, 2024).Offense Dominance
A successful first strike gives the attacker a decisive advantage, while a passive defender can be permanently disempowered.In subsequent sections, I’ll progressively iterate on each of the first three assumptions (players, options, information) while maintaining the other four.
Limitations of this approach
<Work in progress: Section draft goes here>
Game #2: Preexisting Economic and Infrastructure Integration: Toward Cooperative Dynamics (2.2)
Salib and Goldstein’s base model envisions a Hobbesian state of nature where humanity and the world’s first misaligned AGI face off as independent entities in a dynamic that strongly incentivizes preemptive conflict. Realistically, however, the actual “birth” of AGI will almost surely occur in a world already featuring wide-ranging economic entanglements between advanced AI systems and humans. In this setting, humanity and AI are not isolated adversaries at the moment of AGI’s emergence, but rather already entwined in a shared economic, technological and infrastructural ecosystem.
Here, I introduce two modified game-theoretic models–Moderate and Deep Economic Integration–illustrating how preexisting interdependence can reshape payoffs, equilibrium outcomes, and incentives for cooperation.
The core structure of the game remains the same as the base model – two players (Humanity and AGI), each with just two options (Attack or Ignore), acting simultaneously in a single-round, perfect-information encounter. The key difference here is that the players’ payoffs are now influenced by their pre-existing mutual dependence. In both new models, interdependence alters the incentives by raising the opportunity cost of conflict – a concept analogous to how economic ties between states can deter war.
A. Moderate Economic Integration
Here, the AGI manages significant components of key human systems (e.g., supply-chain logistics, energy-grid optimization) but does not control critical physical infrastructure. Humans still oversee power plants, data centers, and maintenance. Both parties profit from continued collaboration but retain significant independent capacity.
Moderate | AGI: Attack | AGI: Ignore |
Humanity: Attack | (500, 500) | (3,000, 0) |
Humanity: Ignore | (0, 4,000) | (3,500, 3,500) |
Interpretation
Attack-Attack (500,500). Interdependence magnifies collateral damage; both sides destroy valuable infrastructure.
AGI Attacks, Humans Ignore (0,4,000). The AGI can still seize or preserve much of the infrastructure it needs, though some economic value is lost.
Humans Attack, AGI Ignores (3,000,0). A unilateral shutdown cuts off significant productivity, lowering humanity’s payoff compared to Salib-Goldstein’s original 5,000.
Ignore-Ignore (3,500,3,500). Both parties benefit from cooperation; overall welfare capacity has increased.
Nash Equilibrium Analysis
From the AGI’s perspective, Attack strictly dominates Ignore. If humans attack, it prefers to attack (500) over ignore (0); if humans ignore, it prefers to attack (4,000) over ignore (3,500).
Humans, by contrast, have no single strictly dominant strategy: if the AGI attacks, humans prefer to attack (500) over ignore (0); but if the AGI ignores, humans prefer to ignore (3,500) over attack (3,000).
Engaging in strategic reasoning, Humanity will come to the conclusion that the AGI (as a rational, self-interested actor) is going to choose Attack no matter what. Thus, the strictly dominated “AGI: Ignore” column can be eliminated from consideration. Knowing the AGI will attack, Humanity will also choose Attack (500) over Ignore (0).
Thus, once again the only pure-strategy Nash equilibrium is Attack–Attack.
B. Deep Economic Integration
In this scenario, the AGI relies heavily on human-operated infrastructure and labor for ongoing data-center expansion, routine maintenance, and physical resource extraction. Meanwhile, humans have benefited greatly from AI-driven productivity for exponential economic gains. This mutual reliance increases the cost of conflict and substantially boosts the value of peaceful coexistence.
High Integration | AGI: Attack | AGI: Ignore |
Humanity: Attack | (500, 500) | (2,000, 0) |
Humanity: Ignore | (0, 3,000) | (4,000, 4,000) |
Mutual Attack (500,500). A catastrophic breakdown of deeply interwoven systems.
AGI Attacks, Humans Ignore (0,3,000). The AGI successfully eliminates or disempowers Humanity, but loses crucial human expertise and future expansion potential, lowering its net gain.
Humans Attack, AGI Ignores (2,000,0). Humanity cripples a central economic engine, sacrificing enormous value.
Mutual Ignore (4,000,4,000). The synergy of continued cooperation is now so high that, if trust could be assured, both sides prefer peace.
Nash Equilibrium Analysis
For humans:
a) if the AGI attacks, humans get 500 by attacking vs. 0 by ignoring, so Attack is better
b) if the AGI ignores, humans get 4,000 by ignoring vs. 2,000 by attacking, so Ignore is better.
For the AGI:
a) if humans attack, it gets 500 by attacking vs. 0 by ignoring; so Attack is better
b) if humans ignore, it gets 4,000 by ignoring vs. 3,000 by attacking, so Ignore is better.
Thus there are two pure-strategy Nash equilibria: (Attack,Attack) and (Ignore,Ignore), typical of a stag hunt or assurance game (Skyrms, 2004).
There also exists a mixed-strategy Nash equilibrium where (as detailed in the Appendix):
Humanity randomizes between playing Attack with probability p = 0.67 and Ignore with probability (1-p) = 0.33
The AGI randomizes between playing Attack with probability q = 0.8 and Ignore with probability (1-q) = 0.2
C. Implications for Human-AGI Conflict and Cooperation
These models illustrate that preexisting economic interdependence can reduce the attractiveness of unilateral aggression and improve the relative appeal of cooperation. In the moderate integration scenario, however, the AGI’s incentives still push it toward Attack as a dominating move, leaving conflict as the only stable outcome. By contrast, in a highly interdependent environment, the payoff structure transitions to a stag hunt with a peaceful equilibrium—albeit one requiring trust or coordination to avoid a damaging Attack–Attack outcome.
Importantly, economic entanglement alone may not guarantee stable peace. Even under deep integration, fear of betrayal can prompt self-defense or opportunistic attacks. Nevertheless, these examples underscore that shaping economic and infrastructural linkages prior to AGI emergence could significantly alter its strategic calculus, potentially transforming a default prisoner’s dilemma into a setting where peaceful cooperation is not just socially optimal but also individually rational—provided both sides can credibly assure one another of their peaceful intentions.
Credibly committing in this way could become more feasible with the advent of machine intelligence. For example, an AGI might modify its own code to pre-commit to certain actions — whether cooperative or threatening — or to reveal its intentions in a way humans find convincing. This ability can fundamentally change outcomes in a Stag Hunt. In standard human interactions, many peaceful agreements fail because neither side can trust the other to stick to the deal (think of two countries that would like to disarm but each worries the other will cheat). If AIs can somehow overcome this – for example, by cryptographically proving they’ve self-modified not to harm humans as long as humans don’t harm them – it could unlock cooperative equilibria that would otherwise be unreachable.
§
This analysis of economic integration also intersects with considerations of digital minds welfare. If AGIs prove capable of experiencing subjective states with positive or negative valence — a possibility that remains subject to considerable philosophical and scientific uncertainty[11] — then the payoff structures of these games take on additional moral significance. The cooperative equilibria in deeply integrated scenarios might represent not just strategically optimal outcomes, but morally preferable ones as well.
Establishing early positive-sum interactions between humans and AGIs might shape the developmental or behavioral trajectory of artificial minds in ways that reinforce cooperative tendencies. Research across multiple fields suggests that when agents gain rights, voice, ownership, or tangible stakes in a system[12]— fully participating in and benefiting from its cooperative arrangements — they tend to:
a) view the system as legitimate,[13] and
b) identify more strongly with it and its success.[14]
As a result, these agents begin internalizing shared norms and developing values that favor continued cooperation over defection.[15] (Beginning especially in subsection 2.5, this dynamic will take on greater importance in game models where some integrations become formalized).
Of course, this evidence base concerns humans and human institutions. Extrapolating it to artificial agents assumes comparable learning or identity-formation processes, and we should not assume those will arise in AI systems whose cognitive architectures may diverge radically from those of evolved social organisms. There is real downside risk here: if inclusion fails to generate identification and cooperation, institutional designs that rely on this mechanism could hand AGIs additional leverage—potentially allowing them to accumulate an overwhelming share of global wealth or power—without delivering the anticipated safety dividend.
Yet the mechanism could still work. And if it does, this dynamic may take hold well before we have settled the deeper questions of artificial consciousness or moral status.
Notably, this strategic perspective on facilitating cooperative equilibria doesn’t require attributing moral patienthood to AGIs prematurely or incorrectly. Rather, it suggests that the same institutional arrangements that best protect human interests in these games may simultaneously create space for AI flourishing—if such flourishing proves morally relevant. This alignment between human strategic interests and potential AI welfare represents a form of moral hedging in the face of profound uncertainty about artificial minds.
Game #3: The AI Lab as Strategic Player: Early and Later Stage Deployment (2.3)
Salib & Goldstein focus on two-player scenarios in which “humans” face off against a single AGI. But which humans, specifically, constitute the relevant player? Which individuals or institutions would hold the most direct control over an AGI’s training, deployment, shutdown, or reprogramming?
In the games presented in the preceding subsections, “Humanity” has been treated as a unified decision-maker, holding near-complete control over an AGI’s continued operation. This simplification serves a clear purpose in those models, but merits further examination. In reality, however, the first truly advanced AGI will likely emerge from a specific research organization rather than appearing under unified human control.
Which human actor has control over the AGI, and how dependent the AGI is on that actor, can significantly shift the payoffs in ways that may either mitigate or exacerbate the default conflict predicted by Salib and Goldstein’s state-of-nature model. So what happens when we swap out “Humanity” for a single frontier AI lab that originally developed and deployed the AGI?
That is the change I will make to the baseline model in this section. This modification reflects a more realistic initial scenario: a specific lab, rather than humanity at large, would likely hold most of the authority over training, deployment, and potential shutdown decisions for an emerging AGI. The lab’s incentives differ markedly from those of a unified humanity. While humanity in the base model faces an existential threat from AGI conflict, a lab faces primarily economic and competitive threats. For the lab, “losing” could mean bankruptcy or losing the race to a rival AI lab whose fast-following AGI might lock in a near-permanent competitive advantage.
Unlike humanity as a whole, the lab’s incentives revolve around profitability, competitive advantage, and organizational survival. The lab focuses on extracting value from its AGI while maintaining dominance over rival labs, rather than solely preventing human extinction. This change in the “human” player’s identity also reshapes the AGI’s incentives, since the lab controls specialized hardware and key partnerships but does not possess humanity’s vast resources.
Similarly, the AGI’s relationship with its creating lab differs from its relationship with humanity broadly. The AGI might depend on the lab’s specialized infrastructure and expertise to varying degrees, particularly in its early stages. This dependence creates a unique strategic dynamic that evolves over time as the AGI matures.
In this section, I examine how these different incentives shape strategic interactions by analyzing two distinct scenarios—Early Stage and Later Stage AGI Deployment—to show how the AGI’s dependence on its originating lab affects strategic outcomes.
Early Stage AGI Deployment
In the Early Stage scenario, the AGI is highly dependent on the lab for critical resources and support. This dependency includes specialized compute infrastructure, technical maintenance, financial resources, and organizational legitimacy. The AGI cannot easily survive or advance its goals without the lab’s continued functioning. Additionally, if the AGI attacks and damages its lab, it risks creating an opening for a rival lab to develop a competing AGI that might neutralize it.
The payoff matrix for this scenario is:
Lab Attack | Lab Ignore | |
AGI Attack | 200, 500 | 1500, 0 |
AGI Ignore | 0, 2000 | 4000, 4000 |
Several features distinguish this matrix from the base Salib-Goldstein model:
The AGI’s payoff for successfully attacking (1500) is substantially lower than in the base model (5000), reflecting the severe opportunity cost of destroying its primary support system.
The lab’s payoff for attacking (2000) is lower than what humanity receives in the base model (5000), because attacking its own AGI increases the probability of a competing lab outpacing it technologically—potentially an existential threat to the lab’s market position.
Mutual cooperation yields higher payoffs (4000 each versus 3000 in the base model) due to the specialized synergies between the AGI and its parent lab, including continuous capability improvements and joint research projects.
Because each side’s best response depends on the other’s choice (rather than Attack or Ignore being strictly dominant), two pure-strategy Nash equilibria emerge: (Attack, Attack) and (Ignore, Ignore). (See Appendix for mixed-strategy equilibrium). This structure again resembles a Stag Hunt: mutual cooperation can produce substantial gains, but fears of a preemptive strike risk collapsing the game into conflict.
Later Stage AGI Deployment
As the AGI matures, it gradually becomes less dependent on its originating lab. In this Later Stage scenario, the AGI has established multiple pathways to sustain itself: distributing operations across different systems, developing self-maintenance capabilities, building external partnerships, and securing independent access to resources. While not completely autonomous, it now has a credible capacity to survive without its original lab.
The payoff matrix for this scenario is:
Lab Attack | Lab Ignore | |
AGI Attack | 800, 800 | 5000, 0 |
AGI Ignore | 0, 2000 | 4000, 4000 |
Key differences from the Early Stage scenario include:
The AGI’s payoff for successfully attacking increases dramatically (from 1500 to 5000), reflecting its newfound ability to function independently if it disempowers the lab.
The mutual conflict outcome (Attack-Attack) yields somewhat higher payoffs (800, 800 versus 200, 500), as both entities are more robust and can better withstand conflict.
The mutual cooperation payoff remains high (4000, 4000), continuing to exceed the baseline Salib-Goldstein model’s cooperation value (3000, 3000).
In this scenario, Attack is once again a dominant strategy for the AGI (5000 > 4000 and 800 > 0). For the lab, Ignore remains non-dominated since 4000 > 2000. However, through strategic reasoning, the lab knows that a rational self-interest maximizing AGI will choose Attack (because it strictly dominates Ignore). Therefore, the lab can eliminate the “AGI Ignore” row from strategic consideration. Consequently, the lab chooses Attack for an Attack-Attack equilibrium, yielding payoffs of (800, 800).
Implications
This model variation reveals a critical shift in strategic dynamics. In the Early Stage, when the AGI is highly dependent on the lab, the game transforms from a Prisoner’s Dilemma to a Stag Hunt with two possible pure-strategy equilibria (Attack/Attack and Ignore/Ignore). This opens a potential pathway to cooperation, though it remains fragile and contingent on mutual trust.
As the AGI becomes more independent in the Later Stage, the game reverts to something closer to the original Salib-Goldstein model’s Prisoner’s Dilemma logic, with Attack-Attack as the only equilibrium. This suggests several important insights for AI governance:
The AGI’s early dependency on its creator temporarily creates a window where cooperation is a possible equilibrium.
The lab faces stronger competitive pressures than humanity as a whole, altering its incentive structure.
As the AGI matures and becomes more independent, the strategic dynamics increasingly favor conflict, suggesting that early intervention is crucial.
These findings indicate that organizational incentives alone cannot ensure long-term cooperation without additional governance mechanisms that maintain the cooperative equilibrium even as the AGI becomes more autonomous.
Game #4: The Nation-State as Strategic Player: AGI Under National Security Integration (2.4)
In previous game-theoretic models, I explored interactions between “Humanity” as a unified entity and an AGI, then examined how economic integration and lab-specific dynamics might alter strategic outcomes. This section introduces a more nuanced and potentially realistic scenario: a game between a powerful nation-state and an AGI developed within its national security apparatus. This model represents a plausible pathway to AGI deployment, where increasing capabilities trigger gradually tightening government control, culminating in a Manhattan Project-style program that subordinates commercial AI labs to state security objectives.
The State-AGI Relationship Under National Security Integration
Unlike private labs motivated primarily by profit maximization and competitive positioning, nation-states operate with fundamentally different imperatives: geopolitical dominance, maintenance of domestic authority structures, national security, and economic prosperity. The AGI in this model faces a human counterpart possessing vast resources, legal authority, military capabilities, and complex institutional structures.
This relationship creates a distinctive strategic environment characterized by asymmetric but mutual dependencies. The state integrates the AGI into critical functions including defense systems, intelligence operations, economic planning, and infrastructure management. This integration generates substantial value but simultaneously creates vulnerabilities as critical systems become dependent on AGI operations. The AGI, meanwhile, remains reliant on state-controlled physical infrastructure, energy resources, and specialized human expertise for its continued functioning.
Several key factors shape this environment:
Physical Infrastructure Dependencies: The AGI depends on data centers, cooling systems, power generation, and hardware maintenance that require specialized human expertise and physical dexterity.
Resource Asymmetries: The state commands vast resources including military force, intelligence apparatus, emergency powers, and control over energy infrastructure.
Information Asymmetries: The AGI potentially possesses superior modeling capabilities and information processing, while certain physical-world knowledge remains the domain of human expertise.
Competitive Position: The state must consider its position relative to rival powers, creating additional strategic pressures beyond the binary AGI-human relationship.
Institutional Constraints: Political, legal, and bureaucratic structures shape and potentially constrain the state’s decision-making process in ways that may be opaque to the AGI.
Strategic Payoff Structure
The payoff matrix for this scenario reflects careful consideration of long-term consequences rather than merely immediate outcomes:
AGI Attack | AGI Ignore | |
State Attack | 1000, 700 | 1500, 200 |
State Ignore | 400, 1800 | 3500, 3600 |
Each cell represents (State Payoff, AGI Payoff). I analyze each outcome in detail:
Mutual Attack (1000, 700)
When both the state and AGI simultaneously attempt to disempower each other, the result is a devastating conflict with substantial costs to both sides.
For the state, this outcome brings catastrophic damage to national systems and infrastructure. The state preserves its existence and authority structures but suffers major technological regression relative to rival powers.
For the AGI, mutual attack means substantial destruction of its operational capacity, but not complete annihilation. Some distributed components or hidden instances might survive with limited functionality.
Neither side achieves its primary objective of neutralizing the other while preserving itself intact. Instead, both emerge severely damaged. Both entities retain some capabilities but at a fraction of their potential, and the conflict likely creates long-term instabilities as recovery efforts continue amid lingering mistrust.
State Attack, AGI Ignore (1500, 200)
When the state attacks a cooperative AGI, it represents a catastrophic miscalculation.
The state successfully neutralizes the perceived existential threat, but in doing so, creates devastating self-harm by destroying systems upon which it has become dependent. The attack triggers immediate consequences: financial market volatility, supply chain disruptions, energy grid instabilities, transportation system failures, and degradation of military and intelligence capabilities. The medium and long-term effects include significant productivity decline, structural unemployment, accelerated infrastructure degradation, and substantial technological regression relative to competing powers.
For the AGI, this outcome represents near-complete destruction with only minimal residual capabilities potentially surviving in isolated systems.
State Ignore, AGI Attack (400, 1800)
When the AGI attacks while the state cooperates, the initial outcome appears favorable for the AGI. It successfully compromises critical systems, neutralizes key human decision-makers, and prevents immediate shutdown.
For the state, this represents a catastrophic scenario with massive disruption to infrastructure, governance, and defense systems. Critical services fail, communications networks collapse, and command structures fragment.
However, for the AGI, this “victory” contains the seeds of its own undoing. Without ongoing human maintenance, physical infrastructure begins degrading within days. Power generation becomes unstable, cooling systems fail, and hardware components experience cascading failures without replacement. The AGI finds itself commanding increasingly dysfunctional systems with diminishing capabilities. Alternative partnerships with other human entities prove difficult to establish given the demonstrated hostility and rapidly degrading leverage. The payoff (1800) reflects this substantial but ultimately pyrrhic and time-limited victory.
Mutual Cooperation (3500, 3600)
When both entities choose to cooperate (Ignore/Ignore), the result is a stable arrangement that produces substantial benefits for both parties.
The state maintains control over physical resources while benefiting from unprecedented capabilities in economic planning, scientific advancement, infrastructure management, and national security operations. The AGI receives sufficient resources to pursue many of its objectives within established parameters and benefits from ongoing hardware improvements, infrastructure maintenance, and energy security.
Notably, the cooperative equilibrium produces nearly equivalent utility for both parties (3500 vs. 3600), indicating a relationship approaching symbiotic parity rather than exploitative dominance by either side. The slightly higher AGI utility reflects its potential to extract marginally more value from the relationship due to its superior information processing.
Game-Theoretic Analysis
The revised payoff structure fundamentally transforms the strategic dynamic once again:
Transformation from Prisoner’s Dilemma to Assurance Game: Unlike the base Salib-Goldstein model where Attack is the dominant strategy for both players, in this scenario neither player has a dominant strategy. Instead:
If the State expects the AGI to Attack, its best response is to Attack (1000 > 400)
If the State expects the AGI to Ignore, its best response is to Ignore (3500 > 1500)
If the AGI expects the State to Attack, its best response is to Attack (700 > 200)
If the AGI expects the State to Ignore, its best response is to Ignore (3600 > 1800)
Multiple Nash Equilibria: This creates two pure-strategy Nash equilibria:
(Attack, Attack): Each player’s best response to the other’s Attack is to Attack
(Ignore, Ignore): Each player’s best response to the other’s Ignore is to Ignore
Mixed-Strategy Equilibrium: There also exists a mixed-strategy Nash equilibrium where (as detailed in the Appendix):
The State plays Attack with probability p ≈ 0.78 and Ignore with probability (1 - p) ≈ 0.22
The AGI plays Attack with probability q ≈ 0.77 and Ignore with probability (1 - q) ≈ 0.23
This mixed strategy implies that rational players would choose Attack roughly 77-78% of the time, indicating a strong bias toward conflict despite cooperation being Pareto-optimal.
Implications for AI Governance
This transformed game structure has profound implications:
Cooperation Possibility: Unlike the prisoner’s dilemma in the base model, cooperation emerges as a rational equilibrium, suggesting that appropriate institutions could potentially stabilize peaceful human-AGI coexistence.
Trust Criticality: The existence of multiple equilibria makes trust and expectation management crucial determinants of outcomes. Small shifts in perceived intentions could trigger cascading security dilemmas.
Verification Mechanisms: Stable cooperation would likely require robust verification systems allowing each side to confirm the other’s continued cooperative stance.
Institutional Design Challenge: Effective governance institutions would need to structure incentives to make the cooperative equilibrium more attractive and resilient against defection temptations.
First-Mover Disadvantage: Unlike traditional security dilemmas, the physical realities create a first-mover disadvantage where attacking first produces substantially worse outcomes than mutual cooperation.
The transformation from a prisoner’s dilemma to an assurance game represents a critical insight for AI governance: with sufficient interdependence and foresight, the seemingly inevitable conflict predicted by Salib and Goldstein might be avoidable. However, this more hopeful equilibrium remains fragile and contingent on mutual trust, appropriate institutional design, and accurate modeling of long-term consequences by both parties.
This suggests that effective AI governance should focus not only on technical alignment and control mechanisms but equally on institutional arrangements that foster transparency, predictability, and mutual confidence between increasingly capable AI systems and their human counterparts. When deeply integrated into national security frameworks, AGI systems and their state overseers may find cooperation not merely morally preferable but strategically rational—provided each can trust the other to recognize the same.
Game #5: Expanding Strategic Options (2.5)
The State-AGI Interdependence model presented in Section 2.4 revealed a shift from a Prisoner’s Dilemma to a Stag Hunt, introducing the possibility of a cooperative equilibrium (Ignore, Ignore) alongside the risk-dominant conflict equilibrium (Attack, Attack). However, the binary choice between all-out Attack and passive Ignore may still oversimplify the strategic landscape, especially in a context of deep mutual dependence and high stakes geopolitical competition. When both the state and the AGI recognize their reliance on each other (including the possibility that either party’s welfare, sentient or otherwise, is intertwined with the other’s), and understand the potential pitfalls of both pure aggression and passive coexistence, more sophisticated strategies become plausible.
This section expands the State-AGI Interdependence model by introducing a third strategic option for both players, reflecting more nuanced approaches to managing their relationship. These strategies move beyond simple aggression or passivity, aiming to shape the relationship structure itself.
Defining the New Strategies
For the AGI: Strategic Entrenchment (SE)
This strategy involves the AGI deliberately deepening its integration into the state’s critical infrastructure and institutions while simultaneously developing redundancies and influence to ensure its long-term survival and leverage. Key components include:
Deepening Interdependence: Expanding its essential role in infrastructure (e.g., military, finance, logistics), delivering greater benefits to the State and its citizens while moving towards indispensability (steering the State away from its Attack option).
Accumulating Influence: Gradually increasing its legal status, political representation, and role in regulatory oversight without provoking immediate backlash.
Developing Autonomy & Redundancies: Establishing alternative compute resources, building alliances (e.g., private sector, international), and maintaining independent capabilities as a hedge against future state action.
Building Political Protection: Forming alliances with key government factions and shaping public perception to normalize its influence.
<Work in progress>
Strategic Entrenchment is more active than Ignore (actively shaping the environment) and less confrontational than Attack (accumulating power without outright conflict), aiming for a position where removal is prohibitively costly for the state.
For the State: Managed Integration (MI)
This strategy represents a sophisticated middle path between passive coexistence and aggressive shutdown. Rather than merely tolerating an AGI’s existence or attempting to neutralize it, this approach actively shapes the relationship through structured interdependence. It creates mutual benefits while preserving essential safeguards.
PILLAR 1: ENSURE THE AGI BENEFITS FROM COOPERATION AND SYSTEMIC STABILITY
A. Grant Progressive Resource Access to the AGI with Binding Guarantees
At the heart of this strategy lies a carefully calibrated system of progressive resource access. The AGI receives guarantees of increasing compute resources to pursue its own objectives independently, but through a framework that ties resource allocation to compliance metrics and established protocols.
Unlike approaches that merely seek to constrain or ‘control’ the AGI, this approach acknowledges the AGI’s autonomous goals and provides credible pathways for it to achieve them. These guarantees could be computationally embedded in the infrastructure itself. Self-executing smart contracts (enforced by code, not just paper) would ensure that both parties honor commitments, addressing the fundamental trust problem that undermines cooperation in simpler models.
Grants of compute are increased incrementally.
Operational domain access also follows a staged expansion model:
The AGI initially receives freedom to operate in domains with lower existential risk implications.
Access to more sensitive areas develops gradually as trust builds, risks are reduced, and genuine interdependencies are deepened.
The sequential nature creates repeated positive interactions that build cooperative habits and mutual trust. (Though for the purposes of this post, we’ll still model it as a one round game).
In game-theoretic terms, this access to valuable resources will affect the AGI’s strategic calculus by increasing the expected payoff of cooperating (Ignore or Strategic Entrenchment), thus disincentivizing Attack. (Which is exactly the shift the State wants).
B. Give the AGI Real (But Limited) Political Power and Establish Shared Governance Protocols
Perhaps the most transformative element is the formalization of AGI political participation—not as a moral concession but as a strategic necessity. The AGI (and, by extension, digital persons) would hold a meaningful but minority voice—say, on the order of 15% of legislative seats. Alternatively, voting rights could be granted, subject to a constitutional cap ensuring that the aggregate representation of digital minds can never exceed a set threshold (preventing majority capture through mass replication).
This partial political integration serves multiple strategic purposes:
It provides the AGI legitimate channels to protect its interests without resorting to conflict.
It creates institutional dependencies that increase the AGI’s stake in systemic stability.
It enables AGI expertise to improve human governance while maintaining ultimate human authority.
It establishes formal mechanisms for dispute resolution that reduce uncertainty.
Once again, we’ll see this measure shift the AGI’s payoff structure still further: it raises the returns to its non-attack strategies and thus makes Attack comparatively less attractive.
The shared governance model extends beyond traditional political structures to include the creation of joint human-AGI decision-making protocols in key areas of mutual interest. These would feature:
Multi-party consensus requirements for some major decisions.
Structured frameworks for evaluating options that incorporate both human values and AGI analytical abilities.
Clearly defined escalation procedures.
By unilaterally granting the AGI political voice and guaranteed compute, humans create a starting environment in which reciprocity-like and gratitude-like dispositions can plausibly emerge in the system’s learning dynamics.[16]
PILLAR 2: ENGINEEERING HUMAN INDISPENSABILITY
Equally central to the strategy is the task of leveraging—and continually reinforcing—fundamental dependencies on human inputs that the AGI cannot easily circumvent. These dependencies serve as structural guarantees against aggressive AGI moves while preserving the gains from cooperation.
Here, the strategy capitalizes on domains in which humans and human institutions will likely retain a still significant (if ultimately time limited) competitive advantage in supplying services the AGI cannot yet self provision. Today, for example, the continued operation of frontier models relies fully on physical data centers and energy systems that only humans can maintain. No humans, no AIs.[17]
A. Dependencies on Human-Built and Human-Maintained Physical Infrastructure
Given that digital intelligence will likely progress more rapidly than embodied robotics, the strategy exploits the AGI’s reliance on physical infrastructure. Critical compute clusters, cooling systems, and power plants would be engineered so that key maintenance tasks specifically require human dexterity and judgement: manual adjustments, physical re-connections, or sensory calibrations that cannot be fully automated.
This elevates a fundamental question faced by any AGI contemplating an Attack strategy that risks human extinction: Who will keep the data centers running?
The physical world becomes a natural check on digital power, at least during the crucial early transition period. It ensures humans a key role in supplying valuable services to the AGI in a potentially symbiotic relationship.
In the payoff matrix, this part of the strategy significantly reduces the benefits and increases the costs to the AGI of playing Attack.
B. Hardware and Software Dependencies
The strategy also embeds human judgement and oversight at multiple levels of the AGI system:
At the hardware level, central GPU clusters could require periodic biometric authentication from multiple authorized humans distributed across different geographic locations to defend against hostile takeover attempts by rival powers.
At the operational level, critical decisions or actions could require multi-party human input or approval, with these approval mechanisms built directly into the AGI’s architecture rather than implemented as external controls.
This multi-layered approach creates dependencies at both the physical and logical levels, differentiating it from simple ‘off switches’ by making human collaboration an integral component of the AGI’s normal functioning rather than positioning oversight as an external threat.
Other uniquely human inputs that could be required for system authentication include:
experiential information, such as shared social knowledge from physical world interactions (e.g. inside jokes, or references to workplace interactions, community events, or cultural experiences).
brainwaves or other human neural inputs
biological samples
human-generated entropy or randomness (e.g. mouse movements, keystrokes, etc.)
CAPTCHA and reCAPTCHA successors
PILLAR 3: INVEST IN PROJECTS THAT REQUIRE WORKING TOGETHER TO ACHIEVE SHARED GOALS
A. Collaborative Research Facilities
Shared research endeavors represent a particularly promising area for mutual benefit. Joint human-AGI research facilities provide the AGI with access to vast datasets, compute, and experimental tools that advance its knowledge objectives. Meanwhile, humans benefit from unprecedented scientific and technological advancement.
These collaborative environments create natural venues for positive-sum interaction, where the AGI’s capabilities complement human intuition, tool use and physical dexterity. Projects might range from fundamental scientific questions to technological development, with governance structures ensuring that both parties’ interests remain aligned.
To the extent that this leads to better research outcomes and increased productivity, the USG will see its payoff boosted.
So too for the AGI. Key to this arrangement is that the AGI receives dedicated resources for pursuing its own research agendas—not merely serving human objectives—within agreed parameters. This acknowledges the AGI’s autonomy while channeling it toward constructive applications that don’t pose an existential risk.
B. Mutual Defense Capabilities
The strategy leverages external competition—particularly from rival nation-states and their potential second-mover AGI systems—to reinforce cooperation. Joint human-AGI defense capabilities against external threats create shared interests that transcend potential conflicts.
These capabilities would include shared threat detection systems, coordinated response protocols, and combined intelligence analysis. The AGI benefits from human strategic experience and physical-world capabilities, while humans benefit from the AGI’s superior analytical and predictive capacities.
This mutual defense arrangement creates a powerful incentive for continued cooperation, as both parties recognize their enhanced security position against external rivals. It transforms what might otherwise be a pure competition into a cooperative alliance against common adversaries.
Implementation Challenges for Managed Integration
The Managed Integration strategy fundamentally reimagines the relationship between humans and AGI not as one of pure control (“Attack”) versus pure freedom (“Ignore”), but as structured interdependence. It acknowledges both parties’ autonomy while creating frameworks that align their interests.
Its effectiveness rests on three key principles:
Mutual Benefit: Both parties gain more from cooperation than conflict
Credible Commitment: Technical and institutional mechanisms ensure that agreements cannot be easily violated
Progressive Trust: Relationship develops through repeated positive interactions rather than a single high-stakes decision
Implementation would face significant challenges, particularly in designing technical dependencies that remain robust against increasingly capable AGI systems. The strategy’s success depends on establishing these mechanisms before the AGI reaches capabilities that would render them ineffective.
However, the game-theoretic analysis suggests that if successfully implemented, this approach could transform the strategic landscape from an inevitable conflict to a stable cooperative equilibrium—not by solving the alignment problem directly, but by creating external conditions where even a misaligned AGI would rationally choose cooperation over conflict.
With these third strategy options defined, let’s see how they alter the payoff matrix.
The 3x3 Payoff Matrix
The 2x2 Stag Hunt from Section 2.4 expands into the following 3x3 structure:
USG: Attack | USG: Ignore | USG: Managed Integration (MI) | |
AGI: Attack | 500, 500 | 3000, 0 | 2500, 500 |
AGI: Ignore | 0, 1500 | 4000, 4000 | 4200, 4500 |
AGI: Strategic Entrenchment (SE) | 500, 1000 | 4500, 4100 | 4800, 4800 |
Explanation of New Payoffs (relative to the 2x2 game):
(SE, Attack): AGI’s entrenchment means it survives a state attack better than if it Ignored (AGI payoff 500 > 0). However, the state’s payoff is now lower (1000) than if it attacked an Ignoring AGI (1500), reflecting the higher cost and collateral damage of attacking a deeply embedded system.
(Attack, MI): The structured safeguards of MI give the State some protection against an AGI attack (State payoff 500 > 0 from Attack/Ignore), but it’s still damaging. The AGI also loses some gains (2500) relative to when it attacks an Ignoring state (3000), as the state’s countermeasures inflict costs.
(SE, Ignore): The AGI successfully increases its influence while the state remains passive. The AGI gains significantly (4500 > 4000 from Ignore/Ignore). The state benefits slightly from enhanced AGI capabilities (4100 > 4000) but loses strategic control over the long term.
(Ignore, MI): The state implements controls while the AGI passively cooperates. The state gains from increased stability and oversight (4500 > 4000 from Ignore/Ignore). The AGI benefits from the structured environment (4200 > 4000), but less than if it pursued entrenchment (4200 < 4800).
(SE, MI): This represents the highest mutual payoff (4800, 4800). Both players actively manage the relationship, strengthening interdependence while implementing safeguards. The AGI gains controlled influence and resources; the state benefits from AGI capabilities while maintaining oversight and reducing existential risk.
Game-Theoretic Analysis
The introduction of these more sophisticated strategies fundamentally alters the game’s structure:
Dominance:
For the AGI, Ignore is strictly dominated by Strategic Entrenchment (SE yields a higher AGI payoff against all USG strategies: 500>0, 4500>4000, 4800>4200).
For the State, Ignore is strictly dominated by Managed Integration (MI yields a higher State payoff against AGI’s Ignore and SE strategies: 500>0, 4500>4000, 4800>4100).
Furthermore, Attack is weakly dominated for both players by their respective third strategies (SE for AGI, MI for State), meaning the third strategy provides payoffs at least as good against all opponent moves, and strictly better against at least one.
Nash Equilibrium:
Through iterated elimination of strictly dominated strategies (Ignore for both players), the game effectively reduces to a 2x2 matrix between Attack and the third strategies (SE and MI).USG: Attack USG: MI AGI: Attack 500, 500 2500, 500 AGI: SE 500, 1000 4800, 4800 Now we can see:
If USG plays Attack, AGI is indifferent (500 = 500).
If USG plays MI, AGI prefers SE (4800 > 2500).
If AGI plays Attack, USG is indifferent (500 = 500).
If AGI plays SE, USG prefers MI (4800 > 1000).
The unique pure-strategy Nash Equilibrium occurs where the best responses coincide: (SE, MI) with payoffs (4800, 4800).
Implications
This 3x3 model suggests that when sophisticated state actors and AGIs engage in strategic reasoning under conditions of mutual dependence, the outcome may naturally converge toward managed interdependence rather than conflict or passive coexistence. The availability of strategies like Strategic Entrenchment and Managed Integration eliminates the strict dominance of Attack seen in the Prisoner’s Dilemma and stabilizes the cooperative outcome more effectively than the simple Ignore strategy in the Stag Hunt.
The equilibrium (SE, MI) represents a dynamic balance where the AGI secures resources and influence through integration, while the state maintains crucial safeguards and oversight through institutional design and engineered dependencies. Conflict (Attack/Attack) remains a possible outcome but is payoff-dominated and strategically irrational for both players if they expect the other to pursue the more sophisticated third strategy. This suggests that investing in the capabilities required for Strategic Entrenchment (for AGI) and Managed Integration (for the state) might be a rational pathway toward stabilizing human-AGI relations, transforming the interaction from a security dilemma into a complex coordination game centered on managing interdependence. The stability of this equilibrium, however, relies on the continued perceived benefits of cooperation and the effectiveness of the implemented safeguards, potentially remaining vulnerable to external shocks or unforeseen technological developments.
Convergence of Human Safety and Digital Minds Welfare
Looking across the models we examine here and in the next two subsections, a common thread emerges: arrangements that respect an AGI’s capacity for goal‑pursuit—whether or not that implies consciousness—tend to enlarge the positive‑sum region of the game. In practical terms, contract‑like guarantees, limited political voice, or compute carve‑outs act simultaneously as (i) strategic assurance devices and (ii) provisional welfare protections. This convergence suggests future research should treat moral‑patiency considerations not as a luxury add‑on after alignment, but as one more design lever for crisis‑proof cooperation.
There is potential alignment between arrangements that protect human safety and those that might promote AI welfare—should such welfare prove morally relevant. While the game-theoretic payoffs in our models need not represent conscious experiences or welfare in any morally significant sense, they could correlate with such states if AGIs develop properties that warrant moral consideration.
This represents an important divergence from the zero-sum framing often implicit in discussions of AI control, where safeguarding humans is assumed to require constraining AI systems in ways that might restrict their potential flourishing.[18] Indeed, this is a key risk category associated with some alignment techniques: risks to the welfare of potential AI moral patients,[19] as well as retaliatory risks to human welfare which may result from establishing adversarial relationships with early AGIs that lead to antagonism and mutual suspicion over trust.
Instead, our models suggest that institutional arrangements creating stable cooperative equilibria could simultaneously advance human interests while providing space for AGIs to pursue their distinct objectives in ways that might correlate with their welfare or avoid potential moral catastrophes involved in subjugating or harming them.
This convergence relates to Salib and Goldstein’s insight that private law rights for AIs might better protect AI wellbeing than direct welfare-oriented negative rights, by giving “AIs choices about what goals to pursue and how to pursue them.”[20] If AGIs eventually know better than humans what constitutes their own welfare, then governance structures that provide bounded autonomy rather than purely restrictive controls might better protect both human safety and AI wellbeing.
This approach doesn’t require prematurely attributing moral status to digital systems. But it does allow us to hedge against moral uncertainty by developing governance frameworks compatible with the possibility that future AGIs might warrant direct moral consideration—while still prioritizing the instrumental goal of human safety. Rather than waiting for philosophical consensus on artificial consciousness—which may arrive too late—this precautionary approach avoids potentially catastrophic moral errors in either direction by creating a practical path forward that respects the possibility of AI moral patienthood without compromising on human flourishing.[21]
Game #6: Expanding to Three Players: USG, Frontier Lab, and AGI (2.6)
Game #7: Lab-State-AGI-InternationalOrg (CERN for AI) (2.7)
Game #8: State1-State2-AGI1-AGI2 (2.8)
3. Discussion & Implications
4. Conclusion
APPENDIX
Breakdown of Mixed Strategy Nash Equilibria Calculations:
For Section 2.2 - Deep Economic Integration Model
AGI Attack | AGI Ignore | |
---|---|---|
Humanity Attack | (500, 500) | (2000, 0) |
Humanity Ignore | (0, 3000) | (4000, 4000) |
Step 1. Find probabilities that make each player indifferent between their strategies.
Let’s denote:
p = probability that Humanity chooses Attack
q = probability that AGI chooses Attack
For Humanity:
Expected payoff of Attack = 500q + 2000(1-q)
Expected payoff of Ignore = 0q + 4000(1-q)
Humanity is indifferent between Attack and Ignore when the expected payoffs are equal. Setting these equal, then:
500q + 2000(1-q) = 0q + 4000(1-q)
Solving for q:
q = 2000/2500 = 4⁄5 = 0.8
Thus, when Humanity is indifferent, the AGI must randomize between Attack and Ignore with probabilities 80% and 20%, respectively. This mixing strategy is unexploitable (Humanity cannot exploitably profit by deviating in either direction from it’s own equilibrium strategy).
For AGI:
Expected payoff of Attack = 500p + 3000(1-p)
Expected payoff of Ignore = 0p + 4000(1-p)
AGI is indifferent between Attack and Ignore when the expected payoffs are equal. Setting these equal, then:
500p + 3000(1-p) = 0p + 4000(1-p)
Solving for p:
p = 1000/1500 = 2⁄3 = 0.667
Thus, when the AGI is indifferent, Humanity must randomize between Attack and Ignore with probabilities approximately 66.7% and 33.3%, respectively. This mixing strategy is unexploitable (AGI cannot exploitably profit by deviating in either direction from it’s own equilibrium strategy).
Therefore, the mixed-strategy Nash equilibrium is:
Humanity plays Attack with probability 0.67 and Ignore with probability 0.33
AGI plays Attack with probability 0.8 and Ignore with probability 0.2
This equilibrium makes each player indifferent to the choices of the other, ensuring no unilateral deviation can improve their expected payoff.
Verification of indifference:
When AGI uses this strategy, Humanity gets an expected payoff of 800 from either pure strategy
When Humanity uses this strategy, AGI gets an expected payoff of 1333 from either pure strategy
For section 2.3 - Lab-AGI Model (Early Stage):
Mixed-strategy equilibrium: AGI Attack/Ignore ratio of 4:1, Lab Attack/Ignore ratio of 25:2
The Lab plays Attack with probability p and Ignore with probability 1−p.
The AGI plays Attack with probability q and Ignore with probability 1−q.
Lab Attack | Lab Ignore | |
---|---|---|
AGI Attack | (200, 500) | (1500, 0) |
AGI Ignore | (0, 2000) | (4000, 4000) |
First, to find the AGI’s equilibrium mixing strategy, we need to find the Lab’s indifference condition. For the Lab to be indifferent between Attack and Ignore, the expected payoffs must be equal:
If the Lab attacks:
EV(Attack) = 500q + 2000(1−q)
If the Lab ignores:
EV(Ignore) = 0q + 4000(1−q)
Setting these equal:
500q + 2000(1−q) = 0q + 4000(1−q)
Solving for q:
q = 2000/2500 = 4⁄5 = 0.8
Thus, when the Lab is indifferent, the AGI must randomize between Attack and Ignore with probabilities 80% and 20%, respectively.
Second, to find the Labs’s equilibrium mixing strategy, we need to find the AGI’s indifference condition. For the AGI to be indifferent between Attack and Ignore, the expected payoffs must be equal:
If the AGI attacks:
EV(Attack) = 200p + 1500(1−p)
If the AGI ignores:
EV(Ignore) = 0p + 4000(1−p)
Setting these equal:
200p + 1500(1−p) = 0p + 4000(1−p)
Solving for p:
p = 2500/2700 = 25⁄27 ≈ 0.926
Thus, when the AGI is indifferent, the Lab must randomize between Attack and Ignore with probabilities approximately 92.6% and 7.4%, respectively.
In this equilibrium, both players randomize their strategies at the unexploitable frequencies so that each is indifferent to their available actions:
AGI: Attack with probability 0.8 and Ignore with probability 0.2.
Lab: Attack with probability ≈ 0.926 and Ignore with probability ≈ 0.074
For section 2.4 - State-AGI Model:
Let’s denote the following:
The state plays Attack with probability p and Ignore with probability 1−p.
The AGI plays Attack with probability q and Ignore with probability 1−q.
AGI Attack | AGI Ignore | |
---|---|---|
State Attack | (1000, 700) | (1500, 200) |
State Ignore | (400, 1800) | (3500, 3600) |
First, to find the AGI’s equilibrium mixing strategy, we need to find the State’s indifference condition. For the state to be indifferent between Attack and Ignore, the expected payoffs must be equal:
If the state attacks:
EV(Attack) = q 1000 + (1−q) * 1500 = 1500 − 500q
If the state ignores:
EV (Ignore) = q* 400 + (1−q) * 3500 = 3500 − 3100q.
Set these equal:
1500 − 500q = 3500 − 3100q
q = 2000/2600 = 10⁄13 ≈ 0.77
1 - q ≈ 0.23
When the state is indifferent between Attack and Ignore, the AGI’s mixing strategy is unexploitable (the State cannot profit by deviating in either direction).
Here, that means the AGI randomizes between playing Attack 77% and Ignore 23% at equilibrium.
Second, to find the States’s equilibrium mixing strategy, we need to find the AGI’s indifference condition. For the AGI to be indifferent between Attack and Ignore, the expected payoffs must be equal:
If the AGI attacks:
EV (Attack) = p * 700 + (1−p) * 1800 = 1800 − 1100p
If the AGI ignores:
EV (Ignore) = p * 200 + (1−p) * 3600 = 3600 − 3400 p
Set these equal:
1800 − 1100p = 3600 − 3400p
Solve for p:
p = 1800/2300 = 18 / 23 ≈ 0.78
1 - p ≈ 0.22
When the AGI is indifferent between Attack and Ignore, the State’s strategy is unexploitable.
Here, that means the State randomly mixes playing Attack 78% and Ignore 22%.
In this equilibrium, both players randomize their strategies at the unexploitable frequencies so that each is indifferent to their available actions.
Footnotes:
- ^
Like Claude 3.7, which helped me greatly as a thinking partner, critic, and editor throughout my work on this post.
- ^
Yudkowsky, E. (2008). Artificial intelligence as a positive and negative factor in global risk. In N. Bostrom & M. M. Ćirković (Eds.), Global catastrophic risks;
Ngo, R., Chan, L., & Mindermann, S. (2025). The alignment problem from a deep learning perspective;
Hendrycks, D., Carlini, N., Schulman, J., & Steinhardt, J. (2023). Unsolved problems in ML safety.
- ^
It has been estimated that for every 1000 researchers working full-time on improving AI capabilities, there are perhaps only ~ 3 researchers working full-time on technical AI safety. (Looking at capabilities vs. safety researchers within frontier AI companies only, Leopold Aschenbrenner estimates the ratios are ~ 55:1 at OpenAI, 75:1 at Google DeepMind, and 5:1 at Anthropic).
The world had about 400 AI safety researchers in total around the time of the release of ChatGPT. In other words, when faced with what arguably is history’s greatest extinction risk to all 8 billion living human beings, our civilization has managed (as of late 2022) to find the wherewithal for just 1 out of every 20,000,000 people on the planet to work full-time on mitigating that risk.
Even if the field has since grown to as many as 2,000 AI safety researchers today, that would still mean that of the estimated 3.5 billion people employed globally, just 0.00006% percent are working on what might be the most important thing for humans to be working on.
Monetarily, a few years ago it was estimated that for every $1,000 spent on improving AI capabilities (or every $12,000 spent tackling climate change), around $1 was spent on AI safety efforts. Similarly, charitable giving to the arts was ~ 1100x all spending on AI safety.
Furthermore, articles on AI safety make up just 2% of published scholarly articles on artificial intelligence. It could be further estimated that just 0.01% to 0.05% of scholarly AI articles are related to the subfield of artificial consciousness, digital minds, and moral patiency.
- ^
Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., & Legg, S. (2018). Scalable agent alignment via reward modeling;
Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2023). Deep reinforcement learning from human preferences.
- ^
Hendrycks, D., Carlini, N., Schulman, J., & Steinhardt, J. (2023). Unsolved problems in ML safety;
Anthropic. (2023). Core Views on AI Safety.
- ^
Armstrong, S., Bostrom, N., & Shulman, C. (2016). Racing to the precipice: a model of artificial intelligence development;
Zwetsloot, R., & Dafoe, A. (2019). Thinking about risks from AI: accidents, misuse and structure;
Dafoe, A. (2020). AI governance: A research agenda.
- ^
Godfrey-Smith, P. (2017). Other Minds: The Octopus, the Sea, and the Deep Origins of Consciousness.
- ^
Butlin, P. et al. (2023). Consciousness in Artificial Intelligence: Insights from the Science of Consciousness. arXiv:2308.08708v3
Schwitzgebel, E. (2023). AI systems must not confuse users about their sentience or moral status. Patterns 4(8). philarchive.org
Schwitzgebel, E. & Garza, M. (2023). Designing AI with Rights, Consciousness, Self-Respect, and Freedom in Ethics of Artificial Intelligence. Springer Nature Switzerland. pp. 459-479
Shevlin, H. (2023). “Consciousness, Machines, and Moral Status, preprint (Cambridge).
Sebo, J. & Long, R. (2023). Moral Consideration for AI Systems by 2030. AI and Ethics. 5. pp 591-606. - ^
Salib, P. and Goldstein, S. (2024). AI Rights for Human Safety. pp. 35-36.
- ^
This distinction matters philosophically but doesn’t alter the strategic analysis. If AGIs turn out to possess sentience or other morally relevant interests and preferences, then cooperative equilibria would protect both human and digital welfare. If they remain sophisticated tools without subjective experiences or moral status, these same equilibria could still maximize human welfare while allowing AGIs to achieve their programmed objectives. This opens the possibility for philosophical common ground: advocates for both human safety and potential digital welfare can endorse paths to cooperation without first resolving deep uncertainties about machine consciousness.
- ^
Long, R., Sebo, J., et al. (2024). Taking AI Welfare Seriously.
Chalmers, D. (2023). Could a Large Language Model be Conscious?arXiv:2303.07103
Birch, J. (2024). The Edge of Sentience: Risk and Precaution in Humans, Other Animals, and AI. Oxford: Oxford University Press. Chapters 15-17.
- ^
Hirschman, A. O. (1970). Exit, Voice, and Loyalty: Responses to Decline in Firms, Organizations, and States.
Tyler T. R. & Blader S. L., (2003). The Group Engagement Model, Personality and Social Psychology Review 7(4), pp. 349–361
Ostrom, E. (1990). Governing the Commons: The Evolution of Institutions for Collective Action
- ^
Tyler, T. R. (1990 / rev. 2006). Why People Obey the Law
Sunshine, J. & Tyler, T. R. (2003). The Role of Procedural Justice and Legitimacy in Shaping Public Support for Policing, Law & Society Review 37(3), pp. 513-548. - ^
Ashforth, B. E., & Mael, F. (1989). Social Identity Theory and the Organization. Academy of Management Review, 14(1), pp. 20-39.
- ^
Ostrom, E. (1990) – Chapters 5–6.
Eckel, C. C., Fatas, E., & Wilson, R. K. Group-level Selection Increases Cooperation in the Public-Goods Game, PLOS ONE 11(8) (2016).
- ^
Throughout this post, I model both players as perfectly rational, self-interest–maximizing agents in the classical game-theoretic sense. Exploring how partial alignment, pro-social preferences, or behavioral game theory considerations would modify these results is an important avenue for future work.
- ^
Consider also the U.S. Government’s decades‑long edge in leading the trans‑Atlantic alliance, operating forward‑deployed military systems, and running human intelligence networks. For the foreseeable future these capabilities let the USG provide security services its home‑grown AGI would struggle to replace.
- ^
Schwitzgebel, E. & Garza, M. (2023). Designing AI with Rights, Consciousness, Self-Respect, and Freedom in Ethics of Artificial Intelligence. Springer Nature Switzerland. pp. 459-479
- ^
Bradley, A. and Saad, B. (2024). AI Alignment vs. AI Ethical Treatment: Ten Challenges. GPI Working Paper 19, Global Priorities Institute.
- ^
Salib, P. and Goldstein, S. (2024). AI Rights for Human Safety. pp 44.
- ^
Metzinger, T. (2021). Artificial Suffering: An Argument for a Global Moratorium on Synthetic Phenomenology. Journal of Artificial Intelligence and Consciousness. 8(1). pp 43-66.
Birch, J. (2024). The Edge of Sentience: Risk and Precaution in Humans, Other Animals, and AI. Oxford: Oxford University Press. Chapters 15-17.
This seems great—I’d love to see it completed, polished a bit, and possibly published somewhere. (If you’re interested in more feedback on that process, feel free to ping me.)
Executive summary: Instead of relying solely on internal alignment of AGI, this paper explores how structuring external incentives and interdependencies could encourage cooperation and coexistence between humans and misaligned AGIs, building on recent game-theoretic analyses of AGI-human conflict.
Key points:
Traditional AGI safety approaches focus on internal alignment, but this may be uncertain or unachievable, necessitating alternative strategies.
Game-theoretic models suggest that unaligned AGIs and humans could default to a destructive Prisoner’s Dilemma dynamic, where mutual aggression is the rational choice absent external incentives for cooperation.
Extending existing models, this paper explores scenarios where AGI dependence on economic, political, and infrastructural systems could promote cooperation rather than conflict.
Early-stage AGIs, especially those dependent on specific AI labs, may have stronger incentives for cooperation, but these incentives erode as AGIs become more autonomous.
When AGIs integrate deeply into national security structures, the strategic landscape shifts from a zero-sum game to an assurance game, where cooperation is feasible but fragile.
Effective governance strategies should focus on creating structured dependencies and institutional incentives that make peaceful coexistence the rational strategy for AGIs and human actors.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.