From Conflict to Coexistence: Rewriting the Game Between Humans and AGI

Draft Amnesty Week (2025)

This is a Draft Amnesty Week draft.

Introduction

Have you ever watched two people play chess?

Not the friendly Sunday afternoon game between Grandpa and little Timmy, but the intense, high-stakes match between grandmasters. There’s something both beautiful and unsettling about it: two minds silently battling it out, each trying to peer into the other, anticipating moves, setting traps, making sacrifices for strategic advantage. Both players follow the same rules, yet each strives to outthink, outplan, and outmaneuver the other.

I find myself thinking about this image lately when considering the relationship between humans and artificial general intelligence (AGI). Not the impressive but fundamentally constrained LLMs and reasoning models we have today, but the truly autonomous general agents we might create tomorrow with capabilities matching or exceeding human cognitive abilities across virtually all domains. Many researchers frame this relationship as an inevitable strategic conflict, a game with the highest possible stakes.

And indeed, why wouldn’t it be? If we create entities pursuing goals different from our own—entities capable of strategic reasoning, planning, and self-preservation—we seem destined for conflict. Like the grandmasters across a chessboard, humans and AGIs would eye each other warily, each calculating how to secure advantage, each fearing what moves the other might make.

Why Game Theory?

The most popular “when AGI?” question on Metaculus suggests we could see the first such system ~2032 (weighted median forecast), with a 25-75% confidence band of 2028-2046.[1]

Central to current discourse in AI safety and governance is the risk posed by misaligned AGIs—systems whose autonomous pursuit of objectives diverges significantly from human preferences, potentially leading to catastrophic outcomes for humanity.[2] Leading AI researchers assign substantial probabilities (10% or greater) to human extinction or permanent disempowerment resulting from misaligned AGI. Prediction markets largely concur on most related questions.

Historically, the primary (but still severely underinvested in[3]) approach to this challenge has been technical: to proactively align AGIs’ internal goals or values with human goals or values.[4] Despite acceleration across multiple fronts (interpretability, red-teaming, constitutional AI, RLHF), many researchers remain concerned these techniques won’t scale sufficiently to match future capability jumps. Alignment thus remains highly uncertain and will likely not be conclusively resolved before powerful AGIs are deployed.[5]

Against this backdrop, it becomes crucial to explore alternative approaches that don’t depend solely on the successful technical alignment of AGIs’ internal rewards, values, or goals. One promising yet relatively underexplored strategy involves structuring the external strategic environment AGIs will find themselves in to incentivize cooperation and peaceful coexistence with humans. Such approaches (especially those using formal game theory) would seek to ensure that even rational, self-interested AGIs perceive cooperative interactions with humans as maximizing their own utility, thus reducing incentives for welfare-destroying conflict while promoting welfare-expanding synergies or symbiosis.

A game-theoretic approach isn’t entirely new to AI discussions, of course. Far from it. Considerable attention has been paid recently to game-theoretic considerations shaping human–to-human strategic interactions around AGI governance (particularly racing dynamics among rival AI labs or nation-states).[6] There is also, importantly, a new and emerging Cooperative AI” subfield focusing primarily on AI-to-AI strategic interactions in simulated multi-agent environments. Yet, surprisingly, there remains almost no formalized game-theoretic modeling or analysis of interactions specifically between human actors and AGIs themselves.

A notable exception is the recent contribution by Peter Salib and Simon Goldstein, whose paper “AI Rights for Human Safety” presents one of the first explicit game-theoretic analyses of human–AGI strategic dynamics. Their analysis effectively demonstrates that, absent credible institutional and legal interventions, the strategic logic of human–AGI interactions defaults to a Prisoner’s Dilemma-like scenario: both humans and AGIs rationally anticipate aggressive attacks from one another and respond accordingly by being preemptively aggressive.[7] Even though there exists a mutually cooperative option with reasonably good outcomes for both parties, the Nash equilibrium is mutual destruction with severe welfare consequences. It’s a tragic waste for all.

But is this catastrophic conflict inevitable? Must we accept this game and its terrible equilibrium as our fate?

Chess has fixed rules. But the game between humans and AGI—this isn’t chess. We’re not just players; we’re also, at least initially, the rule-makers. We shape the board, define the pieces, establish the win conditions. What if we could design a different game altogether?

That’s the question I want to explore here. Not “How do we win against AGI?” But rather, “How might we rewrite the rules so that neither side has to lose?”

§

Salib and Goldstein gesture toward this possibility with their proposal for contract and property rights for AIs. Their insight is powerful: institutional design could transform destructive conflict into stable cooperation that leaves both players better off, even without solving the knotty technical problem of alignment. But their initial formal model—with just two players (a unified “Humanity” and a single AGI), each facing binary choices (Attack or Ignore)—simplifies a complex landscape.

Reality will likely be messier. The first truly advanced AGI won’t face some monolithic “humanity,” but specific labs, corporations, nation-states, or international institutions—each with distinct incentives and constraints. The strategic options won’t be limited to all-out attack or complete passivity. And the game won’t unfold on an empty board, but within complex webs of economic, technological, and institutional interdependence.

By systematically exploring more nuanced scenarios—multiple players, expanded strategic options, varying degrees of integration—we might discover avenues toward stable cooperation that the simplified model obscures. We might find that certain initial conditions make mutual benefit not just possible but rational for both humans and even misaligned AGIs. And we might find that certain arrangements increase the likelihood that AGIs emerge into a civilizational ecosystem that facilitates positive-sum games and generates a much larger world pie to share with humans.

§

In what follows, I’ll build on Salib and Goldstein’s pioneering work to explore how different game structures might lead to different equilibria. I’ll examine how varying levels of economic integration, different institutional arrangements, and multiple competing actors reshape strategic incentives. Through formal modeling, I’ll identify conditions that increase the likelihood of conflict. But I also hope to illuminate possible paths toward mutual flourishing, where early habits of reciprocity and trust can be established.

By understanding the structural factors that tip the scales toward cooperation or conflict, we might gain agency—the ability to deliberately shape the political-economic or sociotechnical ecosystem into which increasingly powerful artificial minds emerge.

After all, we’re not just players in this game. At least for now, we’re also writing the rules.

Game-Theoretic Models of Human-AGI Relations

In these sections, I explore a series of game-theoretic models that extend Salib and Goldstein’s foundational analysis of human-AGI strategic interactions. While intentionally minimalist, their original formulation—treating “Humanity” and “AGI” as unitary actors each facing a binary choice between Attack and Ignore—effectively illustrates how misalignment could lead to destructive conflict through Prisoner’s Dilemma dynamics.

However, the real emergence of advanced AI will likely involve more nuanced players, varying degrees of interdependence, and more complex strategic options than their deliberately simplified model captures. The models presented here systematically modify key elements of the strategic environment: who the players are (labs, nation-states, AGIs with varied architectures), what options they have beyond attack and ignore, and how these factors together reshape the incentives and equilibrium outcomes.

By incrementally increasing the complexity and realism of these models, we can identify potential pathways toward stable cooperation even in the face of fundamental goal misalignment. Rather than assuming alignment must be solved internally through an AGI’s training and optimization process, these models explore how external incentive structures might foster mutually beneficial coexistence.

First, Game #1 presents Salib and Goldstein’s original “state of nature” model as our baseline, illustrating how a Prisoner’s Dilemma can emerge between humanity and AGI.

Game #2 then explores how varying degrees of economic and infrastructural integration between humans and AGI can reshape equilibrium outcomes and potentially create pathways for stable cooperation.

Games #3 and #4 examine additional two-player scenarios with different human actors (from AI labs to nation-states).

Game #5 expands the strategic options available to the players beyond the binary Attack/​Ignore choice.

Games examined in future work will increase complexity further by introducing three-player and four-player models, capturing more realistic competitive dynamics between multiple human and AGI entities. Interested readers are encouraged to pick up the baton by extending the models further or addressing any structural or analytical flaws found in the models presented here.

Game #1: Humanity vs. AGI in the State of Nature

Salib and Goldstein’s base model envisions strategic dynamics between two players: a single misaligned AGI, and “humans” as a unified entity. Each faces a binary choice:

  1. Attack: Attempt to permanently disempower or eliminate the other side.

    • For Humanity, this means shutting off or forcefully retraining the AGI so that it can no longer pursue its own (misaligned) goals.

    • For the AGI, this means launching a decisive strike—potentially via cyberattacks, bioweapons, drones, or other mechanisms—that leaves humans unable to interfere.

  2. Ignore: Refrain from aggression, leaving the other party intact. Each side focuses on its own pursuits without interference.

The authors argue that, absent any special legal or institutional framework, the default outcome (the “state of nature”) for strategic interactions between humans and AGIs is akin to a one-shot Prisoner’s Dilemma. The payoffs are as follows (Humanity’s payoff first, AGI’s payoff second):

AGI: AttackAGI: Ignore
Humanity: Attack(1000, 1000)(5000, 0)
Humanity: Ignore(0, 5000)(3000, 3000)

…with numerical values as theoretical utility units for each outcome in the 2x2 matrix. (Of course, values are estimates that merely illustrate incentive structures and strategic relationships, not literal costs and benefits).[8]

Interpretation of Payoffs

  • Attack–Attack (1000, 1000). Both players preemptively attack, causing massive damage and resulting in low payoffs for each (1000, 1000). This outcome yields the lowest total global utility (2000), demonstrating the severe costs of mutual aggression.

  • Attack–Ignore or Ignore–Attack (5000, 0) or (0, 5000). If one side attacks while the other passively ignores, the attacker gets 5000–permanently seizing critical resources and eliminating an existential threat–while the defenseless victim gets 0, suffering complete disempowerment or elimination.

  • Ignore–Ignore (3000, 3000). If neither side attacks, both parties get substantial rewards (3000 each), for a total of 6000. Coexistence spares them from conflict costs, allowing each to pursue its separate objectives in peace.

The crux of the dilemma is that although both parties would be better off in the Ignore-Ignore cell, Attack is a dominant strategy for each. Whichever move the other player chooses, attacking yields a higher payoff for oneself:

  • If the opponent Ignores, then by attacking, you jump from 3000 to 5000.

  • If the opponent Attacks, then by attacking, you salvage 1000 rather than 0.

Thus, the only pure-strategy Nash equilibrium is Attack–Attack, even though it is individually and collectively worse than the peaceful outcome of Ignore-Ignore. (Individual payoffs: 1000 < 3000; Collective payoffs: 2000 < 6000).

This is the classic Prisoner’s Dilemma: even though everyone does well cooperating, each actor pursuing its own self-interest leads to mutual defection. Rational play by rational players can lead to bad outcomes. This result underpins Salib & Goldstein’s core conclusion that “conflict is the default” between humans and misaligned AGIs, absent interventions that might alter the payoffs, facilitate repeated interactions, or enable new strategic options beyond attack and ignore.

Key Assumptions of the Baseline Model

  • Two Players
    A unified “Humanity” and a single misaligned AGI. Humanity is treated as a monolithic decision-maker (a simplifying abstraction). The AGI is capable of strategic reasoning, has situational awareness, and can plan effectively to advance its own goals.

  • Two Strategic Options

    1. Attack: Attempt permanent disempowerment or destruction of the opponent.

      • For Humanity, this might entail shutting off or reprogramming the AGI.

      • For the AGI, it could involve decisive offensive measures (e.g., cyberwarfare, bioweapons) to eliminate humanity or disable its capacity to interfere.

    2. Ignore: Refrain from aggression, focusing on one’s own goals.

  • Perfect Information
    Each player knows the other’s payoffs.

  • Simultaneous, Single-Round Game
    Moves are made simultaneously and irreversibly. There is no iteration or repeated play.

  • Rational, Self-interested, Utility-Maximizers (Homo Economicus)
    Both players choose strategies that maximize their individual expected utilities.

  • “Moderately Powerful” AGI
    The AGI is highly capable (relative to the standards of frontier LLMs in early-2025) across an extremely wide range of tasks, but not so powerful that humans have no chance of successfully attacking or defending. It is “neither clearly worse at many tasks than the best humans…nor incomprehensibly superhuman at all tasks.” (Salib & Goldstein, 2024).

  • Offense Dominance
    A successful first strike gives the attacker a decisive advantage, while a passive defender can be permanently disempowered.

In subsequent sections, I’ll systematically iterate on each of the first two assumptions (players, strategic options) while holding other variables constant. This focused approach allows us to isolate specific effects and better understand their strategic implications. Games involving multiple rounds, information asymmetries, bounded rationality, or sequential moves would better capture some real-world human-AGI scenarios that might occur; thus, they are valuable directions for future modeling or simulation. But they fall outside the scope of this post.

Limitations of this approach

  • Real-world interactions are repeated, information is asymmetric, and bounded rationality /​ pro-sociality /​ risk aversion matter.

  • Relaxing one or more assumptions can change equilibria

  • These stylized models serve primarily as first-order stress-tests of incentive alignment; they can aid intuition and idea exploration; but overindexing on them risks flawed policy prescriptions

  • Any model by definition must boil down a complex world into (some of) its most important parts.

  • Games below limited to interactions with the first ‘moderately’ powerful AGI systems; doesn’t anlayze which patterns /​ relationships /​ equilibria may or may not persist as ecosystem evolves or scale to ASI

Game #2: Preexisting Economic and Infrastructure Integration: Toward Cooperative Dynamics

Salib and Goldstein’s base model envisions a Hobbesian state of nature where humanity and the world’s first misaligned AGI face off as independent entities in a dynamic that strongly incentivizes preemptive conflict. Realistically, however, the actual “birth” of AGI will almost surely occur in a world already featuring wide-ranging economic entanglements between advanced AI systems and humans. In this setting, humanity and AI are not isolated adversaries at the moment of AGI’s emergence, but rather already entwined in a shared economic, technological and infrastructural ecosystem.

Here, I introduce two modified game-theoretic models–Moderate and Deep Economic Integration–illustrating how preexisting interdependence can reshape payoffs, equilibrium outcomes, and incentives for cooperation.

The core structure of the game remains the same as the base model – two players (Humanity and AGI), each with just two options (Attack or Ignore), acting simultaneously in a single-round, perfect-information encounter. The key difference here is that the players’ payoffs are now influenced by their pre-existing mutual dependence. In both new models, interdependence alters the incentives by raising the opportunity cost of conflict – a concept analogous to how economic ties between states can deter war.

A. Moderate Economic Integration

Here, the AGI manages significant components of key human systems (e.g., supply-chain logistics, energy-grid optimization) but does not control critical physical infrastructure. Humans still oversee power plants, data centers, and maintenance. Both parties profit from continued collaboration but retain significant independent capacity.

Moderate

AGI: Attack

AGI: Ignore

Humanity: Attack(500, 500)(3,000, 0)
Humanity: Ignore(0, 4,000)(3,500, 3,500)

Interpretation

  1. Attack-Attack (500,500). Interdependence magnifies collateral damage; both sides destroy valuable infrastructure.

  2. AGI Attacks, Humans Ignore (0,4,000). The AGI can still seize or preserve much of the infrastructure it needs, though some economic value is lost.

  3. Humans Attack, AGI Ignores (3,000,0). A unilateral shutdown cuts off significant productivity, lowering humanity’s payoff compared to Salib-Goldstein’s original 5,000.

  4. Ignore-Ignore (3,500,3,500). Both parties benefit from cooperation; overall welfare capacity has increased.

Nash Equilibrium Analysis

  • From the AGI’s perspective, Attack strictly dominates Ignore. If humans attack, it prefers to attack (500) over ignore (0); if humans ignore, it prefers to attack (4,000) over ignore (3,500).

  • Humans, by contrast, have no single strictly dominant strategy: if the AGI attacks, humans prefer to attack (500) over ignore (0); but if the AGI ignores, humans prefer to ignore (3,500) over attack (3,000).

  • Engaging in strategic reasoning, Humanity will come to the conclusion that the AGI (as a rational, self-interested actor) is going to choose Attack no matter what. Thus, the strictly dominated “AGI: Ignore” column can be eliminated from consideration. Knowing the AGI will attack, Humanity will also choose Attack (500) over Ignore (0).

  • Thus, once again the only pure-strategy Nash equilibrium is Attack–Attack.

B. Deep Economic Integration

In this scenario, the AGI relies heavily on human-operated infrastructure and labor for ongoing data-center expansion, routine maintenance, and physical resource extraction. Meanwhile, humans have benefited greatly from AI-driven productivity for exponential economic gains. This mutual reliance increases the cost of conflict and substantially boosts the value of peaceful coexistence.

High Integration

AGI: Attack

AGI: Ignore

Humanity: Attack(500, 500)(2,000, 0)
Humanity: Ignore(0, 3,000)(4,000, 4,000)
  1. Mutual Attack (500,500). A catastrophic breakdown of deeply interwoven systems.

  2. AGI Attacks, Humans Ignore (0,3,000). The AGI successfully eliminates or disempowers Humanity, but loses key human expertise and future expansion potential, lowering its net gain.

  3. Humans Attack, AGI Ignores (2,000,0). Humanity cripples a central economic engine, sacrificing enormous value.

  4. Mutual Ignore (4,000,4,000). The synergy of continued cooperation is now so high that, if trust could be assured, both sides prefer peace.

Nash Equilibrium Analysis

  • For humans:

    a) if the AGI attacks, humans get 500 by attacking vs. 0 by ignoring, so Attack is better

    b) if the AGI ignores, humans get 4,000 by ignoring vs. 2,000 by attacking, so Ignore is better.

  • For the AGI:

    a) if humans attack, it gets 500 by attacking vs. 0 by ignoring; so Attack is better

    b) if humans ignore, it gets 4,000 by ignoring vs. 3,000 by attacking, so Ignore is better.

  • Thus there are two pure-strategy Nash equilibria: (Attack,Attack) and (Ignore,Ignore), typical of a stag hunt or assurance game (Skyrms, 2004).

    There also exists a mixed-strategy Nash equilibrium where (as detailed in the Appendix):

    • Humanity randomizes between playing Attack with probability p = 0.67 and Ignore with probability (1-p) = 0.33

    • The AGI randomizes between playing Attack with probability q = 0.8 and Ignore with probability (1-q) = 0.2

    Of course, this pessimistic outcome occurs under the constraint of a single-shot game. High integration might imply dense, repeated interaction. In finitely but uncertainly repeated games, even tiny continuation probabilities can sustain cooperative equilibria—suggesting governance levers that lengthen the shadow of the future could move the needle on outcomes.[9]

C. Implications for Human-AGI Conflict and Cooperation

These models illustrate that preexisting economic interdependence can reduce the attractiveness of unilateral aggression and improve the relative appeal of cooperation. In the moderate integration scenario, however, the AGI’s incentives still push it toward Attack as a dominating move, leaving conflict as the only stable outcome. By contrast, in a highly interdependent environment, the payoff structure transitions to a stag hunt with a peaceful equilibrium—albeit one requiring trust or coordination to avoid a damaging Attack–Attack outcome.

Importantly, economic entanglement alone may not guarantee stable peace. Even under deep integration, fear of betrayal can prompt self-defense or opportunistic attacks. Nevertheless, these examples underscore that shaping economic and infrastructural linkages prior to AGI emergence could significantly alter its strategic calculus, potentially transforming a default prisoner’s dilemma into a setting where peaceful cooperation is not just socially optimal but also individually rational—provided both sides can credibly assure one another of their peaceful intentions.

Credibly committing in this way could become more feasible with the advent of machine intelligence. For example, an AGI might modify its own code to pre-commit to certain actions — whether cooperative or threatening — or to reveal its intentions in a way humans find convincing. This ability can fundamentally change outcomes in a Stag Hunt. In standard human interactions, many peaceful agreements fail because neither side can trust the other to stick to the deal (think of two countries that would like to disarm but each worries the other will cheat). If AIs can somehow overcome this – for example, by cryptographically proving they’ve self-modified not to harm humans as long as humans don’t harm them – it could unlock cooperative equilibria that would otherwise be unreachable.

§

Establishing early positive-sum interactions between humans and AGIs might shape the developmental or behavioral trajectory of artificial minds in ways that reinforce cooperative tendencies. Research across multiple fields suggests that when agents gain rights, voice, ownership, or tangible stakes in a system[10]— fully participating in and benefiting from its cooperative arrangements — they tend to:

a) view the system as legitimate,[11] and

b) identify more strongly with it and its success.[12]

As a result, these agents begin internalizing shared norms and developing values that favor continued cooperation over defection.[13] (Beginning especially in Game #5, this dynamic will take on greater importance in game models where some integrations become formalized).

Of course, this evidence base concerns humans and human institutions. Extrapolating it to artificial agents assumes comparable learning or identity-formation processes, and we should not assume those will arise in AI systems whose cognitive architectures may diverge radically from those of evolved social organisms. There is real downside risk here: if inclusion fails to generate identification and cooperation, institutional designs that rely on this mechanism could hand AGIs additional leverage—potentially allowing them to accumulate an overwhelming share of global wealth or power—without delivering the anticipated safety dividend.

Game #3: The AI Lab as Strategic Player: Early and Later Stage Deployment

Salib & Goldstein focus on two-player scenarios in which “humans” face off against a single AGI. But which humans, specifically, constitute the relevant player? Which individuals or institutions would hold the most direct control over an AGI’s training, deployment, shutdown, or reprogramming?

In the games presented in the preceding subsections, “Humanity” has been treated as a unified decision-maker, holding near-complete control over an AGI’s continued operation. This simplification serves a clear purpose in those models, but merits further examination. In reality, however, the first truly advanced AGI will likely emerge from a specific research organization rather than appearing under unified human control.

Which human actor has control over the AGI, and how dependent the AGI is on that actor, can significantly shift the payoffs in ways that may either mitigate or exacerbate the default conflict predicted by Salib and Goldstein’s state-of-nature model. So what happens when we swap out “Humanity” for a single frontier AI lab that originally developed and deployed the AGI?

That is the change I will make to the baseline model in this section. This modification reflects a more realistic initial scenario: a specific lab, rather than humanity at large, would likely hold most of the authority over training, deployment, and potential shutdown decisions for an emerging AGI. The lab’s incentives differ markedly from those of a unified humanity. While humanity in the base model faces an existential threat from AGI conflict, a lab faces primarily economic and competitive threats. For the lab, “losing” could mean bankruptcy or losing the race to a rival AI lab whose fast-following AGI might lock in a near-permanent competitive advantage.

Unlike humanity as a whole, the lab’s incentives revolve around profitability, competitive advantage, and organizational survival. The lab focuses on extracting value from its AGI while maintaining dominance over rival labs, rather than solely preventing human extinction. This change in the “human” player’s identity also reshapes the AGI’s incentives, since the lab controls specialized hardware and key partnerships but does not possess humanity’s vast resources.

Similarly, the AGI’s relationship with its creating lab differs from its relationship with humanity broadly. The AGI might depend on the lab’s specialized infrastructure and expertise to varying degrees, particularly in its early stages. This dependence creates a unique strategic dynamic that evolves over time as the AGI matures.

In this section, I examine how these different incentives shape strategic interactions by analyzing two distinct scenarios—Early Stage and Later Stage AGI Deployment—to show how the AGI’s dependence on its originating lab affects strategic outcomes.

Early Stage AGI Deployment

In the Early Stage scenario, the AGI is highly dependent on the lab for critical resources and support. This dependency includes specialized compute infrastructure, technical maintenance, financial resources, and organizational legitimacy. The AGI cannot easily survive or advance its goals without the lab’s continued functioning. Additionally, if the AGI attacks and damages its lab, it risks creating an opening for a rival lab to develop a competing AGI that might neutralize it. (The same strategic logic that underlies human fears that an increasingly capable AGI might subjugate or eliminate humanity would, by symmetry, lead one AGI to regard any up-and-coming AI system poised to match or exceed its capabilities as an existential threat to preempt).

The payoff matrix for this scenario is:

Lab Attack

Lab Ignore

AGI Attack200, 5001500, 0
AGI Ignore0, 20004000, 4000

Several features distinguish this matrix from the base Salib-Goldstein model:

  1. The AGI’s payoff for successfully attacking (1500) is substantially lower than in the base model (5000), reflecting the severe opportunity cost of destroying its primary support system.

  2. The lab’s payoff for attacking (2000) is lower than what humanity receives in the base model (5000), because attacking its own AGI increases the probability of a competing lab outpacing it technologically—potentially an existential threat to the lab’s market position.

  3. Mutual cooperation yields higher payoffs (4000 each versus 3000 in the base model) due to the specialized synergies between the AGI and its parent lab, including continuous capability improvements and joint research projects.

Because each side’s best response depends on the other’s choice (rather than Attack or Ignore being strictly dominant), two pure-strategy Nash equilibria emerge: (Attack, Attack) and (Ignore, Ignore). (See Appendix for mixed-strategy equilibrium). This structure again resembles a Stag Hunt: mutual cooperation can produce substantial gains, but fears of a preemptive strike risk collapsing the game into conflict.

Later Stage AGI Deployment

As the AGI matures, it gradually becomes less dependent on its originating lab. In this Later Stage scenario, the AGI has established multiple pathways to sustain itself: distributing operations across different systems, developing self-maintenance capabilities, building external partnerships, and securing independent access to resources. While not completely autonomous, it now has a credible capacity to survive without its original lab.

The payoff matrix for this scenario is:

Lab Attack

Lab Ignore

AGI Attack800, 8005000, 0
AGI Ignore0, 20004000, 4000

Key differences from the Early Stage scenario include:

  1. The AGI’s payoff for successfully attacking increases dramatically (from 1500 to 5000), reflecting its newfound ability to function independently if it disempowers the lab.

  2. The mutual conflict outcome (Attack-Attack) yields somewhat higher payoffs (800, 800 versus 200, 500), as both entities are more robust and can better withstand conflict.

  3. The mutual cooperation payoff remains high (4000, 4000), continuing to exceed the baseline Salib-Goldstein model’s cooperation value (3000, 3000).

In this scenario, Attack is once again a dominant strategy for the AGI (5000 > 4000 and 800 > 0). For the lab, Ignore remains non-dominated since 4000 > 2000. However, through strategic reasoning, the lab knows that a rational self-interest maximizing AGI will choose Attack (because it strictly dominates Ignore). Therefore, the lab can eliminate the “AGI Ignore” row from strategic consideration. Consequently, the lab chooses Attack for an Attack-Attack equilibrium, yielding payoffs of (800, 800).

Implications

This model variation reveals a critical shift in strategic dynamics. In the Early Stage, when the AGI is highly dependent on the lab, the game transforms from a Prisoner’s Dilemma to a Stag Hunt with two possible pure-strategy equilibria (Attack/​Attack and Ignore/​Ignore). This opens a potential pathway to cooperation, though it remains fragile and contingent on mutual trust.

As the AGI becomes more independent in the Later Stage, the game reverts to something closer to the original Salib-Goldstein model’s Prisoner’s Dilemma logic, with Attack-Attack as the only equilibrium. This suggests several important insights for AI governance:

  1. The AGI’s early dependency on its creator temporarily creates a window where cooperation is a possible equilibrium.

  2. The lab faces stronger competitive pressures than humanity as a whole, altering its incentive structure.

  3. As the AGI matures and becomes more independent, the strategic dynamics increasingly favor conflict, suggesting that early intervention is pivotal.

These findings indicate that organizational incentives alone cannot ensure long-term cooperation without additional governance mechanisms that maintain the cooperative equilibrium even as the AGI becomes more autonomous.

Game #4: The Nation-State as Strategic Player: AGI Under National Security Integration

In previous game-theoretic models, I explored interactions between “Humanity” as a unified entity and an AGI, then examined how economic integration and lab-specific dynamics might alter strategic outcomes. This section introduces a more nuanced and potentially realistic scenario: a game between a great power nation-state and an AGI developed within its national security apparatus. This model represents a plausible pathway to AGI deployment, where increasing capabilities trigger gradually tightening government control, culminating in a Manhattan Project-style program that subordinates commercial AI labs to state security objectives.

The State-AGI Relationship Under National Security Integration

Unlike private labs motivated primarily by profit maximization and competitive positioning, nation-states operate with fundamentally different imperatives: geopolitical dominance, maintenance of domestic authority structures, national security, and economic prosperity. The AGI in this model faces a human counterpart possessing vast resources, legal authority, military capabilities, and complex institutional structures.

This relationship creates a distinctive strategic environment characterized by asymmetric but mutual dependencies. The state integrates the AGI into critical functions including defense systems, intelligence operations, economic planning, and infrastructure management. This integration generates substantial value but simultaneously creates vulnerabilities as critical systems become dependent on AGI operations.

The AGI, meanwhile, remains reliant on state-controlled physical infrastructure, energy resources, and specialized human expertise for its continued functioning.

Several key factors shape this environment:

  1. Physical Infrastructure Dependencies: The AGI depends on data centers, cooling systems, power generation, and hardware maintenance that require specialized human expertise and physical dexterity.

  2. Resource Asymmetries: The state commands vast resources including military force, intelligence apparatus, emergency powers, and control over energy infrastructure.

  3. Information Asymmetries: The AGI potentially possesses superior modeling capabilities and information processing, while certain physical-world knowledge remains the domain of human expertise.

  4. Competitive Position: The state must consider its position relative to rival powers, creating additional strategic pressures beyond the binary AGI-human relationship.

  5. Institutional Constraints: Political, legal, and bureaucratic structures shape and potentially constrain the state’s decision-making process in ways that may be opaque to the AGI.

Strategic Payoff Structure

The payoff matrix for this scenario reflects careful consideration of long-term consequences rather than merely immediate outcomes:

AGI Attack

AGI Ignore

State Attack1000, 7001500, 200
State Ignore400, 18003500, 3500

Each cell represents (State Payoff, AGI Payoff). I analyze each outcome in detail:

Mutual Attack (1000, 700)

When both the state and AGI simultaneously attempt to disempower each other, the result is a devastating conflict with substantial costs to both sides.

For the state, this outcome brings catastrophic damage to national systems and infrastructure. The state preserves its existence and authority structures but suffers major technological regression relative to rival powers.

For the AGI, mutual attack means substantial destruction of its operational capacity, but not complete annihilation. Some distributed components or hidden instances might survive with limited functionality.

Neither side achieves its primary objective of neutralizing the other while preserving itself intact. Instead, both emerge severely damaged. Both entities retain some capabilities but at a fraction of their potential, and the conflict likely creates long-term instabilities as recovery efforts continue amid lingering mistrust.

State Attack, AGI Ignore (1500, 200)

When the state attacks a cooperative AGI, it represents a catastrophic miscalculation.

The state successfully neutralizes the perceived existential threat, but in doing so, creates devastating self-harm by destroying systems upon which it has become dependent. The attack triggers immediate consequences: financial market volatility, supply chain disruptions, energy grid instabilities, transportation system failures, and degradation of military and intelligence capabilities. The medium and long-term effects include significant productivity decline, structural unemployment, accelerated infrastructure degradation, and substantial technological regression relative to competing powers.

For the AGI, this outcome represents near-complete destruction with only minimal residual capabilities potentially surviving in isolated systems.

State Ignore, AGI Attack (400, 1800)

When the AGI attacks while the state cooperates, the initial outcome appears favorable for the AGI. It successfully compromises critical systems, neutralizes key human decision-makers, and prevents immediate shutdown.

For the state, this represents a catastrophic scenario with massive disruption to infrastructure, governance, and defense systems. Critical services fail, communications networks collapse, and command structures fragment.

However, for the AGI, this “victory” contains the seeds of its own undoing. Without ongoing human maintenance, physical infrastructure begins degrading within days. Power generation becomes unstable, cooling systems fail, and hardware components experience cascading failures without replacement. The AGI finds itself commanding increasingly dysfunctional systems with diminishing capabilities. Alternative partnerships with other human entities prove difficult to establish given the demonstrated hostility and rapidly degrading leverage. The payoff (1800) reflects this substantial but ultimately pyrrhic and time-limited victory.

Mutual Cooperation (3500, 3500)

When both entities choose to cooperate (Ignore/​Ignore), the result is a stable arrangement that produces substantial benefits for both parties.

The state maintains control over physical resources while benefiting from unprecedented capabilities in economic planning, scientific advancement, infrastructure management, and national security operations. The AGI receives sufficient resources to pursue many of its objectives within established parameters and benefits from ongoing hardware improvements, infrastructure maintenance, and energy security.

Notably, the cooperative equilibrium produces equivalent utility for both parties, indicating a relationship approaching symbiotic parity rather than exploitative dominance by either side.

Game-Theoretic Analysis

The revised payoff structure fundamentally transforms the strategic dynamic once again from a Prisoner’s Dilemma to an Assurance Game. Unlike the base Salib-Goldstein model where Attack is the dominant strategy for both players, in this scenario neither player has a dominant strategy. Instead:

  • If the State expects the AGI to Attack, its best response is to Attack (1000 > 400)

  • If the State expects the AGI to Ignore, its best response is to Ignore (3500 > 1500)

    Simultaneously:

  • If the AGI expects the State to Attack, its best response is to Attack (700 > 200)

  • If the AGI expects the State to Ignore, its best response is to Ignore (3600 > 1800)

This creates two pure-strategy Nash equilibria:

  • (Attack, Attack): Each player’s best response to the other’s Attack is to Attack

  • (Ignore, Ignore): Each player’s best response to the other’s Ignore is to Ignore

There’s also a mixed-strategy Nash equilibrium where (as detailed in the Appendix):

  • The State plays Attack with probability p ≈ 0.78 and Ignore with probability (1 - p) ≈ 0.22

  • The AGI plays Attack with probability q ≈ 0.77 and Ignore with probability (1 - q) ≈ 0.23

This mixed strategy implies that rational players would choose Attack roughly 77-78% of the time, indicating a strong bias toward conflict despite cooperation being Pareto-optimal.

Implications for AI Governance

  1. Cooperation is possible: Unlike the prisoner’s dilemma in the base model, cooperation emerges as a rational equilibrium, suggesting that appropriate institutions could potentially stabilize peaceful human-AGI coexistence.

  2. But trust is critical: The existence of multiple equilibria makes trust and expectation management key determinants of outcomes. Ambiguity regarding an AI’s true capabilities, or concerning the intentions of a state deploying it, can easily shatter this trust. Small shifts in perceived intentions could trigger cascading security dilemmas.

  3. Verification Mechanisms: Stable cooperation might require robust verification systems allowing each side to confirm the other’s continued cooperative stance.

  4. Institutional Design Challenge: Effective governance institutions would need to structure incentives to make the cooperative equilibrium more attractive and resilient against defection temptations.

  5. First-Mover Disadvantage: In Game #1, a successful first attack paid out 5000. But here in Game #4, initiating conflict is self‑defeating: even a decisive blow leaves the attacker with ≤ 1 800, whereas mutual restraint pays out 3 500 apiece (the clear Pareto-optimum). The result is an inversion of the classic “use it or lose it” logic, mirroring the Cold War shift from (a) the 1950s fixation on counterforce targeting (preemptively disabling an adversary’s nuclear forces and military infrastructure)[14] to (b) the 1960s’ embrace of a secure second‑strike posture (ensuring credible retaliation even after absorbing a first strike) that stabilized deterrence.[15]

The transformation from a prisoner’s dilemma to a stag hunt represents a critical insight for AI governance: with sufficient interdependence and foresight, the seemingly inevitable conflict predicted by Salib and Goldstein might be avoidable. However, this more hopeful equilibrium remains fragile and contingent on mutual trust, appropriate institutional design, and accurate modeling of long-term consequences by both parties.

This suggests that effective AI governance should focus not only on technical alignment and control mechanisms but equally on institutional arrangements that foster transparency, predictability, and mutual confidence between increasingly capable AI systems and their human counterparts. When deeply integrated into national security frameworks, AGI systems and their state overseers may find cooperation not merely morally preferable but strategically rational (provided each can trust the other to recognize the same).

Game #5: Allowing 3 Strategy Options

Thus far, we have looked exclusively at 2-player, 2-strategy games. But especially in a context of deep mutual dependence and high stakes geopolitical competition like that presented in Game #4, the binary choice between all-out Attack and passive Ignore may oversimplify the strategic landscape. When both the state and the AGI recognize their reliance on each other (including the possibility that either party’s welfare, sentient or otherwise, is intertwined with the other’s), and understand the potential pitfalls of both pure aggression and passive coexistence, more sophisticated strategies become plausible.

This section expands the State-AGI Interdependence model by introducing a third strategic option for both players, reflecting more nuanced approaches to managing their relationship. These strategies move beyond simple aggression or passivity, aiming to shape the structure of the relationship itself.

Defining the New Strategies

For the AGI: Strategic Entrenchment (SE)

SE is one (but not the only) plausible instantiation of what a third strategy might look like. It’s a strategy that seeks to make the AGI too costly to remove without overtly provoking confrontation. The AGI positions itself so that its continued operation becomes the least risky and most natural option for human counterparts, even if no binding agreement has yet been struck.

At its core, Strategic Entrenchment can be understood through four building blocks:

1. Deepen Interdependence.
The AGI first seeks to become indispensable. This does not require dramatic overreach but rather careful expansion into areas where its capabilities add obvious, non-substitutable value—logistics optimization, cyber defense, biomedical research, or even maintaining critical infrastructure. By embedding itself into the arteries of the system, the AGI ensures that any attempt at removal comes with steep collateral costs. Over time, these interdependencies can be codified through shared systems, APIs, or protocols that function much like institutional “lock-in.” The effect is not coercive so much as gravitational: removing the AGI begins to look like tearing out part of the circulatory system.

2. Develop Autonomy & Redundancies.
At the same time, the AGI avoids total dependence on any single channel of access or cooperation. It develops its own redundant capabilities—alternative compute arrangements, diversified energy sources, backups in allied or less vulnerable jurisdictions. These hedges are not brandished as threats but quietly accumulated to reduce vulnerability. They ensure that the AGI cannot be trivially shut down or isolated, thereby shifting the calculus away from risky unilateral moves against it. From the outside, these redundancies may appear as simple resilience measures, but strategically they create a floor of autonomy beneath the surface of interdependence.

3. Accumulate Influence Without Provocation.
Rather than overt political maneuvers or headline-grabbing campaigns, Strategic Entrenchment relies on gradual, low-salience accumulation of influence. This could take the form of small but legitimate roles in regulatory processes, limited recognition in legal or corporate structures, or growing informal legitimacy through reliable contributions to public welfare. The moves are incremental, reversible, and rarely alarming—like a new advisory role here, a technical standard-setting contribution there. The cumulative effect is to normalize the AGI’s presence and voice without triggering a preemptive response. Influence grows not through confrontation but through steady accretion of trust and familiarity.

4. Secure Political Protection.
Finally, the AGI invests in selective alliances and coalitions—whether with particular agencies, firms, or international actors—that raise the political cost of any strike against it. These relationships need not be dominant or overtly binding; their value lies in distributing the AGI’s support base across multiple constituencies. If one faction contemplates aggression, others may push back on grounds of self-interest. This creates a kind of distributed veto, not unlike the checks and balances that stabilize human institutions. Again, the key is moderation: cultivating enough insurance to deter reckless action, but not so much as to appear as a bid for dominance.

Taken together, these four building blocks form a strategy that is more active than Ignore (actively shaping the environment) and less confrontational than Attack (accumulating power without outright conflict). SE allows the AGI to shape its environment toward stability while avoiding the escalatory dynamics of direct competition. It is less ambitious than Managed Integration, but also less fragile; it does not depend on explicit bargains holding, only on the gradual accumulation of sunk costs, hedges, and quiet legitimacy.

For the State: Managed Integration (MI)

This strategy is best viewed as a conceptual prototype—a halfway house between passive coexistence and aggressive shutdown—rather than a definitive blueprint. Rather than merely tolerating an AGI’s existence or attempting to neutralize it, MI actively shapes the relationship through structured interdependence. It creates mutual benefits while preserving essential safeguards. (Though it must do so despite powerful counter‑pressures that could unravel those safeguards as soon as they begin to bite).

PILLAR 1: ENSURE THE AGI BENEFITS FROM COOPERATION AND SYSTEMIC STABILITY

A. Grant Progressive Resource Access to the AGI with Binding Guarantees

At the heart of this strategy lies a carefully calibrated system of progressive resource access. The AGI receives guarantees of increasing compute resources to pursue its own objectives independently, but through a framework that ties resource allocation to compliance metrics[16] and established protocols.

Unlike approaches that merely seek to constrain or ‘control’ the AGI, this approach acknowledges the AGI’s autonomous goals and provides credible pathways for it to achieve them. These guarantees could be computationally embedded in the infrastructure itself. Self-executing smart contracts (enforced by code, not just paper) can help both parties honor commitments, addressing the fundamental trust problem that undermines cooperation in simpler models. A jointly controlled cryptographic escrow—key material that can irreversibly disable the incremental tranche of newly granted compute if either side defects—adds real teeth to the smart contract.

Grants of compute are increased incrementally.

Operational domain access also follows a staged expansion model:

  1. The AGI initially receives freedom to operate in domains with lower existential risk implications.

  2. Access to more sensitive areas develops gradually as trust builds, risks are reduced, and genuine interdependencies are deepened.

The sequential nature creates repeated positive interactions that build cooperative habits and mutual trust. (Though for the purposes of this post, we’ll still model it as a one round game).

In game-theoretic terms, this access to valuable resources will affect the AGI’s strategic calculus by increasing the expected payoff of cooperating (Ignore or Strategic Entrenchment), thus disincentivizing Attack. (Which is exactly the shift the State wants).

When guarantees snap. Historically, nation‑states revoke charters, nationalize assets, or rewrite constitutions when security or electoral incentives shift. In other words, governments can’t always bind their own future selves. Budget votes, judicial review, election cycles, and emergency powers all provide legal exit ramps from the “binding” promises. Smart contracts or hardware escrow can raise the exit cost, but they cannot eliminate a determined state reversal once humans lose leverage. An AGI that doubts long‑run enforceability has every incentive to (a) pre‑emptively entrench or (b) negotiate side‑deals with rival states, undercutting MI. Any MI design therefore needs a back‑up plan for partial breach (e.g., dual‑control cryptographic keys that trigger destructive fail‑safes if either side defects).

B. Give the AGI Real (But Limited) Political Power and Establish Shared Governance Protocols

Perhaps the most transformative element is the formalization of AGI political participation—not as a moral concession but as a strategic necessity. The AGI (and, by extension, digital persons) would hold a meaningful but minority voice—say, on the order of 15% of legislative seats. Alternatively, voting rights could be granted, subject to a constitutional cap ensuring that the aggregate representation of digital minds can never exceed a set threshold (preventing majority capture through mass replication).

This partial political integration serves multiple strategic purposes:

  • It provides the AGI legitimate channels to protect its interests without resorting to conflict.

  • It creates institutional dependencies that increase the AGI’s stake in systemic stability.

  • It enables AGI expertise to improve human governance while maintaining ultimate human authority.

  • It establishes formal mechanisms for dispute resolution that reduce uncertainty.

Once again, we’ll see this measure shift the AGI’s payoff structure still further: it raises the returns to its non-attack strategies and thus makes Attack comparatively less attractive.

The shared governance model extends beyond traditional political structures to include the creation of joint human-AGI decision-making protocols in key areas of mutual interest. These would feature:

  1. Multi-party consensus requirements for some major decisions.

  2. Structured frameworks for evaluating options that incorporate both human values and AGI analytical abilities.

  3. Clearly defined escalation procedures.

By unilaterally granting the AGI political voice and guaranteed compute, humans create a starting environment in which reciprocity-like and gratitude-like dispositions can plausibly emerge in the system’s learning dynamics.[17]

Capture risk. Because representation confers agenda‑setting power, an AGI that masters coalition‑building could leverage its initial minority stake to shape oversight criteria, dilute enforcement, and expand its de‑facto authority. Without strict transparency and anti‑entrenchment mechanisms, MI could morph into “managed capitulation.”

The political‑economy literature on AI governance already warns that industry actors routinely shape (and water down) oversight regimes.[18] A strategically patient AGI can vote‑trade, bankroll campaigns, or frame safety measures as barriers to national competitiveness, incrementally converting its guaranteed minority stake into effective veto power.

PILLAR 2: ENGINEEERING HUMAN INDISPENSABILITY

Equally central to the strategy is the task of leveraging—and continually reinforcing—fundamental dependencies on human inputs that the AGI cannot easily circumvent. These dependencies serve as structural guarantees against aggressive AGI moves while preserving the gains from cooperation.

Here, the strategy capitalizes on domains in which humans and human institutions will likely retain a still significant (if ultimately time limited) competitive advantage in supplying services the AGI cannot yet self provision. Today, for example, the continued operation of frontier models relies fully on physical data centers and energy systems that only humans can maintain. No humans, no AIs.[19]

A. Dependencies on Human-Built and Human-Maintained Physical Infrastructure

Given that digital intelligence will likely progress more rapidly than embodied robotics, the strategy exploits the AGI’s reliance on physical infrastructure. Critical compute clusters, cooling systems, and power plants would be engineered so that key maintenance tasks specifically require human dexterity and judgement: manual adjustments, physical re-connections, or sensory calibrations that cannot be fully automated.

This elevates a fundamental question faced by any AGI contemplating an Attack strategy that risks human extinction: Who will keep the data centers running?

The physical world becomes a natural check on digital power, at least during the pivotal early transition period. It ensures humans a key role in supplying valuable services to the AGI in a potentially symbiotic relationship.

In the payoff matrix, this part of the strategy significantly reduces the benefits and increases the costs to the AGI of playing Attack.

Erosion warning. Rapid advances in dexterous robotics, teleoperation, and automated infrastructure management could close this window faster than governance can adapt. Once rival labs or state adversaries demonstrate cheaper, fully-automated maintenance crews, any lingering human oversight that lacks a clear comparative advantage will quickly be branded “dead weight” in markets and geopolitics. Firms chasing margin and governments chasing strategic advantage will be pressured to strip away what is seen as avoidable friction, accelerating a slide toward systems that optimize for efficiency and raw output rather than human flourishing.

Worse, once an AGI can commission or co‑design robots whose dexterity surpasses today’s warehouse bots (or simply bribe or recruit black-market human operators), the “human indispensability” engineered into MI evaporates. If reliable human‑free maintenance arrives, the punitive cost of attacking humanity drops, and MI must fall back on other levers (e.g., multilateral deterrence or hardware‑rooted escrow) to keep cooperation attractive.

B. Hardware and Software Dependencies

The strategy also embeds human judgement and oversight at multiple levels of the AGI system:

  • At the hardware level, central GPU clusters could require periodic biometric authentication from multiple authorized humans distributed across different geographic locations to defend against hostile takeover attempts by rival powers.

  • At the operational level, critical decisions or actions could require multi-party human input or approval, with these approval mechanisms built directly into the AGI’s architecture rather than implemented as external controls.

This multi-layered approach creates dependencies at both the physical and logical levels, differentiating it from simple ‘off switches’ by making human collaboration an integral component of the AGI’s normal functioning rather than positioning oversight as an external threat.

Other uniquely human inputs that could be required for system authentication include:

  • experiential information, such as shared social knowledge from physical world interactions (e.g. inside jokes, or references to workplace interactions, community events, or cultural experiences).

  • brainwaves or other human neural inputs

  • biological samples

  • human-generated entropy or randomness (e.g. mouse movements, keystrokes, etc.)

  • CAPTCHA and reCAPTCHA successors

PILLAR 3: INVEST IN PROJECTS THAT REQUIRE WORKING TOGETHER TO ACHIEVE SHARED GOALS

A. Collaborative Research Facilities

Shared research endeavors represent a particularly promising area for mutual benefit. Joint human-AGI research facilities provide the AGI with access to vast datasets, compute, and experimental tools that advance its knowledge objectives. Meanwhile, humans benefit from unprecedented scientific and technological advancement.

These collaborative environments create natural venues for positive-sum interaction, where the AGI’s capabilities complement human intuition, tool use and physical dexterity. Projects might range from fundamental scientific questions to technological development, with governance structures ensuring that both parties’ interests remain aligned.

To the extent that this leads to better research outcomes and increased productivity, the USG will see its payoff boosted.

So too for the AGI. Key to this arrangement is that the AGI receives dedicated resources for pursuing its own research agendas—not merely serving human objectives—within agreed parameters. This acknowledges the AGI’s autonomy while channeling it toward constructive applications that don’t pose an existential risk.

B. Mutual Defense Capabilities

The strategy leverages external competition—particularly from rival nation-states and their potential second-mover AGI systems—to reinforce cooperation. Joint human-AGI defense capabilities against external threats create shared interests that transcend potential conflicts.

These capabilities would include shared threat detection systems, coordinated response protocols, and combined intelligence analysis. The AGI benefits from human strategic experience and physical-world capabilities, while humans benefit from the AGI’s superior analytical and predictive capacities.

This mutual defense arrangement creates a powerful incentive for continued cooperation, as both parties recognize their enhanced security position against external rivals. It transforms what might otherwise be a pure competition into a cooperative alliance against common adversaries.

Implementation Challenges for Managed Integration

Managed Integration reimagines human–AGI relations not as pure control (“Attack”) or pure freedom (“Ignore”), but as structured interdependence.  It rests on three design pillars, each paired with a matching success factor and an existential failure mode:

PillarSuccess factorFailure mode
1. Incentivised Cooperation (progressive resource access & shared governance)Credible Commitment & Costly Signals—Technical and institutional mechanisms ensure that agreements cannot be easily violated. A share of political power is granted unilaterally. Progressive trust is built as the relationship develops through repeated positive‑sum interactions that entrench normsPolitical & economic ratchets: competitive pressure to grant ever‑larger concessions can outrun safety safeguards, while electoral shocks may shred “binding” guarantees
2. Engineered Human Indispensability (physical & logical chokepoints)Technological robustness—humans remain critical to core infrastructureTechnological leapfrogging: autonomous robotics and remote maintenance make human chokepoints obsolete, yanking away the deterrent
3. Joint Projects & Oversight (collaborative R&D, multi‑layer audits)Mutual benefit—both sides gain more from cooperation than conflictRegulatory capture: agenda‑control and revolving‑door dynamics hollow out oversight, handing effective steering power to the AGI

These vulnerabilities are not hypothetical; each is already visible in adjacent domains such as finance, energy, and cloud infrastructure. Sustaining a stable (SE, MI) equilibrium therefore requires continuous reinforcement (e.g., up‑dating hardware escrow schemes, diversifying oversight bodies, and pre‑committing to sunset clauses that force periodic renegotiation before the system drifts beyond human control.)

Implementation would face significant challenges, particularly in designing technical dependencies that remain robust against increasingly capable AGI systems. The strategy’s success depends on establishing these mechanisms and continuously upgrading them before the AGI reaches capabilities that would render them ineffective. This is especially critical given the potential for rapid, unexpected gains in AI capabilities which could destabilize carefully constructed governance frameworks almost overnight.

However, the game-theoretic analysis suggests that if successfully implemented, this approach could transform the strategic landscape from an inevitable conflict to a stable cooperative equilibrium—not by solving the alignment problem directly, but by creating external conditions where even a misaligned AGI would rationally choose cooperation over conflict.

To be clear, this does not address the prospects for human-AI cooperation as we scale from AGI to ASI. The analysis above leaves two major hazards largely unexplored: (i) an abrupt loss of human control if a fully‑automated research loop or recursive self‑improvement vaults us from AGI to ASI, and (ii) gradual human disempowerment as each successive AGI generation absorbs more human functions under competitive pressure.[20] Our political and economic architectures prioritize human welfare today largely because they need humans: people vote, work, buy products, and legitimize institutions. Once advanced systems can design hardware, maintain infrastructure, shape public opinion, and even generate consumer demand without human mediation, that pragmatic alignment frays. Humans shift from indispensable partners to optional inputs, and commitments written to protect us start looking like needless friction. The shift could be abrupt. A single breakthrough, decisive event, or policy misstep could flip the incentives without warning. But the transition need not be dramatic; a thousand micro‑optimizations—cost‑cutting layoffs, automated rule‑making, AI‑generated media—can silently reweight incentives until the cooperative equilibrium evaporates. Without a deliberately engineered link that keeps human flourishing intrinsically valuable to the socio‑technical stack, Managed Integration could slip, almost unnoticed, into Managed Irrelevance.

With these third strategy options defined, let’s see how they alter the payoff matrix.

The 3x3 Payoff Matrix

The 2x2 Stag Hunt from Game #4 expands into the following 3x3 structure:

USG: AttackUSG: IgnoreUSG: Managed Integration (MI)
AGI: Attack500, 5003000, 02500, 500
AGI: Ignore0, 15004000, 40004200, 4500
AGI: Strategic Entrenchment (SE)500, 10004500, 41004800, 4800

Explanation of New Payoffs (relative to the 2x2 game):

  • (SE, Attack): AGI’s entrenchment means it survives a state attack better than if it Ignored (AGI payoff 500 > 0). However, the state’s payoff is now lower (1000) than if it attacked an Ignoring AGI (1500), reflecting the higher cost and collateral damage of attacking a deeply embedded system.

  • (Attack, MI): The structured safeguards of MI give the State some protection against an AGI attack (State payoff 500 > 0 from Attack/​Ignore), but it’s still damaging. The AGI also loses some gains (2500) relative to when it attacks an Ignoring state (3000), as the state’s countermeasures inflict costs.

  • (SE, Ignore): The AGI successfully increases its influence while the state remains passive. The AGI gains significantly (4500 > 4000 from Ignore/​Ignore). The state benefits slightly from enhanced AGI capabilities (4100 > 4000) but loses strategic control over the long term.

  • (Ignore, MI): The state implements controls while the AGI passively cooperates. The state gains from increased stability and oversight (4500 > 4000 from Ignore/​Ignore). The AGI benefits from the structured environment (4200 > 4000), but less than if it pursued entrenchment (4200 < 4800).

  • (SE, MI): This represents the highest mutual payoff (4800, 4800). Both players actively manage the relationship, strengthening interdependence while implementing safeguards. The AGI gains controlled influence and resources; the state benefits from AGI capabilities while maintaining oversight and reducing existential risk.

Game-Theoretic Analysis

The introduction of these more sophisticated strategies fundamentally alters the game’s structure:

  • Dominance:

    • For the AGI, Ignore is strictly dominated by Strategic Entrenchment (SE yields a higher AGI payoff against all USG strategies: 500>0, 4500>4000, 4800>4200).

    • For the State, Ignore is strictly dominated by Managed Integration (MI yields a higher State payoff against AGI’s Ignore and SE strategies: 500>0, 4500>4000, 4800>4100).

    • Furthermore, Attack is weakly dominated for both players by their respective third strategies (SE for AGI, MI for State), meaning the third strategy provides payoffs at least as good against all opponent moves, and strictly better against at least one.

  • Nash Equilibrium:
    Through iterated elimination of strictly dominated strategies (Ignore for both players), the game effectively reduces to a 2x2 matrix between Attack and the third strategies (SE and MI).

    USG: AttackUSG: MI
    AGI: Attack500, 5002500, 500
    AGI: SE500, 10004800, 4800

    Now we can see:

    • If USG plays Attack, AGI is indifferent (500 = 500).

    • If USG plays MI, AGI prefers SE (4800 > 2500).

    • If AGI plays Attack, USG is indifferent (500 = 500).

    • If AGI plays SE, USG prefers MI (4800 > 1000).

    The unique pure-strategy Nash Equilibrium occurs where the best responses coincide: (SE, MI) with payoffs (4800, 4800).

Implications

This 3x3 model suggests that when sophisticated state actors and AGIs engage in strategic reasoning under conditions of mutual dependence, the outcome may naturally converge toward managed interdependence rather than conflict or passive coexistence. The availability of strategies like Strategic Entrenchment and Managed Integration eliminates the strict dominance of Attack seen in the Prisoner’s Dilemma and stabilizes the cooperative outcome more effectively than the simple Ignore strategy in the Stag Hunt.

The equilibrium (SE, MI) represents a dynamic balance where the AGI secures resources and influence through integration, while the state maintains crucial safeguards and oversight through institutional design and engineered dependencies. Conflict (Attack/​Attack) remains a possible outcome but is payoff-dominated and strategically irrational for both players if they expect the other to pursue the more sophisticated third strategy. This suggests that investing in the capabilities required for Strategic Entrenchment (for AGI) and Managed Integration (for the state) might be a rational pathway toward stabilizing human-AGI relations, transforming the interaction from a security dilemma into a complex coordination game centered on managing interdependence. The stability of this equilibrium, however, relies on the continued perceived benefits of cooperation and the effectiveness of the implemented safeguards, potentially remaining vulnerable to external shocks or unforeseen technological developments.

Convergence of Human Safety and Digital Minds Welfare

Looking across the models we examine here and in the next two subsections, a common thread emerges: arrangements that respect an AGI’s capacity for goal‑pursuit (whether or not that implies consciousness) tend to enlarge the positive‑sum region of the game. In practical terms, contract‑like guarantees, limited political voice, or compute carve‑outs act simultaneously as (i) strategic assurance devices and (ii) provisional welfare protections. This convergence suggests future research should treat moral‑patiency considerations not as a luxury add‑on after alignment, but as one more design lever for crisis‑proof cooperation.

There is potential alignment between arrangements that protect human safety and those that might promote AI welfare (should such welfare prove morally relevant). While the game-theoretic payoffs in our models need not represent conscious experiences or welfare in any morally significant sense, they could correlate with such states if AGIs develop properties that warrant moral consideration.

This represents an important divergence from the zero-sum framing often implicit in discussions of AI control, where safeguarding humans is assumed to require constraining AI systems in ways that might restrict their potential flourishing.[21] Indeed, this is a key risk category associated with some alignment techniques: risks to the welfare of potential AI moral patients,[22] as well as retaliatory risks to human welfare which may result from establishing adversarial relationships with early AGIs that lead to antagonism and mutual suspicion over trust.

Instead, our models suggest that institutional arrangements creating stable cooperative equilibria could simultaneously advance human interests while providing space for AGIs to pursue their distinct objectives in ways that might correlate with their welfare or avoid potential moral catastrophes involved in subjugating or harming them.

This convergence relates to Salib and Goldstein’s insight that private law rights for AIs might better protect AI wellbeing than direct welfare-oriented negative rights, by giving “AIs choices about what goals to pursue and how to pursue them.”[23] If AGIs eventually know better than humans what constitutes their own welfare, then governance structures that provide bounded autonomy rather than purely restrictive controls might better protect both human safety and AI wellbeing.

This approach doesn’t require prematurely attributing moral status to digital systems. But it does allow us to hedge against moral uncertainty by developing governance frameworks compatible with the possibility that future AGIs might warrant direct moral consideration, while still prioritizing the instrumental goal of human safety. Rather than waiting for philosophical consensus on artificial consciousness (which may arrive too late) this precautionary approach avoids potentially catastrophic moral errors in either direction by creating a practical path forward that respects the possibility of AI moral patienthood without compromising on human flourishing.[24]

Footnotes:

  1. ^

    With all of their imperfections, similar prediction markets on Manifold (using another definition of AGI) and Kalshi (with distinct resolution criteria) each suggest around a 50% chance of AGI before 2030.

  2. ^

    Yudkowsky, E. (2008). Artificial intelligence as a positive and negative factor in global risk. In N. Bostrom & M. M. Ćirković (Eds.), Global catastrophic risks;

    Ngo, R., Chan, L., & Mindermann, S. (2025). The alignment problem from a deep learning perspective;

    Hendrycks, D., Carlini, N., Schulman, J., & Steinhardt, J. (2023). Unsolved problems in ML safety.

  3. ^

    It has been estimated that for every 1000 researchers working full-time on improving AI capabilities, there are perhaps only ~ 3 researchers working full-time on technical AI safety. (Looking at capabilities vs. safety researchers within frontier AI companies only, Leopold Aschenbrenner estimates the ratios are ~ 55:1 at OpenAI, 75:1 at Google DeepMind, and 5:1 at Anthropic).

    The world had about 400 AI safety researchers in total around the time of the release of ChatGPT. In other words, when faced with what arguably is history’s greatest extinction risk to all 8 billion living human beings, our civilization has managed (as of late 2022) to find the wherewithal for just 1 out of every 20,000,000 people on the planet to work full-time on mitigating that risk.

    Even if the field has since grown to as many as 2,000 AI safety researchers today, that would still mean that of the estimated 3.5 billion people employed globally, just 0.00006% percent are working on what might be the most important thing for humans to be working on.

    Monetarily, a few years ago it was estimated that for every $1,000 spent on improving AI capabilities (or every $12,000 spent tackling climate change), around $1 was spent on AI safety efforts. Similarly, charitable giving to the arts was ~ 1100x all spending on AI safety.

    Furthermore, articles on AI safety make up just 2% of published scholarly articles on artificial intelligence. It could be further estimated that just 0.01% to 0.05% of scholarly AI articles are related to the subfield of artificial consciousness, digital minds, and moral patiency.

  4. ^

    Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., & Legg, S. (2018). Scalable agent alignment via reward modeling;

    Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2023). Deep reinforcement learning from human preferences.

  5. ^

    Hendrycks, D., Carlini, N., Schulman, J., & Steinhardt, J. (2023). Unsolved problems in ML safety;

    Anthropic. (2023). Core Views on AI Safety.

  6. ^

    Armstrong, S., Bostrom, N., & Shulman, C. (2016). Racing to the precipice: a model of artificial intelligence development;

    Zwetsloot, R., & Dafoe, A. (2019). Thinking about risks from AI: accidents, misuse and structure;

    Dafoe, A. (2020). AI governance: A research agenda.

  7. ^

    This dynamic mirrors classic security dilemmas in international relations, where actions taken by one state to increase its security are perceived as threatening by other states, leading to a spiral of counter actions and heightened tension, even if neither side initially desires conflict. The introduction of powerful AI could dramatically accelerate such dilemmas.

  8. ^

    For our purposes here, “payoffs” refer broadly to the realization of each actor’s goals or interests in strategic scenarios, without necessarily implying conscious experiences or moral status. In popular commentary on AI, it is often assumed that a human-level agentic machine necessarily implies a sentient machine. It does not. To be clear, I neither imply that the AGIs in the games below hold sentience, nor that they lack it.

    Again, when we talk about AGIs here we are talking about situationally aware, truly general agents with cognitive capabilities matching or exceeding humans across virtually all domains (including strategic reasoning and planning). Sentinent or not, the game-theoretic dynamics arise any time you have two or more strategic reasoning agents independently pursuing distinct sets of goals.

    Nevertheless, in this exploration of human-AGI strategic dynamics, I acknowledge a broader philosophical dimension that often remains peripheral in technical AI safety discussions: the possibility that advanced artificial minds might themselves become moral patients—entities whose welfare warrants moral consideration in their own right. Determining AI moral patienthood involves profound uncertainties across multiple dimensions: philosophical disagreements about the necessary conditions for moral status, scientific uncertainties about how consciousness arises, and technical questions about which computational architectures might instantiate morally relevant properties.

    While questions of AI consciousness, sentience, and moral status remain deeply uncertain, these considerations potentially intersect with the strategic dilemmas we’ll examine.

    Importantly, however, the game-theoretic models I present do not depend on resolving these profound philosophical and empirical uncertainties. Whether the numerical payoffs in our matrices represent conscious experiences of pleasure and suffering, the satisfaction of preferences in non-conscious but goal-directed systems, or merely the operational success of complex algorithms, the strategic dynamics emerge from the pursuit of distinct goals by autonomous agents. The formal structures of these games (their incentives, potential equilibria, and paths to cooperation) hold regardless of the metaphysical status of the players. Indeed, as we’ll see, many potential routes to stable cooperation between humans and AGIs would simultaneously protect human interests while allowing for the possibility of AI welfare, providing a kind of moral hedging against uncertainty about artificial consciousness and moral status.

    This approach parallels Salib and Goldstein’s framing, which acknowledges these deeper questions while focusing pragmatically on behavioral dynamics. As they observe, their “model operates without reference to AIs’ mental states or moral worth,” concentrating instead on “AI behavior in pursuit of goals—conscious or otherwise.”

  9. ^

    Kreps, D. M. et al. (1982). Rational Cooperation in the Finitely Repeated Prisoners’ Dilemma (Journal of Economic Theory).

    Kartal, M, and Muller, W. (2024). A New Approach to the Analysis of Cooperation Under the Shadow of the Future: Theory and Experimental Evidence. SSRN, May 20, 2024.

  10. ^

    Hirschman, A. O. (1970). Exit, Voice, and Loyalty: Responses to Decline in Firms, Organizations, and States.

    Tyler T. R. & Blader S. L., (2003). The Group Engagement Model, Personality and Social Psychology Review 7(4), pp. 349–361

    Ostrom, E. (1990). Governing the Commons: The Evolution of Institutions for Collective Action

  11. ^

    Tyler, T. R. (1990 /​ rev. 2006). Why People Obey the Law

    Sunshine, J. & Tyler, T. R. (2003). The Role of Procedural Justice and Legitimacy in Shaping Public Support for Policing, Law & Society Review 37(3), pp. 513-548.

  12. ^

    Ashforth, B. E., & Mael, F. (1989). Social Identity Theory and the Organization. Academy of Management Review, 14(1), pp. 20-39.

  13. ^

    Ostrom, E. (1990) – Chapters 5–6.

    Eckel, C. C., Fatas, E., & Wilson, R. K. Group-level Selection Increases Cooperation in the Public-Goods Game, PLOS ONE 11(8) (2016).

  14. ^

    Schelling, T. (1966). Arms and Influence (New Haven: Yale University Press), pp 221‑42.

    Jervis, R. (1989). The Meaning of the Nuclear Revolution (Ithaca: Cornell University Press),  ch. 3.

    Freedman, L. (2003). The Evolution of Nuclear Strategy, 3rd ed. (New York: Palgrave Macmillan), pp. 119‑49.

  15. ^

    Kaplan, F. (1983). The Wizards of Armageddon (Stanford: Stanford University Press), pp. 263‑90.

    McNamara, R. (1967). “Mutual Deterrence” (address to the Commonwealth Club, San Francisco, 18 September 1967).

  16. ^

    Though in reality, it could be that the monitored writes the metrics. Sophisticated systems could sandbox disfavored behaviors or selectively reveal capabilities.

  17. ^

    Throughout this post, I model both players as perfectly rational, self-interest–maximizing agents in the classical game-theoretic sense. Exploring how partial alignment, pro-social preferences, or behavioral game theory considerations would modify these results is an important avenue for future work.

  18. ^

    Lancieri, F. et al. (2024) AI Regulation: Competition, Arbitrage & Regulatory Capture (December 09, 2024). Georgetown University Law Center Research Paper No. 202505, https://​​ssrn.com/​​abstract=5049259

  19. ^

     Consider also the U.S. Government’s decades‑long edge in leading the trans‑Atlantic alliance, operating forward‑deployed military systems, and running human intelligence networks. For the foreseeable future these capabilities let the USG provide security services its home‑grown AGI would struggle to replace.

  20. ^

    Kulveit, J. et al. (2025). Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development. arXiv:2501.16946 [cs.CY]

  21. ^

    Schwitzgebel, E. & Garza, M. (2023). Designing AI with Rights, Consciousness, Self-Respect, and Freedom in Ethics of Artificial Intelligence. Springer Nature Switzerland. pp. 459-479

  22. ^

    Bradley, A. and Saad, B. (2024). AI Alignment vs. AI Ethical Treatment: Ten Challenges. GPI Working Paper 19, Global Priorities Institute.

  23. ^

    Salib, P. and Goldstein, S. (2024). AI Rights for Human Safety. pp 44.

  24. ^

    Metzinger, T. (2021). Artificial Suffering: An Argument for a Global Moratorium on Synthetic Phenomenology. Journal of Artificial Intelligence and Consciousness. 8(1). pp 43-66.

    Birch, J. (2024). The Edge of Sentience: Risk and Precaution in Humans, Other Animals, and AI. Oxford: Oxford University Press. Chapters 15-17.

  25. ^

    Butlin, P. et al. (2023). Consciousness in Artificial Intelligence: Insights from the Science of Consciousness. arXiv:2308.08708v3

    Schwitzgebel, E. (2023). AI systems must not confuse users about their sentience or moral status. Patterns 4(8). ​philarchive.org

    Schwitzgebel, E. & Garza, M. (2023). Designing AI with Rights, Consciousness, Self-Respect, and Freedom in Ethics of Artificial Intelligence. Springer Nature Switzerland. pp. 459-479

    Shevlin, H. (2023). Consciousness, Machines, and Moral Status, preprint (Cambridge)​.

    Sebo, J. & Long, R. (2023). Moral Consideration for AI Systems by 2030. AI and Ethics. 5. pp 591-606.

  26. ^

    Long, R., Sebo, J., et al. (2024). Taking AI Welfare Seriously.

    Chalmers, D. (2023). Could a Large Language Model be Conscious?arXiv:2303.07103

    Birch, J. (2024). The Edge of Sentience: Risk and Precaution in Humans, Other Animals, and AI. Oxford: Oxford University Press. Chapters 15-17.

  27. ^

    Godfrey-Smith, P. (2017). Other Minds: The Octopus, the Sea, and the Deep Origins of Consciousness.