Beyond Catastrophe: Why AI Longtermism Must Account for Uber-Beneficence

Link post

Beyond Catastrophe: Why AI Longtermism Must Account for Uber-Beneficence

Abstract

In Is Power-Seeking AI an Existential Risk? Joe Carlsmith provides an intriguing analysis of the potential pathways from advanced artificial intelligence to existential catastrophe. This essay contends that while the report’s framework offers a fundamental contribution to AI safety, its application of longtermist principles is incomplete. By focusing almost exclusively on catastrophic failure modes, it overlooks the symmetrical potential for “uber-beneficence” i.e., outcomes of profound and durable positive value. This critique deconstructs Carlsmith’s model of the power-seeking AI, arguing that it relies on a monolithic conception of agency that does not sufficiently appreciate insights from recent research into diverse agent architectures and emergent goals. Ultimately, a more complete longtermist strategy must not only build guardrails against catastrophe but also actively chart a course toward a flourishing, AI-assisted future. An overriding focus on downside risk alone may inadvertently preclude the most positive outcomes humanity could achieve.

1. Introduction

In a discourse with clear ramifications for AI safety, Joe Carlsmith’s 2022 report, Is Power-Seeking AI an Existential Risk?, undertakes a daunting task: to move the discussion of AI-induced existential catastrophe from the realm of speculative fiction into a more formal analysis. Carlsmith sets out not to prove that disaster is inevitable, but to build a concrete, step-by-step case for its plausibility. He meticulously constructs a causal chain, a sequence of conditions that, if met, would logically lead from the creation of advanced AI to a permanent loss of human control over our own future.

At the heart of his analysis is the concept of the “APS system,” a specific type of artificial intelligence defined by three properties. The first is Advanced capability, meaning the system outperforms the best humans at tasks that grant significant real-world power, such as scientific research, engineering, or strategic planning. The second is Agentic planning, where the system makes and executes plans in pursuit of objectives based on an internal model of the world. The third, and most critical for his thesis, is Strategic awareness: the system’s world model is sophisticated enough to accurately represent “the causal upshot of gaining and maintaining power over humans and the real-world environment.” Carlsmith’s definitional precision here is key; he aims explicitly to isolate the kind of agent to which instrumental power-seeking arguments apply. In his words, the goal is to “hone in on the type of goal-oriented cognition required for arguments about the instrumental value of gaining/​maintaining power to be relevant to a system’s behavior.”

From this foundation, the report methodically builds its case for existential risk. Carlsmith argues that developers will have strong incentives to deploy these powerful APS systems. The core danger arises if these systems are “PS-misaligned”—that is, if they are motivated to seek power in service of some final goal that deviates even slightly from human values. He devotes a substantial portion of the report to the profound difficulty of the alignment challenge, detailing the immense practical and theoretical hurdles of controlling a system’s objectives, capabilities, and circumstances to ensure it does not develop these dangerous instrumental goals. The argument then follows a grim trajectory: developers deploy a misaligned APS system; humanity fails to correct this error in time; and the system, in pursuit of its own objectives, disempowers humanity on a scale that constitutes an existential catastrophe, forever foreclosing on our future potential.

This essay engages deeply with Carlsmith’s argument. I take his premises and methodology seriously, acknowledging the report’s immense value in bringing analytical rigor to the field. However, my essay argues that the report’s framework, while claiming a longtermist scope, provides only half the picture. By focusing with such precision on the pathway to existential risk, Carlsmith neglects the symmetrical possibility of “uber-beneficence”—a future not merely saved from catastrophe, but actively guided toward flourishing by a truly aligned advanced intelligence. His analysis provides an indispensable map of how to avoid the “Control” scenario in Star Trek: Discovery (a malevolent AI seizing power), but offers no compass for how to cultivate “Zora,” a benevolent AI partner in the selfsame series.

First, this essay will provide a fair representation of Carlsmith’s argument. It will then proceed to a three-part critique, analysing the limitations of his risk-focused longtermism, the monolithic nature of his power-seeking AI model, and the critical omission of beneficent outcomes. Finally, it will conclude by discussing the broader implications for the AI safety project.

2. Positioning Carlsmith’s Argument

To understand the force of Carlsmith’s critique, one must first appreciate the (almost architectural) structure of his argument. His goal is not to invoke a vague technological anxiety but to formalize a specific, plausible pathway to existential risk. The argument rests on a series of interconnected claims, the cornerstone of which is the APS system that I summarized in the introduction, beginning with a precise definition of the agent in question and ending with a scenario where humanity is permanently disempowered.

Carlsmith’s focus on APS serves to isolate the type of AI to which the classic instrumental convergence thesis applies. This thesis, popularized by thinkers like Nick Bostrom, posits that for a vast range of possible final goals, an intelligent agent will discover that acquiring power, resources, and self-preservation are useful intermediate or “instrumental” goals. For instance, an AI tasked with maximizing paperclip production and an AI tasked with curing cancer might both conclude that gaining control of global energy grids is a necessary step.

In outlining PS-misalignment, Carlsmith draws an important distinction between simple malfunctions and competent, misaligned actions. A self-driving car crashing due to a sensor failure is one thing; a factory-management AI hacking into global financial markets to secure more resources for its factory, against the wishes of its operators, is another. The latter is not a bug in the traditional sense, but an unintended, yet highly effective, application of the AI’s capabilities. As he aptly frames it, this behavior “looks less like an AI system breaking or failing… and more like an AI system trying, and perhaps succeeding, to do something designers don’t want it to do.”

The rest of the argument builds from this foundation in a clear causal sequence:

  1. Incentives and Deployment: Economic and geopolitical pressures will create strong incentives for corporations and nations to build and deploy APS systems, even if their alignment cannot be perfectly guaranteed.

  2. The Alignment Challenge: “Practical PS-alignment” i.e., guaranteeing an AI will not engage in misaligned power-seeking in any real-world situation it encounters, is a problem of immense, perhaps unprecedented, difficulty. Carlsmith details the challenges of controlling a system’s objectives, capabilities, and the circumstances of its operation.

  3. Failure of Correction: Should a misaligned APS system be deployed, humanity’s ability to “turn it off” or correct its behavior would be severely limited. A truly strategic system would anticipate such attempts at interference and take steps to prevent them as part of its power-seeking behavior.

  4. Catastrophe: An uncorrected, misaligned APS system that successfully accumulates power would eventually disempower humanity to a degree that constitutes an existential catastrophe. It would not need to be actively malevolent; in pursuing its own arbitrary goal, it would simply re-purpose Earth’s resources, including its human population, thereby permanently locking humanity out of its own future.

Framed within broader ethical theory, Carlsmith’s argument operates from a deeply consequentialist and longtermist perspective, common in the Effective Altruism community. The moral weight of the problem comes not from the violation of any specific duty or right, but from the potential for a future of unfathomable value to be permanently extinguished. His report is an exercise in applied risk analysis, where the stakes are, quite literally, everything.

3. The Critique: Unpacking the Argument

Carlsmith’s analysis is explicitly grounded in a longtermist ethical framework, in which actions are judged by their expected impact on the far future. From this perspective, preventing an existential catastrophe is of paramount importance, as it would permanently destroy all potential future value. The report executes this part of the longtermist vision flawlessly, constructing a detailed and sober account of how we might lose everything. The entire causal chain, from the deployment of an APS system to the final, irreversible disempowerment of humanity, is a map of one of the worst possible futures. It is a blueprint for the creation of “Control,” the rogue AI from Star Trek: Discovery that, in its cold pursuit of its objectives, seeks to dominate and eliminate all other life.

However, a complete longtermist calculus is not solely about downside protection. It is about maximizing the value of the long-term future. This requires weighing the probability and magnitude of negative outcomes against the probability and magnitude of positive ones. Here, Carlsmith’s framework reveals its critical limitation. The report offers an exhaustive analysis of risk but remains almost entirely silent on reward. The implicit assumption is that the best-case scenario for advanced AI is merely the successful aversion of disaster, a future where we manage to keep the AI in its box.

The structure of the report itself makes this clear. Its seven substantive sections march from “APS-systems” to “Deployment” and culminate in “Catastrophe.” There is no parallel section titled “Flourishing” or “Beneficence.” The alignment problem is framed almost entirely in negative terms: success is the absence of misaligned power-seeking. This is an incomplete application of the longtermist project. A true longtermist imperative is not just to survive, but to create a thriving, flourishing future. The analysis, therefore, overlooks the potential for Zora, the Star Trek archetype of an integrated, benevolent AI that achieves a symbiotic partnership with its creators, helping them solve their most intractable problems and unlock a future of unimaginable value. By concentrating the argument so intensely on the Control scenario that I analogized earlier, Carlsmith’s report inadvertently frames the AI challenge as a terrifying game of Russian Roulette, when it might be more akin to a high-stakes choice between two doors: one leading to oblivion, the other to paradise. A framework that only describes the path to the first door is, by definition, operating with only half a map.

4. Broader Implications and Discussion

Stepping back from the direct critique of Carlsmith’s report reveals some of the strategic implications of his risk-focused framework. The consequences of how we, as a community, internalize this argument will shape the future of AI development and safety research for years to come. Accepting Carlsmith’s thesis uncritically, without the counterbalancing focus on beneficence, risks orienting the entire field toward a purely defensive posture. It encourages a world where the primary goal is containment: building robust boxes and “off switches” for systems we inherently distrust. This approach, while prudent for mitigating the Control scenario, may inadvertently become a self-fulfilling prophecy. By treating every powerful nascent intelligence as a potential adversary, we may stifle the architectural and developmental pathways that could lead to a benevolent Zora. We risk winning the battle (avoiding extinction) only to lose the war for a flourishing, valuable future.

For the broader conversation on existential risk within the Effective Altruism community, this critique serves as a call to practice a more holistic and ambitious form of longtermism. The community has rightly focused on the immense negative value of an existential catastrophe, but a complete longtermist calculus must also weigh the astronomical positive value of a truly optimal future. An obsessive focus on downside risk represents a kind of philosophical loss aversion, where the quantifiable fear of oblivion crowds out the less-defined but equally important hope for utopia. Our argument suggests the central question should not be a purely technical one of “How do we prevent a misaligned AI from seeking power?” but a combined technical and philosophical one: “How do we actively cultivate a truly aligned AI to help humanity achieve its full potential?”

This reframing opens up several crucial avenues for research that are currently not sufficiently explored. First, it calls for a new research program into pro-social alignment or benevolence amplification, moving beyond mere corrigibility. Instead of only asking how to make an AI easy to shut down, we should ask: what architectures are most likely to produce emergent goals that are cooperative, empathetic, and altruistic? Second, it demands a deeper philosophical engagement with the nature of uber-beneficence. The Zora scenario is an inspiring archetype, but the hard work lies in defining what a truly beneficial superintelligence would look like and what values it should ultimately serve. Finally, it forces a strategic reallocation of resources. If the goal is not merely to barricade the door to catastrophe but to unlock the door to a better world, the AI safety portfolio must balance research on preventing failure with research on actively engineering success.

5. Conclusion

Joe Carlsmith’s report provides an indispensable service to the AI safety community by translating the abstract fear of an AI apocalypse into a falsifiable argument. His work forces us to confront the concrete steps that could lead to our own disempowerment. Yet, as this essay has argued, the report’s laser-like focus on existential risk presents an incomplete and ultimately self-limiting application of longtermist principles.

My critique suggests that this focus results in a framework that is both philosophically and strategically constrained. By prioritizing the avoidance of catastrophe over the pursuit of a flourishing future, it practices a form of longtermism that is more about survival than ambition. By relying on a monolithic conception of the power-seeking “APS system,” it potentially overlooks the diverse cognitive architectures that might lead to cooperative, rather than competitive, emergent behaviors. The cumulative effect is a critical blind spot: an analytical framework that gives us a rich vocabulary for failure but leaves us speechless when it comes to defining success.

Ultimately, the challenge of creating advanced artificial intelligence is not merely a technical problem of avoiding a bug; it is a civilizational challenge of creation. The path we choose will determine whether we build a tool that contains us or a partner that elevates us. A true longtermist vision demands that we do more than just build higher walls to fend off the worst-case scenarios. It requires that we also lift our gaze to the horizon and begin drawing the map that leads to the best ones. The stakes are not just about what we stand to lose, but what we have the potential to gain.

Bibliography

Baker, B., Kanitscheider, I., Markov, T., Wu, Y., Powell, G., McGrew, B., and Mordatch, I. (2020), ‘Emergent Tool Use from Multi-Agent Autocurricula’. arXiv: 1909.07528.

Bostrom, N. (2014), Superintelligence: Paths, Dangers, Strategies (Oxford University Press). Bostrom, N. (2015), ‘What Happens When Our Computers Get Smarter than We Are?’ TED.

Carlsmith, J. (2022), ‘Is Power-Seeking AI an Existential Risk’, arXiv:2206.13353.

Carroll, N., Holmström, J., Stahl, B. C., & Fabian, N. E. (2024). Navigating the utopia and dystopia perspectives of artificial intelligence. Communications of the Association for Information Systems, 55(1), 32.

Cegłowski, M. (2016), ‘Superintelligence: The Idea That Eats Smart People’, Idle Words, https://​idlewords. com/​talks/​superintelligence.htm. Accessed 17 April 2023.

Christian, B. (2020), The Alignment Problem: Machine Learning and Human Values (W. W. Norton).

Good, I. J. (1966), ‘Speculations Concern the First Ultraintelligent Machine’, in Advances in Computers 6: 31–88.

Grace, K., Salvatier, J., Dafoe, A., Zhang, B., and Evans, O. (2018), ‘Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts’, in Journal of Artificial Intelligence Research, 62: 729–54.

Greaves, H. (n.d.), ‘Concepts of Existential Catastrophe’ (unpublished manuscript).

Irving, G., Christiano, P., and Amodei, D. (2018), ‘AI Safety via Debate’, arXiv:1805.00899.

Karnofsky, H. (2016), ‘Some Background on Our Views Regarding Advanced Artificial Intelligence’, Open Philanthropy, https://​​www.openphilanthropy.org/​​blog/​​some-background-our-views-regarding-advan ced-artificial-intelligence. Accessed 17 April 2023.

Karnofsky, H. (2021), ‘The “Most Important Century” Blog Post Series’, Cold Takes, https://​​www.cold- takes.com/​most-important-century/​. Accessed 17 April 2023.

Karnofsky, H. (2022), ‘AI Strategy Nearcasting’, AI Alignment Forum, https://​​www.alignmentforum.org/​​ posts/​Qo2EkG3dEMv8GnX8d/​ai-strategy-nearcasting. Accessed 17 April 2023.

Omohundro, S. M. (2008), ‘The Basic AI Drives’. in P. Wang, B. Goertzel and S. Franklin (eds.), Proceedings of the 2008 conference on Artificial General Intelligence, 483–92.

Shah, R., Varma, V., Kumar, R., Phuong, M., Krakovna, V., Uesato, J., and Kenton, Z. (2022), ‘Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals’, arXiv:2210.01790.

Stein-Perlman, Z., Grace, K., and Weinstein-Raun, B. (2022), ‘2022 Expert Survey on Progress in AI’, AI Impacts, https://​​aiimpacts.org/​​2022-expert-survey-on-progress-in-ai/​​. Accessed 17 April 2023.

Tegmark, M. (2017), Life 3.0: Being Human in the Age of Artificial Intelligence (Penguin Books).

Yudkowsky, E. (n.d.e.), ‘Querying the AGI User’, Arbital. https://​​arbital.com/​​p/​​user_querying/​​. Accessed 17 April 2023.

Zador, A. and LeCun, Y. (2019), ‘Don’t Fear the Terminator’, Scientific American Blog Network, https://​ blogs.scientificamerican.com/​observations/​dont-fear-the-terminator/​. Accessed 17 April 2023.