Future Matters #6: FTX collapse, value lock-in, and counterarguments to AI x-risk
[T]he sun with all the planets will in time grow too cold for life, unless indeed some great body dashes into the sun and thus gives it fresh life. Believing as I do that man in the distant future will be a far more perfect creature than he now is, it is an intolerable thought that he and all other sentient beings are doomed to complete annihilation after such long-continued slow progress.
— Charles Darwin
Future Matters is a newsletter about longtermism and existential risk by Matthew van der Merwe and Pablo Stafforini. Each month we curate and summarize relevant research and news from the community, and feature a conversation with a prominent researcher. You can also subscribe on Substack, listen on your favorite podcast platform and follow on Twitter. Future Matters is also available in Spanish.
A message to our readers
Welcome back to Future Matters. We took a break during the autumn, but will now be returning to our previous monthly schedule. Future Matters would like to wish all our readers a happy new year!
The most significant development during our hiatus was the collapse of FTX and the fall of Sam Bankman-Fried, until then one of the largest and most prominent supporters of longtermist causes. We were shocked and saddened by these revelations, and appalled by the allegations and admissions of fraud, deceit, and misappropriation of customer funds. As others have stated, fraud in the service of effective altruism is unacceptable, and we condemn these actions unequivocally and support authorities’ efforts to investigate and prosecute any crimes that may have been committed.
Research
Artificial general intelligence and lock-in [🔉], by Lukas Finnveden, C. Jess Riedel and Carl Shulman, considers AGI as enabling, for the first time in history, the creation of long-lived and highly-stable institutions with the capacity to pursue a variety of well-specified goals. The authors argue that, if a significant fraction of the world’s powers agreed to establish and empower them to defend against external threats, such institutions could arise and subsist for millions, or even trillions, of years. We found this report to be one of the most important contributions to longtermist macrostrategy published in recent years, and we hope to feature a conversation with one of the authors in a future issue of the newsletter.
A classic argument for existential risk from superintelligent AI goes something like this: (1) superintelligent AIs will be goal-directed; (2) goal-directed superintelligent AIs will likely pursue outcomes that we regard as extremely bad; therefore (3) if we build superintelligent AIs, the future will likely be extremely bad. Katja Grace’s Counterarguments to the basic AI x-risk case [🔉] identifies a number of weak points in each of the premises in the argument. We refer interested readers to our conversation with Katja below for more discussion of this post, as well as to Erik Jenner and Johannes Treutlein’s Responses to Katja Grace’s AI x-risk counterarguments [🔉].
The key driver of AI risk is that we are rapidly developing more and more powerful AI systems, while making relatively little progress in ensuring they are safe. Katja Grace’s Let’s think about slowing down AI [🔉] argues that the AI risk community should consider advocating for slowing down AI progress. She rebuts some of the objections commonly levelled against this strategy: e.g. to the charge of infeasibility, she points out that many technologies (human gene editing, nuclear energy) have been halted or drastically curtailed due to ethical and/or safety concerns. In the comments, Carl Shulman argues that there is not currently enough buy-in from governments or the public to take more modest safety and governance interventions, so it doesn’t seem wise to advocate for such a dramatic and costly policy: “It’s like climate activists in 1950 responding to difficulties passing funds for renewable energy R&D or a carbon tax by proposing that the sale of automobiles be banned immediately. It took a lot of scientific data, solidification of scientific consensus, and communication/movement-building over time to get current measures on climate change.”
We enjoyed Kelsey Piper’s review of What We Owe the Future [🔉], not necessarily because we agree with her criticisms, but because we thought the review managed to identify, and articulate very clearly, what we take to be the main crux between the longtermist EAs who liked the book and those who, like Piper, had major reservations about it: “most longtermists working in AI safety are worried about scenarios where humans fail to impart the goals they want to the systems they create. But MacAskill think it’s substantially more likely that we’ll end up in a situation where we know how to set AI goals, and set them based on parochial 21st century values—which makes it utterly crucial that we improve our values so that the future we build upon them isn’t dystopian.”
In How bad could a war get? [🔉], Stephen Clare and Rani Martin ask what the track record of conflict deaths tells us about the likelihood of wars severe enough to threaten human extinction. They conclude that history provides no strong reason to rule out such wars, particularly given the recent arrival of technologies with unprecedented destructive potential, e.g. nuclear weapons and bioweapons.
Catastrophes that fall short of human extinction could nonetheless reduce humanity to a pre-industrial or pre-agricultural state. In What is the likelihood that civilizational collapse would cause technological stagnation? [🔉], Luisa Rodriguez asks whether we would ever recover from this sort of setback. The past provides some comfort: in humanity’s short history, agriculture was independently invented numerous times, and the industrial revolution followed just 10,000 years later. Given this, it would be surprising if it was extremely hard to do it all over again. Moreover, a post-collapse humanity would likely have materials and knowledge left over from industrial civilisation, placing them at an advantage vs our hunter-gatherer ancestors. On the other hand, certain catastrophes could make things more difficult, such as extreme and long-lasting environmental damage. All-things-considered, Rodriguez thinks that humanity has at least a 97% chance of recovering from collapse.
In The Precipice, Toby Ord proposed a grand strategy of human development involving the attainment of existential security—a stable state of negligible existential risk—followed initially by a “long reflection” and ultimately by the full realization of human potential. Ord’s recent contribution to the UN Human Development Report [🔉] focuses on the first of these three stages and considers specifically the institutions needed for existential security. Ord answer is that what is needed are international institutions with outstanding forecasting expertise, strong coordinating ability, and a great deal of buy-in.[1]
In A lunar backup record of humanity, Caroline Ezell, Alexandre Lazarian and Abraham Loeb offer an intriguing proposal for helping humanity recover from catastrophes. As part of the first lunar settlements, we should build a data storage infrastructure on the moon to keep a continuously updated backup of important materials e.g. books, articles, genetic information, satellite imagery. They suggest this would improve the chances that lunar settlements could rebuild civilisation in the event of a terrestrial catastrophe, and that it could work with current technologies for data storage and transmission.
Max Tegmark explains Why [he] think there’s a one-in-six chance of an imminent global nuclear war [🔉], in light of the Russia-Ukraine War. Note that this post was published on 8th October, so the author’s current views might differ. Tegmark provides a simple probabilistic model of how the war might play out. He puts ~30% that Russia launches a nuclear strike on Ukraine, 80% that this would result in a non-nuclear military response from NATO, and 70% that this would be followed by rapid escalation leading to all-out nuclear war, for an overall probability of ~17%. See also Samotsvety’s Nuclear risk update [🔉] (from 3rd October) —they place ~16% on Russia using a nuclear weapon in the next year, and ~10% that nuclear conflict scales beyond Ukraine in the subsequent year, resulting in a ~1.6% probability of global nuclear conflict. We applaud Tegmark and Samotsvety for making clear, quantitative forecasts on this topic.
Marius Hobbhahn’s The next decades might be wild [🔉] is a speculative account of how the next few decades could play out if we are just 10–20 years away from transformative AI. Stefan Schubert responds, taking issue with Hobbhahn’s expectation of a muted reaction from the public and a wholly ineffective response from governments as AI systems start to run amok.
Previously, Holden Karnosfky argued that, if advanced AI systems aimed at destroying or disempowering humanity, they might succeed (see FM#3). In Why would AI “aim” to defeat humanity? [🔉], Karnofsky explains why such systems will, by default, likely adopt such an aim. From the assumptions that (1) we will soon developed powerful AI (2) in a world that is otherwise similar to today’s, (3) with techniques broadly similar to those currently being used (4) that push AI systems to be ever more capable and (5) with no specific countermeasures to prevent such systems from causing an existential catastrophe, Karnofsky argues that we should expect the AI systems that emerge to behave as if they had aims; that, due to the nature of the training process, some of these aims will likely be aims humans did not intend; and that, because of that, AI systems will likely also have the intermediate aim of deceiving and ultimately disempowering humanity.[2]
High-level hopes for AI alignment [🔉] outlines three broad approaches to AI alignment that Karnofsky finds promising. ‘Digital neuroscience’ aims to develop something like lie detection /mind-reading techniques to inspect the motivations of AI systems. We could try to develop ‘Limited AI’: systems that don’t engage in the sort of long-term general planning that seems particularly worrying. Lastly, we could develop systems of ‘AI checks and balances’ where we use AI systems to supervise each other. Karnofsky concludes that the success of any of these approaches depends to a large extent on us having enough time to develop and test them before AI systems become extremely powerful.
AI safety seems hard to measure [🔉], in turn, argues that it is hard to know whether AI safety research is actually making AI systems safer. Karnofsky offers four reasons for this conclusion. First, one cannot easily tell the difference between behaving well and merely appearing to do so. Second, it is difficult to infer how an agent will behave once they have power over you, based on how they have behaved so far, before acquiring such power. Third, current systems are not yet sophisticated enough to display the advanced cognitive abilities—such as the ability to deceive and manipulate —that we want to study. Fourth, systems expected to be vastly more capable than humans will be creatures of a very alien sort, and we just have no idea of how to prepare for our first encounter with them.
Finally, in Racing through a minefield [🔉] Karnofsky outlines a broader problem than the AI alignment problem, which he calls the AI deployment problem. This is the problem confronted by any agent who can potentially develop transformative AI and faces a tradeoff between moving fast and risking developing unsafe AI, and moving slowly and risking unsafe AI be developed by less cautious, faster-moving agents. Karnofsky likens this to a race through a minefield where each agent has an incentive to beat the others but where moving quickly endangers all agents, and offers some possible measures with the potential to make the problem less severe. Continuing with the analogy, these include charting a safe path through the minefield (alignment research), alerting others about the mines (threat assessment), moving more cautiously through the minefield (avoiding races), and preventing others from stepping on mines (global monitoring and policing).
In AI will change the world, but won’t take it over by playing “3-dimensional chess”, Boaz Barak and Ben Edelman question the standard argument for the conclusion that power-seeking AI could cause an existential catastrophe. Briefly, the authors argue that the relevant conflict is not “humans vs. AI”, as the argument assumes, but rather “humans aided by AIs with short-term goals vs. AIs with long-term goals”. Since AIs will have a much more decisive advantage over humans in short-term than in long-term planning ability, whether humanity will lose control over its future is much less clear than generally believed by the alignment community. Furthermore, to reduce the chance of catastrophe, the authors hold that we should focus less on general AI alignment and more on differential AI capabilities research, specifically on developing AI systems with short rather than long time horizons.
Our World in Data’s new page on artificial intelligence features five separate articles about various aspects of AI. Artificial intelligence is transforming our world [🔉] attempts to answer three questions: Why is it hard to take the prospect of a world transformed by AI seriously? How can we imagine such a world? And what is at stake as this technology becomes more powerful? The brief history of artificial intelligence [🔉] takes a look at how the field of AI has evolved in the past in order to inform our expectations about its future. Artificial intelligence has advanced despite having few resources dedicated to its development focuses on various metrics indicative of the growth of the AI as a field over the past decade. AI timelines summarizes various attempts to forecast the arrival of human-level AI, including surveys of machine learning researchers, predictions by Metaculus forecasters, and Ajeya Cotra’s biological anchors report. Finally, Technology over the long run tries to give an intuitive sense of how different the future may look from the present by looking back at how rapidly technology has changed our world in the past.
Dan Luu’s Futurist prediction methods and accuracy [🔉] examines resolved long-range forecasts by a dozen or so prominent predictors and relies on this examination to identify forecasting techniques predictive of forecasting performance. Luu finds that the best forecasters tend to have a strong technical understanding of the relevant domains and a record of learning lessons from past predictive errors, while the worst forecasters tend to be overly certain about their methods and to make forecasts motivated by what he calls “panacea thinking”, or the belief that a single development or intervention—such as powerful computers or population control—can solve all of humanity’s problems.
Clarifying AI x-risk [🔉], by Zac Kenton and others from DeepMind’s AGI safety team, explore the different AI threat models—pathways by which misaligned AI could result in an existential catastrophe. They identify and categorize a number of models in the literature, finding broad agreement between researchers, and set out their team’s own threat model: AGI is developed via scaling up foundation models, fine-tuned by RL from human feedback (RLHF); in the course of training, a misaligned and power-seeking agent emerges and conceals its misalignment from developers; key decision-makers will fail to understand the risk and respond appropriately; interpretability will be hard. They note that among existing threat models, theirs seems closest to that of Ajeya Cotra (see our summary in FM#4, and our conversation with Ajeya in FM#5). The authors’ literature review, with summaries of each threat model, was published separately here.
Peter Wyg’s A theologian’s response to anthropogenic existential risk offers an argument for the importance of existential risk reduction and concern for future generations grounded in Christian thought. Wyg points out, for example, that “If human history is just the beginning … then God could well bestow countless future graces: saints will be raised up, sinners will be forgiven, theologians will explore new depths, the faithful will experience new heights of spiritual experience.” We are always keen to better understand how different worldviews think about existential risk and the future, so we found this a valuable read.
Hayden Wilkinson’s The unexpected value of the future argues that on various plausible modeling approaches the expected value of humanity’s future is undefined. However, Wilkinson does not conclude from this result that the case for longtermism is undermined. Instead, he defends an extension of expected value theory capable of handling expectation-defying prospects without abandoning risk-neutrality.
In an previous issue of this newsletter, we noted that Scott Aaronson had joined OpenAI to work on AI safety (see FM#3). Now halfway through this project, Aaronson has given a lecture sharing his thoughts on activities so far. In the lecture, Aaronson covers his views about the current state of AI scaling; identifies eight different approaches to AI safety; and discusses the three specific projects that he has been working on. These projects are (1) statistically watermarking the outputs of large language models (so that the model’s involvement in the generation of long enough strings of text can’t be concealed); (2) inserting cryptographic backdoors in AI systems (allowing for an “off switch” that the AI can’t disable); and developing a theory of learning in dangerous environments.
Longtermism in an infinite world, by Christian Tarsney and Hayden Wilkinson, considers how the possibility of a universe infinite in potential value affects the risk-neutral, totalist case for longtermism.[3] The authors conclusions may be summarized as follows: (1) risk-neutral totalism can be plausibly extended to deal adequately with infinite contexts when agents can affect only finitely many locations; (2) however, we should have a credence higher than zero in hypothesis about the physical world on which individual agents can affect infinitely many locations; (3) if plausible extensions of risk-neutral totalism can also rank such prospects, the case for longtermism would likely be vindicated; (4) by contrast, the case for longtermism would be undermined if instead such extensions imply widespread incomparability.[4]
One-line summaries
Alasdair Phillips-Robins’s Catastrophic risk, uncertainty, and agency analysis proposes some changes to the governance of federal policymaking.
Jan Leike’s Why I’m optimistic about our alignment approach [🔉] offers some arguments in favor of OpenAI’s approach to alignment research and responses to common objections.
Jaime Sevilla, Anson Ho & Lennart Heim collect some AI forecasting research ideas [🔉].
David Roodman’s Comments on Ajeya Cotra’s draft report on AI timelines offers a critical review of Cotra’s biological anchors model.
Anders Sandberg on Cyborgs v ‘holdout humans’ [🔉] speculates on what might happen if the human species survives for a million years.
Eric Martinez and Christoph Winter’s Cross-cultural perceptions of rights for future generations finds robust popular support for increasing legal protections for future generations from respondents from across six continents.
The Global Priorities Institute released new summaries of “The paralysis argument” [🔉] by Will MacAskill and Andreas Mogensen, and “Do not go gentle: why the Asymmetry does not support anti-natalism” [🔉] by Mogensen.
Longtermist political philosophy: an agenda for future research [🔉] by Jacob Barrett and Andreas Schmidt is GPI’s attempt to set out longtermist political philosophy as an academic research field.
Siméon Campos’s AGI timelines in governance [🔉] lists some likely differences between worlds in which AGI is developed before and after 2030, and discusses how those differences should affect approaches to AGI governance.
Tobias Baumann’s Avoiding the Worst: How to Prevent a Moral Catastrophe [🔉] is a book-length introduction to suffering risks.
In Investing in pandemic prevention is essential to defend against future outbreaks [🔉], Bridget Williams and Will MacAskill argue that investments in pandemic preparedness are surprisingly low given the health and economic costs of the COVID-19 pandemic, and identify four promising areas for government funding: vaccines for prototype pathogens, disease surveillance using metagenomic sequencing, clean indoor-air technology, and better personal protective equipment.
How the Patient Philanthropy and Global Catastrophic Risks Funds work together [🔉], by Christian Ruhl & Tom Barnes, explains the differences and complementarities between those two funds managed by Founders Pledge.
In The socialist case for longtermism [🔉], Garrison Lovely contends that longtermism may be regarded as a extension of the socialist concern for the masses of working people, by extending this circle of compassion to an even larger group of moral patients—those yet to be born.
In AI experts are increasingly afraid of what they’re creating [🔉], Kelsey Piper explains how the risks posed by AI are becoming harder to ignore as AI systems become increasingly capable and general.
Steve Byrnes’s What does it take to defend the world against out-of-control AGIs? [🔉] argues that aligned AGI would not protect humanity fully from risks posed by misaligned and power-seeking AIs, as is often assumed.
Nate Soares’s Warning shots probably wouldn’t change the picture much [🔉] draws that conclusion from observing the failure of people concerned with biorisk to get gain-of-function research banned in the wake of the COVID-19 pandemic.
In Parfit + Singer + aliens = ? [🔉], Maxwell Tabarrok argues that expanding the circle of moral concern to include both nonhuman and future sentients makes the value of existential risk reduction highly sensitive to one’s credence in the existence of sentient life elsewhere in the universe.
Richard Fisher’s Eucatastrophe [🔉] discusses J. R. R. Tolkien’s proposed neologism to describe the concept of a “positive catastrophe”—an idea for which there appears to be no English word.
In AI alignment is distinct from its near-term applications [🔉], Paul Christiano worries that applying alignment techniques to train extremely inoffensive systems could undermine support for AI alignment research.
The Economist’s Should we care about people who need never exist? is perhaps the most detailed and rigorous discussion of population ethics ever to appear in a mainstream publication.
To mark the milestone in human population, Bryan Walsh asks: Are 8 billion people too many—or too few? [🔉]
John Bliss’s Existential advocacy examines the strategies being pursued by legal advocates working on mitigating existential risks and safeguarding humanity.
The Global Challenges Foundation has published their annual report on risks threatening humanity—Global Catastrophic Risks 2022: A year of colliding consequences.
Séb Krier’s AI from superintelligence to ChatGPT [🔉] recounts the story of how AI systems became so capable and describes current efforts to make them safer.
In a Twitter thread, Will MacAskill lists two reasons why he rejects the objection that longtermism is just an excuse for neglecting the important problems of today’s world: that the interventions longtermists typically favor also benefit people alive today, and that prioritizing actions that seek to benefit future people has a reassuring historical track record.
Ben Cottier’s Understanding the diffusion of large language models [🔉] uses GPT-3 as a case study on the time and resources required for state-of-the-art AI breakthroughs to be replicated by other groups.
Tristan Cook and Guillaume Corlouer’s The optimal timing of spending on AGI safety work develops a quantitative model for allocating funding to AI safety over time.
Sam Clarke and Di Cooke draw some lessons for AI governance from early electricity regulation [🔉].
Hamish Hobbs, Jonas Sandbrik, and Allan Dafoe’s Differential technology development summarizes a preprint [🔉] on this approach to reducing risks from emerging technologies.
A “sequence” of posts by Jesse Clifton, Samuel Martin and Anthony DiGiovanni considers the conditions that make technical work on AGI conflict reduction effective, the circumstances under which these conditions hold, and some promising directions for research to prevent AGI conflicts.
In The Intercept, Mara Hvistendahl’s Experimenting with disaster reports on a number of shocking and previously undisclosed accidents in US biolabs working with dangerous pathogens.
Janne M. Korhonen’s Sheltering humanity against x-risk summarizes the takeaways from a recent meeting to discuss whether extremely resilient bunkers could offer humanity so protection against some existential risks.
Kevin Esvelt’s Delay, detect, defend develops a framework for handling risks from biotechnology involving three distinct strategies: delay via deterrence, information denial, and physical denial; detection via reliable and sensitive untargeted sequencing; and defence via pandemic-proof protective equipment, resilient production and supply chains, diagnostics and personalized early warning, and germicidal far-UVC light.
Toby Ord’s Lessons from the development of the atomic bomb considers the Manhattan Project[5] as an instructive case study in the creation of a transformative technology.
The Blue Marble, taken by the Apollo 17 crew fifty years ago (restored by Toby Ord)
News
Asterisk, a quarterly EA journal, published its first issue. Highlights include an interview with Kevin Esvelt about preventing the next pandemic; an essay about the logic of nuclear escalation by Fred Kaplan; and a review of What We Owe the Future [🔉] by Kelsey Piper (summarized above).
Future Perfect published a series celebrating “the scientists, thinkers, scholars, writers, and activists building a more perfect future”.
For a few months, the Global Priorities Institute has been releasing summaries of some of their papers. If these are still too long for you, you can now read Jack Malde’s “mini-summaries” [🔉].
Nonlinear is offering [🔉] $500 prizes for posts that expand on Holden Karnofsky’s Most Important Century series.
The Future of Life Institute Podcast has relaunched with Gus Docker as the show’s new host. We were fans of Docker’s previous podcast, and have been impressed with the interviews published so far, especially the conversations with Robin Hanson on Grabby Aliens, with Ajeya Cotra on forecasting transformative AI, and with Anders Sandberg on ChatGPT and on Grand Futures.
The Nuclear Threat Initiative (NTI) has launched the International Biosecurity and Biosafety Initiative for Science (IBBIS), a program led by Jaime Yassif that seeks to reduce emerging biological risks.
The Centre on Long-Term Risk is raising funds [🔉] to support their work on s-risks, cooperative AI, acausal trade, and general longtermism. Donate here.
Will MacAskill appeared on The Daily Show discussing effective altruism and What We Owe The Future. He was also interviewed [🔉] by Jacob Stern for The Atlantic.
The Forecasting Research Institute, a new organization focused on advancing the science of forecasting for the public good, has just launched. FRI is also hiring for several roles.
Ben Snodin and Marie Buhl have compiled a list of resources relevant for nanotechnology strategy research.
Robert Wiblin interviewed [🔉] Richard Ngo on large language models for the 80,000 Hours Podcast.
Spencer Greenberg released an excellent episode on the FTX catastrophe for the Clearer Thinking podcast.
In a press release, FTX announced a “process for voluntary return of avoidable payments”. See this Effective Altruism Forum post by Molly Kovite from Open Philanthropy for context and clarifications.
Giving What We Can announced the results of the Longtermist Fund’s first-ever grantmaking round.
80,000 Hours published several exploratory profiles in the ‘Sometimes recommended’ category: S-risks [🔉], Whole brain emulation [🔉], Risks from malevolent actors [🔉], and Risks of stable totalitarianism [🔉].
The Survival and Flourishing Fund has opened its next application round. SFF estimates that they will distribute around $10 million this round. Applications are due on January 30. Apply now.
The Global Priorities Institute welcomes applications for Predoctoral Research Fellows in Economics. Apply now.
The Rational Animations Youtube channel has released videos on how to take over the universe in three easy steps and on whether a single alien message could destroy humanity.
The Centre for Long-Term Resilience published a response to the UK government’s new National Resilience Framework [🔉].
The Space Futures Initiative launched in September. They are seeking expressions of interest from potential organizations and individuals interested in collaborating.
Conversation with Katja Grace
Katja Grace is the founder and lead researcher at AI Impacts. Her work focuses on forecasting the likely impacts of human-level AI. Katja blogs at World spirit sock puppet.
Future Matters: You recently published this piece, Counterarguments to the Basic AI X-risk Case. Could you walk through what you see as the basic case for existential risk from AI?
Katja Grace: I will try to get it similar to what I wrote down. There’s very likely to be human-level AI at some point, pretty plausibly soon, though we probably don’t need that. AI that’s able to do anything humans can do, basically. Some of it is very likely to be agentic, in the sense of having goals and pursuing them. It may not be perfectly goal-directed, but it will probably be at least as goal-directed as a human. And the goals that AI is likely to have, if it has goals, are likely to be bad for reasons of alignment being hard, etc. Now at the levels of competence we’re talking about, if there are creatures with bad goals around, they probably destroy the world somehow, either immediately in some kind of intelligence explosion catastrophe or more gradually over time, taking power away from the humans.
Future Matters: You point out some weak points in the argument, and we wonder if any of them were particularly important to your own assessment of the case for AI risk—parts where a decisive argument either way might have really influenced your views.
Katja Grace: I don’t think I’ve heard a decisive argument on any of these points. But I think one that feels pretty important to me is, if you try to get close to human values, how close do you get? It seems plausible to me that you get relatively close compared, for example, to the gaps between different humans. And so I think a fairly plausible seeming future is one where there’s a great number of AIs doing all sorts of things. It looks a little bit different from what humans would want, but it’s hard to really say that was decisively different from human values in this direction. And it’s not more immediately deadly than any of the things that humans are usually trying to do. So things would look broadly similar, but faster.
Future Matters: This relates to the other aspect of your post we wanted to ask about, which was a specific point you raised about corporations. You suggest that the basic argument for AI risk proves too much, because a similar argument could be run about corporations — they are goal-directed entities with superhuman capabilities, and their goals are poorly aligned with what humans ultimately want. But in fact corporations don’t generally seem to pose a profound threat to humanity, so there must be something wrong with the argument. Could you say a little more about this objection?
Katja Grace: It seems to me like a compelling argument that the case as laid out for AI risk is not sufficient. That doesn’t mean it is a compelling argument to the opposite conclusion, that bad things won’t happen. But the case as laid out isn’t sufficient: it probably needs something more quantitative in it. It seems to me that you can be in a situation surrounded by superhuman misaligned agents without especially bad things happening, because we are already in that situation with corporations. And so the question is, how powerful are these new agents who are trying to do things you don’t want? I do find it plausible that AI will cause there to be even more powerful agents with misaligned goals, even to the extent that it does pose an existential risk. But I guess it would be nice to just acknowledge that the argument as it stands does not imply a hundred percent risk—which seems like a controversial question, around here at least. And it’d also be nice to try to actually figure out quantitatively how bad the risk is.
Future Matters: How do you think people in the AI safety community are thinking about this basic case wrong? It sounds like you think maybe people place a lot of weight on arguments of this style.
Katja Grace: I’m actually not quite sure what people place a lot of weight on. I think if you write out an argument in this style and think about it, it’s hard to become really confident that you’re doomed. I’m pretty good at being unconfident, I guess. But I think the intuitive case that people have in mind is something like this.
I guess if I am to speculate about what errors other people are making, my sense is that in general, on a lot of things, people in the local rationalist sphere expect a thing to be either infinitely intense or zero. For instance, if there’s a feedback loop involving AI capabilities, then people will tend to think it’s one that takes seconds to take over the world or something. Or if there’s a reason for a thing to become more agentic, it probably becomes arbitrarily more agentic quite fast. But I think you could say similar things about the current world. There’s a feedback loop in technology leading to more technology. The same incentives exist to be more agentic, and perhaps humans do become more agentic, but they do it so slowly that it’s arguably not one of the main things going on in the world.
Future Matters: So if these arguments were supplemented with clearer and more quantitative claims, e.g. about the gap between machine and human intelligence, does that rescue some of the plausibility?
Katja Grace: I feel that the argument needs to be more quantitative in almost every part of it. We also need some measure of how strong the pressures are for these systems to be agentic. All this seems kind of hazy, so it’s hard to know how to say something quantitative about it. And it is not clear in the least how to draw relevant consequences from, say, strong economic incentives, even if they’re quantified.
I think one interesting way of thinking about it, which I haven’t really heard around, but seems sort of promising to me, is the following. It seems like we’re going to have a whole pile of new cognitive labor. As a fraction of that, how much of it will be directed at pursuing goals? And as a fraction of that, how much of it will be directed at pursuing goals that humans would prefer to avoid? I imagine a whole lot of it is going to just be directed at things that we’re already doing and are broadly in favor of, and also not necessarily very agentic.
Future Matters: What do you make of the fact that there is this level of disagreement about basic parts of the intuitive case for AI risk? Do you think we can read much into that? Does it undermine the case itself, or does it reflect poorly on the community of people thinking about this? Or is it just what we should expect when we’re thinking about really difficult, unprecedented things?
Katja Grace: I do tentatively think it undermines the case and probably reflects poorly on the community relative to an ideal community. But these are also very hard questions, and I don’t know if another community would do a lot better. My point is that the intuitive case should be more carefully analyzed and tested. I agree there has been some of that recently. For instance, Joseph Carlsmith wrote out the case much more elaborately than I have, and that seems great. Still, I’m sort of confused by why that didn’t happen much earlier, since this is a community of people who love reason and explicitness. So to me, it feels like an extremely natural thing to do —whether or not it’s a good idea— to try and write down the case carefully.
Future Matters: You’ve been working on AI risk stuff since before most people had heard of it, and many things have changed since then. We are curious if you could describe how your views on AI risk have changed over this period.
Katja Grace: In some sense, my views have changed embarrassingly little. I came into this community in 2009, and back then my position was one of great uncertainty. And I feel like the arguments that I’m thinking about now and then are relatively similar, but in the meantime, the actual empirical situation has changed quite a lot. And I think it does seem more viscerally scary, and things are happening quicker than I thought when I showed up. I probably feel more worried about it.
Future Matters: That’s unfortunate to hear. Could we ask you to quantify your credence in the probability of existential catastrophe from AI?
Katja Grace: Yes. My sort of cached answer for the probability of doom from AI at all is 7%, but that was from thinking through all these arguments and putting numbers on different bits and combining them. I guess in general I’ve been pretty keen on that sort of thing. Then I recently learned that I’m actually pretty good at forecasting in terms of just making up numbers and then being correct. So now I’m starting to think that I should really do that more instead of this elaborate spreadsheet-related thing.
Future Matters: When you reached that number, did it surprise you? Was it in tension with the number that you had reached intuitively?
Katja Grace: I think I hadn’t already reached a number intuitively. I think my feelings say it should be more scary, but “my feelings” probably isn’t quite the same as the mental faculty I actually successfully use for making predictions. I think if I tried to do the same thing that I’d usually use for making predictions in a “making up numbers” kind of a way, I think it’s kind of similar, close to 7%, maybe 10%… But I’m worried that I just got anchored. And this question feels a bit different from other things I make up numbers about (like, “will there be 1M covid cases in China by some date?”) because there are social pressures to have certain beliefs about it, and high stakes. So I trust myself less to be well-calibrated.
Future Matters: Thanks, Katja!
We thank Leonardo Picón and Lyl Macalalad for editorial assistance.
- ^
See also Kelsey Piper’s Future Perfect coverage of Ord’s report [🔉].
- ^
The argument should be familiar to readers exposed to the standard arguments for AI risk. But even these readers may learn from this article, which makes its assumptions and conclusions unusually explicit.
- ^
By risk-neutral totalism, the authors mean an axiology defined by the conjunction of additivity—the value of an outcome is a weighted sum of its value locations—, impartiality—all locations have the same weight in the sum—, and risk-neutrality—the value of a risky option is equal to the expected value of its outcome. This axiology supplies, in the authors’ opinion, the most straightforward argument for longtermism.
- ^
This summary closely follows p. 23 of the paper.
- ^
By ‘Manhattan Project’, we mean the period of 6.5 years ranging from the discovery of fission to the delivery of a working bomb, rather than the last three years of this period during which the US government became actively involved (see p. 13 of Ord’s report).
Thanks for your work on this! I find this newsletter useful, and also appreciate you making the Future Matters Reader feed for audio versions of many of the writings covered in the newsletter. It seems like a lot of those writings weren’t on Nonlinear’s podcast feeds, either due to not being on the EA Forum / LessWrong / Alignment Forum or for some other reasons, so this seems useful and I’ve now downloaded a bunch of things in your feed.
(I’m leaving this comment partly to make other readers aware of this.)
That’s great to hear.
In the future, we would like the Future Matters Reader feed to include human narrations when available (such as the narrations by Type 3 Audio). Unfortunately, our current podcast host doesn’t support reposting episodes from other podcasts (à la The Valmy). But we may switch to a different podcast host if we can do so easily.