Red-teaming existential risk from AI

Are we too willing to accept forecasts from experts on the probability of humanity’s demise at the hands of artificial intelligence? What degree of individual liberty should we curtail in the name of AI risk mitigation? I argue that focusing on AI’s existential risk distracts from real negative externalities that we can observe today and that we ought to dismiss long-range forecasts with low confidence scores.

Examining existential risk scenarios

The physicist Richard Feynman put it best, “The first principle is that you must not fool yourself, and you are the easiest person to fool,” in other words, claiming N number of experts, all subject to the same inherent cognitive biases we all suffer, espouse a specific belief is a poor substitute for rigorous, evidence-based decision-making. Yet this is precisely where the debate on existential risk from AI seems to hinge.

Eliezer Yudkowsky’s argument in his Time magazine oped reads:

Many researchers steeped in these issues, including myself, expect that the most likely result of building a superhumanly smart AI, under anything remotely like the current circumstances, is that literally everyone on Earth will die. Not as in “maybe possibly some remote chance,” but as in “that is the obvious thing that would happen.” It’s not that you can’t, in principle, survive creating something much smarter than you; it’s that it would require precision and preparation and new scientific insights, and probably not having AI systems composed of giant inscrutable arrays of fractional numbers.

Most AI doomsday scenarios rely on compounding assumptions and intuitive leaps (for example, that creating synthetic super intelligence is possible in the first instance, or that it escapes human control, or that it is centralized, or that it becomes agentic, or that it decides to destroy humanity). Rather than delve into a specific doomsday scenario, Yudkwosky links to a “survey” in which respondents estimate the risk of uncontrolled AI. Again, these forecasts are given without any supporting evidence (fanciful and detailed doomsday scenarios notwithstanding). Still, when a group of AI researchers tell you that we should slow down or risk destroying humanity, we should listen, right? Perhaps not.

The validity of forecasts beyond the 10-year horizon

We know from Phil Tetlock’s work on forecasting that domain experts consistently overestimate risks emanating from their field and that the accuracy of forecasts decays rapidly as time horizons expand. It may be reasonable to forecast events with a 12-month, and even 5-year, horizon. Beyond that scope, accuracy is so hard to estimate that it renders the forecast almost useless from a policy view. In a world of trade-offs and limited resources, should governments halt progress or divert public energy to risks that are impossible to quantify accurately?

Actuarial tables help us hedge against risk—they do this because they employ base rates in their forecasts. As critics of the long-termist viewpoint have noted, the base-rate for human extinction is zero. Of course, this is of only mild comfort since the past can only tell us so much about the future. Still, invoking Tetlock once more, the base-rate is what informs our forecasts, meaning any attempt to estimate the existential threat from technology ought to start from zero.

Examining the prevalence of belief in AI existential risk

Let’s assume for a moment that domain experts who warn of imminent threats to humanity’s survival from AI are acting in good faith and are sincere in their convictions. How can one explain why so many intelligent individuals (some dubbed the “godfathers of AI” by news media) would coalesce around an unreasonable position? I suspect two phenomena may be at play—a general foreboding about the future coupled with motivated reasoning.

Psychologist Tali Sharot illustrates this in her work, noting that surveys and her own lab’s experiments consistently find a gap between general optimism of personal circumstances and outlook for society. “While private optimism (positive expectations regarding our own future) is commonplace, it is typically accompanied by public despair (negative expectations regarding the future of our country),” she writes. A Pew Research survey from August 2023 found that more than half of Americans were “more concerned than excited” about the increased use of artificial intelligence. Fear of a change and novel technology isn’t limited to AI—we’ve seen similar skepticism from the general public on nuclear fission, climate change, genetic engineering, etc. Why this systemic aversion to novel technology?

Avoiding the unknown recruits two cognitive biases—the status quo bias and uncertainty avoidance. Status quo bias is a subcategory born from Kahneman and Tversky’s prospect theory, in which people overvalue what they already have compared to what they don’t. According to researchers, people systematically avoid uncertainty whenever possible, although there is a marked cultural difference between groups in their tolerance for ambiguity. Differing national levels in risk tolerance may explain the gap between public opinion on, for example, genetic engineering between the United States and Europe.

Wharton Business School professor Jonah Berger puts this well in his book Contagious, “This devaluing of things uncertain is called the ‘uncertainty tax.’ When choosing between a sure thing and a risky one, the risky option has to be that much better to get chosen. The remodeled room has to be that much nicer. The gamble has to be that much higher in expected value.” This may help explain why novel technologies are met with suspicion to varying degrees across national cultures, but it still leaves the question of why domain experts continue to view specific AI doomsday scenarios as credible.

Fooling oneself

Another cognitive quirk may be at play—the ability to structure an argument around an emotion. Science author Dave Robson puts it well in The Intelligence Trap, noting three reasons why “an intelligent person may act stupidly.” Namely, “They may lack elements of creative or practical intelligence that are essential for dealing with life’s challenges; they may suffer from ‘dysrationalia,’ using biased intuitive judgments to make decisions; and they may use their intelligence to dismiss any evidence that contradicts their views thanks to motivated reasoning.” This is not to say that any specific AI researcher who assigns a probability of say, 10 percent, that rogue AI will destroy humanity in the next decade suffers from dysrationalia or is hopelessly trapped in cognitive biases, but it gives us a sense of the macro picture. In any case, we ought to judge the argument on its merits and not the pedigree of its proponent.

Skepticism of expert warnings

Most of us are inclined to weigh expert opinion above that of novices, yet this heuristic may break down in cases with high uncertainty. You should value your physician’s interpretation of an MRI more than an eight-year-old, but you may want to assign their judgments of a coming apocalypse equally. In essence, questions that require forecasting beyond the next decade and that require multiple assumptions to be true move from epistemic uncertainty to aleatory uncertainty—from the theoretically knowable to the unknowable. Setting aside the existential risk from AI, we can instead focus on near-term negative externalities.

The key to efficient market interventions from central authorities in the form of regulation is an entire subject unto itself. For our purposes, we must acknowledge the inherent trade-offs between market efficiency, liberty, and regulation. We balance an individual right to free speech with collective security, for example, by curtailing speech when it is designed to spark violence. Too often, the debate around AI regulation is painted without mention of trade-offs. For example, a global pause in model training that many advocated for made no reference to the idea’s inherent weakness—that is, it sets up a prisoner’s dilemma in which the more AI firms voluntarily agree to pause research, the greater the incentive for any one group to defect from the agreement and gain a competitive edge. It makes no mention of practical implementation, nor does it explain how it arrived on its pause time-duration; nor does it recognize the improbability of enforcing a global treaty on AI.

Any discussion on curtailing private enterprise in the name of public good must clearly establish a real causal relationship to negative externalities and estimate trade-offs in the form of lower efficiency and slower progress. Systematic reviews show an inverse relationship between regulatory burdens and innovation. And innovation will be the key to continued global prosperity, without which we may see increased geopolitical instability as pension systems collapse under demographic burdens.

Moving the debate towards observable risk

A better way to look at AI alignment might be to set aside existential risk and focus on demonstrated externalities and ethical considerations. While less heroic, considerations like the production at scale of hateful synthetic media, or copyright infringement, or scams hypercharged by AI, are more proximate and data-driven then long-range forecasts of doom. What might research along those lines look like?

Focusing on demonstrated externalities from LLMs and other forms of AI means creating industry standards, best practices, a code of ethics, etc. Just as the public demands accountability and sound ethical practices from journalists and major news media, or from its legal practitioners, so too should we expect responsible behavior from AI companies. And just as news media organizes itself into associations with its own norms that ultimately protect the entire industry from bad actors, so too should AI firms form their own guild. Moving the discussion to the responsible development of AI and away from doomsday scenarios, we can focus on practical steps, which companies like Anthropic are already doing, as are institutions like the Center for Human-Campatible Artificial Intelligence

Practical interventions

The AI safety community must acknowledge the practical limits of global enforcement and regulatory regimes. For example, authorities can attempt to ban doping in professional sports, but pull factors still prompt athletes to risk their health to gain an advantage. The more athletes that are drug-free, the greater the incentive to cheat. On a grander scale, nuclear weapon proliferation works this way. A strict international regime dedicated to preventing proliferation still failed to prevent India, Israel, Pakistan, North Korea, and, likely, Iran from acquiring weapons. (The counterfactual is unknowable, i.e., how many states would have an active nuclear weapons program in the absence of a global regime—I would maintain that states with nuclear umbrellatreaties have little incentive to pursue costly and unpopular weapons programs).

Instead of outlandish ideas of a new global government capable of unilaterally curtailing compute power or some other factor through force, we should focus on what is practically achievable today. Encouraging firms like OpenAI to red-team their models before release, for example, is practical and limits negative externalities. This practical approach, which is already well underway in labs across the globe, and focuses on issues like LLMs’ interpretability and on creating a global community of researchers. The American Bar Association looks to limit unethical behavior among attorneys—this ultimately helps the entire industry and engenders public trust. AI companies need similar institutions, ones that encourage ethical behavior and avoid race to the bottom.