Beware of non-evidence-based argumentation

Recently, many people have given increasingly short timelines for AGI. What unites many of these statements is the thorough lack of any evidence. Instead, many resort to narrative arguments, ad-hoc predictive models, and mistaken arguments. In this post, I will cover these different types of non-evidence-based arguments.

I have described my views on AI risk previously in this post, which I think is still relevant. I have also laid down a basic argument against AI risk interventions in this comment where I argue that AI risk is neither important, neglected nor tractable.

Clarification (29.1.2026): In this post, I use the term “evidence” in the context of “evidence-based policy making”. A major tenet of the EA movement, I believe, is that interventions should be based on evidence as part of rational decision making. In this context, evidence is contrasted with opinions, intuitions, anecdotes, etc. I have elaborated further about what the bar for this kind of evidence is in the commentsin particular, while I do not believe that pre-publication peer-review is always necessary, results should withstand wider scrutiny and replication.

Narrative arguments

Narrative argument is a way to present an argument in a form of a story, parable, or extended metaphor. These arguments present a chain of events as probable or even inevitable due to their narrative coherence: if the “logical continuation” of the story follows a pattern, that pattern is then postulated to exist in the real world as well. This depends on the “story-logic” matching real-world causal mechanisms.

Narrative arguments can be a powerful way to demonstrate or explain a phenomenon in an understandable way, but they need to be supported by actual evidence. The existence of the narrative is not evidence in itself. Narratives are easily manipulated, and it is always possible to find another narrative (or story) that has a different ending, thus demonstrating a different conclusion.

In recent discussion relating to AI risk, narrative arguments have become increasingly common. They are often accompanied either by faulty evidence or no evidence at all. Perhaps the most prominent example is AI 2027, which features a science fiction story supplemented by predictive models. These models have been criticized heavily to be faulty in many ways. They are also somewhat difficult to understand to an average person, and I doubt most readers of AI 2027 haven’t taken a good look at them. Therefore, AI 2027 rests almost entirely on the narrative argument unsubstantiated by any evidence other than the faulty models.

Another example is the book If Anyone Builds It, Everyone Dies by Yudkowsky and Soares. This book is entirely based on narrative arguments, featuring a full multi-chapter science fiction story very similar to that of AI 2027, among many small parable stories that provide little to no value since they are not accompanied by evidence.

One parable struck me as especially memorable since it appeared to argue for something else than what it actually did: One of the stories featured a kingdom where alchemists try to synthesize gold from lead. Still, despite the king threatening to execute anyone who fails to do so along with their entire village, many alchemists attempt this impossible task. I thought the task of creating gold was akin to the task of creating a god-like AI – impossible. The king’s threat was an analogue for the bankruptcy of OpenAI and other companies that failed to deliver on this promise. But no! Yudkowsky and Soared actually meant gold to symbolize alignment and the king to symbolize AI killing anyone.

Narratives can be twisted to fit any purpose, and even the same narrative can be used to justify opposite conclusions. Narratives are not evidence; they might help understand evidence. Never trust narratives alone.

Ad-hoc predictive models

The second type of argument I’ve seen (most prominently in the case of AI 2027), is ad-hoc predictive models.

There is a saying in my city that a skilled user of Excel can justify any decision. Whenever a funding decision is made (a railway project, for example), the city council must propose a budget with predicted costs and revenues, and an assessment of benefits and harmful effects. It is typical that the politicians proposing the project make their own, “optimistic” version of these predictions, and the opposing politicians make a “pessimistic” version. By adjusting the factors that are taken into these predictions – how many people switch from cars to trains, how much new workplaces are created near the railway stations, etc. –, it is possible to mangle the prediction into supporting any outcome.

This is called “policy-based evidence making”, and it is common in politics. Often, these types of biased models are accompanied by a narrative argument painting whatever brighter future the politician in question believes their policy will result in. Spotting the faults in these ad-hoc models is very difficult for regular citizens who are not experts in, e.g., traffic modelling. For this reason, for a layperson, the best strategy is to meet any usage of predictive models not supported by academic consensus with strong skepticism.

I believe that this bears a lot of parallels to the predictive models presented in AI 2027. While, unlike in the case of my local politicians, I do not believe that the authors necessarily had any malicious intent, the model is similarly unjustified by academic consensus. It has many hidden, likely biased assumptions and ad-hoc modelling decisions. Since all of these assumptions are more or less based on intuition instead of evidence, they should be considered to have high uncertainty.

Rationalist sometimes use the phrase “If it’s worth doing, it’s worth doing with made-up statistics.” While I’m sympathetic to the idea that people should reveal their hidden assumptions in numerical form, this advice is often taken to mean that these made-up numbers somehow become good evidence when put to a predictive model. No, the numbers are still made-up, and the results of the model are also made-up. Garbage in, garbage out.

Evidence mismatch

Another common style of argumentation is to use evidence for something as evidence for something else.

For example, Aschenbrenner’s Situational Awareness basically argues that because AI systems are on the level of a “smart high-schooler” in some tasks, in the future they will be on the PhD level in all tasks. Aschenbrenner justified this by claiming that the current models are “hobbled” and unhobbling them would expand their capabilities to these other tasks. I wrote about the essay in an earlier post.

Similarly, many people (including AI 2027) base their argumentation on METR’s Long Tasks benchmark, despite it covering only a limited domain of tasks, many of which are unrealistic[1] when compared to real-world programming tasks. Coding is one of the domains in which it is easy to automatically verify AI responses without human feedback, and it has seen a lot of progress in the recent years, which explains the exponential progress seen in that domain. However, this is in no way evidence for progress in other domains, or even in real-world programming tasks.

Benchmarks, overall, are a very bad source of evidence, due to Goodhart’s law. It is easy to cheat benchmarks by training the model specifically on tasks similar to them (even if the models are not trained on the test set per se). If the model is optimized, “benchmaxxed” towards a benchmark, progress in that benchmark cannot be taken as evidence for progress in other tasks. Unfortunately, this issue plagues most benchmarks.

Sometimes evidence mismatch can be caused by mismatch in definitions. One example is the definition of AGI used by Metaculus, which I have criticized in this comment. Since the Metaculus definition doesn’t actually define AGI, but instead uses bad indicators for it that arguably could be passed using a non-AGI system, it is entirely unclear what the Metaculus question measures. It cannot be used as evidence for AGI if it does not forecast AGI.

Conclusions

In this post, I’ve criticized non-evidence-based arguments, which hangs on the idea that evidence is something that is inherently required. Yet it has become commonplace to claim the opposite. One example of this argument is presented in the International AI Safety Report[2], in which the authors argue that AI poses an “evidence dilemma” to policymakers:

Given sometimes rapid and unexpected advancements, policymakers will often have to weigh potential benefits and risks of imminent AI advancements without having a large body of scientific evidence available. In doing so, they face a dilemma. On the one hand, pre-emptive risk mitigation measures based on limited evidence might turn out to be ineffective or unnecessary. On the other hand, waiting for stronger evidence of impending risk could leave society unprepared or even make mitigation impossible – for instance if sudden leaps in AI capabilities, and their associated risks, occur.

This text uses emotive phrases such as “imminent AI advancements” and “impending risk” despite acknowledging that there is only limited evidence. This kind of rhetoric goes against the tenet of rational, evidence-based policy making, and it is alarming that high-profile expert panels use it.

I believe that states should ensure that the critical functions of society will continue to exist in unexpected, unknown crisis situations. In my country, Finland, this is called huoltovarmuus, and it is a central government policy. In past years, this concept has been expanded to include digital infrastructure and digital threats, including things like social media election inference and unspecified cyberthreats. I think it is fine to prepare for hypothetical AI threats together with other, more tangible threats as part of general preparedness. This work has low cost but huge impact in a crisis situation.

But calling for drastic actions specifically for AI is something that requires more evidence. Existential risk from AI, in particular, is an extraordinary claim requiring extraordinary evidence.

  1. ^

    This article by Nathan Witkin includes a lot of criticism for METR’s methodology, and I encourage people to read it in its entirety.

  2. ^

    This report was written by an expert panel nominated by 30 world governments, the EU, UN, and OECD, and is lead by prof. Yoshua Bengio.