Releasing TakeOverBench.com: a benchmark, for AI takeover
Today, PauseAI and the Existential Risk Observatory release TakeOverBench.com: a benchmark, but for AI takeover.
There are many AI benchmarks, but this is the one that really matters: how far are we from a takeover, possibly leading to human extinction?
In 2023, the broadly coauthored paper Model evaluation for extreme risks defined the following nine dangerous capabilities: Cyber-offense, Deception, Persuasion & manipulation, Political strategy, Weapons acquisition, Long-horizon planning, AI development, Situational awareness, and Self-proliferation. We think progress in all of these domains is worrying, and it is even more worrying that some of these domains add up to AI takeover scenarios (existential threat models).
Using SOTA benchmark data, to the degree it is available, we track how far we have progressed on our trajectory towards the end of human control. We highlight four takeover scenarios, and track the dangerous capabilities needed for them to become a reality.
Our website aims to be a valuable source of information for researchers, policymakers, and the public. At the same time, we want to highlight gaps in the current research:
For many leading benchmarks, we just don’t know how the latest models score. Replibench, for example, hasn’t been run for almost a whole year. We need more efforts to run existing benchmarks against newer models!
AI existential threat models have received only limited serious academic attention, which we think is a very poor state of affairs (the Existential Risk Observatory, together with MIT FutureTech and FLI, is currently trying to mitigate this situation with a new threat model research project).
Even if we had accurate threat models, we currently don’t know exactly where the capability red lines (or red regions, given uncertainty) are. Even if we had accurate red lines/regions, we don’t always reliably know how to measure them with benchmarks.
Despite all these uncertainties, we think it is constructive to center the discussion on the concept of an AI takeover, and to present the knowledge that we do have on this website.
We hope that TakeOverBench.com contributes to:
Raising awareness.
Grounding takeover scenarios in objective data.
Providing accessible information for researchers, policymakers, and the public.
Highlighting gaps in research on takeover scenarios, red lines, and benchmarks.
TakeOverBench.com is an open source project, and we invite everyone to comment and contribute on GitHub.
Enjoy TakeOverBench!
What scale is the METR benchmark on? I see a line that “Scores are normalized such that 100% represents a 50% success rate on tasks requiring 8 human-expert hours.”, but is the 0% point on the scale 0 hours?
METR does not think that 8 human hours is sufficient autonomy for takeover; in fact 40 hours is our working lower bound.
METR has an official internal view on what time horizons correspond to “takeover not ruled out”?
See the gpt-5 report. “Working lower bound” is maybe too strong; maybe it’s more accurate to describe it as an initial guess at a warning threshold for rogue replication and 10x uplift (if we can even measure time horizons that long). I don’t know what the exact reasoning behind 40 hours was, but one fact is that humans can’t really start viable companies using plans that only take a ~week of work. IMO if AIs could do the equivalent with only a 40 human hour time horizon and continuously evade detection, they’d need to use their own advantages and have made up many current disadvantages relative to humans (like being bad at adversarial and multi-agent settings).
Indeed the 0%point is zero hours, so compared to the METR plot it is divided by 8 hours.
The 8 hours I agree is somewhat arbitrary and I had missed that METR had a more ‘official’ stance on it. I made an issue out of it now to see if anyone else had reasons to make it 8 hours.
(For context I did most of the benchmark literature review for this project and data collection.)
Edit (29 Jan 2026):
The change to 40 hour normalization is now live!
Could you please explain your reasoning on 40 hours?
I think this is pretty cool. Good to see some relevant benchmarks collected in the same place, and I can see how this is handy as a communication tool.
From a quick skim I wasn’t really sure how to interpret the main graph, and there didn’t seem to be an explanation. In particular, the Y axis is a percentage, but a percentage of what? Some of the benchmarks are projected to reach 100% in a year, does that mean you project AI takeover in a year etc?
(Sharing less as ‘please answer my question’ and more as ‘user feedback’—if I’m confused by this, I imagine lots of people who know (even) less than me about AI (safety) will also be confused; though maybe they’re not your target audience)
Hi Jamie, thanks for your comment, glad you like it!
It’s hard to go into this without answering your question anyway a bit, but we appreciate the user feedback too.
We got some quick data on the project yesterday (n=15, tech audience but not xrisk, data here). We asked, among other questions: “In your own words, what is this website tracking or measuring?” Almost everyone gave a correct answer. Also from the other answers, I think the main points get across pretty well, so we’re not really planning to modify too much.
The percentage that you’re asking about (‘Score’) is the amount of questions answered correctly by the AI model in a benchmark (with 1-2 exceptions, we explain these under ‘Benchmarks’). I agree that’s not super clear, I’ve added an issue to Github to explain this a bit better.
Does 100% mean a takeover? Not really. The issue here is that none of us knows at which capabilities threshold a takeover can occur exactly. We don’t have data on takeovers since they haven’t happened yet and the world is complex. ‘Human expert level’ is definitely a relevant boundary to cross, and we have included this in the benchmark plots wherever meaningful (not on the homepage, that would have been too messy).
As we said, we think part of the website’s point is to point to missing pieces of the puzzle. Threat models (AI takeover scenarios) are currently hardly scientifically analysed, and we plan to do research into them this year (Existential Risk Observatory, MIT FutureTech, FLI). Once we have more robust threat models, we should determine which dangerous capabilities have which red lines for each model. Then, we can find out whether current benchmarks can measure those and if so, what the relevant scores are (and if not, build new ones that can). We’d like to work on these projects together with other researchers!
Currently, that work is not done. TakeOverBench is an attempt to shed more light on the matter using the research we have right now. We plan to update it when better research becomes available.
Cool website!
One question: why are most of the SOTA models Claude? Is it because Anthropic is the company that releases the most data about their models? I thought that by most measures, Gemini would be the SOTA model today.
Thanks!
To the largest degree possible we have collected data from public leader boards or system cards and it seems to be the case that Gemini models are a bit underrepresented. I am not sure why that is, but that Anthropic releases more data is definitely part of it. For example the most updated data points from CyBench come from the Claude (and Grok) system cards, and for the virology test there are data points in the system card of Opus 4.5 but not for Gemini 3.0 Pro.