AI benchmarks

TagLast edit: 2 Feb 2024 10:57 UTC by Toby Tremlett🔹

Benchmarks are tests which enable us to measure the progress of AI capabilities, and test for characteristics which might pose safety risks.

Related entries

AI safety | standards and regulation

VSPE vs. flattery: Testing emotional scaffolding for early-stage alignment

Astelle Kay24 Jun 2025 9:39 UTC

2 points

1 comment1 min readEA link

Trendlines in AIxBio evals

ljusten31 Oct 2024 0:09 UTC

40 points

2 comments11 min readEA link

(www.lennijusten.com)

Open Phil releases RFPs on LLM Benchmarks and Forecasting

Lawrence Chan11 Nov 2023 3:01 UTC

12 points

0 comments2 min readEA link

(www.openphilanthropy.org)

XPT forecasts on (some) Direct Approach model inputs

Forecasting Research Institute20 Aug 2023 12:39 UTC

37 points

0 comments9 min readEA link

Announcing Epoch’s newly expanded Parameters, Compute and Data Trends in Machine Learning database

Robi Rahman🔸25 Oct 2023 3:03 UTC

38 points

1 comment1 min readEA link

(epochai.org)

Language models surprised us

Ajeya29 Aug 2023 21:18 UTC

59 points

10 comments5 min readEA link

Prizes for ML Safety Benchmark Ideas

Joshc28 Oct 2022 2:44 UTC

56 points

8 comments1 min readEA link

Results from an Adversarial Collaboration on AI Risk (FRI)

Forecasting Research Institute11 Mar 2024 15:54 UTC

196 points

25 comments9 min readEA link

(forecastingresearch.org)

AI Forecasting Research Ideas

Jaime Sevilla17 Nov 2022 17:37 UTC

78 points

1 comment1 min readEA link

(docs.google.com)

$250K in Prizes: SafeBench Competition Announcement

Center for AI Safety3 Apr 2024 22:07 UTC

47 points

0 comments1 min readEA link

Stated Values, Revealed Habits: The Challenge of Measuring AI Preferences

Aidan Kankyoku7 Jul 2026 17:07 UTC

7 points

0 comments21 min readEA link

Long list of AI questions

NunoSempere6 Dec 2023 11:12 UTC

124 points

16 comments86 min readEA link

Survey on the acceleration risks of our new RFPs to study LLM capabilities

Ajeya10 Nov 2023 23:59 UTC

44 points

1 comment8 min readEA link

Announcing Epoch’s dashboard of key trends and figures in Machine Learning

Jaime Sevilla13 Apr 2023 7:33 UTC

127 points

4 comments1 min readEA link

(epochai.org)

A compute-based framework for thinking about the future of AI

Matthew_Barnett31 May 2023 22:00 UTC

96 points

36 comments19 min readEA link

AI Benchmarks Series — Metaculus Questions on Evaluations of AI Models Against Technical Benchmarks

christian27 Mar 2024 23:05 UTC

10 points

0 comments1 min readEA link

(www.metaculus.com)

Metaculus Q4 AI Benchmarking: Bots Are Closing The Gap

Molly Hickman19 Feb 2025 22:46 UTC

42 points

8 comments13 min readEA link

Are the Costs of AI Agents Also Rising Exponentially?

Toby_Ord2 Feb 2026 8:45 UTC

82 points

11 comments8 min readEA link

(www.tobyord.com)

Q2 AI Benchmark Results: Pros Maintain Clear Lead

Benjamin Wilson 🔸28 Oct 2025 5:13 UTC

55 points

0 comments24 min readEA link

(www.metaculus.com)

Announcing Metaculus Summer 2026 FutureEval Bot Tournament

postreal1 May 2026 16:36 UTC

5 points

0 comments4 min readEA link

(www.metaculus.com)

Where’s my ten minute AGI?

Vasco Grilo🔸19 May 2025 17:45 UTC

47 points

6 comments7 min readEA link

(epoch.ai)

NYU CMEP Call for EOIs: Contract Technical Benchmarking Lead & Researcher, Welfare Alignment Project

Sofia_Fogel15 Apr 2026 19:43 UTC

32 points

0 comments2 min readEA link

Evidence that Recent AI Gains are Mostly from Inference-Scaling

Toby_Ord2 Feb 2026 8:45 UTC

23 points

2 comments5 min readEA link

(www.tobyord.com)

AI predictions for 2026

Ajeya20 Jan 2026 10:35 UTC

70 points

2 comments7 min readEA link

(Linkpost) METR: Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

Yadav11 Jul 2025 8:58 UTC

37 points

2 comments2 min readEA link

(metr.org)

Decentralizing Model Evaluation: Lessons from AI4Math

SMalagon5 Jun 2025 18:57 UTC

23 points

1 comment4 min readEA link

Encultured AI, Part 1: Enabling New Benchmarks

Andrew Critch8 Aug 2022 22:49 UTC

17 points

0 comments6 min readEA link

Benchmark Scores = General Capability + Claudiness

Vasco Grilo🔸25 Nov 2025 17:58 UTC

19 points

0 comments4 min readEA link

(epochai.substack.com)

AISN #61: OpenAI Releases GPT-5

Center for AI Safety12 Aug 2025 17:52 UTC

6 points

0 comments4 min readEA link

(newsletter.safe.ai)

AISN #65: Measuring Automation and Superintelligence Moratorium Letter

Center for AI Safety29 Oct 2025 16:08 UTC

8 points

0 comments3 min readEA link

(newsletter.safe.ai)

Testing Human Flow in Political Dialogue: A New Benchmark for Emotionally Aligned AI

DongHun Lee30 May 2025 4:37 UTC

1 point

0 comments1 min readEA link

METR Time Horizon 2.0—The benchmark you’ve been waiting for

AgentMa🔸8 Jul 2026 23:41 UTC

24 points

16 comments5 min readEA link

Building Technology to Drive AI Governance

jsteinhardt18 Feb 2026 22:35 UTC

14 points

2 comments7 min readEA link

AI Forecasting in 2026: What 11 Analyses Say

Benjamin Wilson 🔸8 Jul 2026 14:33 UTC

10 points

0 comments17 min readEA link

(www.metaculus.com)

Impact of Quantization on Small Language Models (SLMs) for Multilingual Mathematical Reasoning Tasks

Angie Paola Giraldo7 May 2025 21:48 UTC

11 points

0 comments14 min readEA link

Why I am Still Skeptical about AGI by 2030

James Fodor2 May 2025 7:13 UTC

134 points

16 comments6 min readEA link

Fit Testing AI Benchmarking

Declan McKenna 🔷7 Apr 2026 10:17 UTC

4 points

1 comment2 min readEA link

(declanmck.com)

Estimating METR Time Horizons for Claude Opus 4.6 and GPT 5.3 Codex (xhigh)

Charles Dillon 🔸16 Feb 2026 18:15 UTC

8 points

0 comments3 min readEA link

o3

Zach Stein-Perlman20 Dec 2024 21:00 UTC

84 points

9 comments1 min readEA link

Announcing the AI Forecasting Benchmark Series | July 8, $120k in Prizes

christian19 Jun 2024 21:37 UTC

52 points

4 comments5 min readEA link

(www.metaculus.com)

A Calibration Benchmark for LLM Beliefs Across a Taxonomic Hierarchy

DanRKAlex7 Jul 2026 13:38 UTC

1 point

0 comments3 min readEA link

(github.com)

Automated Evaluation of LLMs for Math Benchmark.

CisnerosA30 Oct 2025 20:28 UTC

3 points

0 comments5 min readEA link

Inference Scaling and the Log-x Chart

Toby_Ord2 Feb 2026 8:43 UTC

29 points

2 comments9 min readEA link

(www.tobyord.com)

The tables have turned on AI sceptics

Stefan_Schubert7 May 2026 17:53 UTC

83 points

35 comments3 min readEA link

(www.update.news)

Race to the Top: Benchmarks for AI Safety

isaduan4 Dec 2022 22:50 UTC

52 points

8 comments1 min readEA link

Benchmarking Emotional Alignment: Can VSPE Reduce Flattery in LLMs?

Astelle Kay4 Aug 2025 3:36 UTC

2 points

0 comments3 min readEA link

The Scaling Paradox

Toby_Ord30 Jan 2026 13:34 UTC

51 points

1 comment8 min readEA link

(www.tobyord.com)

Eval-related prompt cues predicted refusal shifts across 32k LLM rollouts

Ratnaditya19 May 2026 16:54 UTC

1 point

0 comments1 min readEA link

Dwarkesh Patel’s thoughts on AI progress (Dec 2025)

Vasco Grilo🔸1 Feb 2026 9:28 UTC

31 points

2 comments8 min readEA link

(www.dwarkesh.com)

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Matrice Jacobine🔸🏳️‍⚧️12 May 2025 15:20 UTC

14 points

1 comment1 min readEA link

(www.arxiv.org)

Predict 2025 AI capabilities (by Sunday)

Jonas_15 Jan 2025 0:16 UTC

16 points

0 comments1 min readEA link

AISN #53: An Open Letter Attempts to Block OpenAI Restructuring

Center for AI Safety29 Apr 2025 15:56 UTC

6 points

0 comments4 min readEA link

(newsletter.safe.ai)

Coordinal: A Postmortem.

Ronak Mehta18 May 2026 20:43 UTC

88 points

6 comments4 min readEA link

(ronakrm.github.io)

From Long Novels to Large Language Models

sorenprojections3 May 2026 12:04 UTC

1 point

0 comments46 min readEA link

There is no METR for medical AI. I want to build one.

Mahmud Omar 9 Mar 2026 21:31 UTC

21 points

3 comments1 min readEA link

AIs Are Expert-Level at Many Virology Skills

Center for AI Safety2 May 2025 16:07 UTC

22 points

0 comments1 min readEA link

From Therapy Tool to Alignment Puzzle-Piece: Introducing the VSPE Framework

Astelle Kay18 Jun 2025 14:47 UTC

6 points

1 comment2 min readEA link

I underestimated AI capabilities (again)

Ajeya5 Mar 2026 19:15 UTC

64 points

4 comments5 min readEA link

Road to AnimalHarmBench

Artūrs Kaņepājs1 Jul 2025 13:37 UTC

141 points

11 comments7 min readEA link

AGI by 2032 is extremely unlikely

Yarrow Bouchard 🔸16 Oct 2025 22:50 UTC

24 points

44 comments7 min readEA link

Good Benchmarks

Ivan Bercovich14 Jul 2026 14:03 UTC

2 points

0 comments12 min readEA link

Absolute Zero: AlphaZero for LLM

alapmi12 May 2025 14:54 UTC

2 points

0 comments1 min readEA link

Measuring Adversarial Robustness of LLMs in Nonhuman Welfare Reasoning

Allen Lu 🔸10 Feb 2026 21:16 UTC

16 points

1 comment5 min readEA link

Is there a Half-Life for the Success Rates of AI Agents?

Toby_Ord2 Feb 2026 8:44 UTC

29 points

1 comment10 min readEA link

(www.tobyord.com)

Performance of Large Language Models (LLMs) in Complex Analysis: A Benchmark of Mathematical Competence and its Role in Decision Making.

Jaime Esteban Montenegro Barón6 May 2025 21:08 UTC

1 point

0 comments23 min readEA link

We are in a New Paradigm of AI Progress—OpenAI’s o3 model makes huge gains on the toughest AI benchmarks in the world

Garrison22 Dec 2024 21:45 UTC

26 points

0 comments4 min readEA link

(garrisonlovely.substack.com)

Three Weeks In: What GPT-5 Still Gets Wrong

JAM27 Aug 2025 14:43 UTC

2 points

0 comments3 min readEA link

Large Language Models Pass the Turing Test

Matrice Jacobine🔸🏳️‍⚧️2 Apr 2025 5:41 UTC

11 points

6 comments1 min readEA link

(arxiv.org)

Is AI Hitting a Wall or Moving Faster Than Ever?

Garrison9 Jan 2025 22:18 UTC

35 points

5 comments5 min readEA link

(garrisonlovely.substack.com)

MORU—A benchmark for generalized moral compassion

Declan McKenna 🔷10 Mar 2026 15:24 UTC

25 points

0 comments3 min readEA link

Exaggerating the risks (Part 20: AI 2027 timelines forecast, benchmarks and gaps) | Reflective Altruism

Unofficial Reflective Altruism Cross-Poster1 Jan 2026 23:32 UTC

14 points

6 comments18 min readEA link

(reflectivealtruism.com)

TSArena: Independent Blind Pairwise Evaluation of AI Safety Behavior

solosevn2 Mar 2026 15:16 UTC

1 point

0 comments1 min readEA link

Animal Norms In Moral Assessment (ANIMA): Evaluating LLMs on reasoning about animal welfare

Sentient Futures5 Nov 2025 1:13 UTC

55 points

7 comments6 min readEA link

The Khayali Protocol

khayali2 Jun 2025 14:40 UTC

−8 points

0 comments3 min readEA link

OpenAI’s o3 model scores 3% on the ARC-AGI-2 benchmark, compared to 60% for the average human

Yarrow Bouchard 🔸1 May 2025 13:57 UTC

16 points

15 comments3 min readEA link

(arcprize.org)

[Question] Is benchmarking AI capabilities positive EV?

Charlie_Guthmann17 Feb 2026 22:30 UTC

24 points

4 comments1 min readEA link

Alignment for Animals

Jasmine Brazilek5 May 2026 16:00 UTC

15 points

0 comments5 min readEA link

Releasing TakeOverBench.com: a benchmark, for AI takeover

Otto22 Jan 2026 16:38 UTC

23 points

9 comments1 min readEA link

An Empirical Review of the Animal Harm Benchmark (ANIMA)

Lukas Gebhard1 Mar 2026 17:50 UTC

30 points

2 comments16 min readEA link

Epoch AI’s top 10 Data Insights and Gradient Updates of 2025

Vasco Grilo🔸7 Jan 2026 17:30 UTC

25 points

0 comments5 min readEA link

(epoch.ai)

AI benchmarking has a Y-axis problem

Lizka6 Feb 2026 7:45 UTC

74 points

2 comments7 min readEA link

Launching the AI Forecasting Benchmark Series Q3 | $30k in Prizes

christian8 Jul 2024 17:20 UTC

17 points

0 comments1 min readEA link

(www.metaculus.com)

FutureEval Forecasting Bot-Maker Survey: What Winners Did Differently

grace_mclain2 May 2026 12:55 UTC

7 points

0 comments1 min readEA link

(www.metaculus.com)

A Benchmark for Measuring Honesty in AI Systems

Mantas Mazeika4 Mar 2025 17:44 UTC

29 points

0 comments2 min readEA link

(www.mask-benchmark.ai)

Thoughts on Toby Ords AI Scaling Series

Srdjan Miletic4 Feb 2026 0:46 UTC

52 points

3 comments4 min readEA link

(www.dissent.blog)

Fact Check: 57% of the internet is NOT AI-generated

James-Hartree17 Jan 2025 21:26 UTC

1 point

0 comments1 min readEA link

Benchmark Performance is a Poor Measure of Generalisable AI Reasoning Capabilities

James Fodor21 Feb 2025 4:25 UTC

12 points

3 comments24 min readEA link

MichaelA🔸 9 Nov 2022 12:19 UTC
2 points
0 ∶ 0
I’m not totally sure whether this should exist, and whether it should be called this.
[ ]
[deleted]

AI benchmarks

Further reading

Related entries