RSS

AI benchmarks

TagLast edit: 2 Feb 2024 10:57 UTC by Toby Tremlett🔹

Benchmarks are tests which enable us to measure the progress of AI capabilities, and test for characteristics which might pose safety risks.

Further reading

The Benchmark Lottery

BASALT: A Benchmark for Learning from Human Feedback—AI Alignment Forum

Misaligned Powerseeking — SERI ML Alignment Theory Scholars Program | Summer 2022

[2110.06674] Truthful AI: Developing and governing AI that does not lie

Related entries

AI safety | standards and regulation

VSPE vs. flat­tery: Test­ing emo­tional scaf­fold­ing for early-stage alignment

Astelle Kay24 Jun 2025 9:39 UTC
2 points
1 comment1 min readEA link

Trendlines in AIxBio evals

ljusten31 Oct 2024 0:09 UTC
40 points
2 comments11 min readEA link
(www.lennijusten.com)

Open Phil re­leases RFPs on LLM Bench­marks and Forecasting

Lawrence Chan11 Nov 2023 3:01 UTC
12 points
0 comments2 min readEA link
(www.openphilanthropy.org)

XPT fore­casts on (some) Direct Ap­proach model inputs

Forecasting Research Institute20 Aug 2023 12:39 UTC
37 points
0 comments9 min readEA link

An­nounc­ing Epoch’s newly ex­panded Pa­ram­e­ters, Com­pute and Data Trends in Ma­chine Learn­ing database

Robi Rahman🔸25 Oct 2023 3:03 UTC
38 points
1 comment1 min readEA link
(epochai.org)

Lan­guage mod­els sur­prised us

Ajeya29 Aug 2023 21:18 UTC
59 points
10 comments5 min readEA link

Prizes for ML Safety Bench­mark Ideas

Joshc28 Oct 2022 2:44 UTC
56 points
8 comments1 min readEA link

Re­sults from an Ad­ver­sar­ial Col­lab­o­ra­tion on AI Risk (FRI)

Forecasting Research Institute11 Mar 2024 15:54 UTC
196 points
25 comments9 min readEA link
(forecastingresearch.org)

AI Fore­cast­ing Re­search Ideas

Jaime Sevilla17 Nov 2022 17:37 UTC
78 points
1 comment1 min readEA link
(docs.google.com)

$250K in Prizes: SafeBench Com­pe­ti­tion An­nounce­ment

Center for AI Safety3 Apr 2024 22:07 UTC
47 points
0 comments1 min readEA link

Long list of AI ques­tions

NunoSempere6 Dec 2023 11:12 UTC
124 points
16 comments86 min readEA link

Sur­vey on the ac­cel­er­a­tion risks of our new RFPs to study LLM capabilities

Ajeya10 Nov 2023 23:59 UTC
38 points
1 comment8 min readEA link

An­nounc­ing Epoch’s dash­board of key trends and figures in Ma­chine Learning

Jaime Sevilla13 Apr 2023 7:33 UTC
127 points
4 comments1 min readEA link
(epochai.org)

A com­pute-based frame­work for think­ing about the fu­ture of AI

Matthew_Barnett31 May 2023 22:00 UTC
96 points
36 comments19 min readEA link

AI Bench­marks Series — Me­tac­u­lus Ques­tions on Eval­u­a­tions of AI Models Against Tech­ni­cal Benchmarks

christian27 Mar 2024 23:05 UTC
10 points
0 comments1 min readEA link
(www.metaculus.com)

Me­tac­u­lus Q4 AI Bench­mark­ing: Bots Are Clos­ing The Gap

Molly Hickman19 Feb 2025 22:46 UTC
42 points
8 comments13 min readEA link

Q2 AI Bench­mark Re­sults: Pros Main­tain Clear Lead

Benjamin Wilson 🔸28 Oct 2025 5:13 UTC
46 points
0 comments24 min readEA link
(www.metaculus.com)

Where’s my ten minute AGI?

Vasco Grilo🔸19 May 2025 17:45 UTC
47 points
6 comments7 min readEA link
(epoch.ai)

(Linkpost) METR: Mea­sur­ing the Im­pact of Early-2025 AI on Ex­pe­rienced Open-Source Devel­oper Productivity

Yadav11 Jul 2025 8:58 UTC
37 points
2 comments2 min readEA link
(metr.org)

De­cen­tral­iz­ing Model Eval­u­a­tion: Les­sons from AI4Math

SMalagon5 Jun 2025 18:57 UTC
23 points
1 comment4 min readEA link

En­cul­tured AI, Part 1: En­abling New Benchmarks

Andrew Critch8 Aug 2022 22:49 UTC
17 points
0 comments6 min readEA link

Bench­mark Scores = Gen­eral Ca­pa­bil­ity + Claudiness

Vasco Grilo🔸25 Nov 2025 17:58 UTC
19 points
0 comments4 min readEA link
(epochai.substack.com)

AISN #61: OpenAI Re­leases GPT-5

Center for AI Safety12 Aug 2025 17:52 UTC
6 points
0 comments4 min readEA link
(newsletter.safe.ai)

AISN #65: Mea­sur­ing Au­toma­tion and Su­per­in­tel­li­gence Mo­ra­to­rium Let­ter

Center for AI Safety29 Oct 2025 16:08 UTC
8 points
0 comments3 min readEA link
(newsletter.safe.ai)

Test­ing Hu­man Flow in Poli­ti­cal Dialogue: A New Bench­mark for Emo­tion­ally Aligned AI

DongHun Lee30 May 2025 4:37 UTC
1 point
0 comments1 min readEA link

Im­pact of Quan­ti­za­tion on Small Lan­guage Models (SLMs) for Mul­tilin­gual Math­e­mat­i­cal Rea­son­ing Tasks

Angie Paola Giraldo7 May 2025 21:48 UTC
11 points
0 comments14 min readEA link

Why I am Still Skep­ti­cal about AGI by 2030

James Fodor2 May 2025 7:13 UTC
134 points
15 comments6 min readEA link

o3

Zach Stein-Perlman20 Dec 2024 21:00 UTC
84 points
9 comments1 min readEA link

An­nounc­ing the AI Fore­cast­ing Bench­mark Series | July 8, $120k in Prizes

christian19 Jun 2024 21:37 UTC
52 points
4 comments5 min readEA link
(www.metaculus.com)

Au­to­mated Eval­u­a­tion of LLMs for Math Bench­mark.

CisnerosA30 Oct 2025 20:28 UTC
3 points
0 comments5 min readEA link

Race to the Top: Bench­marks for AI Safety

isaduan4 Dec 2022 22:50 UTC
52 points
8 comments1 min readEA link

Bench­mark­ing Emo­tional Align­ment: Can VSPE Re­duce Flat­tery in LLMs?

Astelle Kay4 Aug 2025 3:36 UTC
2 points
0 comments3 min readEA link

Ab­solute Zero: Re­in­forced Self-play Rea­son­ing with Zero Data

Matrice Jacobine🔸🏳️‍⚧️12 May 2025 15:20 UTC
14 points
1 comment1 min readEA link
(www.arxiv.org)

Pre­dict 2025 AI ca­pa­bil­ities (by Sun­day)

Jonas_15 Jan 2025 0:16 UTC
16 points
0 comments1 min readEA link

AISN #53: An Open Let­ter At­tempts to Block OpenAI Restructuring

Center for AI Safety29 Apr 2025 15:56 UTC
6 points
0 comments4 min readEA link
(newsletter.safe.ai)

AIs Are Ex­pert-Level at Many Virol­ogy Skills

Center for AI Safety2 May 2025 16:07 UTC
22 points
0 comments1 min readEA link

From Ther­apy Tool to Align­ment Puz­zle-Piece: In­tro­duc­ing the VSPE Framework

Astelle Kay18 Jun 2025 14:47 UTC
6 points
1 comment2 min readEA link

Road to AnimalHarmBench

Artūrs Kaņepājs1 Jul 2025 13:37 UTC
137 points
11 comments7 min readEA link

AGI by 2032 is ex­tremely unlikely

Yarrow Bouchard 🔸16 Oct 2025 22:50 UTC
24 points
44 comments7 min readEA link

Ab­solute Zero: AlphaZero for LLM

alapmi12 May 2025 14:54 UTC
2 points
0 comments1 min readEA link

Perfor­mance of Large Lan­guage Models (LLMs) in Com­plex Anal­y­sis: A Bench­mark of Math­e­mat­i­cal Com­pe­tence and its Role in De­ci­sion Mak­ing.

Jaime Esteban Montenegro Barón6 May 2025 21:08 UTC
1 point
0 comments23 min readEA link

We are in a New Paradigm of AI Progress—OpenAI’s o3 model makes huge gains on the tough­est AI bench­marks in the world

Garrison22 Dec 2024 21:45 UTC
26 points
0 comments4 min readEA link
(garrisonlovely.substack.com)

Three Weeks In: What GPT-5 Still Gets Wrong

JAM27 Aug 2025 14:43 UTC
2 points
0 comments3 min readEA link

Large Lan­guage Models Pass the Tur­ing Test

Matrice Jacobine🔸🏳️‍⚧️2 Apr 2025 5:41 UTC
11 points
6 comments1 min readEA link
(arxiv.org)

Is AI Hit­ting a Wall or Mov­ing Faster Than Ever?

Garrison9 Jan 2025 22:18 UTC
35 points
5 comments5 min readEA link
(garrisonlovely.substack.com)

An­i­malHar­mBench 2.0: Eval­u­at­ing LLMs on rea­son­ing about an­i­mal welfare

Sentient Futures5 Nov 2025 1:13 UTC
43 points
4 comments6 min readEA link

The Khay­ali Pro­to­col

khayali2 Jun 2025 14:40 UTC
−8 points
0 comments3 min readEA link

OpenAI’s o3 model scores 3% on the ARC-AGI-2 bench­mark, com­pared to 60% for the av­er­age human

Yarrow Bouchard 🔸1 May 2025 13:57 UTC
14 points
8 comments3 min readEA link
(arcprize.org)

Launch­ing the AI Fore­cast­ing Bench­mark Series Q3 | $30k in Prizes

christian8 Jul 2024 17:20 UTC
17 points
0 comments1 min readEA link
(www.metaculus.com)

A Bench­mark for Mea­sur­ing Hon­esty in AI Systems

Mantas Mazeika4 Mar 2025 17:44 UTC
29 points
0 comments2 min readEA link
(www.mask-benchmark.ai)

Fact Check: 57% of the in­ter­net is NOT AI-gen­er­ated

James-Hartree-Law17 Jan 2025 21:26 UTC
1 point
0 comments1 min readEA link

Bench­mark Perfor­mance is a Poor Mea­sure of Gen­er­al­is­able AI Rea­son­ing Capabilities

James Fodor21 Feb 2025 4:25 UTC
12 points
3 comments24 min readEA link