AI benchmarks

TagLast edit: Feb 2, 2024, 10:57 AM by Toby Tremlett🔹

Benchmarks are tests which enable us to measure the progress of AI capabilities, and test for characteristics which might pose safety risks.

Further reading

The Benchmark Lottery

BASALT: A Benchmark for Learning from Human Feedback—AI Alignment Forum

Misaligned Powerseeking — SERI ML Alignment Theory Scholars Program | Summer 2022

[2110.06674] Truthful AI: Developing and governing AI that does not lie

Related entries

AI safety | standards and regulation

Trendlines in AIxBio evals

ljustenOct 31, 2024, 12:09 AM

40 points

2 comments11 min readEA link

(www.lennijusten.com)

Open Phil releases RFPs on LLM Benchmarks and Forecasting

Lawrence ChanNov 11, 2023, 3:01 AM

12 points

0 comments1 min readEA link

(www.openphilanthropy.org)

XPT forecasts on (some) Direct Approach model inputs

Forecasting Research InstituteAug 20, 2023, 12:39 PM

37 points

0 comments9 min readEA link

Announcing Epoch’s newly expanded Parameters, Compute and Data Trends in Machine Learning database

Robi RahmanOct 25, 2023, 3:03 AM

38 points

1 comment1 min readEA link

(epochai.org)

Language models surprised us

AjeyaAug 29, 2023, 9:18 PM

59 points

10 comments5 min readEA link

Prizes for ML Safety Benchmark Ideas

JoshcOct 28, 2022, 2:44 AM

56 points

8 comments1 min readEA link

Results from an Adversarial Collaboration on AI Risk (FRI)

Forecasting Research InstituteMar 11, 2024, 3:54 PM

193 points

25 comments9 min readEA link

(forecastingresearch.org)

AI Forecasting Research Ideas

Jaime SevillaNov 17, 2022, 5:37 PM

78 points

1 comment1 min readEA link

(docs.google.com)

$250K in Prizes: SafeBench Competition Announcement

Center for AI SafetyApr 3, 2024, 10:07 PM

47 points

0 comments1 min readEA link

Long list of AI questions

NunoSempereDec 6, 2023, 11:12 AM

124 points

15 comments86 min readEA link

Survey on the acceleration risks of our new RFPs to study LLM capabilities

AjeyaNov 10, 2023, 11:59 PM

38 points

1 comment8 min readEA link

Announcing Epoch’s dashboard of key trends and figures in Machine Learning

Jaime SevillaApr 13, 2023, 7:33 AM

127 points

4 comments1 min readEA link

A compute-based framework for thinking about the future of AI

Matthew_BarnettMay 31, 2023, 10:00 PM

96 points

36 comments19 min readEA link

AI Benchmarks Series — Metaculus Questions on Evaluations of AI Models Against Technical Benchmarks

christianMar 27, 2024, 11:05 PM

10 points

0 comments1 min readEA link

(www.metaculus.com)

Metaculus Q4 AI Benchmarking: Bots Are Closing The Gap

Molly HickmanFeb 19, 2025, 10:46 PM

41 points

8 comments13 min readEA link

Where’s my ten minute AGI?

Vasco Grilo🔸May 19, 2025, 5:45 PM

43 points

6 comments7 min readEA link

(epoch.ai)

Encultured AI, Part 1: Enabling New Benchmarks

Andrew CritchAug 8, 2022, 10:49 PM

17 points

0 comments6 min readEA link

Testing Human Flow in Political Dialogue: A New Benchmark for Emotionally Aligned AI

DongHun LeeMay 30, 2025, 4:37 AM

1 point

0 comments1 min readEA link

Impact of Quantization on Small Language Models (SLMs) for Multilingual Mathematical Reasoning Tasks

Angie Paola GiraldoMay 7, 2025, 9:48 PM

1 point

0 comments14 min readEA link

Why I am Still Skeptical about AGI by 2030

James FodorMay 2, 2025, 7:13 AM

131 points

13 comments6 min readEA link

o3

Zach Stein-PerlmanDec 20, 2024, 9:00 PM

84 points

9 comments1 min readEA link

Announcing the AI Forecasting Benchmark Series | July 8, $120k in Prizes

christianJun 19, 2024, 9:37 PM

52 points

4 comments5 min readEA link

(www.metaculus.com)

Race to the Top: Benchmarks for AI Safety

isaduanDec 4, 2022, 10:50 PM

52 points

8 comments1 min readEA link

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Matrice JacobineMay 12, 2025, 3:20 PM

14 points

1 comment1 min readEA link

(www.arxiv.org)

Predict 2025 AI capabilities (by Sunday)

Jonas VJan 15, 2025, 12:16 AM

16 points

0 comments1 min readEA link

AISN #53: An Open Letter Attempts to Block OpenAI Restructuring

Center for AI SafetyApr 29, 2025, 3:56 PM

6 points

0 comments4 min readEA link

(newsletter.safe.ai)

AIs Are Expert-Level at Many Virology Skills

Center for AI SafetyMay 2, 2025, 4:07 PM

22 points

0 comments1 min readEA link

Absolute Zero: AlphaZero for LLM

alapmiMay 12, 2025, 2:54 PM

2 points

0 comments1 min readEA link

Performance of Large Language Models (LLMs) in Complex Analysis: A Benchmark of Mathematical Competence and its Role in Decision Making.

Jaime Esteban Montenegro BarónMay 6, 2025, 9:08 PM

1 point

0 comments23 min readEA link

We are in a New Paradigm of AI Progress—OpenAI’s o3 model makes huge gains on the toughest AI benchmarks in the world

GarrisonDec 22, 2024, 9:45 PM

26 points

0 comments4 min readEA link

(garrisonlovely.substack.com)

Large Language Models Pass the Turing Test

Matrice JacobineApr 2, 2025, 5:41 AM

11 points

6 comments1 min readEA link

(arxiv.org)

Is AI Hitting a Wall or Moving Faster Than Ever?

GarrisonJan 9, 2025, 10:18 PM

35 points

5 comments5 min readEA link

(garrisonlovely.substack.com)

The Khayali Protocol

khayaliJun 2, 2025, 2:40 PM

−5 points

0 comments3 min readEA link

OpenAI’s o3 model scores 3% on the ARC-AGI-2 benchmark, compared to 60% for the average human

Yarrow🔸May 1, 2025, 1:57 PM

14 points

8 comments3 min readEA link

(arcprize.org)

Launching the AI Forecasting Benchmark Series Q3 | $30k in Prizes

christianJul 8, 2024, 5:20 PM

17 points

0 comments1 min readEA link

(www.metaculus.com)

A Benchmark for Measuring Honesty in AI Systems

Mantas MazeikaMar 4, 2025, 5:44 PM

29 points

0 comments2 min readEA link

(www.mask-benchmark.ai)

Fact Check: 57% of the internet is NOT AI-generated

James-Hartree-LawJan 17, 2025, 9:26 PM

1 point

0 comments1 min readEA link

Benchmark Performance is a Poor Measure of Generalisable AI Reasoning Capabilities

James FodorFeb 21, 2025, 4:25 AM

12 points

3 comments24 min readEA link

MichaelA🔸Nov 9, 2022, 12:19 PM
2 points
0 ∶ 0
I’m not totally sure whether this should exist, and whether it should be called this.
[ ]

[deleted]

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer