RSS

AI benchmarks

TagLast edit: 2 Feb 2024 10:57 UTC by Toby Tremlett🔹

Benchmarks are tests which enable us to measure the progress of AI capabilities, and test for characteristics which might pose safety risks.

Further reading

The Benchmark Lottery

BASALT: A Benchmark for Learning from Human Feedback—AI Alignment Forum

Misaligned Powerseeking — SERI ML Alignment Theory Scholars Program | Summer 2022

[2110.06674] Truthful AI: Developing and governing AI that does not lie

Related entries

AI safety | standards and regulation

Trendlines in AIxBio evals

ljusten31 Oct 2024 0:09 UTC
39 points
2 comments11 min readEA link
(www.lennijusten.com)

Open Phil re­leases RFPs on LLM Bench­marks and Forecasting

Lawrence Chan11 Nov 2023 3:01 UTC
12 points
0 comments1 min readEA link
(www.openphilanthropy.org)

An­nounc­ing Epoch’s newly ex­panded Pa­ram­e­ters, Com­pute and Data Trends in Ma­chine Learn­ing database

Robi Rahman25 Oct 2023 3:03 UTC
38 points
1 comment1 min readEA link
(epochai.org)

$250K in Prizes: SafeBench Com­pe­ti­tion An­nounce­ment

Center for AI Safety3 Apr 2024 22:07 UTC
47 points
0 comments1 min readEA link

XPT fore­casts on (some) Direct Ap­proach model inputs

Forecasting Research Institute20 Aug 2023 12:39 UTC
37 points
0 comments9 min readEA link

Lan­guage mod­els sur­prised us

Ajeya29 Aug 2023 21:18 UTC
59 points
10 comments5 min readEA link

Long list of AI ques­tions

NunoSempere6 Dec 2023 11:12 UTC
124 points
14 comments86 min readEA link

Sur­vey on the ac­cel­er­a­tion risks of our new RFPs to study LLM capabilities

Ajeya10 Nov 2023 23:59 UTC
38 points
1 comment8 min readEA link

Re­sults from an Ad­ver­sar­ial Col­lab­o­ra­tion on AI Risk (FRI)

Forecasting Research Institute11 Mar 2024 15:54 UTC
193 points
25 comments9 min readEA link
(forecastingresearch.org)

A com­pute-based frame­work for think­ing about the fu­ture of AI

Matthew_Barnett31 May 2023 22:00 UTC
96 points
36 comments19 min readEA link

AI Fore­cast­ing Re­search Ideas

Jaime Sevilla17 Nov 2022 17:37 UTC
78 points
1 comment1 min readEA link
(docs.google.com)

An­nounc­ing Epoch’s dash­board of key trends and figures in Ma­chine Learning

Jaime Sevilla13 Apr 2023 7:33 UTC
127 points
4 comments1 min readEA link

Prizes for ML Safety Bench­mark Ideas

Joshc28 Oct 2022 2:44 UTC
56 points
8 comments1 min readEA link

A Bench­mark for Mea­sur­ing Hon­esty in AI Systems

Mantas Mazeika4 Mar 2025 17:44 UTC
23 points
0 comments2 min readEA link
(www.mask-benchmark.ai)

Large Lan­guage Models Pass the Tur­ing Test

Matrice Jacobine2 Apr 2025 5:41 UTC
11 points
6 comments1 min readEA link
(arxiv.org)

Race to the Top: Bench­marks for AI Safety

isaduan4 Dec 2022 22:50 UTC
52 points
8 comments1 min readEA link

En­cul­tured AI, Part 1: En­abling New Benchmarks

Andrew Critch8 Aug 2022 22:49 UTC
17 points
0 comments6 min readEA link

An­nounc­ing the AI Fore­cast­ing Bench­mark Series | July 8, $120k in Prizes

christian19 Jun 2024 21:37 UTC
52 points
4 comments5 min readEA link
(www.metaculus.com)

AI Bench­marks Series — Me­tac­u­lus Ques­tions on Eval­u­a­tions of AI Models Against Tech­ni­cal Benchmarks

christian27 Mar 2024 23:05 UTC
10 points
0 comments1 min readEA link
(www.metaculus.com)

Launch­ing the AI Fore­cast­ing Bench­mark Series Q3 | $30k in Prizes

christian8 Jul 2024 17:20 UTC
17 points
0 comments1 min readEA link
(www.metaculus.com)

o3

Zach Stein-Perlman20 Dec 2024 21:00 UTC
84 points
5 comments1 min readEA link

We are in a New Paradigm of AI Progress—OpenAI’s o3 model makes huge gains on the tough­est AI bench­marks in the world

Garrison22 Dec 2024 21:45 UTC
26 points
0 comments4 min readEA link
(garrisonlovely.substack.com)

Me­tac­u­lus Q4 AI Bench­mark­ing: Bots Are Clos­ing The Gap

Molly Hickman19 Feb 2025 22:46 UTC
41 points
8 comments13 min readEA link

Is AI Hit­ting a Wall or Mov­ing Faster Than Ever?

Garrison9 Jan 2025 22:18 UTC
35 points
3 comments5 min readEA link
(garrisonlovely.substack.com)

Pre­dict 2025 AI ca­pa­bil­ities (by Sun­day)

Jonas V15 Jan 2025 0:16 UTC
16 points
0 comments1 min readEA link

Bench­mark Perfor­mance is a Poor Mea­sure of Gen­er­al­is­able AI Rea­son­ing Capabilities

James Fodor21 Feb 2025 4:25 UTC
12 points
3 comments24 min readEA link

Fact Check: 57% of the in­ter­net is NOT AI-gen­er­ated

James-Hartree-Law17 Jan 2025 21:26 UTC
1 point
0 comments1 min readEA link