RSS

AI benchmarks

TagLast edit: Feb 2, 2024, 10:57 AM by Toby TremlettšŸ”¹

Benchmarks are tests which enable us to measure the progress of AI capabilities, and test for characteristics which might pose safety risks.

Further reading

The Benchmark Lottery

BASALT: A Benchmark for Learning from Human Feedback—AI Alignment Forum

Misaligned Powerseeking — SERI ML Alignment Theory Scholars Program | Summer 2022

[2110.06674] Truthful AI: Developing and governing AI that does not lie

Related entries

AI safety | standards and regulation

Trendlines in AIxBio evals

ljustenOct 31, 2024, 12:09 AM
39 points
2 comments11 min readEA link
(www.lennijusten.com)

Open Phil re­leases RFPs on LLM Bench­marks and Forecasting

Lawrence ChanNov 11, 2023, 3:01 AM
12 points
0 comments1 min readEA link
(www.openphilanthropy.org)

AI Fore­cast­ing Re­search Ideas

Jaime SevillaNov 17, 2022, 5:37 PM
78 points
1 comment1 min readEA link
(docs.google.com)

Prizes for ML Safety BenchĀ­mark Ideas

JoshcOct 28, 2022, 2:44 AM
56 points
8 comments1 min readEA link

AnĀ­nouncĀ­ing Epoch’s newly exĀ­panded PaĀ­ramĀ­eĀ­ters, ComĀ­pute and Data Trends in MaĀ­chine LearnĀ­ing database

Robi RahmanOct 25, 2023, 3:03 AM
38 points
1 comment1 min readEA link
(epochai.org)

$250K in Prizes: SafeBench ComĀ­peĀ­tiĀ­tion AnĀ­nounceĀ­ment

Center for AI SafetyApr 3, 2024, 10:07 PM
47 points
0 comments1 min readEA link

XPT foreĀ­casts on (some) Direct ApĀ­proach model inputs

Forecasting Research InstituteAug 20, 2023, 12:39 PM
37 points
0 comments9 min readEA link

LanĀ­guage modĀ­els surĀ­prised us

AjeyaAug 29, 2023, 9:18 PM
59 points
10 comments5 min readEA link

Long list of AI quesĀ­tions

NunoSempereDec 6, 2023, 11:12 AM
124 points
14 comments86 min readEA link

SurĀ­vey on the acĀ­celĀ­erĀ­aĀ­tion risks of our new RFPs to study LLM capabilities

AjeyaNov 10, 2023, 11:59 PM
38 points
1 comment8 min readEA link

A comĀ­pute-based frameĀ­work for thinkĀ­ing about the fuĀ­ture of AI

Matthew_BarnettMay 31, 2023, 10:00 PM
96 points
36 comments19 min readEA link

Re­sults from an Ad­ver­sar­ial Col­lab­o­ra­tion on AI Risk (FRI)

Forecasting Research InstituteMar 11, 2024, 3:54 PM
193 points
25 comments9 min readEA link
(forecastingresearch.org)

AnĀ­nouncĀ­ing Epoch’s dashĀ­board of key trends and figures in MaĀ­chine Learning

Jaime SevillaApr 13, 2023, 7:33 AM
127 points
4 comments1 min readEA link

Large Lan­guage Models Pass the Tur­ing Test

Matrice JacobineApr 2, 2025, 5:41 AM
11 points
6 comments1 min readEA link
(arxiv.org)

A Bench­mark for Mea­sur­ing Hon­esty in AI Systems

Mantas MazeikaMar 4, 2025, 5:44 PM
22 points
0 comments2 min readEA link
(www.mask-benchmark.ai)

Race to the Top: BenchĀ­marks for AI Safety

isaduanDec 4, 2022, 10:50 PM
52 points
8 comments1 min readEA link

EnĀ­culĀ­tured AI, Part 1: EnĀ­abling New Benchmarks

Andrew CritchAug 8, 2022, 10:49 PM
17 points
0 comments6 min readEA link

An­nounc­ing the AI Fore­cast­ing Bench­mark Series | July 8, $120k in Prizes

christianJun 19, 2024, 9:37 PM
52 points
4 comments5 min readEA link
(www.metaculus.com)

AI BenchĀ­marks Series — MeĀ­tacĀ­uĀ­lus QuesĀ­tions on EvalĀ­uĀ­aĀ­tions of AI Models Against TechĀ­niĀ­cal Benchmarks

christianMar 27, 2024, 11:05 PM
10 points
0 comments1 min readEA link
(www.metaculus.com)

Launch­ing the AI Fore­cast­ing Bench­mark Series Q3 | $30k in Prizes

christianJul 8, 2024, 5:20 PM
17 points
0 comments1 min readEA link
(www.metaculus.com)

o3

Zach Stein-PerlmanDec 20, 2024, 9:00 PM
84 points
5 comments1 min readEA link

We are in a New Paradigm of AI Progress—OpenAI’s o3 model makes huge gains on the toughĀ­est AI benchĀ­marks in the world

GarrisonDec 22, 2024, 9:45 PM
26 points
0 comments4 min readEA link
(garrisonlovely.substack.com)

MeĀ­tacĀ­uĀ­lus Q4 AI BenchĀ­markĀ­ing: Bots Are ClosĀ­ing The Gap

Molly HickmanFeb 19, 2025, 10:46 PM
41 points
8 comments13 min readEA link

Is AI Hit­ting a Wall or Mov­ing Faster Than Ever?

GarrisonJan 9, 2025, 10:18 PM
35 points
3 comments5 min readEA link
(garrisonlovely.substack.com)

PreĀ­dict 2025 AI caĀ­paĀ­bilĀ­ities (by SunĀ­day)

Jonas VJan 15, 2025, 12:16 AM
16 points
0 comments1 min readEA link

BenchĀ­mark PerforĀ­mance is a Poor MeaĀ­sure of GenĀ­erĀ­alĀ­isĀ­able AI ReaĀ­sonĀ­ing Capabilities

James FodorFeb 21, 2025, 4:25 AM
12 points
3 comments24 min readEA link

Fact Check: 57% of the inĀ­terĀ­net is NOT AI-genĀ­erĀ­ated

James-Hartree-LawJan 17, 2025, 9:26 PM
1 point
0 comments1 min readEA link