Quick list of some ways benchmarks might be (accidentally) misleading[1]
Poor “construct validity”[2]( & systems that are optimized for the metric)
The connection between what the benchmark is measuring and what it’s trying to measure (or what people think it’s measuring) is broken. In particular:
Missing critical steps
When benchmarks are trying to evaluate progress on some broad capability (like “engineering” or “math ability” or “planning”), they’re often testing specific performance on meaningfully simpler tasks. Performance on those tasks might be missing key aspects of true/real-world/relevant progress on that capability.
Besides the inherent difficulty of measuring the right thing, it’s important to keep in mind that systems might have been trained specifically to perform well on a given benchmark.
This is probably a bigger problem for benchmarks that have gotten more coverage.
And some benchmarks have been designed specifically to be challenging for existing leading models, which can make new/other AI systems appear to have made more progress on these capabilities (relative to the older models) than they actually are.
We’re seeing this with the “Humanity’s Last Exam” benchmark.
...and sometimes some of the apparent limitations are random or kinda fake, such that a minor improvement appears to lead to radical progress
Discontinuous metrics: (partial) progress on a given benchmark might be misleading.
The difficulty of tasks/tests in a benchmark often varies significantly (often for good reason), and reporting of results might explain the benchmark by focusing on its most difficult tests instead of the ones that the model actually completed.
I think this was an issue for Frontier Math, although I’m not sure how much strongly to discount some of the results as a result.
-> This (along with issues like 1b above, which can lead to saturation) is part of what makes it harder to extrapolate from e.g. plots of progress on certain benchmarks.
Noise & poisoning of the metric: Even on the metric in question, data might have leaked into the training process, the measurement process itself can be easily affected by things like who’s running it, comparing performance of different models that were tested slightly differently might be messy, etc. Some specific issues here (I might try to add links later):
Differences in how the measurement/testing process was actually set up (including how much inference compute was used, how many tries a model was given, access to tools, etc.)
Misc
Selective reporting/measurement (AI companies want to report their successes)
Tasks that appear difficult (because they’re hard for humans) might be might be especially easier for AI systems (and vice versa) — and this might cause us to think that more (or less) progress is being made than is true
Additions are welcome! (Also, I couldn’t quickly find a list like this earlier, but I’d be surprised if a better version than what I have above wasn’t available somewhere; I’d love recommendations.)
Follow-up:
Quick list of some ways benchmarks might be (accidentally) misleading[1]
Poor “construct validity”[2]( & systems that are optimized for the metric)
The connection between what the benchmark is measuring and what it’s trying to measure (or what people think it’s measuring) is broken. In particular:
Missing critical steps
When benchmarks are trying to evaluate progress on some broad capability (like “engineering” or “math ability” or “planning”), they’re often testing specific performance on meaningfully simpler tasks. Performance on those tasks might be missing key aspects of true/real-world/relevant progress on that capability.
Besides the inherent difficulty of measuring the right thing, it’s important to keep in mind that systems might have been trained specifically to perform well on a given benchmark.
This is probably a bigger problem for benchmarks that have gotten more coverage.
And some benchmarks have been designed specifically to be challenging for existing leading models, which can make new/other AI systems appear to have made more progress on these capabilities (relative to the older models) than they actually are.
We’re seeing this with the “Humanity’s Last Exam” benchmark.
...and sometimes some of the apparent limitations are random or kinda fake, such that a minor improvement appears to lead to radical progress
Discontinuous metrics: (partial) progress on a given benchmark might be misleading.
The difficulty of tasks/tests in a benchmark often varies significantly (often for good reason), and reporting of results might explain the benchmark by focusing on its most difficult tests instead of the ones that the model actually completed.
I think this was an issue for Frontier Math, although I’m not sure how much strongly to discount some of the results as a result.
-> This (along with issues like 1b above, which can lead to saturation) is part of what makes it harder to extrapolate from e.g. plots of progress on certain benchmarks.
Noise & poisoning of the metric: Even on the metric in question, data might have leaked into the training process, the measurement process itself can be easily affected by things like who’s running it, comparing performance of different models that were tested slightly differently might be messy, etc. Some specific issues here (I might try to add links later):
Brittleness to question phrasing
Data contamination (discussed e.g. here)
Differences in how the measurement/testing process was actually set up (including how much inference compute was used, how many tries a model was given, access to tools, etc.)
Misc
Selective reporting/measurement (AI companies want to report their successes)
Tasks that appear difficult (because they’re hard for humans) might be might be especially easier for AI systems (and vice versa) — and this might cause us to think that more (or less) progress is being made than is true
E.g. protein folding is really hard for humans
This looks relevant, but I haven’t read it
Some benchmarks seem mostly irrelevant to what I care about
Systems are tested pre post-training enhancements or other changes
Additions are welcome! (Also, I couldn’t quickly find a list like this earlier, but I’d be surprised if a better version than what I have above wasn’t available somewhere; I’d love recommendations.)
Open Phil’s announcement of their now-closed benchmarking RFP has some useful notes on this, particularly the section on “what makes for a strong benchmark.”I also appreciated METR’s list of desiderata here.
To be clear: I’m not trying to say anything on ways benchmarks might be useful/harmful here. And I’m really not an expert.
This paper looks relevant but I haven’t read it.