It’s very difficult to do this with benchmarks, because as the models improve benchmarks come and go. Things that used to be so hard that it couldn’t do better than chance quickly become saturated and we look for the next thing, then the one after that, and so on. For me, the fact that GPT-4 → GPT4.5 seemed to involve climbing about half of one benchmark was slower progress than I expected (and the leaks from OpenAI suggest they had similar views to me). When GPT-3.5 was replaced by GPT-4, people were losing their minds about it — both internally and on launch day. Entirely new benchmarks were needed to deal with what it could do. I didn’t see any of that for GPT-4.5.
I agree with you that the evidence is subjective and disputable. But I don’t think it is a case where the burden of proof is disproportionately on those saying it was a smaller jump than previously.
(Also, note that this doesn’t have much to do with the actual scaling laws, which are a measure of how much prediction error of the next token goes down when you 10x the training compute. I don’t have reason to think that has gone off trend. But I’m saying that the real-world gains from this (or the intuitive measure of intelligence) has diminished, compared to the previous few 10x jumps. This is definitely compatible. e.g. if the model only trained on wikipedia plus an unending supply of nursery rhymes, its prediction error would continue to drop as more training happened, but its real world capabilities wouldn’t improve by continued 10x jumps in the number of nursery rhymes added in. I think the real world is like this where GPT-4-level systems are already trained on most books ever written and much of the recorded knowledge of the last 10,000 years of civilisation, and it makes sense that adding more Reddit comments wouldn’t move the needle much.)
That’s a very nice and clear idea — I think you’re right that working on making mission-critical, but illegible, problems legible is robustly high value.