it says ‘less than 10 SAT exams’ in the training data in black and white
Good to know! That certainly changes my view of whether or not this will happen soon, but also makes me think the resolution criteria is poor.
I think funding, supporting, and popularising research into what ‘good’ benchmarks would be and creating a new test would be high impact work for the AI field—I’d love to see orgs look into this!
Perhaps the median community/AI-Safety researcher response was more measured.
People around me seemed to have a reasonably measured response.
I think we’ll probably get a pretty big update about the power of LLM scaling in the next 1-2 years with the release of GPT5. Like, in the same way that each of GPT3 and GPT4 were quite informative even for the relatively savvy.
Thanks for the response!
A few quick responses:
Good to know! That certainly changes my view of whether or not this will happen soon, but also makes me think the resolution criteria is poor.
You might be interested in the recent OpenPhil RFP on benchmarks and forecasting.
People around me seemed to have a reasonably measured response.
I think we’ll probably get a pretty big update about the power of LLM scaling in the next 1-2 years with the release of GPT5. Like, in the same way that each of GPT3 and GPT4 were quite informative even for the relatively savvy.