I only came across this paper in the last few days! (The post you link to is from 5th April; my article was first published 21st March.)
I want to see more commentary on the paper before deciding what to do about it. My current understanding:
o3-mini seems to be a lot worse than o3 – it only got ~10% on Frontier Math, similar to o1. (Claude Sonnet 3.7 only gets ~3%.)
So the results actually seem consistent with Frontier Math, except they didn’t test o3, which is significantly ahead of other models.
The other factor seems to be that they evaluated the quality of the proofs rather than the ability to get a correct numerical answer.
I’m not sure data leakage is a big part of the difference.
Apparently there’s a preprint showing Gemini 2.5 gets 20% on the Olympiad questions, which would be in line with the o3 result.
I only came across this paper in the last few days! (The post you link to is from 5th April; my article was first published 21st March.)
I want to see more commentary on the paper before deciding what to do about it. My current understanding:
o3-mini seems to be a lot worse than o3 – it only got ~10% on Frontier Math, similar to o1. (Claude Sonnet 3.7 only gets ~3%.)
So the results actually seem consistent with Frontier Math, except they didn’t test o3, which is significantly ahead of other models.
The other factor seems to be that they evaluated the quality of the proofs rather than the ability to get a correct numerical answer.
I’m not sure data leakage is a big part of the difference.
Apparently there’s a preprint showing Gemini 2.5 gets 20% on the Olympiad questions, which would be in line with the o3 result.