“In response, Epoch AI created Frontier Math — a benchmark of insanely hard mathematical problems. The easiest 25% are similar to Olympiad-level problems. The most difficult 25% are, according to Fields Medalist Terence Tao, “extremely challenging,” and would typically need an expert in that branch of mathematics to solve them.
Previous models, including GPT-o1, could hardly solve any of these questions.[20] In December 2024, OpenAI claimed that GPT-o3 could solve 25%.”
I think if your going to mention the seemingly strong performance of GPT-o3 on Frontier Math, it’s worth pointing out the extremely poor performance of all LLMs including when they were given Math Olympiad questions more recently,. though they did use o3 mini rather than o3, so I guess it’s a not a direct comparison: https://garymarcus.substack.com/p/reports-of-llms-mastering-math-have ”The USA Math Olympiad is an extremely challenging math competition for the top US high school students; the top scorers get prizes and an invitation to the International Math Olympiad. The USAMO was held this year March 19-20. Hours after it was completed, so there could be virtually no chance of data leakage, a team of scientists gave the problems to some of the top large language models, whose mathematical and reasoning abilities have been loudly proclaimed: o3-Mini, o1-Pro, DeepSeek R1, QwQ-32B, Gemini-2.0-Flash-Thinking-Exp, and Claude-3.7-Sonnet-Thinking. The proofs output by all these models were evaluated by experts. The results were dismal: None of the AIs scored higher than 5% overall.”
“In response, Epoch AI created Frontier Math — a benchmark of insanely hard mathematical problems. The easiest 25% are similar to Olympiad-level problems. The most difficult 25% are, according to Fields Medalist Terence Tao, “extremely challenging,” and would typically need an expert in that branch of mathematics to solve them.
Previous models, including GPT-o1, could hardly solve any of these questions.[20] In December 2024, OpenAI claimed that GPT-o3 could solve 25%.”
I think if your going to mention the seemingly strong performance of GPT-o3 on Frontier Math, it’s worth pointing out the extremely poor performance of all LLMs including when they were given Math Olympiad questions more recently,. though they did use o3 mini rather than o3, so I guess it’s a not a direct comparison: https://garymarcus.substack.com/p/reports-of-llms-mastering-math-have
”The USA Math Olympiad is an extremely challenging math competition for the top US high school students; the top scorers get prizes and an invitation to the International Math Olympiad. The USAMO was held this year March 19-20. Hours after it was completed, so there could be virtually no chance of data leakage, a team of scientists gave the problems to some of the top large language models, whose mathematical and reasoning abilities have been loudly proclaimed: o3-Mini, o1-Pro, DeepSeek R1, QwQ-32B, Gemini-2.0-Flash-Thinking-Exp, and Claude-3.7-Sonnet-Thinking. The proofs output by all these models were evaluated by experts. The results were dismal: None of the AIs scored higher than 5% overall.”
I only came across this paper in the last few days! (The post you link to is from 5th April; my article was first published 21st March.)
I want to see more commentary on the paper before deciding what to do about it. My current understanding:
o3-mini seems to be a lot worse than o3 – it only got ~10% on Frontier Math, similar to o1. (Claude Sonnet 3.7 only gets ~3%.)
So the results actually seem consistent with Frontier Math, except they didn’t test o3, which is significantly ahead of other models.
The other factor seems to be that they evaluated the quality of the proofs rather than the ability to get a correct numerical answer.
I’m not sure data leakage is a big part of the difference.
Apparently there’s a preprint showing Gemini 2.5 gets 20% on the Olympiad questions, which would be in line with the o3 result.