Most college-educated adults would get well under half of these problems right (the authors used computer science undergraduates as human subjects, and their performance ranged from 40% to 90%).
I think the hardness of the MATH benchmark was somewhat exaggerated. I downloaded the dataset myself and took a look, and came to the conclusion that many—perhaps most—of the questions are simple plug-and-chug problems. The reported performance of 40-90% among students may have been a result of time constraints rather than pure difficulty. In the paper, they wrote:
“To provide a rough but informative comparison to human-level performance, we randomly sampled 20 problems from the MATH test set and gave them to humans. We artificially require that the participants have 1 hour to work on the problems and must perform calculations by hand.”
I think the hardness of the MATH benchmark was somewhat exaggerated. I downloaded the dataset myself and took a look, and came to the conclusion that many—perhaps most—of the questions are simple plug-and-chug problems. The reported performance of 40-90% among students may have been a result of time constraints rather than pure difficulty. In the paper, they wrote:
“To provide a rough but informative comparison to human-level performance, we randomly sampled 20 problems from the MATH test set and gave them to humans. We artificially require that the participants have 1 hour to work on the problems and must perform calculations by hand.”