I agree it is somewhat misleading, but I feel like using the internet is itself a highly useful skill in the modern world and insofar as the other models couldn’t do it, that is too bad for them.
Pretty sure o1 and Gemini have access to the internet.
The main way it’s potentially misleading is that it’s not a log plot (most benchmark results will look like exponentials on a linear scale) – however, I expect Deep Research would still seem above trend even if it was. I also think it’s helpful to new readers to see some of the charts on linear scales, since in some ways it’s more intuitive.
While you can use o1 and gemini with internet access, I think they almost certainly evaluated it without such access (see the original paper here).
I really really do not think you should put the plot there. It’s like comparing two different students performance except one of them has access to the internet. I think it’s extremely misleading. If you want to illustrate progress you could just use the FrontierMath/GPQA results or even ARC-AGI.
Wait that humanity’s last exam plot is super misleading right? Since the other models did not have access to the internet but Deep Research does?
I agree it is somewhat misleading, but I feel like using the internet is itself a highly useful skill in the modern world and insofar as the other models couldn’t do it, that is too bad for them.
Pretty sure o1 and Gemini have access to the internet.
The main way it’s potentially misleading is that it’s not a log plot (most benchmark results will look like exponentials on a linear scale) – however, I expect Deep Research would still seem above trend even if it was. I also think it’s helpful to new readers to see some of the charts on linear scales, since in some ways it’s more intuitive.
While you can use o1 and gemini with internet access, I think they almost certainly evaluated it without such access (see the original paper here).
I really really do not think you should put the plot there. It’s like comparing two different students performance except one of them has access to the internet. I think it’s extremely misleading. If you want to illustrate progress you could just use the FrontierMath/GPQA results or even ARC-AGI.
Thanks this is helpful.
(Just adding the FrontierMath/GPQA and ARC-AGI charts you mentioned for my own benefit, and others)