To the largest degree possible we have collected data from public leader boards or system cards and it seems to be the case that Gemini models are a bit underrepresented. I am not sure why that is, but that Anthropic releases more data is definitely part of it. For example the most updated data points from CyBench come from the Claude (and Grok) system cards, and for the virology test there are data points in the system card of Opus 4.5 but not for Gemini 3.0 Pro.
Thanks!
To the largest degree possible we have collected data from public leader boards or system cards and it seems to be the case that Gemini models are a bit underrepresented. I am not sure why that is, but that Anthropic releases more data is definitely part of it. For example the most updated data points from CyBench come from the Claude (and Grok) system cards, and for the virology test there are data points in the system card of Opus 4.5 but not for Gemini 3.0 Pro.