Wow, with tool use, pretty much every SOTA model from 6 months ago outperforms the public median forecast! I’d be curious to see how gpt-5/​4.5 sonnet/​4.1 opus do on this
Wow, with tool use, pretty much every SOTA model from 6 months ago outperforms the public median forecast! I’d be curious to see how gpt-5/​4.5 sonnet/​4.1 opus do on this