Just so people know what you’re referring to, this is Figure 4:
Ben West noted in the blog post that
We think these results help resolve the apparent contradiction between superhuman performance on many benchmarks and the common empirical observations that models do not seem to be robustly helpful in automating parts of people’s day-to-day work: the best current models—such as Claude 3.7 Sonnet—are capable of some tasks that take even expert humans hours, but can only reliably complete tasks of up to a few minutes long.
Just so people know what you’re referring to, this is Figure 4:
Ben West noted in the blog post that