Thanks for writing this up, I think it’s a really useful benchmark for tracking AI capabilities.
One minor feedback point, I feel like instead of reporting on statistical significance in the summary, I’d report on effect sizes, or maybe even better just put the discrimination plots in the summary as they give a very concrete and striking sense of the difference in performance. Statistical significance is affected by how many datapoints you have, which makes lack of a difference especially hard to interpret in terms of how real-world significant the difference is.
Thanks for writing this up, I think it’s a really useful benchmark for tracking AI capabilities.
One minor feedback point, I feel like instead of reporting on statistical significance in the summary, I’d report on effect sizes, or maybe even better just put the discrimination plots in the summary as they give a very concrete and striking sense of the difference in performance. Statistical significance is affected by how many datapoints you have, which makes lack of a difference especially hard to interpret in terms of how real-world significant the difference is.