This is very cool work and I would love to see it get more attention, and potentially get it or something like it incorporated into future AI model safety reports.
A couple things:
Does your approach roughly mirror the assessment approaches that companies currently take for assessing potential harm to humans? I imagine that it will be easier to make the case for a benchmark like this if it’s as similar as possible to existing best-in-class benchmarks. (You may have done this.)
I would recommend improving the graphs significantly. When I look at model performance, safety assessments, etc., 90% of the “rapid takeaway” value is in the graphs and having them be extremely easy to understand and very well designed. I cannot, at a glance determine what the graphs above are about other than “here are some models with some scores higher than others”. (Is score good? Bad?) I’d recommend that the graph should communicate 95% of what you want people to know so it can be rapidly shared.
1. I’m not sure how closely at the technical level this resembles exactly what the companies do. We did base this on the standard Inspect Framework to be widely usable, and looked at other Inspect evals and benchmarks/datasets (e.g. HH-RLHF) for inspiration. When discussing at a high level with some people from the companies, this seemed like something resembling what they could use but again, I’m not sure about the more technical details
2. Thanks for the recommendation, makes sense. We did think about comms somewhat e.g. to convey intuition for someone skimming that “higher is better” in the paper (https://arxiv.org/pdf/2503.04804) we first present results with different species (Figure 2). Could probably use colours and other design elements to improve the presentation.
This is very cool work and I would love to see it get more attention, and potentially get it or something like it incorporated into future AI model safety reports.
A couple things:
Does your approach roughly mirror the assessment approaches that companies currently take for assessing potential harm to humans? I imagine that it will be easier to make the case for a benchmark like this if it’s as similar as possible to existing best-in-class benchmarks. (You may have done this.)
I would recommend improving the graphs significantly. When I look at model performance, safety assessments, etc., 90% of the “rapid takeaway” value is in the graphs and having them be extremely easy to understand and very well designed. I cannot, at a glance determine what the graphs above are about other than “here are some models with some scores higher than others”. (Is score good? Bad?) I’d recommend that the graph should communicate 95% of what you want people to know so it can be rapidly shared.
Thanks for your work! Very important stuff.
Thanks, and very good question+comment!
1. I’m not sure how closely at the technical level this resembles exactly what the companies do. We did base this on the standard Inspect Framework to be widely usable, and looked at other Inspect evals and benchmarks/datasets (e.g. HH-RLHF) for inspiration. When discussing at a high level with some people from the companies, this seemed like something resembling what they could use but again, I’m not sure about the more technical details
2. Thanks for the recommendation, makes sense. We did think about comms somewhat e.g. to convey intuition for someone skimming that “higher is better” in the paper (https://arxiv.org/pdf/2503.04804) we first present results with different species (Figure 2). Could probably use colours and other design elements to improve the presentation.