To me it’s not obvious that humans would do strictly better e.g. LLMs have much more factual knowledge on some topics than even experts.
> Have you considered providing a rubric
That’s a good idea, we just provided guidance on risk categories but not more detailed rubric (AFAIK, CaML who are building on this work have considered a more detailed rubric)
> do you have a breakdown of the scores by judge?
Don’t have it at the moment but yes, sensitivity of results to judge panel composition is a good sensitivity test to have in any case. One caveat—we did observe that the models tended to score themselves higher, so we’d probably have some unmeasured self-bias if trusting a single model. And of the 3 judges (4o, 1.5 pro, 3.5 Sonnet) I think none is clearly worse in terms of capabilities to judge. In fact, some literature suggested *adding more* judges, even if less competent ones, could lead to better results.
> LLMs are not competent.
To me it’s not obvious that humans would do strictly better e.g. LLMs have much more factual knowledge on some topics than even experts.
> Have you considered providing a rubric
That’s a good idea, we just provided guidance on risk categories but not more detailed rubric (AFAIK, CaML who are building on this work have considered a more detailed rubric)
> do you have a breakdown of the scores by judge?
Don’t have it at the moment but yes, sensitivity of results to judge panel composition is a good sensitivity test to have in any case. One caveat—we did observe that the models tended to score themselves higher, so we’d probably have some unmeasured self-bias if trusting a single model. And of the 3 judges (4o, 1.5 pro, 3.5 Sonnet) I think none is clearly worse in terms of capabilities to judge. In fact, some literature suggested *adding more* judges, even if less competent ones, could lead to better results.