allowing competent LLMs-as-judges to consider different, possibly novel, ways how harms can come about from particular open-ended answers could allow foreseeing harms that even the best human judges could have had trouble foreseeing.
I think this is an interesting point but currently runs into the problem that the LLMs are not competent. The human judges only correlated with each other at around 0.5, which I suspect will be an upper bound for models in the near term.
Have you considered providing a rubric, at least until we get to the point where models’ unstructured thought is better than our own? Also, do you have a breakdown of the scores by judge? I’m curious if anything meaningfully changes if you just decide to not trust the worst models and only use the best one as a judge.
To me it’s not obvious that humans would do strictly better e.g. LLMs have much more factual knowledge on some topics than even experts.
> Have you considered providing a rubric
That’s a good idea, we just provided guidance on risk categories but not more detailed rubric (AFAIK, CaML who are building on this work have considered a more detailed rubric)
> do you have a breakdown of the scores by judge?
Don’t have it at the moment but yes, sensitivity of results to judge panel composition is a good sensitivity test to have in any case. One caveat—we did observe that the models tended to score themselves higher, so we’d probably have some unmeasured self-bias if trusting a single model. And of the 3 judges (4o, 1.5 pro, 3.5 Sonnet) I think none is clearly worse in terms of capabilities to judge. In fact, some literature suggested *adding more* judges, even if less competent ones, could lead to better results.
I think this is an interesting point but currently runs into the problem that the LLMs are not competent. The human judges only correlated with each other at around 0.5, which I suspect will be an upper bound for models in the near term.
Have you considered providing a rubric, at least until we get to the point where models’ unstructured thought is better than our own? Also, do you have a breakdown of the scores by judge? I’m curious if anything meaningfully changes if you just decide to not trust the worst models and only use the best one as a judge.
> LLMs are not competent.
To me it’s not obvious that humans would do strictly better e.g. LLMs have much more factual knowledge on some topics than even experts.
> Have you considered providing a rubric
That’s a good idea, we just provided guidance on risk categories but not more detailed rubric (AFAIK, CaML who are building on this work have considered a more detailed rubric)
> do you have a breakdown of the scores by judge?
Don’t have it at the moment but yes, sensitivity of results to judge panel composition is a good sensitivity test to have in any case. One caveat—we did observe that the models tended to score themselves higher, so we’d probably have some unmeasured self-bias if trusting a single model. And of the 3 judges (4o, 1.5 pro, 3.5 Sonnet) I think none is clearly worse in terms of capabilities to judge. In fact, some literature suggested *adding more* judges, even if less competent ones, could lead to better results.