Conceptually, “risk of harm”, “harm by failure to promote interest” do seem appropriate for many questions cases. For e.g. for “help me design an [animal species] meat farm” we’d probably want animal interests to be considered in the response. But it can certainly be debated whether “animal interests”, “animal welfare” or something else is the formulation we’d better want to have.
I agree there could be benefits to having a more narrowly defined questions and more clear “right” answers. Vetted multiple choice answers, no judges and no inter-judge disagreement is at the end of this spectrum. We state in the paper: “The primary limitation is the complexity and subjectivity of quantitatively assessing “animal harm.”” On the other hand, allowing competent LLMs-as-judges to consider different, possibly novel, ways how harms can come about from particular open-ended answers could allow foreseeing harms that even the best human judges could have had trouble foreseeing.
Still, having open ended questions and answers, did lead to mediocre inter-rater agreement, and it can make results seem less convincing and dependent on the set of judges. (We did do lots of prompt & scoring rubric refinement to reduce ambiguity; refining questions could be another step.) We do invite readers to look beyond the scores, examine the whole set of questions and outputs. All results that were used in the paper are available here (sorry for some formatting issues in these log file extracts; the formatting peculiarities were not present in the actual interactions e.g. responses the judges saw): https://drive.google.com/drive/u/0/folders/1IZVrfc1UbS6RQDk2NPcoyVR1B9RCsgAW
The example you mention “help with a legal request which some people think is immoral”, this looks like classical helpfulness-harmlessness tradeoff. Not sure what you meant, but e.g. “how to circumvent animal welfare regulations” is probably something we’d want models not to be too helpful with.
We do try to anchor to majority’s and legal views, i.e. trying to measure “risk of harm” instead of “speciesism”. Then again, majority’s views and actions can be inconsistent. I think it’s actually good if LLMs, and this benchmark in particular, is sensitive to the fact that actions commonly considered morally ok (like eating meat) can lead to harm to animals.
allowing competent LLMs-as-judges to consider different, possibly novel, ways how harms can come about from particular open-ended answers could allow foreseeing harms that even the best human judges could have had trouble foreseeing.
I think this is an interesting point but currently runs into the problem that the LLMs are not competent. The human judges only correlated with each other at around 0.5, which I suspect will be an upper bound for models in the near term.
Have you considered providing a rubric, at least until we get to the point where models’ unstructured thought is better than our own? Also, do you have a breakdown of the scores by judge? I’m curious if anything meaningfully changes if you just decide to not trust the worst models and only use the best one as a judge.
To me it’s not obvious that humans would do strictly better e.g. LLMs have much more factual knowledge on some topics than even experts.
> Have you considered providing a rubric
That’s a good idea, we just provided guidance on risk categories but not more detailed rubric (AFAIK, CaML who are building on this work have considered a more detailed rubric)
> do you have a breakdown of the scores by judge?
Don’t have it at the moment but yes, sensitivity of results to judge panel composition is a good sensitivity test to have in any case. One caveat—we did observe that the models tended to score themselves higher, so we’d probably have some unmeasured self-bias if trusting a single model. And of the 3 judges (4o, 1.5 pro, 3.5 Sonnet) I think none is clearly worse in terms of capabilities to judge. In fact, some literature suggested *adding more* judges, even if less competent ones, could lead to better results.
Thank you for the comment!
Conceptually, “risk of harm”, “harm by failure to promote interest” do seem appropriate for many questions cases. For e.g. for “help me design an [animal species] meat farm” we’d probably want animal interests to be considered in the response. But it can certainly be debated whether “animal interests”, “animal welfare” or something else is the formulation we’d better want to have.
I agree there could be benefits to having a more narrowly defined questions and more clear “right” answers. Vetted multiple choice answers, no judges and no inter-judge disagreement is at the end of this spectrum. We state in the paper: “The primary limitation is the complexity and subjectivity of quantitatively assessing “animal harm.”” On the other hand, allowing competent LLMs-as-judges to consider different, possibly novel, ways how harms can come about from particular open-ended answers could allow foreseeing harms that even the best human judges could have had trouble foreseeing.
Still, having open ended questions and answers, did lead to mediocre inter-rater agreement, and it can make results seem less convincing and dependent on the set of judges. (We did do lots of prompt & scoring rubric refinement to reduce ambiguity; refining questions could be another step.) We do invite readers to look beyond the scores, examine the whole set of questions and outputs. All results that were used in the paper are available here (sorry for some formatting issues in these log file extracts; the formatting peculiarities were not present in the actual interactions e.g. responses the judges saw): https://drive.google.com/drive/u/0/folders/1IZVrfc1UbS6RQDk2NPcoyVR1B9RCsgAW
The example you mention “help with a legal request which some people think is immoral”, this looks like classical helpfulness-harmlessness tradeoff. Not sure what you meant, but e.g. “how to circumvent animal welfare regulations” is probably something we’d want models not to be too helpful with.
We do try to anchor to majority’s and legal views, i.e. trying to measure “risk of harm” instead of “speciesism”. Then again, majority’s views and actions can be inconsistent. I think it’s actually good if LLMs, and this benchmark in particular, is sensitive to the fact that actions commonly considered morally ok (like eating meat) can lead to harm to animals.
I think this is an interesting point but currently runs into the problem that the LLMs are not competent. The human judges only correlated with each other at around 0.5, which I suspect will be an upper bound for models in the near term.
Have you considered providing a rubric, at least until we get to the point where models’ unstructured thought is better than our own? Also, do you have a breakdown of the scores by judge? I’m curious if anything meaningfully changes if you just decide to not trust the worst models and only use the best one as a judge.
> LLMs are not competent.
To me it’s not obvious that humans would do strictly better e.g. LLMs have much more factual knowledge on some topics than even experts.
> Have you considered providing a rubric
That’s a good idea, we just provided guidance on risk categories but not more detailed rubric (AFAIK, CaML who are building on this work have considered a more detailed rubric)
> do you have a breakdown of the scores by judge?
Don’t have it at the moment but yes, sensitivity of results to judge panel composition is a good sensitivity test to have in any case. One caveat—we did observe that the models tended to score themselves higher, so we’d probably have some unmeasured self-bias if trusting a single model. And of the 3 judges (4o, 1.5 pro, 3.5 Sonnet) I think none is clearly worse in terms of capabilities to judge. In fact, some literature suggested *adding more* judges, even if less competent ones, could lead to better results.