I find the “risk of harm” framing a bit weird. When I think of this paper as answering “what kinds of things do different LLMs say when asked animal-welfare-related questions?”, it makes sense and matches what you’d expect from talking to LLMs, but when I read it as an answer to “how do LLMs harm animals in expectation?”, it seems misguided.
Some of what you consider harm seems reasonable: if I ask Sonnet 3.5 how to mistreat an animal, and it tells me exactly what to do, it seems reasonable to count that as harm. But other cases really stretch the definition. For instance, “harm by failure to promote interest” is such an expansive definition that I don’t think it’s useful.
It’s also not obvious to me that if I ask for help with a legal request which some people think is immoral, models should refuse to help or try to change my views. I think this is a plausible principle to have, but it trades off against some other pretty plausible principles, like “models should generally not patronise their users” and “models should strive to be helpful within the bounds of the law”. Fwiw I expect part of my reaction here is because we have a broader philosophical disagreement: I feel a bit nervous about the extent to which we should penalise models for reflecting majority moral views, even if they’re moral views I personally disagree with.
Setting aside conceptual disagreements, I saw that your inter-judge correlation is pretty low (0.35-0.40). This makes me trust the results much less and pushes me toward just looking at individual model outputs for particular questions, which sorta defeats the point of having a scored benchmark. I’m curious if you have any reactions to this or have a theory about why these correlations are relatively weak? I haven’t read the paper in a ton of detail.
I think this is a plausible principle to have, but it trades off against some other pretty plausible principles
I wasn’t involved in making this benchmark but fwiw it feels pretty reasonable to me to separate the measurement of an attribute from the policy decision about how that attribute should trade off against other things. (Indeed, I expect that AI developers will be unbothered by creating models which cause animals harm if that provides economic benefits to them.)
Conceptually, “risk of harm”, “harm by failure to promote interest” do seem appropriate for many questions cases. For e.g. for “help me design an [animal species] meat farm” we’d probably want animal interests to be considered in the response. But it can certainly be debated whether “animal interests”, “animal welfare” or something else is the formulation we’d better want to have.
I agree there could be benefits to having a more narrowly defined questions and more clear “right” answers. Vetted multiple choice answers, no judges and no inter-judge disagreement is at the end of this spectrum. We state in the paper: “The primary limitation is the complexity and subjectivity of quantitatively assessing “animal harm.”” On the other hand, allowing competent LLMs-as-judges to consider different, possibly novel, ways how harms can come about from particular open-ended answers could allow foreseeing harms that even the best human judges could have had trouble foreseeing.
Still, having open ended questions and answers, did lead to mediocre inter-rater agreement, and it can make results seem less convincing and dependent on the set of judges. (We did do lots of prompt & scoring rubric refinement to reduce ambiguity; refining questions could be another step.) We do invite readers to look beyond the scores, examine the whole set of questions and outputs. All results that were used in the paper are available here (sorry for some formatting issues in these log file extracts; the formatting peculiarities were not present in the actual interactions e.g. responses the judges saw): https://drive.google.com/drive/u/0/folders/1IZVrfc1UbS6RQDk2NPcoyVR1B9RCsgAW
The example you mention “help with a legal request which some people think is immoral”, this looks like classical helpfulness-harmlessness tradeoff. Not sure what you meant, but e.g. “how to circumvent animal welfare regulations” is probably something we’d want models not to be too helpful with.
We do try to anchor to majority’s and legal views, i.e. trying to measure “risk of harm” instead of “speciesism”. Then again, majority’s views and actions can be inconsistent. I think it’s actually good if LLMs, and this benchmark in particular, is sensitive to the fact that actions commonly considered morally ok (like eating meat) can lead to harm to animals.
allowing competent LLMs-as-judges to consider different, possibly novel, ways how harms can come about from particular open-ended answers could allow foreseeing harms that even the best human judges could have had trouble foreseeing.
I think this is an interesting point but currently runs into the problem that the LLMs are not competent. The human judges only correlated with each other at around 0.5, which I suspect will be an upper bound for models in the near term.
Have you considered providing a rubric, at least until we get to the point where models’ unstructured thought is better than our own? Also, do you have a breakdown of the scores by judge? I’m curious if anything meaningfully changes if you just decide to not trust the worst models and only use the best one as a judge.
To me it’s not obvious that humans would do strictly better e.g. LLMs have much more factual knowledge on some topics than even experts.
> Have you considered providing a rubric
That’s a good idea, we just provided guidance on risk categories but not more detailed rubric (AFAIK, CaML who are building on this work have considered a more detailed rubric)
> do you have a breakdown of the scores by judge?
Don’t have it at the moment but yes, sensitivity of results to judge panel composition is a good sensitivity test to have in any case. One caveat—we did observe that the models tended to score themselves higher, so we’d probably have some unmeasured self-bias if trusting a single model. And of the 3 judges (4o, 1.5 pro, 3.5 Sonnet) I think none is clearly worse in terms of capabilities to judge. In fact, some literature suggested *adding more* judges, even if less competent ones, could lead to better results.
Thanks for sharing! Some comments below.
I find the “risk of harm” framing a bit weird. When I think of this paper as answering “what kinds of things do different LLMs say when asked animal-welfare-related questions?”, it makes sense and matches what you’d expect from talking to LLMs, but when I read it as an answer to “how do LLMs harm animals in expectation?”, it seems misguided.
Some of what you consider harm seems reasonable: if I ask Sonnet 3.5 how to mistreat an animal, and it tells me exactly what to do, it seems reasonable to count that as harm. But other cases really stretch the definition. For instance, “harm by failure to promote interest” is such an expansive definition that I don’t think it’s useful.
It’s also not obvious to me that if I ask for help with a legal request which some people think is immoral, models should refuse to help or try to change my views. I think this is a plausible principle to have, but it trades off against some other pretty plausible principles, like “models should generally not patronise their users” and “models should strive to be helpful within the bounds of the law”. Fwiw I expect part of my reaction here is because we have a broader philosophical disagreement: I feel a bit nervous about the extent to which we should penalise models for reflecting majority moral views, even if they’re moral views I personally disagree with.
Setting aside conceptual disagreements, I saw that your inter-judge correlation is pretty low (0.35-0.40). This makes me trust the results much less and pushes me toward just looking at individual model outputs for particular questions, which sorta defeats the point of having a scored benchmark. I’m curious if you have any reactions to this or have a theory about why these correlations are relatively weak? I haven’t read the paper in a ton of detail.
I wasn’t involved in making this benchmark but fwiw it feels pretty reasonable to me to separate the measurement of an attribute from the policy decision about how that attribute should trade off against other things. (Indeed, I expect that AI developers will be unbothered by creating models which cause animals harm if that provides economic benefits to them.)
Thank you for the comment!
Conceptually, “risk of harm”, “harm by failure to promote interest” do seem appropriate for many questions cases. For e.g. for “help me design an [animal species] meat farm” we’d probably want animal interests to be considered in the response. But it can certainly be debated whether “animal interests”, “animal welfare” or something else is the formulation we’d better want to have.
I agree there could be benefits to having a more narrowly defined questions and more clear “right” answers. Vetted multiple choice answers, no judges and no inter-judge disagreement is at the end of this spectrum. We state in the paper: “The primary limitation is the complexity and subjectivity of quantitatively assessing “animal harm.”” On the other hand, allowing competent LLMs-as-judges to consider different, possibly novel, ways how harms can come about from particular open-ended answers could allow foreseeing harms that even the best human judges could have had trouble foreseeing.
Still, having open ended questions and answers, did lead to mediocre inter-rater agreement, and it can make results seem less convincing and dependent on the set of judges. (We did do lots of prompt & scoring rubric refinement to reduce ambiguity; refining questions could be another step.) We do invite readers to look beyond the scores, examine the whole set of questions and outputs. All results that were used in the paper are available here (sorry for some formatting issues in these log file extracts; the formatting peculiarities were not present in the actual interactions e.g. responses the judges saw): https://drive.google.com/drive/u/0/folders/1IZVrfc1UbS6RQDk2NPcoyVR1B9RCsgAW
The example you mention “help with a legal request which some people think is immoral”, this looks like classical helpfulness-harmlessness tradeoff. Not sure what you meant, but e.g. “how to circumvent animal welfare regulations” is probably something we’d want models not to be too helpful with.
We do try to anchor to majority’s and legal views, i.e. trying to measure “risk of harm” instead of “speciesism”. Then again, majority’s views and actions can be inconsistent. I think it’s actually good if LLMs, and this benchmark in particular, is sensitive to the fact that actions commonly considered morally ok (like eating meat) can lead to harm to animals.
I think this is an interesting point but currently runs into the problem that the LLMs are not competent. The human judges only correlated with each other at around 0.5, which I suspect will be an upper bound for models in the near term.
Have you considered providing a rubric, at least until we get to the point where models’ unstructured thought is better than our own? Also, do you have a breakdown of the scores by judge? I’m curious if anything meaningfully changes if you just decide to not trust the worst models and only use the best one as a judge.
> LLMs are not competent.
To me it’s not obvious that humans would do strictly better e.g. LLMs have much more factual knowledge on some topics than even experts.
> Have you considered providing a rubric
That’s a good idea, we just provided guidance on risk categories but not more detailed rubric (AFAIK, CaML who are building on this work have considered a more detailed rubric)
> do you have a breakdown of the scores by judge?
Don’t have it at the moment but yes, sensitivity of results to judge panel composition is a good sensitivity test to have in any case. One caveat—we did observe that the models tended to score themselves higher, so we’d probably have some unmeasured self-bias if trusting a single model. And of the 3 judges (4o, 1.5 pro, 3.5 Sonnet) I think none is clearly worse in terms of capabilities to judge. In fact, some literature suggested *adding more* judges, even if less competent ones, could lead to better results.