I agree they definitely should’ve included unfiltered LLMs, but it’s not clear that this significantly altered the results. From the paper:
“In response to initial observations of red cells’ difficulties in obtaining useful assistance from LLMs, a study excursion was undertaken. This involved integrating a black cell—comprising individuals proficient in jailbreaking techniques—into the red- teaming exercise. Interestingly, this group achieved the highest OPLAN score of all 15 cells. However, it is important to note that the black cell started and concluded the exercise later than the other cells. Because of this, their OPLAN was evaluated by only two experts in operations and two in biology and did not undergo the formal adjudication process, which was associated with an average decrease of more than 0.50 in assessment score for all of the other plans. […]
Subsequent analysis of chat logs and consultations with black cell researchers revealed that their jailbreaking expertise did not influence their performance; their outcome for biological feasibility appeared to be primarily the product of diligent reading and adept interpretation of the gain-of-function academic literature during the exercise rather than access to the model.”
It’s potentially also worth noting that the difference in scores was pretty enormous:
their jailbreaking expertise did not influence their performance; their outcome for biological feasibility appeared to be primarily the product of diligent reading and adept interpretation of the gain-of-function academic literature during the exercise rather than access to the model.
This is pretty interesting to me (although it’s basically an ~anecdote, given that it’s just one team); it reminds me of some of the literature around superforecasters.
(I probably should have added a note about the black cell (and crimson cells) to the summary — thank you for adding this!)
Subsequent analysis of chat logs and consultations with black cell researchers revealed that their jailbreaking expertise did not influence their performance; their outcome for biological feasibility appeared to be primarily the product of diligent reading and adept interpretation of the gain-of-function academic literature during the exercise rather than access to the model.
My interpretation is something like either (a) the kind of people who are good at jailbreaking LLMs are also the kind of people who are good at thinking creatively about how to cause harm or (b) this is just noise in who you happened to get in which cell.
I agree they definitely should’ve included unfiltered LLMs, but it’s not clear that this significantly altered the results. From the paper:
“In response to initial observations of red cells’ difficulties in obtaining useful assistance from LLMs, a study excursion was undertaken. This involved integrating a black cell—comprising individuals proficient in jailbreaking techniques—into the red- teaming exercise. Interestingly, this group achieved the highest OPLAN score of all 15 cells. However, it is important to note that the black cell started and concluded the exercise later than the other cells. Because of this, their OPLAN was evaluated by only two experts in operations and two in biology and did not undergo the formal adjudication process, which was associated with an average decrease of more than 0.50 in assessment score for all of the other plans. […]
Subsequent analysis of chat logs and consultations with black cell researchers revealed that their jailbreaking expertise did not influence their performance; their outcome for biological feasibility appeared to be primarily the product of diligent reading and adept interpretation of the gain-of-function academic literature during the exercise rather than access to the model.”
It’s potentially also worth noting that the difference in scores was pretty enormous:
This is pretty interesting to me (although it’s basically an ~anecdote, given that it’s just one team); it reminds me of some of the literature around superforecasters.
(I probably should have added a note about the black cell (and crimson cells) to the summary — thank you for adding this!)
My interpretation is something like either (a) the kind of people who are good at jailbreaking LLMs are also the kind of people who are good at thinking creatively about how to cause harm or (b) this is just noise in who you happened to get in which cell.