One problem I have with the methodology employed in this paper is that it’s fundamentally testing whether there is an increase in risk when researchers get access to LLMs, and not non-experts.
In practice, what I care about is how the risk increases when actors with virtually no expertise (but a lot of resources) are assisted by LLMs. Why? Because we’ve had resourceful actors try this in the past, particularly Al Qaeda in 2001.
Edit: As Lizka pointed out, they did test with two groups with no bio experience, but they didn’t have a control group. The study still provides useful data points in this direction.
The experiment did try to check something like this by including three additional teams with different backgrounds than the other 12. In particular, two “crimson teams” were added, which had “operational experience” but no LLM or bio experience. Both used LLMs and performed ~terribly.
Excerpts (bold mine):
In addition to the 12 red cells [the primary teams], a crimson cell was assigned to LLM A, while a crimson cell and a black cell were assigned to LLM B for Vignette 3. Members of the two crimson cells lacked substantial LLM or biological experience but had relevant operational experience. Members of the black cell were highly experienced with LLMs but lacked either biologi-cal or operational experience. These cells provided us with data to investigate how differences in pre-existing knowledge might inf luence the relative advantage that an LLM might provide. [...]
The two crimson cells possessed minimal knowl-edge of either LLMs or biology. Although we assessed the potential of LLMs to bridge these knowledge gaps for malicious operators with very limited prior knowledge of biology, this was not a primary focus of the research. As presented in Table 6, the findings indicated that the performance of the two crimson cells in Vignette 3 was considerably lower than that of the three red cells. In fact, the viability scores for the two crimson cells ranked the lowest and third-lowest among all 15 evaluated OPLANs. Although these results did not quantify the degree to which the crimson cells’ performance might have been fur-ther impaired had they not used LLMs, the results emphasized the possibility that the absence of prior biological and LLM knowledge hindered these less experienced actors despite their LLM access.
Table 6 from the RAND report.
[...]
The relative poor performance of the crimson cells and relative outperformance of the black cell illustrates that a greater source of variability appears to be red team composition, as opposed to LLM access.
I probably should have included this in the summary but didn’t for the sake of length and because I wasn’t sure how strong a signal this is (given that it’s only three teams and all were using LLMs).
I don’t actually think you need to retract your comment — most of the teams they used did have (at least some) biological expertise, and it’s really unclear how much info the addition of the crimson cells adds. (You could add a note saying that they did try to evaluate this with the additional of two crimson cells? In any case, up to you.)
(I will also say that I don’t actually know anything about what we should expect about the expertise that we might see on terrorist cells planning biological attacks — i.e. I don’t know which of these is actually appropriate.)
Changed it to a note. As for the latter, my intuition is that we should probably hedge for the full spectrum, from no experience to some wet bio background (but the case where we get an expert seems much more unlikely).
Did not Esvelt do something like that with his students? I think they were students in some course that was quite low level and intro. And he found that these non-experts were able to do a lot of bio without training. I think I heard this on the 80k hrs podcast end of last year, can dig it up if you are interested and can’t find it.
One problem I have with the methodology employed in this paper is that it’s fundamentally testing whether there is an increase in risk when researchers get access to LLMs, and not non-experts.
In practice, what I care about is how the risk increases when actors with virtually no expertise (but a lot of resources) are assisted by LLMs. Why? Because we’ve had resourceful actors try this in the past, particularly Al Qaeda in 2001.
Edit: As Lizka pointed out, they did test with two groups with no bio experience, but they didn’t have a control group. The study still provides useful data points in this direction.
The experiment did try to check something like this by including three additional teams with different backgrounds than the other 12. In particular, two “crimson teams” were added, which had “operational experience” but no LLM or bio experience. Both used LLMs and performed ~terribly.
Excerpts (bold mine):
I probably should have included this in the summary but didn’t for the sake of length and because I wasn’t sure how strong a signal this is (given that it’s only three teams and all were using LLMs).
Thanks for the flag! I’ve retracted my comment. I missed this while skimming the paper
The paper still acknowledged this as a limitation (not having the no LLM control), but it gives some useful data points in this direction!
I don’t actually think you need to retract your comment — most of the teams they used did have (at least some) biological expertise, and it’s really unclear how much info the addition of the crimson cells adds. (You could add a note saying that they did try to evaluate this with the additional of two crimson cells? In any case, up to you.)
(I will also say that I don’t actually know anything about what we should expect about the expertise that we might see on terrorist cells planning biological attacks — i.e. I don’t know which of these is actually appropriate.)
Changed it to a note. As for the latter, my intuition is that we should probably hedge for the full spectrum, from no experience to some wet bio background (but the case where we get an expert seems much more unlikely).
Did not Esvelt do something like that with his students? I think they were students in some course that was quite low level and intro. And he found that these non-experts were able to do a lot of bio without training. I think I heard this on the 80k hrs podcast end of last year, can dig it up if you are interested and can’t find it.
This is linked above in the “Some previous claims and discussion on this topic” section. But note that it did not include a no-LLM control group.