The failure of Redwood’s adversarial training project is unfortunately wholly unsurprising given almost a decade of similarly failed attempts at defenses to adversarial examples from hundreds or even thousands of ML researchers. For example, the RobustBench benchmark shows the best known robust accuracy on ImageNet is still below 50% for attacks with a barely perceptible perturbation.
The better reference class is adversarially mined examples for text models. Meta and other researchers were working on a similar projects before Redwood started doing that line of research. https://github.com/facebookresearch/anli is an example. (Reader: evaluate your model’s consistency for what counts as alignment research—does this mean non-x-risk-pilled Meta researchers do some alignment research, if we believe RR project constituted exciting alignment research too?)
Separately, I haven’t seen empirical demonstrations that pursuing this line of research can have limited capabilities externalities or result in differential technological progress. Robustifying models against some kinds of automatic adversarial attacks (1,2) does seem to be separable from improving general capabilities though, and I think it’d be good to have more work on that.
We recommend this article by an MIT CS professor which is partly about how creating a sustainable work culture can actually increase productivity.
This researcher’s work attitude is only part of a spectrum. Many researchers find great returns working 80+ hours a week. Some labs differentiate themselves by having usual hours, but many successful labs have their members work a lot, and that works out well. For example, Dawn Song’s students work a ton, and some other Berkeley grad students in other labs are intimidated by her lab’s hours, but that’s OK because her graduate students find that environment suitable. It’d be nice if this post was more specific about how much of the work culture discontent is about hours vs other issues.
The better reference class is adversarially mined examples for text models. Meta and other researchers were working on a similar projects before Redwood started doing that line of research. https://github.com/facebookresearch/anli is an example
I agree that’s a good reference class. I don’t think Redwood’s project had identical goals, and would strongly disagree with someone saying it’s duplicative. But other work is certainly also relevant, and ex post I would agree that other work in the reference class is comparably helpful for alignment
Reader: evaluate your model’s consistency for what counts as alignment research—does this mean non-x-risk-pilled Meta researchers do some alignment research, if we believe RR project constituted exciting alignment research too?
Of course! I’m a bit unusual amongst the EA crowd in how enthusiastic I am about “normal” robustness research, but I’m similarly unusual amongst the EA crowd in how enthusiastic I am this proposed research direction for Redwood, and I suspect those things will typically go together.
Separately, I haven’t seen empirical demonstrations that pursuing this line of research can have limited capabilities externalities or result in differential technological progress.
I’m still not convinced by this perspective. I would frame the situation as:
There’s a task we really want future people to be good at—finding places where models behave in obviously-undesirable ways, and understanding the limitations of such evaluations and the consequences of training on adversarial inputs.
That task isn’t obviously improving automatically with model capabilities, it seems like something that requires knowledge and individual+institutional expertise.
So maybe we should practice a lot to get better at that task, sharing what we learn and building a larger community of researchers and engineers with relevant experience.
Your objection sounds like: “That may be true but there’s not a lot of evidence that this doesn’t also make models more capable, which would be bad.” And I don’t find that very persuasive—I don’t think there is such a strong default presumption that generic research accelerates capabilities enough to be a meaningful cost.
On the question of what generates differential technological progress, I think I’m comparably skeptical of all of the evidence on offer for claims of the form “doing research on X leads to differential progress on Y,” and the best guide we have (both in alignment and in normal academic research!) is basically common-sense arguments along the lines of “investigating and practicing doing X tends to make you better at doing X.”
I don’t think Redwood’s project had identical goals, and would strongly disagree with someone saying it’s duplicative.
I agree it is not duplicative. It’s been a while, but if I recall correctly the main difference seemed to be that they chose a task with gave them a extra nine of reliability (started with an initially easier task) and pursued it more thoroughly.
think I’m comparably skeptical of all of the evidence on offer for claims of the form “doing research on X leads to differential progress on Y,”
I think if we find that improvement of X leads to improvement on Y, then that’s some evidence, but it doesn’t establish that it’s differential. If we find that improvement on X also leads to progress on thing Z that is highly indicative of general capabilities, then that’s evidence against. If we find that it mainly affects Y but not other things Z, then that’s reasonable evidence it’s differential. For example, so far, transparency hasn’t affected general capabilities, so I read that as evidence of differential technological progress. As another example, I think trojan defense research differentially improves our understanding our trojans; I don’t see it making models better at coding or gaining new general instrumental skills.
I think commonsense is too unreliable of a guide when thinking about deep learning; deep learning findings are phenomena are often unintelligible even in hindsight (I still don’t understand why some of my research papers’ methods work). That’s why I’d prefer empirical evidence. Empirical research claiming to differentially improve safety should demonstrate a differential safety improvement empirically.
In my understanding, there was another important difference in Redwood’s project from the standard adversarial robustness literature: they were looking to eliminate only ‘competent’ failures (ie cases where the model probably ‘knows’ what the correct classification is), and would have counted it a success if there were still failures if the failure was due to a lack of competence on the model’s part (e.g. ‘his mitochondria were liberated’ → implies harm but only if you know enough biology)
I think in practice in their exact project this didn’t end up being a super clear conceptual line, but at the start it was plausible to me that only focusing on competent failures made the task feasible even if the general case is impossible.
Thanks for the comment Dan. I agree that the adversarially mined examples literature is the right reference class, of which the two that you mention (Meta’s Dynabench and ANLI) were the main examples (maybe the only examples? I forget) while we were working on this project.
I’ll note that Meta’s Dynabench sentiment model (the only model of theirs that I interacted with) seemed substantially less robust than Redwood’s classifier (e.g. I was able to defeat it manually in about 10 minutes of messing around, whereas I needed the tools we made to defeat the Redwood model).
I think the adversarial mining thing was hot in 2019. IIRC, Hellaswag and others did it; I’d venture maybe 100 papers did it before RR, but I still think it was underexplored at the time and I’m happy RR investigated it.
The better reference class is adversarially mined examples for text models. Meta and other researchers were working on a similar projects before Redwood started doing that line of research. https://github.com/facebookresearch/anli is an example. (Reader: evaluate your model’s consistency for what counts as alignment research—does this mean non-x-risk-pilled Meta researchers do some alignment research, if we believe RR project constituted exciting alignment research too?)
Separately, I haven’t seen empirical demonstrations that pursuing this line of research can have limited capabilities externalities or result in differential technological progress. Robustifying models against some kinds of automatic adversarial attacks (1,2) does seem to be separable from improving general capabilities though, and I think it’d be good to have more work on that.
This researcher’s work attitude is only part of a spectrum. Many researchers find great returns working 80+ hours a week. Some labs differentiate themselves by having usual hours, but many successful labs have their members work a lot, and that works out well. For example, Dawn Song’s students work a ton, and some other Berkeley grad students in other labs are intimidated by her lab’s hours, but that’s OK because her graduate students find that environment suitable. It’d be nice if this post was more specific about how much of the work culture discontent is about hours vs other issues.
I agree that’s a good reference class. I don’t think Redwood’s project had identical goals, and would strongly disagree with someone saying it’s duplicative. But other work is certainly also relevant, and ex post I would agree that other work in the reference class is comparably helpful for alignment
Of course! I’m a bit unusual amongst the EA crowd in how enthusiastic I am about “normal” robustness research, but I’m similarly unusual amongst the EA crowd in how enthusiastic I am this proposed research direction for Redwood, and I suspect those things will typically go together.
I’m still not convinced by this perspective. I would frame the situation as:
There’s a task we really want future people to be good at—finding places where models behave in obviously-undesirable ways, and understanding the limitations of such evaluations and the consequences of training on adversarial inputs.
That task isn’t obviously improving automatically with model capabilities, it seems like something that requires knowledge and individual+institutional expertise.
So maybe we should practice a lot to get better at that task, sharing what we learn and building a larger community of researchers and engineers with relevant experience.
Your objection sounds like: “That may be true but there’s not a lot of evidence that this doesn’t also make models more capable, which would be bad.” And I don’t find that very persuasive—I don’t think there is such a strong default presumption that generic research accelerates capabilities enough to be a meaningful cost.
On the question of what generates differential technological progress, I think I’m comparably skeptical of all of the evidence on offer for claims of the form “doing research on X leads to differential progress on Y,” and the best guide we have (both in alignment and in normal academic research!) is basically common-sense arguments along the lines of “investigating and practicing doing X tends to make you better at doing X.”
I agree it is not duplicative. It’s been a while, but if I recall correctly the main difference seemed to be that they chose a task with gave them a extra nine of reliability (started with an initially easier task) and pursued it more thoroughly.
I think if we find that improvement of X leads to improvement on Y, then that’s some evidence, but it doesn’t establish that it’s differential. If we find that improvement on X also leads to progress on thing Z that is highly indicative of general capabilities, then that’s evidence against. If we find that it mainly affects Y but not other things Z, then that’s reasonable evidence it’s differential. For example, so far, transparency hasn’t affected general capabilities, so I read that as evidence of differential technological progress. As another example, I think trojan defense research differentially improves our understanding our trojans; I don’t see it making models better at coding or gaining new general instrumental skills.
I think commonsense is too unreliable of a guide when thinking about deep learning; deep learning findings are phenomena are often unintelligible even in hindsight (I still don’t understand why some of my research papers’ methods work). That’s why I’d prefer empirical evidence. Empirical research claiming to differentially improve safety should demonstrate a differential safety improvement empirically.
In my understanding, there was another important difference in Redwood’s project from the standard adversarial robustness literature: they were looking to eliminate only ‘competent’ failures (ie cases where the model probably ‘knows’ what the correct classification is), and would have counted it a success if there were still failures if the failure was due to a lack of competence on the model’s part (e.g. ‘his mitochondria were liberated’ → implies harm but only if you know enough biology)
I think in practice in their exact project this didn’t end up being a super clear conceptual line, but at the start it was plausible to me that only focusing on competent failures made the task feasible even if the general case is impossible.
Thanks for the comment Dan. I agree that the adversarially mined examples literature is the right reference class, of which the two that you mention (Meta’s Dynabench and ANLI) were the main examples (maybe the only examples? I forget) while we were working on this project.
I’ll note that Meta’s Dynabench sentiment model (the only model of theirs that I interacted with) seemed substantially less robust than Redwood’s classifier (e.g. I was able to defeat it manually in about 10 minutes of messing around, whereas I needed the tools we made to defeat the Redwood model).
I think the adversarial mining thing was hot in 2019. IIRC, Hellaswag and others did it; I’d venture maybe 100 papers did it before RR, but I still think it was underexplored at the time and I’m happy RR investigated it.