In my understanding, there was another important difference in Redwood’s project from the standard adversarial robustness literature: they were looking to eliminate only ‘competent’ failures (ie cases where the model probably ‘knows’ what the correct classification is), and would have counted it a success if there were still failures if the failure was due to a lack of competence on the model’s part (e.g. ‘his mitochondria were liberated’ → implies harm but only if you know enough biology)
I think in practice in their exact project this didn’t end up being a super clear conceptual line, but at the start it was plausible to me that only focusing on competent failures made the task feasible even if the general case is impossible.
In my understanding, there was another important difference in Redwood’s project from the standard adversarial robustness literature: they were looking to eliminate only ‘competent’ failures (ie cases where the model probably ‘knows’ what the correct classification is), and would have counted it a success if there were still failures if the failure was due to a lack of competence on the model’s part (e.g. ‘his mitochondria were liberated’ → implies harm but only if you know enough biology)
I think in practice in their exact project this didn’t end up being a super clear conceptual line, but at the start it was plausible to me that only focusing on competent failures made the task feasible even if the general case is impossible.