One set of examples is in this section of another post I just put up (linked from a footnote in this post), but that’s pretty gesturing / not complete.
I think that for lots of this alignment work there’s an ambiguity about how much to count the future alignment research community as part of “longtermist EA” which creates ambiguity about whether the research is itself Phase 2. I think that Redwood’s work is Phase 1, but it’s possible that they’ll later produce research artefacts which are Phase 2. Chris Olah’s old work on interpretability felt like Phase 2 to me; I haven’t followed his recent work but if the insights are mostly being captured locally at Anthropic I guess it seems like Phase 1, and if they’re being put into the world in a way that improves general understanding of the issues then it’s more like Phase 2.
Your robustness and science of ML work does look like Phase 2 to me, though again I haven’t looked too closely. I do wish that there was more public EA engagement with e.g. the question of “how good is science of ML work for safeguarding the future?” — this analysis feels like a form of Phase 1.5 work that’s missing from the public record (although you may have been doing a bunch of this in deciding to work on that).
It’s possible btw that I’m just empirically wrong about these percentages of effort! Particularly since there’s so much ambiguity around some of the AI stuff. Also possible that things are happening fairly differently in different corners of the community and there are issues about lack of communication between them.
FWIW I think that compared to Chris Olah’s old interpretability work, Redwood’s adversarial training work feels more like phase 2 work, and our current interpretability work is similarly phase 2.
Thanks for this; it made me notice that I was analyzing Chris’s work more in far mode and Redwood’s more in near mode. Maybe you’re right about these comparisons. I’d be be interested to understand whether/how you think the adversarial training work could most plausibly be directly applied (or if you just mean “fewer intermediate steps till eventual application”, or something else).
One set of examples is in this section of another post I just put up (linked from a footnote in this post), but that’s pretty gesturing / not complete.
I think that for lots of this alignment work there’s an ambiguity about how much to count the future alignment research community as part of “longtermist EA” which creates ambiguity about whether the research is itself Phase 2. I think that Redwood’s work is Phase 1, but it’s possible that they’ll later produce research artefacts which are Phase 2. Chris Olah’s old work on interpretability felt like Phase 2 to me; I haven’t followed his recent work but if the insights are mostly being captured locally at Anthropic I guess it seems like Phase 1, and if they’re being put into the world in a way that improves general understanding of the issues then it’s more like Phase 2.
Your robustness and science of ML work does look like Phase 2 to me, though again I haven’t looked too closely. I do wish that there was more public EA engagement with e.g. the question of “how good is science of ML work for safeguarding the future?” — this analysis feels like a form of Phase 1.5 work that’s missing from the public record (although you may have been doing a bunch of this in deciding to work on that).
It’s possible btw that I’m just empirically wrong about these percentages of effort! Particularly since there’s so much ambiguity around some of the AI stuff. Also possible that things are happening fairly differently in different corners of the community and there are issues about lack of communication between them.
FWIW I think that compared to Chris Olah’s old interpretability work, Redwood’s adversarial training work feels more like phase 2 work, and our current interpretability work is similarly phase 2.
Thanks for this; it made me notice that I was analyzing Chris’s work more in far mode and Redwood’s more in near mode. Maybe you’re right about these comparisons. I’d be be interested to understand whether/how you think the adversarial training work could most plausibly be directly applied (or if you just mean “fewer intermediate steps till eventual application”, or something else).