Thanks for writing this! One thing that might help would be more examples of Phase 2 work. For instance, I think that most of my work is Phase 2 by your definition (see here for a recent round-up). But I am not entirely sure, especially given the claim that very little Phase 2 work is happening. Other stuff in the “I think this counts but not sure” category would be work done by Redwood Research, Chris Olah at Anthropic, or Rohin Shah at DeepMind (apologies to any other people who I’ve unintentionally left out).
Another advantage of examples is it could help highlight what you want to see more of.
I hope it’s not surprising, but I’d consider Manifold Markets to be Phase 2 work, too.
I have a related draft I’ve been meaning to post forever, “EA needs better software”, with some other examples future kinds of Phase 2 work. (Less focused on Longtermism though)
If anyone’s specifically excited about doing Phase 2 work—reach out, we’re hiring!
One set of examples is in this section of another post I just put up (linked from a footnote in this post), but that’s pretty gesturing / not complete.
I think that for lots of this alignment work there’s an ambiguity about how much to count the future alignment research community as part of “longtermist EA” which creates ambiguity about whether the research is itself Phase 2. I think that Redwood’s work is Phase 1, but it’s possible that they’ll later produce research artefacts which are Phase 2. Chris Olah’s old work on interpretability felt like Phase 2 to me; I haven’t followed his recent work but if the insights are mostly being captured locally at Anthropic I guess it seems like Phase 1, and if they’re being put into the world in a way that improves general understanding of the issues then it’s more like Phase 2.
Your robustness and science of ML work does look like Phase 2 to me, though again I haven’t looked too closely. I do wish that there was more public EA engagement with e.g. the question of “how good is science of ML work for safeguarding the future?” — this analysis feels like a form of Phase 1.5 work that’s missing from the public record (although you may have been doing a bunch of this in deciding to work on that).
It’s possible btw that I’m just empirically wrong about these percentages of effort! Particularly since there’s so much ambiguity around some of the AI stuff. Also possible that things are happening fairly differently in different corners of the community and there are issues about lack of communication between them.
FWIW I think that compared to Chris Olah’s old interpretability work, Redwood’s adversarial training work feels more like phase 2 work, and our current interpretability work is similarly phase 2.
Thanks for this; it made me notice that I was analyzing Chris’s work more in far mode and Redwood’s more in near mode. Maybe you’re right about these comparisons. I’d be be interested to understand whether/how you think the adversarial training work could most plausibly be directly applied (or if you just mean “fewer intermediate steps till eventual application”, or something else).
Thanks for writing this! One thing that might help would be more examples of Phase 2 work. For instance, I think that most of my work is Phase 2 by your definition (see here for a recent round-up). But I am not entirely sure, especially given the claim that very little Phase 2 work is happening. Other stuff in the “I think this counts but not sure” category would be work done by Redwood Research, Chris Olah at Anthropic, or Rohin Shah at DeepMind (apologies to any other people who I’ve unintentionally left out).
Another advantage of examples is it could help highlight what you want to see more of.
I hope it’s not surprising, but I’d consider Manifold Markets to be Phase 2 work, too.
I have a related draft I’ve been meaning to post forever, “EA needs better software”, with some other examples future kinds of Phase 2 work. (Less focused on Longtermism though)
If anyone’s specifically excited about doing Phase 2 work—reach out, we’re hiring!
You should post that draft, I’ve been thinking the same stuff and would like to get the conversation started.
One set of examples is in this section of another post I just put up (linked from a footnote in this post), but that’s pretty gesturing / not complete.
I think that for lots of this alignment work there’s an ambiguity about how much to count the future alignment research community as part of “longtermist EA” which creates ambiguity about whether the research is itself Phase 2. I think that Redwood’s work is Phase 1, but it’s possible that they’ll later produce research artefacts which are Phase 2. Chris Olah’s old work on interpretability felt like Phase 2 to me; I haven’t followed his recent work but if the insights are mostly being captured locally at Anthropic I guess it seems like Phase 1, and if they’re being put into the world in a way that improves general understanding of the issues then it’s more like Phase 2.
Your robustness and science of ML work does look like Phase 2 to me, though again I haven’t looked too closely. I do wish that there was more public EA engagement with e.g. the question of “how good is science of ML work for safeguarding the future?” — this analysis feels like a form of Phase 1.5 work that’s missing from the public record (although you may have been doing a bunch of this in deciding to work on that).
It’s possible btw that I’m just empirically wrong about these percentages of effort! Particularly since there’s so much ambiguity around some of the AI stuff. Also possible that things are happening fairly differently in different corners of the community and there are issues about lack of communication between them.
FWIW I think that compared to Chris Olah’s old interpretability work, Redwood’s adversarial training work feels more like phase 2 work, and our current interpretability work is similarly phase 2.
Thanks for this; it made me notice that I was analyzing Chris’s work more in far mode and Redwood’s more in near mode. Maybe you’re right about these comparisons. I’d be be interested to understand whether/how you think the adversarial training work could most plausibly be directly applied (or if you just mean “fewer intermediate steps till eventual application”, or something else).