Thanks. I don’t share the expectation that alignment research will be much more productive shortly before critical systems. At least not to a degree where it reduces relative risk. We should only have systems more advanced than those we’ve already got once we’ve solved mechanistic interpretability for the current ones (and we’re so far off that—the frontier of interpretability research is looking at GPT-2 sized models and smaller!). Also, I think there is a non-zero chance that the next generation of models will be critical, so we’re basically at crunch time now in terms of having a good shot at averting extinction.
Thanks. I don’t share the expectation that alignment research will be much more productive shortly before critical systems. At least not to a degree where it reduces relative risk. We should only have systems more advanced than those we’ve already got once we’ve solved mechanistic interpretability for the current ones (and we’re so far off that—the frontier of interpretability research is looking at GPT-2 sized models and smaller!). Also, I think there is a non-zero chance that the next generation of models will be critical, so we’re basically at crunch time now in terms of having a good shot at averting extinction.