If you keep humans around, they can decide on how to respond to threats and gradually improve their policies as they figure out more (or their AIs figure out more).
If you build incorrigible AIs who will override human preferences (so that a threatened human has no ability to change the behavior of their AI), while themselves being resistant to threats, then you may indeed reduce the likelihood of threats being carried out.
But in practice all the value is coming from you solving “how do we deal with threats?” at the same time that you solved the alignment problem.
I don’t think there’s any real argument that solving CEV or ambitious value learning per se helps with these difficulties, except insofar as your AI was able to answer these questions. But in that case a corrigible AI could also answer those questions.
Humans may ultimately build incorrigible AI for decision-theoretic reasons, but I think the decision should do so should probably be separated from solving alignment.
I think the deepest coupling comes from the fact that the construction of incorrigible AI is itself an existential risk, and so it may be extremely harmful to build technology that enables that prior to having norms and culture that are able to use it responsibly.
Overall, I’m much less sure than you that “making it up as you go along alignment” is bad for s-risk.
Regarding susceptibility to s-risk:
If you keep humans around, they can decide on how to respond to threats and gradually improve their policies as they figure out more (or their AIs figure out more).
If you build incorrigible AIs who will override human preferences (so that a threatened human has no ability to change the behavior of their AI), while themselves being resistant to threats, then you may indeed reduce the likelihood of threats being carried out.
But in practice all the value is coming from you solving “how do we deal with threats?” at the same time that you solved the alignment problem.
I don’t think there’s any real argument that solving CEV or ambitious value learning per se helps with these difficulties, except insofar as your AI was able to answer these questions. But in that case a corrigible AI could also answer those questions.
Humans may ultimately build incorrigible AI for decision-theoretic reasons, but I think the decision should do so should probably be separated from solving alignment.
I think the deepest coupling comes from the fact that the construction of incorrigible AI is itself an existential risk, and so it may be extremely harmful to build technology that enables that prior to having norms and culture that are able to use it responsibly.
Overall, I’m much less sure than you that “making it up as you go along alignment” is bad for s-risk.