Very good summary! I’ve been working on a (much drier) series of posts explaining different AI risk scenarios—https://forum.effectivealtruism.org/posts/KxDgeyyhppRD5qdfZ/link-post-how-plausible-are-ai-takeover-scenarios
But I think I might adopt ‘Sycophant’/‘Schemer’ as better more descriptive names for WFLL1/WFLL2, Outer/Inner alignment failure going forward
I also liked that you emphasised how much the optimist Vs pessimist case depends on hard to articulate intuitions about things like how easily findable deceptive models are and how easy incremental course correction is. I called this the ‘hackability’ of alignment—https://www.lesswrong.com/posts/zkF9PNSyDKusoyLkP/investigating-ai-takeover-scenarios#Alignment__Hackability_
Very good summary! I’ve been working on a (much drier) series of posts explaining different AI risk scenarios—https://forum.effectivealtruism.org/posts/KxDgeyyhppRD5qdfZ/link-post-how-plausible-are-ai-takeover-scenarios
But I think I might adopt ‘Sycophant’/‘Schemer’ as better more descriptive names for WFLL1/WFLL2, Outer/Inner alignment failure going forward
I also liked that you emphasised how much the optimist Vs pessimist case depends on hard to articulate intuitions about things like how easily findable deceptive models are and how easy incremental course correction is. I called this the ‘hackability’ of alignment—https://www.lesswrong.com/posts/zkF9PNSyDKusoyLkP/investigating-ai-takeover-scenarios#Alignment__Hackability_