Deceptive alignment is a convergent instrumental subgoal. If an AI is clearly misaligned while its creator still has the ability to pull the plug, the plug will be pulled; ergo, pretending to be aligned is worthwhile ~regardless of terminal goal.
Thus, the prior would seem to be that all sufficiently-smart AI appear aligned, but only X proportion of them are truly aligned where X is the chance of a randomly-selected value system being aligned; the 1-X others are deceptively aligned.
GPT-4 being the smartest AI we have and also appearing aligned is not really evidence against this; it’s plausibly smart enough in the specific domain of “predicting humans” for its apparent alignment to be deceptive.
I don’t think world dystopia is entirely necessary, but a successful long stop for AI (the ~30+ years it’ll probably take) is probably going to require knocking over a couple of countries that refuse to play ball. It seems fairly hard to keep even small countries from setting up datacentres and chip factories except by threatening or using military force.
To be clear, I think that’s worth it. Heck, nuclear war would be worth it if necessary, although I’m not sure it will be—the PRC in particular I rate as >50% either a) agreeing to a stop, and/or b) getting destroyed in non-AI-related nuclear war in the next few years.