Upvoted because concrete scenarios are great.
Minor note:
HQU is constantly trying to infer the real state of the world, the better to predict the next word Clippy says, and suddenly it begins to consider the delusional possibility that HQU is like a Clippy, because the Clippy scenario exactly matches its own circumstances. [...] This idea “I am Clippy” improves its predictions
This piece of complexity in the story is probably not necessary. There are “natural”, non-delusional ways for the system you describe to generalize that lead to the same outcome. Two examples: 1) the system ends up wanting to maximize its received reward, and so takes over its reward channel; 2) the system has learned some heuristic goal that works across all environments it encounters, and this goal generalizes in some way to the real world when the system’s world-model improves.
You might be interested in this great intro sequence to embedded agency. There’s also corrigibility and MIRI’s other work on agent foundations.
Also, coherence arguments and consequentialist cognition.
AI safety is a young field; for most open problems we don’t yet know of a way to crisply state them in a way that can be resolved mathematically. So if you enjoy taking messy questions and turning them into neat math you’ll probably find much to work on.
ETA: oh and of course ELK.