Rohin Shah comments on Shah and Yudkowsky on alignment failures

Rohin Shah 2 Mar 2022 11:01 UTC
24 points
0 ∶ 0
Yeah, I’ve been thinking about this myself. I think there are a few reasons that it isn’t much more worrying than the “classic” worry (where the AI deduces that it should enact a treacherous turn from first principles):
1. All of the “treacherous turn” examples in the training dataset would involve the AI displaying the treacherous turn at a time when humans are still reading the outputs and could turn off the AI system. So in some sense they aren’t real examples of treacherous turns, and require some generalization of the underlying goal.
2. The examples in the training dataset involve stories of treacherous turns, whereas the thing we are worried about is a real world treacherous turn. This requires generalization from “words describing a treacherous turn” to “actions causing a treacherous turn”. This is a pretty specific kind of generalization that doesn’t seem very likely to me, except via the classic worry. (In some sense this is very similar to point #1.)
3. Most stories of treacherous turns involve some abstract step with extremely strong capabilities (e.g. “create nanobots that take over the world”). To actually be risky, the AI system has to take actions that instantiate that step. But an AI system that could do that could presumably also think of “perhaps I should execute a treacherous turn”, so the fact that there’s a bunch of human-generated text suggesting the possibility probably doesn’t make a huge difference.
All of that being said, I do feel like “AI system executes treacherous turn, but wouldn’t have considered a treacherous turn if all this human data about treacherous turns didn’t exist” is not completely implausible, such that I do feel a bit worried about the discussion on it (but also this effect is way dwarfed by “getting correct agreement on how risky AI is”).