The implicit claim in this part of the argument seems to be that the rate at which all AI systems will attempt to fool human operators attempting to align them is high enough that we can never have (much?) confidence that a system is aligned. But this seems to be asserted rather than argued for. In AI training, we could punish systems strongly for deception to make it strongly disfavoured. Are you saying that deception in training is a 1% chance or a 99% chance? What is the argument for either number?
The model that AI safety people have is that:
It seems like you don’t need a 99% chance or even 1% chance for this to be a big problem. If it just happens once, and the AI is in position to exploit it, that seems dangerous.
To AI safety people, it seems like systems are being built with additional complexity and functionality all the time, and there’s no way of knowing which new system is dangerous, in terms of capability or “alignment”, or what that “percentage” (1% or 99%) might apply to its alignment, or even if this “percentage” or model of risk even is the right way of thinking about this problem for new systems.
The model that AI safety people have is that:
It seems like you don’t need a 99% chance or even 1% chance for this to be a big problem. If it just happens once, and the AI is in position to exploit it, that seems dangerous.
The model is like “Ice-Nine”, I guess.
To AI safety people, it seems like systems are being built with additional complexity and functionality all the time, and there’s no way of knowing which new system is dangerous, in terms of capability or “alignment”, or what that “percentage” (1% or 99%) might apply to its alignment, or even if this “percentage” or model of risk even is the right way of thinking about this problem for new systems.
Ugh, I’m defending Yudkowsky on the EA forum.
I don’t really want to read the OP, maybe if there is time I’ll try to come back and try to make some points.