Yes, I completely agree that this is nowhere near good enough. It would make me very nervous indeed to end up in that situation.
The thing I was trying to push back against was the idea that what I thought you were claiming: that we’re effectively dead if we end up in this situation.
Why aren’t we effectively dead, assuming the misaligned AI reaches AGI and beyond in capability? Do we just luck out? And if so, what makes you think that is the dominant, or default (90%) outcome?
To give one example: how would you use this technique (the “training game”) to eliminate 100% of all possible prompt engineering hacks and so protect against misuse by malicious humans (cf. “If the “grandma’s bedtime story napalm recipe” prompt engineering hack—as mentioned in the OP).
Yes, I completely agree that this is nowhere near good enough. It would make me very nervous indeed to end up in that situation.
The thing I was trying to push back against was the idea that what I thought you were claiming: that we’re effectively dead if we end up in this situation.
Why aren’t we effectively dead, assuming the misaligned AI reaches AGI and beyond in capability? Do we just luck out? And if so, what makes you think that is the dominant, or default (90%) outcome?
To give one example: how would you use this technique (the “training game”) to eliminate 100% of all possible prompt engineering hacks and so protect against misuse by malicious humans (cf. “If the “grandma’s bedtime story napalm recipe” prompt engineering hack—as mentioned in the OP).