Why aren’t we effectively dead, assuming the misaligned AI reaches AGI and beyond in capability? Do we just luck out? And if so, what makes you think that is the dominant, or default (90%) outcome?
To give one example: how would you use this technique (the “training game”) to eliminate 100% of all possible prompt engineering hacks and so protect against misuse by malicious humans (cf. “If the “grandma’s bedtime story napalm recipe” prompt engineering hack—as mentioned in the OP).
Why aren’t we effectively dead, assuming the misaligned AI reaches AGI and beyond in capability? Do we just luck out? And if so, what makes you think that is the dominant, or default (90%) outcome?
To give one example: how would you use this technique (the “training game”) to eliminate 100% of all possible prompt engineering hacks and so protect against misuse by malicious humans (cf. “If the “grandma’s bedtime story napalm recipe” prompt engineering hack—as mentioned in the OP).