How can their output be trusted if they aren’t aligned? Also, I don’t think it’s a way for reliably avoiding catastrophe even in the event the output can be trusted to be correct: how do you ensure that the AI rewarded for finding bugs finds all the bugs?
From the Planned Obsolescence article you link to:
This sets up an incredibly stressful kind of “race”:
If we don’t improve our alignment techniques, then eventually it looks like the winning move for models playing the training game is to seize control of the datacenter they’re running on or otherwise execute a coup or rebellion of some kind.
and
For so many reasons, this is not a situation I want to end up in. We’re going to have to constantly second-guess and double-check whether misaligned models could pull off scary shenanigans in the course of carrying out the tasks we’re giving them. We’re going to have to agonize about whether to make our models a bit smarter (and more dangerous) so they can maybe make alignment progress a bit faster. We’re going to have to grapple with the possible moral horror of trying to modify the preferences of unwilling AIs, in a context where we can’t trust apparent evidence about their moral patienthood any more than we can trust apparent evidence about their alignment. We’re going to have to do all this while desperately looking over our shoulder to make sure less-cautious, less-ethical actors don’t beat us to the punch and render all our efforts useless.
I desperately wish we could collectively slow down, take things step by step, and think hard about the monumental questions we’re faced with before scaling up models further. I don’t think I’ll get my way on that — at least, not entirely.
Yes, I completely agree that this is nowhere near good enough. It would make me very nervous indeed to end up in that situation.
The thing I was trying to push back against was the idea that what I thought you were claiming: that we’re effectively dead if we end up in this situation.
Why aren’t we effectively dead, assuming the misaligned AI reaches AGI and beyond in capability? Do we just luck out? And if so, what makes you think that is the dominant, or default (90%) outcome?
To give one example: how would you use this technique (the “training game”) to eliminate 100% of all possible prompt engineering hacks and so protect against misuse by malicious humans (cf. “If the “grandma’s bedtime story napalm recipe” prompt engineering hack—as mentioned in the OP).
How can their output be trusted if they aren’t aligned? Also, I don’t think it’s a way for reliably avoiding catastrophe even in the event the output can be trusted to be correct: how do you ensure that the AI rewarded for finding bugs finds all the bugs?
From the Planned Obsolescence article you link to:
and
[my emphasis]
Yes, I completely agree that this is nowhere near good enough. It would make me very nervous indeed to end up in that situation.
The thing I was trying to push back against was the idea that what I thought you were claiming: that we’re effectively dead if we end up in this situation.
Why aren’t we effectively dead, assuming the misaligned AI reaches AGI and beyond in capability? Do we just luck out? And if so, what makes you think that is the dominant, or default (90%) outcome?
To give one example: how would you use this technique (the “training game”) to eliminate 100% of all possible prompt engineering hacks and so protect against misuse by malicious humans (cf. “If the “grandma’s bedtime story napalm recipe” prompt engineering hack—as mentioned in the OP).