Plansthat involve increasing AI input into alignment research appear to rest on the assumption that they can be grounded by a sufficiently aligned AI at the start. But how does this not just result in an infinite, error-prone, regress? Such “getting the AI to do your alignment homework” approaches are not safe ways of avoiding doom.
On this point, the initial AI’s needn’t be actually aligned, I think. They could for example do useful alignment work that we can use even though they are “playing the training game”; they might want to take over, but don’t have enough influence yet, so are just doing as we ask for now. (More)
(Clearly this is not a safe way of reliably avoiding catastrophe. But it’s an example of a way that it’s at least plausible we avoid catastrophe.)
How can their output be trusted if they aren’t aligned? Also, I don’t think it’s a way for reliably avoiding catastrophe even in the event the output can be trusted to be correct: how do you ensure that the AI rewarded for finding bugs finds all the bugs?
From the Planned Obsolescence article you link to:
This sets up an incredibly stressful kind of “race”:
If we don’t improve our alignment techniques, then eventually it looks like the winning move for models playing the training game is to seize control of the datacenter they’re running on or otherwise execute a coup or rebellion of some kind.
and
For so many reasons, this is not a situation I want to end up in. We’re going to have to constantly second-guess and double-check whether misaligned models could pull off scary shenanigans in the course of carrying out the tasks we’re giving them. We’re going to have to agonize about whether to make our models a bit smarter (and more dangerous) so they can maybe make alignment progress a bit faster. We’re going to have to grapple with the possible moral horror of trying to modify the preferences of unwilling AIs, in a context where we can’t trust apparent evidence about their moral patienthood any more than we can trust apparent evidence about their alignment. We’re going to have to do all this while desperately looking over our shoulder to make sure less-cautious, less-ethical actors don’t beat us to the punch and render all our efforts useless.
I desperately wish we could collectively slow down, take things step by step, and think hard about the monumental questions we’re faced with before scaling up models further. I don’t think I’ll get my way on that — at least, not entirely.
Yes, I completely agree that this is nowhere near good enough. It would make me very nervous indeed to end up in that situation.
The thing I was trying to push back against was the idea that what I thought you were claiming: that we’re effectively dead if we end up in this situation.
Why aren’t we effectively dead, assuming the misaligned AI reaches AGI and beyond in capability? Do we just luck out? And if so, what makes you think that is the dominant, or default (90%) outcome?
To give one example: how would you use this technique (the “training game”) to eliminate 100% of all possible prompt engineering hacks and so protect against misuse by malicious humans (cf. “If the “grandma’s bedtime story napalm recipe” prompt engineering hack—as mentioned in the OP).
On this point, the initial AI’s needn’t be actually aligned, I think. They could for example do useful alignment work that we can use even though they are “playing the training game”; they might want to take over, but don’t have enough influence yet, so are just doing as we ask for now. (More)
(Clearly this is not a safe way of reliably avoiding catastrophe. But it’s an example of a way that it’s at least plausible we avoid catastrophe.)
How can their output be trusted if they aren’t aligned? Also, I don’t think it’s a way for reliably avoiding catastrophe even in the event the output can be trusted to be correct: how do you ensure that the AI rewarded for finding bugs finds all the bugs?
From the Planned Obsolescence article you link to:
and
[my emphasis]
Yes, I completely agree that this is nowhere near good enough. It would make me very nervous indeed to end up in that situation.
The thing I was trying to push back against was the idea that what I thought you were claiming: that we’re effectively dead if we end up in this situation.
Why aren’t we effectively dead, assuming the misaligned AI reaches AGI and beyond in capability? Do we just luck out? And if so, what makes you think that is the dominant, or default (90%) outcome?
To give one example: how would you use this technique (the “training game”) to eliminate 100% of all possible prompt engineering hacks and so protect against misuse by malicious humans (cf. “If the “grandma’s bedtime story napalm recipe” prompt engineering hack—as mentioned in the OP).