Divergence of efforts in AI alignment could lead to an arms race
Can you be a bit concrete about what this will look like? Is this because different approaches to alignment can also lead to insight in capabilities, or is there something else more insidious?
Naively it’s easy to see why an arms race in AI capabilities is bad, but competition for AI alignment seems basically good.
An alignment arms race is only bad if there is a concomitant capabilities development that would make a wrong alignment protocol counterproductive. Different approaches to alignment can lead to insights into capabilities, and that’s something to be concerned about, but that isn’t anything already captured in analyses of capabilities arms-race scenarios.
If there are 2 or more alignment agencies, but only one of their approaches can fit with advanced AI systems as developed, each would race to complete their alignment agenda before the other agencies could complete theirs. This rushing could be especially bad if anyone doesn’t take the time to authenticate and verify their approach will actually align AI as intended. In addition, if the competition becomes hostile enough, AI alignment agencies won’t be checking each other’s work in good faith, and in general, there won’t be enough trust for anyone to let anyone else check the work they’ve done for alignment.
If 1 or more of these agencies racing to the finish line doesn’t let anyone check their work, and their strategy is invalid or unsound, then implementing one of them into an AI system would fail to lead to alignment, when it was expected that it would. In other words, because of mistakes made, what looks like an alignment competition inadvertently becomes a misalignment race.
I’m not saying competition in AI alignment is either good or bad by default. What I am saying is it appears there are particular conditions that would lead competition in AI alignment to make things worse, and that such states should be avoided. To summarize, it appears to me at least some of those conditions are:
1. Competition in AI alignment becomes a ‘race.’
2. One or more agencies in AI alignment themselves become untrustworthy.
3. Even if in principle all AI alignment agencies should be able to trust each other, in practice they end up mistrusting each other.
How can one incentivise the right kind of behaviour here? This isn’t a zero sum game—we can all win, we can all lose. How do we inculcate the market with that knowledge such that the belief that only one of us can win doesn’t make us all more likely to lose?
Off the top of my head:
Some sort of share trading scheme.
Some guarantee from different AI companies that whichever one reaches AI first will employ people from the others.
Can you be a bit concrete about what this will look like? Is this because different approaches to alignment can also lead to insight in capabilities, or is there something else more insidious?
Naively it’s easy to see why an arms race in AI capabilities is bad, but competition for AI alignment seems basically good.
An alignment arms race is only bad if there is a concomitant capabilities development that would make a wrong alignment protocol counterproductive. Different approaches to alignment can lead to insights into capabilities, and that’s something to be concerned about, but that isn’t anything already captured in analyses of capabilities arms-race scenarios.
If there are 2 or more alignment agencies, but only one of their approaches can fit with advanced AI systems as developed, each would race to complete their alignment agenda before the other agencies could complete theirs. This rushing could be especially bad if anyone doesn’t take the time to authenticate and verify their approach will actually align AI as intended. In addition, if the competition becomes hostile enough, AI alignment agencies won’t be checking each other’s work in good faith, and in general, there won’t be enough trust for anyone to let anyone else check the work they’ve done for alignment.
If 1 or more of these agencies racing to the finish line doesn’t let anyone check their work, and their strategy is invalid or unsound, then implementing one of them into an AI system would fail to lead to alignment, when it was expected that it would. In other words, because of mistakes made, what looks like an alignment competition inadvertently becomes a misalignment race.
I’m not saying competition in AI alignment is either good or bad by default. What I am saying is it appears there are particular conditions that would lead competition in AI alignment to make things worse, and that such states should be avoided. To summarize, it appears to me at least some of those conditions are:
1. Competition in AI alignment becomes a ‘race.’
2. One or more agencies in AI alignment themselves become untrustworthy.
3. Even if in principle all AI alignment agencies should be able to trust each other, in practice they end up mistrusting each other.
How can one incentivise the right kind of behaviour here? This isn’t a zero sum game—we can all win, we can all lose. How do we inculcate the market with that knowledge such that the belief that only one of us can win doesn’t make us all more likely to lose?
Off the top of my head:
Some sort of share trading scheme.
Some guarantee from different AI companies that whichever one reaches AI first will employ people from the others.