Thanks for writing this! I’m excited to see more AI strategy discussions being published.
If the first group to develop AGI manages to develop safe AGI, but the group allows other AGI projects elsewhere in the world to keep running, then one of those other projects will likely eventually develop unsafe AGI that causes human extinction.
(My take: I also agree with this point, except that I would bid to replace “the group allows” with “the world allows”, for reasons that will hopefully become clear in Part 3: It Matters Who Does Things.)
I don’t yet see why you agree with this. If the first general AI were safe and other projects continued, couldn’t people still be safe, through the leading project improving human resilience?
It seems like there are several ways by which safe AI could contribute to human resilience:
Direct defensive interventions, e.g., deploying vaccines in response to viruses
Deterrence, e.g., “if some AI actively tries to harm many people, I shut it down” (arguably much less norm-breaking than shutting it down proactively)
Coordination, i.e., helping create mechanisms for facilitating trade/compromise and reducing conflict
This leading safe AI project could also be especially well-placed to do the above, because market and R&D advantages may help it peacefully grow its influence (at a faster rate than others)
In other words, why assume that (a) AI offense-defense balance is so stacked in favor of offense, and (b) deterrence and coordination wouldn’t pacify a situation with high offensive capabilities on multiple sides?
I’m not sure I get why extortion could give misaligned agents a (big) asymmetric advantage over aligned agents. Here are some things that might each prevent extortion-based takeovers:
Reasons why successful extortion might not happen:
Deterrence might prevent extortion attempts—blackmailing someone is less appealing if they’ve committed to severe retaliation (cf. Liam Neeson).
Plausibly there’ll be good enough interpretability or surveillance (especially since we’re conditioning on there being some safe AI—those are disproportionately worlds in which there’s good interpretability).
Arguably, sufficiently smart and capable agents don’t give in to blackmail, especially if they’ve had time to make commitments. If this applies, the safe AI would be less likely to be blackmailed in the first place, and it would not cede anything if it is blackmailed.
Plausibly, the aligned AI would be aligned to values that would not accept such a scope-insensitive trade, even if they were willing to give in to threats.
Other reasons why extortion might not create asymmetric advantages:
Plausibly, the aligned AI will be aligned to values that would also be fine with doing extortion.
many people would back down in the face of a realistic threat to torture everybody.
Maybe a limitation of this analogy is that it assumes away most of the above anti-extortion mechanisms. (Also, if the human blackmail scenario assumes that many humans can each unilaterally cede control, that also makes it easier for extortion to succeed than if power is more centralized.)
On the other point—seems right, I agree offense is often favored by default. Still:
Deterrence and coordination can happen even (especially?) when offense is favored.
Since the aligned AI may start off with and then grow a lead, some degree of offense being favored may not be enough for things to go wrong; the defense is (in this hypothetical) (much) stronger than the offense, so things may have to be tilted especially heavily toward offense for offense to win.
(Actually, I’m realizing a major limitation of my argument is that it doesn’t consider how the time/investment costs of safety may mean that—even if the first general AI project is safe—it’s then outpaced by other, less cautious projects. More generally, it seems like the leading project’s (relative) growth rate will depend on how much it’s accelerated by its lead, how much it’s directly slowed down by its caution, how much it’s accelerated by actors who want cautious projects to lead, and other factors, and it’s unclear whether this would result in overall faster or slower growth than other projects.)
(I also agree that a high bar for safety in high-stakes scenarios is generally worthwhile; I mainly just mean to disagree with the position that extinction is very likely in these scenarios.)
Thanks for writing this! I’m excited to see more AI strategy discussions being published.
I don’t yet see why you agree with this. If the first general AI were safe and other projects continued, couldn’t people still be safe, through the leading project improving human resilience?
It seems like there are several ways by which safe AI could contribute to human resilience:
Direct defensive interventions, e.g., deploying vaccines in response to viruses
Deterrence, e.g., “if some AI actively tries to harm many people, I shut it down” (arguably much less norm-breaking than shutting it down proactively)
Coordination, i.e., helping create mechanisms for facilitating trade/compromise and reducing conflict
This leading safe AI project could also be especially well-placed to do the above, because market and R&D advantages may help it peacefully grow its influence (at a faster rate than others)
In other words, why assume that (a) AI offense-defense balance is so stacked in favor of offense, and (b) deterrence and coordination wouldn’t pacify a situation with high offensive capabilities on multiple sides?
(Edited for formatting/clarity)
Comment no longer endorsed by author
Thanks for this!
I’m not sure I get why extortion could give misaligned agents a (big) asymmetric advantage over aligned agents. Here are some things that might each prevent extortion-based takeovers:
Reasons why successful extortion might not happen:
Deterrence might prevent extortion attempts—blackmailing someone is less appealing if they’ve committed to severe retaliation (cf. Liam Neeson).
Plausibly there’ll be good enough interpretability or surveillance (especially since we’re conditioning on there being some safe AI—those are disproportionately worlds in which there’s good interpretability).
Arguably, sufficiently smart and capable agents don’t give in to blackmail, especially if they’ve had time to make commitments. If this applies, the safe AI would be less likely to be blackmailed in the first place, and it would not cede anything if it is blackmailed.
Plausibly, the aligned AI would be aligned to values that would not accept such a scope-insensitive trade, even if they were willing to give in to threats.
Other reasons why extortion might not create asymmetric advantages:
Plausibly, the aligned AI will be aligned to values that would also be fine with doing extortion.
Maybe a limitation of this analogy is that it assumes away most of the above anti-extortion mechanisms. (Also, if the human blackmail scenario assumes that many humans can each unilaterally cede control, that also makes it easier for extortion to succeed than if power is more centralized.)
On the other point—seems right, I agree offense is often favored by default. Still:
Deterrence and coordination can happen even (especially?) when offense is favored.
Since the aligned AI may start off with and then grow a lead, some degree of offense being favored may not be enough for things to go wrong; the defense is (in this hypothetical) (much) stronger than the offense, so things may have to be tilted especially heavily toward offense for offense to win.
(Actually, I’m realizing a major limitation of my argument is that it doesn’t consider how the time/investment costs of safety may mean that—even if the first general AI project is safe—it’s then outpaced by other, less cautious projects. More generally, it seems like the leading project’s (relative) growth rate will depend on how much it’s accelerated by its lead, how much it’s directly slowed down by its caution, how much it’s accelerated by actors who want cautious projects to lead, and other factors, and it’s unclear whether this would result in overall faster or slower growth than other projects.)
(I also agree that a high bar for safety in high-stakes scenarios is generally worthwhile; I mainly just mean to disagree with the position that extinction is very likely in these scenarios.)