I’m not sure I get why extortion could give misaligned agents a (big) asymmetric advantage over aligned agents. Here are some things that might each prevent extortion-based takeovers:
Reasons why successful extortion might not happen:
Deterrence might prevent extortion attempts—blackmailing someone is less appealing if they’ve committed to severe retaliation (cf. Liam Neeson).
Plausibly there’ll be good enough interpretability or surveillance (especially since we’re conditioning on there being some safe AI—those are disproportionately worlds in which there’s good interpretability).
Arguably, sufficiently smart and capable agents don’t give in to blackmail, especially if they’ve had time to make commitments. If this applies, the safe AI would be less likely to be blackmailed in the first place, and it would not cede anything if it is blackmailed.
Plausibly, the aligned AI would be aligned to values that would not accept such a scope-insensitive trade, even if they were willing to give in to threats.
Other reasons why extortion might not create asymmetric advantages:
Plausibly, the aligned AI will be aligned to values that would also be fine with doing extortion.
many people would back down in the face of a realistic threat to torture everybody.
Maybe a limitation of this analogy is that it assumes away most of the above anti-extortion mechanisms. (Also, if the human blackmail scenario assumes that many humans can each unilaterally cede control, that also makes it easier for extortion to succeed than if power is more centralized.)
On the other point—seems right, I agree offense is often favored by default. Still:
Deterrence and coordination can happen even (especially?) when offense is favored.
Since the aligned AI may start off with and then grow a lead, some degree of offense being favored may not be enough for things to go wrong; the defense is (in this hypothetical) (much) stronger than the offense, so things may have to be tilted especially heavily toward offense for offense to win.
(Actually, I’m realizing a major limitation of my argument is that it doesn’t consider how the time/investment costs of safety may mean that—even if the first general AI project is safe—it’s then outpaced by other, less cautious projects. More generally, it seems like the leading project’s (relative) growth rate will depend on how much it’s accelerated by its lead, how much it’s directly slowed down by its caution, how much it’s accelerated by actors who want cautious projects to lead, and other factors, and it’s unclear whether this would result in overall faster or slower growth than other projects.)
(I also agree that a high bar for safety in high-stakes scenarios is generally worthwhile; I mainly just mean to disagree with the position that extinction is very likely in these scenarios.)
Comment no longer endorsed by author
Thanks for this!
I’m not sure I get why extortion could give misaligned agents a (big) asymmetric advantage over aligned agents. Here are some things that might each prevent extortion-based takeovers:
Reasons why successful extortion might not happen:
Deterrence might prevent extortion attempts—blackmailing someone is less appealing if they’ve committed to severe retaliation (cf. Liam Neeson).
Plausibly there’ll be good enough interpretability or surveillance (especially since we’re conditioning on there being some safe AI—those are disproportionately worlds in which there’s good interpretability).
Arguably, sufficiently smart and capable agents don’t give in to blackmail, especially if they’ve had time to make commitments. If this applies, the safe AI would be less likely to be blackmailed in the first place, and it would not cede anything if it is blackmailed.
Plausibly, the aligned AI would be aligned to values that would not accept such a scope-insensitive trade, even if they were willing to give in to threats.
Other reasons why extortion might not create asymmetric advantages:
Plausibly, the aligned AI will be aligned to values that would also be fine with doing extortion.
Maybe a limitation of this analogy is that it assumes away most of the above anti-extortion mechanisms. (Also, if the human blackmail scenario assumes that many humans can each unilaterally cede control, that also makes it easier for extortion to succeed than if power is more centralized.)
On the other point—seems right, I agree offense is often favored by default. Still:
Deterrence and coordination can happen even (especially?) when offense is favored.
Since the aligned AI may start off with and then grow a lead, some degree of offense being favored may not be enough for things to go wrong; the defense is (in this hypothetical) (much) stronger than the offense, so things may have to be tilted especially heavily toward offense for offense to win.
(Actually, I’m realizing a major limitation of my argument is that it doesn’t consider how the time/investment costs of safety may mean that—even if the first general AI project is safe—it’s then outpaced by other, less cautious projects. More generally, it seems like the leading project’s (relative) growth rate will depend on how much it’s accelerated by its lead, how much it’s directly slowed down by its caution, how much it’s accelerated by actors who want cautious projects to lead, and other factors, and it’s unclear whether this would result in overall faster or slower growth than other projects.)
(I also agree that a high bar for safety in high-stakes scenarios is generally worthwhile; I mainly just mean to disagree with the position that extinction is very likely in these scenarios.)