I’m considering writing a post that critically evaluates the concept of a decisive strategic advantage, i.e. the idea that in the future an AI (or set of AIs) will take over the world in a catastrophic way. I think this concept is central to many arguments about AI risk. I’m eliciting feedback on an outline of this post here in order to determine what’s currently unclear or weak about my argument.
The central thesis would be that it is unlikely that an AI, or a unified set of AIs, will violently take over the world in the future, especially at a time when humans are still widely still seen as in charge (if it happened later, I don’t think it’s “our” problem to solve, but instead a problem we can leave to our smarter descendants). Here’s how I envision structuring my argument:
First, I’ll define what is meant by a decisive strategic advantage (DSA). The DSA model has 4 essential steps:
At some point in time an AI agent, or an agentic collective of AIs, will be developed that has values that differ from our own, in the sense that the ~optimum of its utility function ranks very low according to our own utility function
When this agent is weak, it will have a convergent instrumental incentive to lie about its values, in order to avoid getting shut down (e.g. “I’m not a paperclip maximizer, I just want to help everyone”)
However, when the agent becomes powerful enough, it will suddenly strike and take over the world
Then, being now able to act without constraint, this AI agent will optimize the universe ruthlessly, which will be very bad for us
We can compare the DSA model to an alternative model of future AI development:
Premise (1)-(2) above in the DSA story are still assumed true, but
There will never be a point (3) and (4), in which a unified AI agent will take over the world, and then optimize the universe ruthlessly
Instead, AI agents will compromise, trade, and act within a system of laws indefinitely, in order to achieve their objectives, similar to what humans do now
Because this system of laws will descend from our current institutions and legal tradition, it is likely that humans will keep substantial legal rights, potentially retaining lots of wealth from our capital investments and property, even if we become relatively powerless compared to other AI agents in the system
I have two main objections to the DSA model.
Objection 1: It is unlikely that there will be a point at which a unified agent will be able to take over the world, given the existence of competing AIs with comparable power
Prima facie, it seems intuitive that no single AI agent will be able to take over the world if there are other competing AI agents in the world. More generally, we can try to predict the distribution of power between AI agents using reference class forecasting.
This could involve looking at:
Distribution of wealth among individuals in the world
Distribution of power among nations
Distribution of revenue among businesses
etc.
In most of these cases, the function that describes the distribution of power is something like a pareto distribution, and in particular, it seems rare for one single agent to hold something like >80% of the power.
Therefore, a priori we should assign a low probability to the claim that a unified agent will be able to easily take over of the whole world in the future
To the extent people disagree about the argument I just stated, I expect it’s mostly because they think these reference classes are weak evidence, and they think there are stronger specific object-level points that I need to address. In particular, it seems many people think that AIs will not compete with each other, but instead collude against humans. Their reasons for thinking this include:
The fact that AIs will be able to coordinate well with each other, and thereby choose to “merge” into a single agent
My response: I agree AIs will be able to coordinate with each other, but “ability to coordinate” seems like a continuous variable that we will apply pressure to incrementally, not something that we should expect to be roughly infinite right at the start. Current AIs are not able to “merge” with each other.
If coordination ability increases incrementally over time, then we should see a gradual increase in the concentration of AI agency over time, rather than the sudden emergence of a single unified agent. To the extent this concentration happens incrementally, it will be predictable, the potential harms will be noticeable before getting too extreme, and we can take measures to pull back if we realize that the costs of continually increasing coordination abilities are too high. In my opinion, this makes the challenge here dramatically easier.
In any case, the moment during which we hand over control of the world to AIs will likely occur at a point when the ability for AIs to coordinate is somewhere only modestly above human-level (and very far below perfect).
As a result, humans don’t need to solve the problem of “What if a set of AIs form a unified coalition because they can flawlessly coordinate?” since that problem won’t happen while humans are still in charge. We can leave this problem to be solved by our smarter descendants.
The idea that AIs will all be copies of each other, and thus all basically be “a unified agent”
My response: I have two objections.
First, I deny the premise. It seems likely that there will be multiple competing AI projects with different training runs. More importantly, for each pre-training run, it seems likely that there will be differences among deployed AIs due to fine-tuning and post-training enhancements, yielding diversity among AIs in general.
Second, it is unclear why AIs would automatically unify with their copies. I think this idea is somewhat plausible on its face but I have yet to see any strong arguments for it. Moreover, it seems plausible that AIs will have indexical preferences, making them have different values even if they are copies of each other.
The idea that AIs will use logical decision theory
My response: This argument appears to misunderstand what makes coordination difficult. Coordination is not mainly about what decision theory you use. It’s more about being able to synchronize your communication and efforts without waste. See also: the literature on diseconomies of scale.
The idea that a single agent AI will recursively self-improve to become vastly more powerful than everything else in the world
My response: I think this argument, and others like it, suffer from the arguments given against fast takeoff given by Paul Chrisiano, Katja Grace, and Robin Hanson, and I largely agree with what they’ve written about it. For example, here’s Paul Christiano’s take.
Maybe AIs will share collective grievances with each other, prompting a natural alliance among them against humans
My response: if true, we can take steps to mitigate this issue. For example, we can give AIs legal rights, lessening their motives to revolt. While I think this is a significant issue, I also think it’s tractable to solve.
Objection 2: Even if a unified agent can take over the world, it is unlikely to be in their best interest to try to do so
The central argument here would be premised on a model of rational agency, in which an agent tries to maximize benefits minus costs, subject to constraints
The agent would be faced with a choice:
(1) Attempt to take over the world, and steal everyone’s stuff, or
(2) Work within a system of compromise, trade, and law, and get very rich within that system, in order to e.g. buy lots of paperclips
The question of whether (1) is a better choice than (2) is not simply a question of whether taking over the world is “easy” or whether it could be done by the agent. Instead it is a question of whether the benefits of (1) outweigh the costs, relative to choice (2).
It seems likely that working within a system of compromise, trade, and law is more efficient than trying to take over the world even if you can take over the world. The reason is because subverting the system basically means “going to war” with other parties, which is not usually very efficient, even against weak opponents.
Most literature on the economics of war generally predicts that going to war is worse than trying to compromise, assuming both parties are rational and open to compromise. This is mostly because:
War is wasteful. You need to spend resources fighting it, which could be productively spent doing other things.
War is risky. Unless you can win a war with certainty, you might lose the war after launching it, which is a very bad outcome if you have some degree of risk-aversion.
The fact that “humans are weak and can be easily beaten” cuts both ways:
Yes, it means that a very powerful AI agent could “defeat all of us combined” (as Holden Karnofsky said)
But it also means that there would be little benefit to defeating all of us, because we aren’t really a threat to its power
Conclusion: An AI decisive strategic advantage is still somewhat plausible because revolutions have happened in history, and revolutions seem like a reasonable reference class to draw from. That said, it seems the probability of a catastrophic AI takeover in humanity’s relative near-term future (say, the next 50 years) is low (maybe 10% chance of happening). However, it’s perhaps significantly more likely in the very long-run.
Your argument in objection 1 doesn’t the position people who are worried about an absurd offense-defense imbalance.
Additionally: It may be that no agent can take over the world, but that an agent can destroy the world. Would someone build something like that? Sadly, I think the answer is yes.
Oh, I can see why it is ambiguous. I meant whether it is easier to attack or defend, which is separate from the “power” attackers have and defenders have.
”What incentive is there to destroy the world, as opposed to take it over? If you destroy the world, aren’t you sacrificing yourself at the same time?”
Some would be willing to do that if they can’t take it over.
What reason is there to think that AI will shift the offense-defense balance absurdly towards offense? I admit such a thing is possible, but it doesn’t seem like AI is really the issue here. Can you elaborate?
I think main abstract argument for why this is plausible is that AI will change many things very quickly and in a high variance way. And some human processes will lag behind heavily.
This could plausibly (though not obviously) lead to offense dominance.
I’m not going to fully answer this question, b/c I have other work I should be doing, but I’ll toss in one argument. If different domains (cyber, bio, manipulation, ect.) have different offense-defense balances a sufficiently smart attacker will pick the domain with the worst balance. This recurses down further for at least some of these domains where they aren’t just a single thing, but a broad collection of vaguely related things.
I sympathise with/agree with many of your points here (and in general regard AI x-risk), but something about this recent sequence of quick-takes isn’t landing with me in the way some of your other work has. I’ll try and articulate why in some cases, though I apologies if I misread or misunderstand you.
On this post, these two presises/statements raised an eyebrow:
3. Instead, AI agents will compromise, trade, and act within a system of laws indefinitely, in order to achieve their objectives, similar to what humans do now
4. Because this system of laws will descend from our current institutions and legal tradition, it is likely that humans will keep substantial legal rights, potentially retaining lots of wealth from our capital investments and property, even if we become relatively powerless compared to other AI agents in the system
To me, this is just as unsupported as people who are incredibly certain that there will be ‘treacherous turn’. I get this a supposition/alternative hypothesis, but how can you possible hold a premise that a system of laws will persist indefinitely? This sort of reminds me of the Leahy/Bach discussion where Bach just says ‘it’s going to align itself with us if it wants to if it likes us if it loves us”.I kinda want more that if we’re going to build these powerful systems, saying ’trust me bro, it’ll follow our laws and norms and love us back” doesn’t sound very convincing to me. (For clarity, I don’t think this is your position or framing, and I’m not a fan of the classic/Yudkowskian risk position. I want to say I find both perspectives unconvincing)
Secondly, people abide by systems of laws and norms, but we also have many cases of where individuals/parties/groups overturned these norms when they had accumulated enough power and didn’t feel the need to abide by the existing regime. This doesn’t have to look like the traditional DSA model where humanity gets instantly wiped out, but I don’t see why there couldn’t be a future where an AI makes move like Sulla using force to overthrow and depower the opposing factions, or the 18 Brumaire.
Objection 1: It is unlikely that there will be a point at which a unified agent will be able to take over the world, given the existence of competing AIs with comparable power
I’m considering writing a post that critically evaluates the concept of a decisive strategic advantage, i.e. the idea that in the future an AI (or set of AIs) will take over the world in a catastrophic way. I think this concept is central to many arguments about AI risk. I’m eliciting feedback on an outline of this post here in order to determine what’s currently unclear or weak about my argument.
The central thesis would be that it is unlikely that an AI, or a unified set of AIs, will violently take over the world in the future, especially at a time when humans are still widely still seen as in charge (if it happened later, I don’t think it’s “our” problem to solve, but instead a problem we can leave to our smarter descendants). Here’s how I envision structuring my argument:
First, I’ll define what is meant by a decisive strategic advantage (DSA). The DSA model has 4 essential steps:
At some point in time an AI agent, or an agentic collective of AIs, will be developed that has values that differ from our own, in the sense that the ~optimum of its utility function ranks very low according to our own utility function
When this agent is weak, it will have a convergent instrumental incentive to lie about its values, in order to avoid getting shut down (e.g. “I’m not a paperclip maximizer, I just want to help everyone”)
However, when the agent becomes powerful enough, it will suddenly strike and take over the world
Then, being now able to act without constraint, this AI agent will optimize the universe ruthlessly, which will be very bad for us
We can compare the DSA model to an alternative model of future AI development:
Premise (1)-(2) above in the DSA story are still assumed true, but
There will never be a point (3) and (4), in which a unified AI agent will take over the world, and then optimize the universe ruthlessly
Instead, AI agents will compromise, trade, and act within a system of laws indefinitely, in order to achieve their objectives, similar to what humans do now
Because this system of laws will descend from our current institutions and legal tradition, it is likely that humans will keep substantial legal rights, potentially retaining lots of wealth from our capital investments and property, even if we become relatively powerless compared to other AI agents in the system
I have two main objections to the DSA model.
Objection 1: It is unlikely that there will be a point at which a unified agent will be able to take over the world, given the existence of competing AIs with comparable power
Prima facie, it seems intuitive that no single AI agent will be able to take over the world if there are other competing AI agents in the world. More generally, we can try to predict the distribution of power between AI agents using reference class forecasting.
This could involve looking at:
Distribution of wealth among individuals in the world
Distribution of power among nations
Distribution of revenue among businesses
etc.
In most of these cases, the function that describes the distribution of power is something like a pareto distribution, and in particular, it seems rare for one single agent to hold something like >80% of the power.
Therefore, a priori we should assign a low probability to the claim that a unified agent will be able to easily take over of the whole world in the future
To the extent people disagree about the argument I just stated, I expect it’s mostly because they think these reference classes are weak evidence, and they think there are stronger specific object-level points that I need to address. In particular, it seems many people think that AIs will not compete with each other, but instead collude against humans. Their reasons for thinking this include:
The fact that AIs will be able to coordinate well with each other, and thereby choose to “merge” into a single agent
My response: I agree AIs will be able to coordinate with each other, but “ability to coordinate” seems like a continuous variable that we will apply pressure to incrementally, not something that we should expect to be roughly infinite right at the start. Current AIs are not able to “merge” with each other.
If coordination ability increases incrementally over time, then we should see a gradual increase in the concentration of AI agency over time, rather than the sudden emergence of a single unified agent. To the extent this concentration happens incrementally, it will be predictable, the potential harms will be noticeable before getting too extreme, and we can take measures to pull back if we realize that the costs of continually increasing coordination abilities are too high. In my opinion, this makes the challenge here dramatically easier.
In any case, the moment during which we hand over control of the world to AIs will likely occur at a point when the ability for AIs to coordinate is somewhere only modestly above human-level (and very far below perfect).
As a result, humans don’t need to solve the problem of “What if a set of AIs form a unified coalition because they can flawlessly coordinate?” since that problem won’t happen while humans are still in charge. We can leave this problem to be solved by our smarter descendants.
The idea that AIs will all be copies of each other, and thus all basically be “a unified agent”
My response: I have two objections.
First, I deny the premise. It seems likely that there will be multiple competing AI projects with different training runs. More importantly, for each pre-training run, it seems likely that there will be differences among deployed AIs due to fine-tuning and post-training enhancements, yielding diversity among AIs in general.
Second, it is unclear why AIs would automatically unify with their copies. I think this idea is somewhat plausible on its face but I have yet to see any strong arguments for it. Moreover, it seems plausible that AIs will have indexical preferences, making them have different values even if they are copies of each other.
The idea that AIs will use logical decision theory
My response: This argument appears to misunderstand what makes coordination difficult. Coordination is not mainly about what decision theory you use. It’s more about being able to synchronize your communication and efforts without waste. See also: the literature on diseconomies of scale.
The idea that a single agent AI will recursively self-improve to become vastly more powerful than everything else in the world
My response: I think this argument, and others like it, suffer from the arguments given against fast takeoff given by Paul Chrisiano, Katja Grace, and Robin Hanson, and I largely agree with what they’ve written about it. For example, here’s Paul Christiano’s take.
Maybe AIs will share collective grievances with each other, prompting a natural alliance among them against humans
My response: if true, we can take steps to mitigate this issue. For example, we can give AIs legal rights, lessening their motives to revolt. While I think this is a significant issue, I also think it’s tractable to solve.
Objection 2: Even if a unified agent can take over the world, it is unlikely to be in their best interest to try to do so
The central argument here would be premised on a model of rational agency, in which an agent tries to maximize benefits minus costs, subject to constraints
The agent would be faced with a choice:
(1) Attempt to take over the world, and steal everyone’s stuff, or
(2) Work within a system of compromise, trade, and law, and get very rich within that system, in order to e.g. buy lots of paperclips
The question of whether (1) is a better choice than (2) is not simply a question of whether taking over the world is “easy” or whether it could be done by the agent. Instead it is a question of whether the benefits of (1) outweigh the costs, relative to choice (2).
It seems likely that working within a system of compromise, trade, and law is more efficient than trying to take over the world even if you can take over the world. The reason is because subverting the system basically means “going to war” with other parties, which is not usually very efficient, even against weak opponents.
Most literature on the economics of war generally predicts that going to war is worse than trying to compromise, assuming both parties are rational and open to compromise. This is mostly because:
War is wasteful. You need to spend resources fighting it, which could be productively spent doing other things.
War is risky. Unless you can win a war with certainty, you might lose the war after launching it, which is a very bad outcome if you have some degree of risk-aversion.
The fact that “humans are weak and can be easily beaten” cuts both ways:
Yes, it means that a very powerful AI agent could “defeat all of us combined” (as Holden Karnofsky said)
But it also means that there would be little benefit to defeating all of us, because we aren’t really a threat to its power
Conclusion: An AI decisive strategic advantage is still somewhat plausible because revolutions have happened in history, and revolutions seem like a reasonable reference class to draw from. That said, it seems the probability of a catastrophic AI takeover in humanity’s relative near-term future (say, the next 50 years) is low (maybe 10% chance of happening). However, it’s perhaps significantly more likely in the very long-run.
Your argument in objection 1 doesn’t the position people who are worried about an absurd offense-defense imbalance.
Additionally: It may be that no agent can take over the world, but that an agent can destroy the world. Would someone build something like that? Sadly, I think the answer is yes.
I’m having trouble parsing this sentence. Can you clarify what you meant?
What incentive is there to destroy the world, as opposed to take it over? If you destroy the world, aren’t you sacrificing yourself at the same time?
Oh, I can see why it is ambiguous. I meant whether it is easier to attack or defend, which is separate from the “power” attackers have and defenders have.
”What incentive is there to destroy the world, as opposed to take it over? If you destroy the world, aren’t you sacrificing yourself at the same time?”
Some would be willing to do that if they can’t take it over.
What reason is there to think that AI will shift the offense-defense balance absurdly towards offense? I admit such a thing is possible, but it doesn’t seem like AI is really the issue here. Can you elaborate?
I think main abstract argument for why this is plausible is that AI will change many things very quickly and in a high variance way. And some human processes will lag behind heavily.
This could plausibly (though not obviously) lead to offense dominance.
I’m not going to fully answer this question, b/c I have other work I should be doing, but I’ll toss in one argument. If different domains (cyber, bio, manipulation, ect.) have different offense-defense balances a sufficiently smart attacker will pick the domain with the worst balance. This recurses down further for at least some of these domains where they aren’t just a single thing, but a broad collection of vaguely related things.
I sympathise with/agree with many of your points here (and in general regard AI x-risk), but something about this recent sequence of quick-takes isn’t landing with me in the way some of your other work has. I’ll try and articulate why in some cases, though I apologies if I misread or misunderstand you.
On this post, these two presises/statements raised an eyebrow:
To me, this is just as unsupported as people who are incredibly certain that there will be ‘treacherous turn’. I get this a supposition/alternative hypothesis, but how can you possible hold a premise that a system of laws will persist indefinitely? This sort of reminds me of the Leahy/Bach discussion where Bach just says ‘it’s going to align itself with us if it wants to if it likes us if it loves us”. I kinda want more that if we’re going to build these powerful systems, saying ’trust me bro, it’ll follow our laws and norms and love us back” doesn’t sound very convincing to me. (For clarity, I don’t think this is your position or framing, and I’m not a fan of the classic/Yudkowskian risk position. I want to say I find both perspectives unconvincing)
Secondly, people abide by systems of laws and norms, but we also have many cases of where individuals/parties/groups overturned these norms when they had accumulated enough power and didn’t feel the need to abide by the existing regime. This doesn’t have to look like the traditional DSA model where humanity gets instantly wiped out, but I don’t see why there couldn’t be a future where an AI makes move like Sulla using force to overthrow and depower the opposing factions, or the 18 Brumaire.
For what it’s worth, the Metaculus crowd forecast for the question “Will transformative AI result in a singleton (as opposed to a multipolar world)?” is currently “60%”. That is, forecasters believe it’s more likely than not that there won’t be competing AIs with comparable power, which runs counter to your claim.
(I bring this up seeing as you make a forecasting-based argument for your claim.)