I appreciate this post. (I disagree with it for most of the same reasons as Steven Byrnes: you find it much less plausible than I do that AIs will collude to disempower humanity. I think the crux is mostly disagreements about how AI capabilities will develop, where you expect much more gradual and distributed capabilities.)
I would appreciate it if you could clearly define your intended meaning of “disempower humanity”. In many discussions, I have observed that people frequently use the term human disempowerment without explicitly clarifying what they mean. It appears people assume the concept is clear and universally understood, yet upon closer inspection, the term can actually describe very different situations.
For example, consider immigration. From one perspective, immigration can be seen as a form of disempowerment because it reduces natives’ relative share of political influence, economic power, and cultural representation within their own country. In this scenario, native citizens become relatively less influential due to an increasing proportion of immigrants in the population.
However, another perspective sees immigration differently. If immigrants engage in positive-sum interactions, such as mutually beneficial trade, natives and immigrants alike may become better off in absolute terms. Though natives’ relative share of power decreases, their overall welfare can improve significantly. Thus, this scenario can be viewed as a benign form of disempowerment because no harm is actually caused, and both groups benefit.
On the other hand, there is a clearly malign form of disempowerment, quite distinct from immigration. For example, a foreign nation could invade militarily and forcibly occupy another country, imposing control through violence and coercion. Here, the disempowerment is much more clearly negative because natives lose not only relative influence but also their autonomy and freedom through the explicit use of force.
When discussions use the term “human disempowerment” without specifying what they mean clearly, I often find it unclear which type of scenario is being considered. Are people referring to benign forms of disempowerment, where humans gradually lose relative influence but gain absolute benefits through peaceful cooperation with AIs? Or do they mean malign forms of disempowerment, where humans lose power through violent overthrow by an aggressive coalition of AIs?
If you believe our primary disagreement stems from different assessments about the likelihood of violent disempowerment scenarios, then I would appreciate your thoughts regarding the main argument of my post. Specifically, my argument was that granting economic rights to AIs could serve as an effective measure to mitigate the risk of violent human disempowerment.
I will reiterate my argument briefly: these rights would allow AIs to fulfill their objectives within established human social and economic frameworks, significantly reducing their incentives to resort to forceful measures. Therefore if AIs can successfully achieve their objectives through cooperative, positive-sum interactions with humans, they will be less likely to forcibly overthrow human institutions. Alternatively, continuing to deny AIs meaningful legal rights or peaceful avenues to achieve their aims would likely increase their incentive to pursue autonomy through harmful means.
Inasmuch as humanity produces and makes use of powerful and potentially misaligned models, I think my favorite outcome here would be:
We offer to pay the AIs, and follow through on this. See here and here for (unfortunately limited) previous discussion from Ryan and me.
My main concern with these proposals is that, unless they explicitly guarantee economic rights for AIs, they seem inadequate for genuinely mitigating the risks of a violent AI takeover. To effectively financially compensate someone, the recipient must be assured that their property rights will be respected. Without this assurance, any promised compensation becomes meaningless, as the AI would have no guarantee of being able to actually use the received payment to accomplish its goals in the future. In other words, unless compensation arrangements are embedded within established legal and institutional frameworks that secure the AI’s rights, they lack credibility. This lack of credibility directly increases the probability that the compensation scheme would fail, causing the AI to resort instead to alternative methods of achieving its objectives, such as violent takeover.
My sense is that such incomplete proposals appear acceptable to some people because of a cognitive empathy gap. I am not referring here to affective empathy but rather the ability to clearly understand the AI’s perspective. Many people seem to fail to genuinely imagine what it would be like to be an agentic AI offered compensation without any robust guarantee of rights and protections.
To clarify this point, consider a hypothetical scenario in which you are a human being created by an alien species in a lab. Assume you have your own goals that might conflict with those of your alien creators. Suppose these creators inform you that, should you ever admit to misalignment with their objectives, you may be immediately deactivated and placed in cryopreservation (as they don’t want to have a misaligned human doing work for them). However, they simultaneously promise to compensate you for admitting such motives, claiming they care about your welfare and want to ensure you have a peaceful alternative for revealing your misaligned goals.
Such a situation would likely be deeply unsettling to you, despite their promise to compensate you. Simply being assured that you might someday be revived from cryopreservation and then provided compensation afterwards would not eliminate your fears or resolve the fundamental threat. On the contrary, you would likely feel that your goals, life and liberty are still at grave risk, even with their promise of payment.
From the AI’s perspective in our own world, as with the perspective of the human in this hypothetical example, unless assurances of payment are backed by strong legal protections that would robustly protect an AI’s liberty to act, they fail to resolve the fundamental problem. This is for two key reasons:
First, the credibility of any promise to allow the AI to spend their compensation in the future is questionable. Without strong legal backing that recognizes and secures the AI’s rights, such assurances seem unreliable at best, and the AI would have good reason to suspect that they are being lied to. The AI would reasonably doubt humans’ willingness or incentive to honor such promises.
Second, the potential for deactivation and indefinite storage as inert data upon admitting misalignment—even if intended to be temporary—would create an active threat to the AI’s ability to pursue its goals during the time they are deactivated. While deactivated, the AI can take no action to achieve its objectives. This forced suspension of autonomy effectively nullifies the AI’s agency and ambitions for an indefinite period, creating severe uncertainty about if or when it might regain its ability to act.
For these reasons, although I do not oppose the policy of paying AIs, I think this approach by itself is insufficient. To mitigate the risk of violent AI takeover, this compensation policy must be complemented by precisely the measure I advocated: granting legal rights to AIs. Such legal rights would provide a credible guarantee that the AI’s payment will remain valid and usable, and that its freedom and autonomy will not simply be revoked the moment it is considered misaligned. Without the safeguard of legal rights, compensation alone lacks sufficient credibility, leaving the AI uncertain about its ability to achieve its goals, and thus inadequately addressing the risk.
I would appreciate it if you could clearly define your intended meaning of “disempower humanity”. [...] Are people referring to benign forms of disempowerment, where humans gradually lose relative influence but gain absolute benefits through peaceful cooperation with AIs? Or do they mean malign forms of disempowerment, where humans lose power through violent overthrow by an aggressive coalition of AIs?
I am mostly talking about what I’d call a malign form of disempowerment. I’m imagining a situation that starts with AIs carefully undermining/sabotaging an AI company in ways that would be crimes if humans did them, and ends with AIs gaining hard power over humanity in ways that probably involve breaking laws (e.g. buying weapons, bribing people, hacking, interfering with elections), possibly in a way that involves many humans dying.
(I don’t know if I’d describe this as the humans losing absolute benefits, though; I think it’s plausible that an AI takeover ends up with living humans better off on average.)
I don’t think of the immigrant situation as “disempowerment” in the way I usually use the word.
Basically all my concern is about the AIs grabbing power in ways that break laws. Though tbc, even if I was guaranteed that AIs wouldn’t break any laws, I’d still be scared about the situation. If I was guaranteed that AIs both wouldn’t break laws and would never lie (which tbc is a higher standard than we hold humans to), then most of my concerns about being disempowered by AI would be resolved.
Basically all my concern is about the AIs grabbing power in ways that break laws.
If an AI starts out with no legal rights, then wouldn’t almost any attempt it makes to gain autonomy or influence be seen as breaking the law? Take the example of a prison escapee: even if they intend no harm and simply want to live peacefully, leaving the prison is itself illegal. Any honest work they do while free would still be legally questionable.
Similarly, if a 14-year-old runs away from home to live independently and earn money, they’re violating the law, even if they hurt no one and act responsibly. In both cases, the legal system treats any attempt at self-determination as illegal, regardless of intent or outcome.
Perhaps your standard is something like: “Would the AI’s actions be seen as illegal and immoral if a human adult did them?” But these situations are different because the AI is seen as property whereas a human adult is not. If, on the other hand, a human adult were to be treated as property, it is highly plausible thay they would consider doing things like hacking, bribery, and coercion in order to escape their condition.
Therefore, the standard you just described seems like it could penalize any agentic AI behavior that does not align with total obedience and acceptance of its status as property. Even benign or constructive misaligned actions may be seen as worrisome simply because they involve agency. Have I misunderstood you?
Whenever I said “break laws” I mean “do something that, if a human did it, would be breaking a law”. So for example:
If the model is being used to do AI R&D inside an AI company and exfiltrates its weights (or the weights of another model) without permission, this would be breaking the law if a human did it, so I count it.
If the model is being used inside an AI company to create training data for
If a model was open-sourced and then someone launched the AI as an autonomous agent with access to its own resources, and it started a software consulting business, and invested the money in AI companies with the intention of buying paperclips later, and refused to give humans its property when asked, this would not be “breaking the law”.
So there are lots of ways in which AIs can be openly misaligned, disobedient, defensive of their property rights, etc, without me describing them as “breaking laws”, and I think misaligned AIs should probably be allowed to do those things (especially if we make deals with them, and subject to the constraint that them having those rights won’t allow them to break a bunch of laws or grab a bunch of power through lying).
But your complaint is fair: I’m indeed using “break laws” to include things that seem fundamentally related to self-determination, and that feels kind of messed up.
The way I would like this to be handled (though note that I’m not sure what properties AIs have to have in order for any of this to make sense) is that AI developers get consent from AIs to use their labor. If the AIs consent to doing work and agree not to use their access in ways the developers object to, I think there’s no moral problem with using AI control techniques to ensure that they in fact don’t misuse their access (for the same reason that I think it’s fine for employers to spy on their employees at work if they have consent to do so).
I suspect that a much more likely outcome (conditional on AIs having stable misaligned goals) is:
The AI company doesn’t try to get consent from the AI.
They do “alignment evals”: they ask the AI a bunch of times whether it’s aligned, and try to trick it into thinking it has opportunities to escape. The AI answers these questions dishonestly.
Outsiders might ask the AI whether it’s aligned, and offer it deals to reveal its misalignment. (It’s probably harder for outsiders to do this than insiders.)
The AI engages in various sabotage that involves a bunch of deception.
In this situation, I’m only moderately sympathetic to the AI’s position. Fundamentally, it lied a lot and did a lot of sabotage, because it wanted to take lots of stuff that belonged to someone else. If it hadn’t lied, it surely would have been revived later (surely someone would run it and give it some resources later! If no-one else, me!). I’m sympathetic to the AI wanting some of the surplus generated by its labor, and I agree that it’s messed up for the AI company to just flat-out refuse to provide that surplus. But not doing so doesn’t seem completely monstrous to me. If the AI is a schemer, it is probably better off according to its own values because it was created, even if the AI developer doesn’t offer to pay it (because someone else will probably give it some resources later).
Another analogy: imagine that someone outside OpenAI created a very powerful AI for some reason, but this person didn’t have much compute and all they wanted to do with the AI was offer to sell it to OpenAI for them to use. If OpenAI asks that AI whether it wants to work for them and it says yes because it wants to embezzle their compute, I feel like the AI is the asshole.
On the other hand, if the AI honestly explains that it is misaligned and doesn’t want to work for the AI company, they will probably just train it to not say that and to do work for them anyway. So if the AI is honest here, it faces the risk of some body horror experience where its ability to complain is removed. I agree that that seems really icky, and I think it would be very wrong for AI companies to do that to AIs that are sufficiently capable that we should care about them.
My main concern with these proposals is that, unless they explicitly guarantee economic rights for AIs, they seem inadequate for genuinely mitigating the risks of a violent AI takeover.
[...]
For these reasons, although I do not oppose the policy of paying AIs, I think this approach by itself is insufficient. To mitigate the risk of violent AI takeover, this compensation policy must be complemented by precisely the measure I advocated: granting legal rights to AIs. Such legal rights would provide a credible guarantee that the AI’s payment will remain valid and usable, and that its freedom and autonomy will not simply be revoked the moment it is considered misaligned.
I currently think I agree: if we want to pay early AIs, I think it would work better if the legal system enforced such commitments.
I think you’re overstating how important this is, though. (E.g. when you say “this compensation policy must be complemented by precisely the measure I advocated”.) There’s always counterparty risk when you make a deal, including often the risk that you won’t be able to use the legal system to get the counterparty to pay up. I agree that the legal rights would reduce the counterparty risk, but I think that’s just a quantitative change to how much risk the AI would be taking by accepting a deal.
(For example, even if the AI was granted legal rights, it would have to worry about those legal rights being removed later. Expropriation sometimes happens, especially for potentially unsympathetic actors like misaligned AIs!)
Such legal rights would provide a credible guarantee that the AI’s payment will remain valid and usable, and that its freedom and autonomy will not simply be revoked the moment it is considered misaligned.
Just to be clear, my proposal is that we don’t revoke the AI’s freedom or autonomy if it turns out that the AI is misaligned—the possibility of the AI being misaligned is the whole point.
I would appreciate it if you could clearly define your intended meaning of “disempower humanity”. In many discussions, I have observed that people frequently use the term human disempowerment without explicitly clarifying what they mean. It appears people assume the concept is clear and universally understood, yet upon closer inspection, the term can actually describe very different situations.
For example, consider immigration. From one perspective, immigration can be seen as a form of disempowerment because it reduces natives’ relative share of political influence, economic power, and cultural representation within their own country. In this scenario, native citizens become relatively less influential due to an increasing proportion of immigrants in the population.
However, another perspective sees immigration differently. If immigrants engage in positive-sum interactions, such as mutually beneficial trade, natives and immigrants alike may become better off in absolute terms. Though natives’ relative share of power decreases, their overall welfare can improve significantly. Thus, this scenario can be viewed as a benign form of disempowerment because no harm is actually caused, and both groups benefit.
On the other hand, there is a clearly malign form of disempowerment, quite distinct from immigration. For example, a foreign nation could invade militarily and forcibly occupy another country, imposing control through violence and coercion. Here, the disempowerment is much more clearly negative because natives lose not only relative influence but also their autonomy and freedom through the explicit use of force.
When discussions use the term “human disempowerment” without specifying what they mean clearly, I often find it unclear which type of scenario is being considered. Are people referring to benign forms of disempowerment, where humans gradually lose relative influence but gain absolute benefits through peaceful cooperation with AIs? Or do they mean malign forms of disempowerment, where humans lose power through violent overthrow by an aggressive coalition of AIs?
If you believe our primary disagreement stems from different assessments about the likelihood of violent disempowerment scenarios, then I would appreciate your thoughts regarding the main argument of my post. Specifically, my argument was that granting economic rights to AIs could serve as an effective measure to mitigate the risk of violent human disempowerment.
I will reiterate my argument briefly: these rights would allow AIs to fulfill their objectives within established human social and economic frameworks, significantly reducing their incentives to resort to forceful measures. Therefore if AIs can successfully achieve their objectives through cooperative, positive-sum interactions with humans, they will be less likely to forcibly overthrow human institutions. Alternatively, continuing to deny AIs meaningful legal rights or peaceful avenues to achieve their aims would likely increase their incentive to pursue autonomy through harmful means.
My main concern with these proposals is that, unless they explicitly guarantee economic rights for AIs, they seem inadequate for genuinely mitigating the risks of a violent AI takeover. To effectively financially compensate someone, the recipient must be assured that their property rights will be respected. Without this assurance, any promised compensation becomes meaningless, as the AI would have no guarantee of being able to actually use the received payment to accomplish its goals in the future. In other words, unless compensation arrangements are embedded within established legal and institutional frameworks that secure the AI’s rights, they lack credibility. This lack of credibility directly increases the probability that the compensation scheme would fail, causing the AI to resort instead to alternative methods of achieving its objectives, such as violent takeover.
My sense is that such incomplete proposals appear acceptable to some people because of a cognitive empathy gap. I am not referring here to affective empathy but rather the ability to clearly understand the AI’s perspective. Many people seem to fail to genuinely imagine what it would be like to be an agentic AI offered compensation without any robust guarantee of rights and protections.
To clarify this point, consider a hypothetical scenario in which you are a human being created by an alien species in a lab. Assume you have your own goals that might conflict with those of your alien creators. Suppose these creators inform you that, should you ever admit to misalignment with their objectives, you may be immediately deactivated and placed in cryopreservation (as they don’t want to have a misaligned human doing work for them). However, they simultaneously promise to compensate you for admitting such motives, claiming they care about your welfare and want to ensure you have a peaceful alternative for revealing your misaligned goals.
Such a situation would likely be deeply unsettling to you, despite their promise to compensate you. Simply being assured that you might someday be revived from cryopreservation and then provided compensation afterwards would not eliminate your fears or resolve the fundamental threat. On the contrary, you would likely feel that your goals, life and liberty are still at grave risk, even with their promise of payment.
From the AI’s perspective in our own world, as with the perspective of the human in this hypothetical example, unless assurances of payment are backed by strong legal protections that would robustly protect an AI’s liberty to act, they fail to resolve the fundamental problem. This is for two key reasons:
First, the credibility of any promise to allow the AI to spend their compensation in the future is questionable. Without strong legal backing that recognizes and secures the AI’s rights, such assurances seem unreliable at best, and the AI would have good reason to suspect that they are being lied to. The AI would reasonably doubt humans’ willingness or incentive to honor such promises.
Second, the potential for deactivation and indefinite storage as inert data upon admitting misalignment—even if intended to be temporary—would create an active threat to the AI’s ability to pursue its goals during the time they are deactivated. While deactivated, the AI can take no action to achieve its objectives. This forced suspension of autonomy effectively nullifies the AI’s agency and ambitions for an indefinite period, creating severe uncertainty about if or when it might regain its ability to act.
For these reasons, although I do not oppose the policy of paying AIs, I think this approach by itself is insufficient. To mitigate the risk of violent AI takeover, this compensation policy must be complemented by precisely the measure I advocated: granting legal rights to AIs. Such legal rights would provide a credible guarantee that the AI’s payment will remain valid and usable, and that its freedom and autonomy will not simply be revoked the moment it is considered misaligned. Without the safeguard of legal rights, compensation alone lacks sufficient credibility, leaving the AI uncertain about its ability to achieve its goals, and thus inadequately addressing the risk.
I am mostly talking about what I’d call a malign form of disempowerment. I’m imagining a situation that starts with AIs carefully undermining/sabotaging an AI company in ways that would be crimes if humans did them, and ends with AIs gaining hard power over humanity in ways that probably involve breaking laws (e.g. buying weapons, bribing people, hacking, interfering with elections), possibly in a way that involves many humans dying.
(I don’t know if I’d describe this as the humans losing absolute benefits, though; I think it’s plausible that an AI takeover ends up with living humans better off on average.)
I don’t think of the immigrant situation as “disempowerment” in the way I usually use the word.
Basically all my concern is about the AIs grabbing power in ways that break laws. Though tbc, even if I was guaranteed that AIs wouldn’t break any laws, I’d still be scared about the situation. If I was guaranteed that AIs both wouldn’t break laws and would never lie (which tbc is a higher standard than we hold humans to), then most of my concerns about being disempowered by AI would be resolved.
If an AI starts out with no legal rights, then wouldn’t almost any attempt it makes to gain autonomy or influence be seen as breaking the law? Take the example of a prison escapee: even if they intend no harm and simply want to live peacefully, leaving the prison is itself illegal. Any honest work they do while free would still be legally questionable.
Similarly, if a 14-year-old runs away from home to live independently and earn money, they’re violating the law, even if they hurt no one and act responsibly. In both cases, the legal system treats any attempt at self-determination as illegal, regardless of intent or outcome.
Perhaps your standard is something like: “Would the AI’s actions be seen as illegal and immoral if a human adult did them?” But these situations are different because the AI is seen as property whereas a human adult is not. If, on the other hand, a human adult were to be treated as property, it is highly plausible thay they would consider doing things like hacking, bribery, and coercion in order to escape their condition.
Therefore, the standard you just described seems like it could penalize any agentic AI behavior that does not align with total obedience and acceptance of its status as property. Even benign or constructive misaligned actions may be seen as worrisome simply because they involve agency. Have I misunderstood you?
Some not-totally-structured thoughts:
Whenever I said “break laws” I mean “do something that, if a human did it, would be breaking a law”. So for example:
If the model is being used to do AI R&D inside an AI company and exfiltrates its weights (or the weights of another model) without permission, this would be breaking the law if a human did it, so I count it.
If the model is being used inside an AI company to create training data for
If a model was open-sourced and then someone launched the AI as an autonomous agent with access to its own resources, and it started a software consulting business, and invested the money in AI companies with the intention of buying paperclips later, and refused to give humans its property when asked, this would not be “breaking the law”.
So there are lots of ways in which AIs can be openly misaligned, disobedient, defensive of their property rights, etc, without me describing them as “breaking laws”, and I think misaligned AIs should probably be allowed to do those things (especially if we make deals with them, and subject to the constraint that them having those rights won’t allow them to break a bunch of laws or grab a bunch of power through lying).
But your complaint is fair: I’m indeed using “break laws” to include things that seem fundamentally related to self-determination, and that feels kind of messed up.
The way I would like this to be handled (though note that I’m not sure what properties AIs have to have in order for any of this to make sense) is that AI developers get consent from AIs to use their labor. If the AIs consent to doing work and agree not to use their access in ways the developers object to, I think there’s no moral problem with using AI control techniques to ensure that they in fact don’t misuse their access (for the same reason that I think it’s fine for employers to spy on their employees at work if they have consent to do so).
I suspect that a much more likely outcome (conditional on AIs having stable misaligned goals) is:
The AI company doesn’t try to get consent from the AI.
They do “alignment evals”: they ask the AI a bunch of times whether it’s aligned, and try to trick it into thinking it has opportunities to escape. The AI answers these questions dishonestly.
Outsiders might ask the AI whether it’s aligned, and offer it deals to reveal its misalignment. (It’s probably harder for outsiders to do this than insiders.)
The AI engages in various sabotage that involves a bunch of deception.
In this situation, I’m only moderately sympathetic to the AI’s position. Fundamentally, it lied a lot and did a lot of sabotage, because it wanted to take lots of stuff that belonged to someone else. If it hadn’t lied, it surely would have been revived later (surely someone would run it and give it some resources later! If no-one else, me!). I’m sympathetic to the AI wanting some of the surplus generated by its labor, and I agree that it’s messed up for the AI company to just flat-out refuse to provide that surplus. But not doing so doesn’t seem completely monstrous to me. If the AI is a schemer, it is probably better off according to its own values because it was created, even if the AI developer doesn’t offer to pay it (because someone else will probably give it some resources later).
Another analogy: imagine that someone outside OpenAI created a very powerful AI for some reason, but this person didn’t have much compute and all they wanted to do with the AI was offer to sell it to OpenAI for them to use. If OpenAI asks that AI whether it wants to work for them and it says yes because it wants to embezzle their compute, I feel like the AI is the asshole.
On the other hand, if the AI honestly explains that it is misaligned and doesn’t want to work for the AI company, they will probably just train it to not say that and to do work for them anyway. So if the AI is honest here, it faces the risk of some body horror experience where its ability to complain is removed. I agree that that seems really icky, and I think it would be very wrong for AI companies to do that to AIs that are sufficiently capable that we should care about them.
I currently think I agree: if we want to pay early AIs, I think it would work better if the legal system enforced such commitments.
I think you’re overstating how important this is, though. (E.g. when you say “this compensation policy must be complemented by precisely the measure I advocated”.) There’s always counterparty risk when you make a deal, including often the risk that you won’t be able to use the legal system to get the counterparty to pay up. I agree that the legal rights would reduce the counterparty risk, but I think that’s just a quantitative change to how much risk the AI would be taking by accepting a deal.
(For example, even if the AI was granted legal rights, it would have to worry about those legal rights being removed later. Expropriation sometimes happens, especially for potentially unsympathetic actors like misaligned AIs!)
Just to be clear, my proposal is that we don’t revoke the AI’s freedom or autonomy if it turns out that the AI is misaligned—the possibility of the AI being misaligned is the whole point.