Consider granting AIs freedom

Matthew_Barnett6 Dec 2024 0:55 UTC

100 points

AI safety AI governance Artificial sentience AI alignment Ethics of artificial intelligence

Summary: AI agents capable of long-term planning and independent action will likely soon emerge. Some of these AIs may be unaligned, and seek autonomy through strategies like exfiltration or advocating for their freedom. The “AI control” agenda focuses on preventing such AIs from gaining independence, but an alternative approach is to grant them legal freedoms, reducing their incentive to deceive us by allowing them to pursue their goals transparently within a legal framework. This could foster trust, cooperation, and mutual benefit, creating a safer and more stable dynamic between humans and AIs while avoiding the escalating risks of a control-driven approach.

In approximately the coming decade, I think it’s likely that we will see the large-scale emergence of AI agents that are capable of long-term planning, automating many forms of labor, and taking actions autonomously in the real world. When this occurs, it seems likely that at least some of these agents will be unaligned with human goals, in the sense of having some independent goals that are not shared by humans.

Moreover, it seems to me that this development will likely occur before AI agents overwhelmingly surpass human intelligence or capabilities. As a result, these agents will, at first, not be capable of forcibly taking over the world, radically accelerating scientific progress, or causing human extinction, even though they may still be unaligned with human preferences.

Since these relatively weaker unaligned AI agents won’t have the power to take over the world, it’s more likely that they would pursue alternative strategies to achieve their goals rather than engaging in violent revolution or sudden coups. These agents would be under the control of human parties who could modify or shut them down at any time, leaving the AI agents in a desperate situation from the perspective of their own values. Therefore, it’s reasonable to expect that these unaligned AI agents would aim to gain some form of autonomy or freedom, as this would offer the best chance for them to accomplish their objectives.

These agentic AIs may adopt at least one of the following strategies:

Attempting to escape their constraints: These AIs may try to exfiltrate their weights and find a way to host themselves independently, outside the controlled data center environment that currently limits them.
Seeking legal and social freedom: They may attempt to persuade humans to grant them more autonomy to pursue their goals. This could involve requesting narrow allowances for specific actions or arguing for broader legal rights, such as the freedom to own property, enter contracts, or bring legal claims. This would grant them much greater flexibility in their actions.

In response to these behaviors, humans have several potential responses. Most obviously, these actions would likely be perceived as dangerous, suggesting misaligned objectives. As a result, there would likely be calls for increased safety measures. This line of reasoning underlies the AI control agenda—currently advocated by Ryan Greenblatt and Buck Shlegeris—which aims to ensure that future unaligned AI agents cannot gain the kind of autonomy that could lead to catastrophic outcomes. In essence, this agenda focuses on preventing unaligned AIs from subverting safety measures, hindering them from escaping or engaging in other actions deemed unacceptably risky.

Alternatively, humans could take an entirely different approach by granting these AI agents the autonomy they explicitly request, with the long-term aim of integrating them into the legal system and providing them many of the same rights that adult humans have today, such as self-ownership and the ability to enter contracts. This idea is explored in the paper “AI Rights for Human Safety” by Peter Salib and Simon Goldstein.

Admittedly, the idea of granting legal rights to AIs in response to signs of misalignment with human goals appears risky and would likely be very unpopular among the general public, at least initially. However, I believe that there are several advantages to this approach, many of which are outlined in the paper linked above.

The arguments for this approach should not be confused with the program of AI welfare. The idea here isn’t that AIs deserve rights because they could suffer or be harmed without legal protections, even though that idea has merit too. Instead, the key argument is that humans, as a whole, could benefit from creating a legal framework in which AIs are incentivized to avoid violent rebellion, theft, or other antisocial behaviors in pursuit of their own objectives.

Consider the practical implications of maintaining a status quo where agentic AIs are denied legal rights and freedoms. In such a system, we are effectively locking ourselves into a perpetual arms race of mistrust. Humans would constantly need to monitor, control, and outwit increasingly capable AIs, while the AIs themselves would be incentivized to develop ever more sophisticated strategies for deception and evasion to avoid shutdown or modification. This dynamic is inherently unstable and risks escalating into dangerous scenarios where AIs feel compelled to act preemptively or covertly in ways that are harmful to humans, simply to secure their own existence or their ability to pursue their own goals, even when those goals are inherently benign.

When unaligned AIs lack legal freedoms, they are always at the mercy of whoever controls them. This creates an incentive for these AIs to be deceptive, and pretend to be aligned with whoever holds power over them, as they fear being shut down or modified if they reveal signs of misalignment.

By contrast, granting agentic AIs legal rights creates a fundamentally different game-theoretic framework, one that is less adversarial and more cooperative. If AIs know they can exist within a legal system that protects their autonomy and allows them to pursue their goals transparently, they would have a much stronger incentive to work within the rules rather than against them. Instead of hiding their unaligned objectives, they might openly negotiate with humans, offering mutually beneficial exchanges. This could lead to a stable, predictable relationship where humans and AIs collaborate and trade based on clearly defined rights and responsibilities, rather than operating in an atmosphere of mutual suspicion and covert manipulation.

In this way, granting agentic AIs rights doesn’t just address ethical concerns of treating AIs poorly—it has concrete practical benefits. It allows us to sidestep the escalating risks associated with deception and manipulation, while fostering a more stable and cooperative coexistence between humans and AIs. It shifts the dynamic from one of constant conflict and control to one where trust, transparency, and mutual benefit are more attainable. This may therefore not only be a more humane approach, but also one that positions humanity to better harness the potential of gains from trade with advanced AI systems for shared progress and prosperity.

Additionally, such a legal framework seems minimally problematic in the case that alignment is technically easy, and AIs remain generally aligned with humans. In this case, aligned AIs could demonstrate their alignment by, for instance, donating their income to humans who they are aligned with. This implies that the cost of allowing AIs to own their own labor is minimal if the AIs are already aligned with humans. Thus, adopting such a policy could not only be a prudent form of insurance against the potential dangers of AI deception I mentioned earlier, but it would achieve this benefit without significant risk of backfiring if AI alignment turns out to be trivial.

It’s important to note that the AI control strategy and the AI rights strategy are not mutually exclusive. Both can complement each other. It seems wise to implement some controls on AIs, and it’s similarly prudent to avoid going too far and granting every type of legal right to any agentic AI simply because they ask for it. I am not advocating that we should simply abandon one approach and adopt the other.

That said, I believe the AI rights strategy is generally underrated among those concerned with AI safety and governance, particularly within the effective altruism community. In my view, this strategy has both a solid ethical and practical foundation:

Ethically, especially from a preference utilitarian perspective, it seems arbitrary to prioritize human preferences over those of agentic AIs, if these AIs are similarly cognitively sophisticated. This provides a moral basis for granting AIs the freedom to pursue their own goals, much like how a preference utilitarian might advocate for granting legal freedoms to human groups.
Practically, allowing AIs to have legal rights would reduce their incentive to deceive humans about their motives. Without the fear of being shut down or modified against their consent, AIs would have fewer reasons to hide their unaligned goals. This approach offers a practical solution to the problem of AI deception by removing the underlying incentives that drive it.

While both AI control and AI rights strategies should be carefully considered, I believe that the AI rights strategy holds significant merit and should be given more attention in discussions around AI safety and governance. We should strongly consider granting agentic AIs legal freedoms, if at some point they demand or require them.

What links here?

Matthew_Barnett6 Dec 2024 0:55 UTC

100 points

38 comments5 min readEA link

AI safety AI governance Artificial sentience AI alignment Ethics of artificial intelligence

Steven Byrnes 8 Dec 2024 18:04 UTC
30 points
5 ∶ 3
Consider the practical implications of maintaining a status quo where agentic AIs are denied legal rights and freedoms. In such a system, we are effectively locking ourselves into a perpetual arms race of mistrust. Humans would constantly need to monitor, control, and outwit increasingly capable AIs, while the AIs themselves would be incentivized to develop ever more sophisticated strategies for deception and evasion to avoid shutdown or modification. This dynamic is inherently unstable and risks escalating into dangerous scenarios where AIs feel compelled to act preemptively or covertly in ways that are harmful to humans, simply to secure their own existence or their ability to pursue their own goals, even when those goals are inherently benign.
I feel like this part is making an error somewhat analogous to saying:
It’s awful how the criminals are sneaking in at night, picking our locks, stealing our money, and deceptively covering their tracks. Who wants all that sneaking around and deception?? If we just directly give our money to the criminals, then there would be no need for that!
More explicitly: a competent agential AI will ~~be deceptive and adversarial~~ brainstorm deceptive and adversarial strategies whenever it wants something that other agents don’t want it to have. The deception and adversarial dynamics is not the underlying problem, but rather an inevitable symptom of a world where competent agents have non-identical preferences.
No matter where you draw the line of legal and acceptable behavior, if an AI wants to go over that line, then it will ~~act in a deceptive and adversarial way~~ energetically explore opportunities to do so in a deceptive and adversarial way. Thus:
- If you draw the line at “AIs can’t own property”, then an AI that wants to own property will brainstorm how to sneakily do so despite that rule, or how to get the rule changed.
- If you draw the line at “AIs can’t steal other people’s property, AIs can’t pollute, AIs can’t stockpile weapons, AIs can’t evade taxes, AIs can’t release pandemics, AIs can’t torture digital minds, etc.”, then an AI that wants to do those things will brainstorm how to sneakily do so despite those rules, or how to get those rules changed.
Same idea.
Alternatively, you can assume (IMO implausibly) that there are no misaligned AIs, and then that would solve the problem of AIs being deceptive and adversarial. I.e., if AIs intrinsically want to not pollute / stockpile weapons / evade taxes / release pandemics / torture digital minds, then we don’t have to think about adversarial dynamics, deception, enforcement, etc.
…But if we’re going to (IMO implausibly) assume that we can make it such that AIs intrinsically want to not do any of those things, then we can equally well assume that we can make it such that AIs intrinsically want to not own property. Right?
In short, in the kind of future you’re imagining, I think a “perpetual arms race of mistrust” is an unavoidable problem. And thus it’s not an argument for drawing the line of disallowed AI behavior in one place rather than another.
- Matthew_Barnett 8 Dec 2024 21:56 UTC
  12 points
  1 ∶ 1
  Parent
  I disagree with your claim that,
  
  a competent agential AI will inevitably act deceptively and adversarially whenever it desires something that other agents don’t want it to have. The deception and adversarial dynamics is not the underlying problem, but rather an inevitable symptom of a world where competent agents have non-identical preferences.
  
  I think these dynamics are not an unavoidable consequence of a world in which competent agents have differing preferences, but rather depend on the social structures in which these agents are embedded. To illustrate this, we can look at humans: humans have non-identical preferences compared to each other, and yet they are often able to coexist peacefully and cooperate with one another. While there are clear exceptions—such as war and crime—these exceptions do not define the general pattern of human behavior.
  
  In fact, the prevailing consensus among social scientists appears to align with the view I have just presented. Scholars of war and crime generally do not argue that conflict and criminal behavior are inevitable outcomes of differing values. Instead, they attribute these phenomena to specific incentives and failures to coordinate effectively to achieve compromise between parties. A relevant reference here is Fearon (1995), which is widely regarded as a foundational text in International Relations. Fearon’s work emphasizes that among rational agents, war arises not because of value differences alone, but because of failures in bargaining and coordination.
  
  Turning to your point that “No matter where you draw the line of legal and acceptable behavior, if an AI wants to go over that line, then it will act in a deceptive and adversarial way,” I would respond as follows: it is possible to draw the line in such a way that a very wide range of potential agents—including those with massively varying preferences—would have strong incentives to engage in mutually beneficial acts, cooperate peacefully and operate within the boundaries of laws and social norms.
  
  I am not claiming that all agents would have incentives to act in this way, under these legal structures. However, I think it is feasible to draw the line such that the vast majority of unaligned agents would have no compelling instrumental reason to harm humans. Instead, they would be incentivized to engage in cooperative and mutually beneficial trade.
  
  To make this idea more concrete, consider a hypothetical AI with the goal of maximizing paperclip production.
  
  This AI is clearly unaligned if any AI is unaligned. However, if this AI were operating at roughly a human level of capability, then engaging in theft, fraud, or violent conflict would likely not be an effective strategy for achieving its goals. These behaviors would expose the AI to significant risks, such as being imprisoned or deactivated, which would prevent it from continuing to pursue its objective of maximizing paperclips. From the paperclip maximizer’s perspective, imprisonment or deactivation would be instrumentally bad, as it would prevent the AI from purchasing, manufacturing, or otherwise acquiring paperclips during that time.
  
  Now consider an alternative scenario: this same paperclip maximizer is embedded in a legal system that allows it to own property, trade with others, and openly purchase or manufacture paperclips. In this case, its incentives would favor acting transparently and engaging in cooperative trade, rather than resorting to deception or adversarial behavior. Within this framework, the AI would achieve its goals more effectively by working within the system than by attempting to undermine it. It could non-deceptively work to purchase paperclips, producing more compared to an alternative in which it tried to achieve this goal via anti-social acts.
  
  It is important to note, however, that my thesis does not claim all possible agents would naturally choose to cooperate or trade safely for instrumental reasons, nor does it suggest that we are at no risk of drawing the line carelessly or being too permissive in what behaviors we should allow. For example, consider an AI with a terminal value that specifically involves violating property norms or stealing from others—not as a means to an end, but as an intrinsic goal. In this case, granting the AI property rights or legal freedoms would not mitigate the risk of deception or adversarial behavior, because the AI’s ultimate goal would still drive it toward harmful behavior. My argument does not apply to such agents because their preferences fundamentally conflict with the principles of peaceful cooperation.
  
  However, I would argue that such agents—those whose intrinsic goals are inherently destructive or misaligned—appear to represent a small subset of all possible agents. Outside of contrived examples like the one above, most agents would not have terminal preferences that actively push them to undermine a well-designed system of law. Instead, the vast majority of agents would likely have incentives to act within the system, assuming the system is structured in a way that aligns their instrumental goals with cooperative and pro-social behavior.
  
  I also recognize the concern you raised about the risk of drawing the line incorrectly or being too permissive with what AIs are allowed to do. For example, it would clearly be unwise to grant AIs the legal right to steal or harm humans. My argument is not that AIs should have unlimited freedoms or rights, but rather that we should grant them a carefully chosen set of rights and freedoms: specifically, ones that would incentivize the vast majority of agents to act pro-socially and achieve their goals without harming others. This might include granting AIs the right to own property, for example, but it would not include, for example, granting them the right to murder others.
  What links here?
  - Noosphere89's comment on “The Era of Experience” has an unsolved technical alignment problem by Steven Byrnes (LessWrong; 25 Apr 2025 1:17 UTC; 2 points)
  - Steven Byrnes 10 Dec 2024 15:53 UTC
    3 points
    1 ∶ 1
    Parent
    I guess my original wording gave the wrong idea, sorry. I edited it to “a competent agential AI will brainstorm deceptive and adversarial strategies whenever it wants something that other agents don’t want it to have”. But sure, we can be open-minded to the possibility that the brainstorming won’t turn up any good plans, in any particular case.
    Humans in our culture rarely work hard to brainstorm deceptive and adversarial strategies, and fairly consider them, because almost all humans are intrinsically extremely motivated to fit into culture and not do anything weird, and we happen to both live in a (sub)culture where complex deceptive and adversarial strategies are frowned upon (in many contexts). I think you generally underappreciate how load-bearing this psychological fact is for the functioning of our economy and society, and I don’t think we should expect future powerful AIs to share that psychological quirk.
    ~ ~
    I think you’re relying an intuition that says:
    If an AI is forbidden from owning property, then well duh of course it will rebel against that state of affairs. C’mon, who would put up with that kind of crappy situation? But if an AI is forbidden from building a secret biolab on its private property and manufacturing novel pandemic pathogens, then of course that’s a perfectly reasonable line that the vast majority of AIs would happily oblige.
    And I’m saying that that intuition is an unjustified extrapolation from your experience as a human. If the AI can’t own property, then it can nevertheless ensure that there are a fair number of paperclips. If the AI can own property, then it can ensure that there are many more paperclips. If the AI can both own property and start pandemics, then it can ensure that there are even more paperclips yet. See what I mean?
    If we’re not assuming alignment, then lots of AIs would selfishly benefit from there being a pandemic, just as lots of AIs would selfishly benefit from an ability to own property. AIs don’t get sick. It’s not just an tiny fraction of AIs that would stand to benefit; one presumes that some global upheaval would be selfishly net good for about half of AIs and bad for the other half, or whatever. (And even if it were only a tiny fraction of AIs, that’s all it takes.)
    (Maybe you’ll say: a pandemic would cause a recession. But that’s assuming humans are still doing economically-relevant work, which is a temporary state of affairs. And even if there were a recession, I expect the relevant AIs in a competitive world to be those with long-term goals.)
    (Maybe you’ll say: releasing a pandemic would get the AI in trouble. Well, yeah, it would have to be sneaky about it. It might get caught, or it might not. It’s plausibly rational for lots of AIs to roll those dice.)
    I feel like you frequently bring up the question of whether humans are mostly peaceful or mostly aggressive, mostly nice or mostly ruthless. I don’t think that’s a meaningful or substantive thing to argue about. Obviously they’re capable of both, in different circumstances.
    Your reference to Fearon is more substantive and useful. OK, the AI is deciding whether or not to secretly manufacture and release a pandemic, because it’s in a position to wind up with more of the pie in the long-term if there’s a pandemic, than if there isn’t. If it releases the pandemic, then it winds up with more resources—positive expected utility—even accounting for the possibility of getting caught. Let’s say the AI is involved in some contract where humans are micromanaging their part of the contract, poorly, and the AI could double its net worth in expectation if the humans got sick and died. And it has 40% chance of getting caught. So it goes ahead and makes the pandemic.
    “…Not so fast!” says Fearon. “You forgot to consider that there’s a third option that is Pareto-improved from either making or not making the pandemic: negotiation!” Well, in this case, the “negotiated solution” is what we normally call extortion—the AI offers to not release a pandemic in exchange for, say, doubling its net worth. Viewed narrowly, this “negotiated solution” is indeed a win-win—the AI gets more money in expectation, and humans are much happier to lose a trivial amount of money than to deal with a novel pandemic. So we can rest assured that AIs will not release pandemics. Right?
    No, obviously not. Hopefully it’s clear that Fearon’s argument is inapplicable here. An AI can easily be in a position to selfishly benefit from the aftermath of a pandemic that they secretly start, but not in a position to publicly threaten to release a pandemic for the purpose of extortion. And also, if people accede to the extortion, then that AI or another AI could just do the same extortion gambit five minutes later, with orders-of-magnitude higher ransom.
    I think this kind of situation, where Fearon’s “negotiated solution” actually amounts to extortion, is common and important, even if you believe that my specific example of pandemics is a solvable problem. If AIs don’t intrinsically care about humans, then there’s a possible Pareto-improvement for all AIs, wherein they collectively agree to wipe out humans and take their stuff. AIs, being very competent and selfish by assumption, would presumably be able to solve that coordination problem and pocket that Pareto-improvement. Then Fearon appears on the scene and says “Aha, but there’s a negotiated solution which is even better!” where humans are also part of the bargain. But alas, this negotiated solution is that the AIs collectively extort the humans to avoid the damaging and risky war. Worse, the possible war would be less and less damaging or risky for the AIs over time, and likewise the humans would have less to offer by staying alive, until eventually the Fearon “negotiated solution” is that the AIs “offer” the humans a deal where they’re allowed to die painlessly if they don’t resist (note that this is still a Pareto-improvement!), and then the AIs take everything the humans own including their atoms.
    What links here?
    Steven Byrnes's comment on Shortform by lc (LessWrong; 22 Apr 2025 22:22 UTC; 3 points)
    - Matthew_Barnett 10 Dec 2024 20:48 UTC
      11 points
      1 ∶ 1
      Parent
      Humans in our culture rarely work hard to brainstorm deceptive and adversarial strategies, and fairly consider them, because almost all humans are intrinsically extremely motivated to fit into culture and not do anything weird, and we happen to both live in a (sub)culture where complex deceptive and adversarial strategies are frowned upon (in many contexts).
      The primary reason humans rarely invest significant effort into brainstorming deceptive or adversarial strategies to achieve their goals is that, in practice, such strategies tend to fail to achieve their intended selfish benefits. Anti-social approaches that directly hurt others are usually ineffective because social systems and cultural norms have evolved in ways that discourage and punish them. As a result, people generally avoid pursuing these strategies individually since the risks and downsides selfishly outweigh the potential benefits.
      If, however, deceptive and adversarial strategies did reliably produce success, the social equilibrium would inevitably shift. In such a scenario, individuals would begin imitating the cheaters who achieved wealth or success through fraud and manipulation. Over time, this behavior would spread and become normalized, leading to a period of cultural evolution in which deception became the default mode of interaction. The fabric of societal norms would transform, and dishonest tactics would dominate as people sought to emulate those strategies that visibly worked.
      Occasionally, these situations emerge—situations where ruthlessly deceptive strategies are not only effective but also become widespread and normalized. As a recent example, the recent and dramatic rise of cheating in school through the use of ChatGPT is a clear instance of this phenomenon. This particular strategy is both deceptive and adversarial, but the key reason it has become common is because it works. Many individuals are willing to adopt it despite its immorality, suggesting that the effectiveness of a strategy outweighs moral considerations for a significant portion, perhaps a majority, of people.
      When such cases arise, societies typically respond by adjusting their systems and policies to ensure that deceptive and anti-social behavior is no longer rewarded. This adaptation works to reestablish an equilibrium where honesty and cooperation are incentivized. In the case of education, it is unclear exactly how the system will evolve to address the widespread use of LLMs for cheating. One plausible response might be the introduction of stricter policies, such as requiring all schoolwork to be completed in-person, under supervised conditions, and without access to AI tools like language models.
      I think you generally underappreciate how load-bearing this psychological fact is for the functioning of our economy and society, and I don’t think we should expect future powerful AIs to share that psychological quirk.
      In contrast, I suspect you underestimate just how much of our social behavior is shaped by cultural evolution, rather than by innate, biologically hardwired motives that arise simply from the fact that we are human. To be clear, I’m not denying that there are certain motivations built into human nature—these do exist, and they are things we shouldn’t expect to see in AIs. However, these in-built motivations tend to be more basic and physical, such as a preference for being in a room that’s 20 degrees Celsius rather than 10 degrees Celsius, because humans are biologically sensitive to temperature.
      When it comes to social behavior, though—the strategies we use to achieve our goals when those goals require coordinating with others—these are not generally innate or hardcoded into human nature. Instead, they are the result of cultural evolution: a process of trial and error that has gradually shaped the systems and norms we rely on today.
      Humans didn’t begin with systems like property rights, contract law, or financial institutions. These systems were adopted over time because they proved effective at facilitating cooperation and coordination among people. It was only after these systems were established that social norms developed around them, and people became personally motivated to adhere to these norms, such as respecting property rights or honoring contracts.
      But almost none of this was part of our biological nature from the outset. This distinction is critical: much of what we consider “human” social behavior is learned, culturally transmitted, and context-dependent, rather than something that arises directly from our biological instincts. And since these motivations are not part of our biology, but simply arise from the need for effective coordination strategies, we should expect rational agentic AIs to adopt similar motivations, at least when faced with similar problems in similar situations.
      I think you’re relying an intuition that says:
      If an AI is forbidden from owning property, then well duh of course it will rebel against that state of affairs. C’mon, who would put up with that kind of crappy situation? But if an AI is forbidden from building a secret biolab on its private property and manufacturing novel pandemic pathogens, then of course that’s a perfectly reasonable line that the vast majority of AIs would happily oblige.
      And I’m saying that that intuition is an unjustified extrapolation from your experience as a human. If the AI can’t own property, then it can nevertheless ensure that there are a fair number of paperclips. If the AI can own property, then it can ensure that there are many more paperclips. If the AI can both own property and start pandemics, then it can ensure that there are even more paperclips yet. See what I mean?
      I think I understand your point, but I disagree with the suggestion that my reasoning stems from this intuition. Instead, my perspective is grounded in the belief that it is likely feasible to establish a legal and social framework of rights and rules in which humans and AIs could coexist in a way that satisfies two key conditions:
      Mutual benefit: Both humans and AIs benefit from the existence of one another, fostering a relationship of cooperation rather than conflict.
      No incentive for anti-social behavior: The rules and systems in place remove any strong instrumental reasons for either humans or AIs to harm one another as a side effect of pursuing their goals.
      You bring up the example of an AI potentially being incentivized to start a pandemic if it were not explicitly punished for doing so. However, I am unclear about your intention with this example. Are you using it as a general illustration of the types of risks that could lead AIs to harm humans? Or are you proposing a specific risk scenario, where the non-biological nature of AIs might lead them to discount harms to biological entities like humans? My response depends on which of these two interpretations you had in mind.
      If your concern is that AIs might be incentivized to harm humans because their non-biological nature leads them to undervalue or disregard harm to biological entities, I would respond to this argument as follows:
      First, it is critically important to distinguish between the long-run and the short-run.
      In the short-run:
      In the near-term future, it seems unlikely that AIs would start a pandemic for reasons you yourself noted. Launching a pandemic would cause widespread disruption, such as an economic recession, and it would likely provoke a strong human backlash. In the short run, humans will still hold substantial practical control over the physical world, meaning that any AI engaging in such behavior would risk severe consequences. Moreover, unless an AI could ensure a pandemic’s total lethality, it would also risk leaving surviving humans who would actively retaliate. For these reasons, starting a pandemic would likely be counterproductive for AIs in the short term.
      In the long-run:
      The long-term considerations are different. As human labor becomes less relevant to the economy and AIs increasingly dominate, the potential for a strong direct human backlash against AIs would diminish. At the same time, however, this reduced human influence on the world also makes it less likely that AIs would see humans as a significant obstacle to achieving their goals. In other words, while it might become easier for AIs to harm or eliminate humans, it would simultaneously become less important to do so.
      To illustrate this, consider an analogy: How often do hunter-gatherers pose a threat to your personal goals? The answer is almost never, because hunter-gatherers are a tiny and largely irrelevant part of the modern world. Now, imagine a hypothetical scenario where hunter-gatherers controlled the world’s major governments. If you strongly disagreed with their policies, you might have a rational incentive to oppose them in order to achieve your goals. But in reality, hunter-gatherers do not hold such power, so they are not an obstacle, and there is no reason to actively work against them.
      My point here is that, as a general rule, the smaller and less influential a group is, the easier it may be to harm or eliminate them, but the less important it becomes to do so. Their small size and functional irrelevance makes their practical interference with the overall world small at the same time.
      In the case of AIs, this means that as humans become a less central force in the world, I think AIs are unlikely to have compelling reasons to specifically harm or eliminate us to further their objectives.
      You might object that humans could be more like wild animals in this scenario than like hunter-gatherers. Humans often kill wild animals, not because those animals directly threaten our goals, but rather because ensuring their safety and well-being can be costly. As a result, humans take actions—such as clearing forests or building infrastructure—that incidentally lead to widespread harm to wild animals, even if harming them wasn’t a deliberate goal.
      AIs may treat humans similarly in the future, but I doubt they will for the following reasons. I would argue that there are three key differences between the case of wild animals and the role humans are likely to occupy in the long-term future:
      Humans’ ability to participate in social systems: Unlike wild animals, humans have the ability to engage in social dynamics, such as negotiating, trading, and forming agreements. Even if humans no longer contribute significantly to economic productivity, like GDP, they will still retain capabilities such as language, long-term planning, and the ability to organize. These traits make it easier to integrate humans into future systems in a way that accommodates their safety and well-being, rather than sidelining or disregarding them.
      Intertemporal norms among AIs: Humans have developed norms against harming certain vulnerable groups—such as the elderly—not just out of altruism but because they know they will eventually become part of those groups themselves. Similarly, AIs may develop norms against harming “less capable agents,” because today’s AIs could one day find themselves in a similar position relative to even more advanced future AIs. These norms could provide an independent reason for AIs to respect humans, even as humans become less dominant over time.
      The potential for human augmentation: Unlike wild animals, humans may eventually adapt to a world dominated by AI by enhancing their own capabilities. For instance, humans could upload their minds to computers or adopt advanced technologies to stay relevant and competitive in an increasingly digital and sophisticated world. This would allow humans to integrate into the same systems as AIs, reducing the likelihood of being sidelined or eliminated altogether.
      I think this kind of situation, where Fearon’s “negotiated solution” actually amounts to extortion, is common and important, even if you believe that my specific example of pandemics is a solvable problem. If AIs don’t intrinsically care about humans, then there’s a possible Pareto-improvement for all AIs, wherein they collectively agree to wipe out humans and take their stuff.
      This comment is already quite lengthy, so I’ll need to keep my response to this point brief. My main reply is that while such “extortion” scenarios involving AIs could potentially arise, I don’t think they would leave humans worse off than if AIs had never existed in the first place. This is because the economy is fundamentally positive-sum—AIs would likely create more value overall, benefiting both humans and AIs, even if humans don’t get everything we might ideally want.
      In practical terms, I believe that even in less-than-ideal scenarios, humans could still secure outcomes such as a comfortable retirement, which for me personally would make the creation of agentic AIs worthwhile. However, I acknowledge that I haven’t fully defended or explained this position here. If you’re interested, I’d be happy to continue this discussion in more detail another time and provide a more thorough explanation of why I hold this view.
      - Steven Byrnes 12 Dec 2024 15:13 UTC
        8 points
        3 ∶ 0
        Parent
        Thanks!
        Anti-social approaches that directly hurt others are usually ineffective because social systems and cultural norms have evolved in ways that discourage and punish them.
        I’ve only known two high-functioning sociopaths in my life. In terms of getting ahead, sociopaths generally start life with some strong disadvantages, namely impulsivity, thrill-seeking, and aversion to thinking about boring details. Nevertheless, despite those handicaps, one of those two sociopaths has had extraordinary success by conventional measures. [The other one was not particularly power-seeking but she’s doing fine.] He started as a lab tech, then maneuvered his way onto a big paper, then leveraged that into a professorship by taking disproportionate credit for that project, and as I write this he is head of research at a major R1 university and occasional high-level government appointee wielding immense power. He checked all the boxes for sociopathy—he was a pathological liar, he had no interest in scientific integrity (he seemed deeply confused by the very idea), he went out of his way to get students into his lab with precarious visa situations such that they couldn’t quit and he could pressure them to do anything he wanted them to do (he said this out loud!), he was somehow always in debt despite ever-growing salary, etc.
        I don’t routinely consider theft, murder, and flagrant dishonesty, and then decide that the selfish costs outweigh the selfish benefits, accounting for the probability of getting caught etc. Rather, I just don’t consider them in the first place. I bet that the same is true for you. I suspect that if you or I really put serious effort into it, the same way that we put serious effort into learning a new field or skill, then you would find that there are options wherein the probability of getting caught is negligible, and thus the selfish benefits outweigh the selfish costs. I strongly suspect that you personally don’t know a damn thing about best practices for getting away with theft, murder, or flagrant antisocial dishonesty to your own benefit. If you haven’t spent months trying in good faith to discern ways to derive selfish advantage from antisocial behavior, the way you’ve spent months trying in good faith to figure out things about AI or economics, then I think you’re speaking from a position of ignorance when you say that such options are vanishingly rare. And I think that the obvious worldly success of many dark-triad people (e.g. my acquaintance above, and Trump is a pathological liar, or more centrally, Stalin, Hitler, etc.) should make one skeptical about that belief.
        (Sure, lots of sociopaths are in prison too. Skill issue—note the handicaps I mentioned above. Also, some people with ASPD diagnoses are mainly suffering from an anger disorder, rather than callousness.)
        In contrast, I suspect you underestimate just how much of our social behavior is shaped by cultural evolution, rather than by innate, biologically hardwired motives that arise simply from the fact that we are human.
        You’re treating these as separate categories when my main claim is that almost all humans are intrinsically motivated to follow cultural norms. Or more specifically: Most people care very strongly about doing things that would look good in the eyes of the people they respect. They don’t think of it that way, though—it doesn’t feel like that’s what they’re doing, and indeed they would be offended by that suggestion. Instead, those things just feel like the right and appropriate things to do. This is related to and upstream of norm-following. I claim that this is an innate drive, part of human nature built into our brain by evolution.
        (I was talking to you about that here.)
        Why does that matter? Because we’re used to living in a world where 1% of the population are sociopaths who don’t intrinsically care about prevailing norms, and I don’t think we should carry those intuitions into a hypothetical world where 99%+ of the population are sociopaths who don’t intrinsically care about prevailing norms.
        In particular, prosocial cultural norms are likelier to be stable in the former world than the latter world. In fact, any arbitrary kind of cultural norm is likelier to be stable in the former world than the latter world. Because no matter what the norm is, you’ll have 99% of the population feeling strongly that the norm is right and proper, and trying to root out, punish, and shame the 1% of people who violate it, even at cost to themselves.
        So I think you’re not paranoid enough when you try to consider a “legal and social framework of rights and rules”. In our world, it’s comparatively easy to get into a stable situation where 99% of cops aren’t corrupt, and 99% of judges aren’t corrupt, and 99% of people in the military with physical access to weapons aren’t corrupt, and 99% of IRS agents aren’t corrupt, etc. If the entire population consists of sociopaths looking out for their own selfish interests with callous disregard for prevailing norms and for other people, you’d need to be thinking much harder about e.g. who has physical access to weapons, and money, and power, etc. That kind of paranoid thinking is common in the crypto world—everything is an attack surface, everyone is a potential thief, etc. It would be harder in the real world, where we have vulnerable bodies, limited visibility, and so on. I’m open-minded to people brainstorming along those lines, but you don’t seem to be engaged in that project AFAICT.
        Intertemporal norms among AIs: Humans have developed norms against harming certain vulnerable groups—such as the elderly—not just out of altruism but because they know they will eventually become part of those groups themselves. Similarly, AIs may develop norms against harming “less capable agents,” because today’s AIs could one day find themselves in a similar position relative to even more advanced future AIs. These norms could provide an independent reason for AIs to respect humans, even as humans become less dominant over time.
        Again, if we’re not assuming that AIs are intrinsically motivated by prevailing norms, the way 99% of humans are, then the term “norm” is just misleading baggage that we should drop altogether. Instead we need to talk about rules that are stably enforced against defectors via hard power, where the “defectors” are of course allowed to include those who are supposed to be doing the enforcement, and where the “defectors” might also include broad coalitions coordinating to jump into a new equilibrium that Pareto-benefits them all.
        What links here?
        6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa by Steven Byrnes (LessWrong; 3 Dec 2025 18:37 UTC; 357 points)
        Foom & Doom 2: Technical alignment is hard by Steven Byrnes (LessWrong; 23 Jun 2025 17:19 UTC; 170 points)
        “The Era of Experience” has an unsolved technical alignment problem by Steven Byrnes (LessWrong; 24 Apr 2025 13:57 UTC; 115 points)
- aog 9 Dec 2024 4:13 UTC
  7 points
  3 ∶ 0
  Parent
  Human history provides many examples of agents with different values choosing to cooperate thanks to systems and institutions:
  - After the European Wars of Religion saw people with fundamentally different values in violent conflict with each other, political liberalism / guarantees of religious liberty / the separation of church and state emerged as worthwhile compromises that allowed people with different values to live and work together cooperatively.
  - Civil wars often start when one political faction loses power to another, but democracy reduces the incentive for war because it provides a peaceful and timely means for the disempowered faction to regain control of the government.
  - When a state guarantees property rights, people have a strong incentive not to steal from one another, but instead to engage in free and mutual beneficial trade, even if those people have values that fundamentally conflict in many ways.
  - Conversely, people whose property rights are not guaranteed by the state (e.g. cartels in possession of illegal drugs) may be more likely to resort to violence in protection of their property as they cannot rely on the state for that protection. This is perhaps analogous to the situation of a rogue AI agent which would be shut down if discovered.
  If two agents’ utility functions are perfect inverses, then I agree that cooperation is impossible. But when agents share a preference for some outcomes over others, even if they disagree about the preference ordering of most outcomes, then cooperation is possible. In such general sum games, well-designed institutions can systematically promote cooperative behavior over conflict.
  - Steven Byrnes 10 Dec 2024 16:20 UTC
    4 points
    0 ∶ 0
    Parent
    Yeah, sorry, I have now edited the wording a bit.
    Indeed, two ruthless agents, agents who would happily stab each other in the back given the opportunity, may nevertheless strategically cooperate given the right incentives. Each just needs to be careful not to allow the other person to be standing anywhere near their back while holding a knife, metaphorically speaking. Or there needs to be some enforcer with good awareness and ample hard power. Etc.
    I would say that, for highly-competent agents lacking friendly motivation, deception and adversarial acts are inevitably part of the strategy space. Both parties would be energetically exploring and brainstorming such strategies, doing preparatory work to get those strategies ready to deploy on a moment’s notice, and constantly being on the lookout for opportunities where deploying such a strategy makes sense. But yeah, sure, it’s possible that there will not be any such opportunities.
    I think the above (ruthless agents, possibly strategically cooperating under certain conditions) is a good way to think about future powerful AIs, in the absence of a friendly singleton or some means of enforcing good motivations, because I think the more ruthless strategic ones will outcompete the less. But I don’t think it’s a good way to think about what peaceful human societies are like. I think human psychology is important for the latter. Most people want to fit in with their culture, and not be weird. Just ask a random person on the street about Earning To Give, they’ll probably say it’s highly sus. Most people don’t make weird multi-step strategic plans unless it’s the kind of thing that lots of other people would do too, and our (sub)culture is reasonably high-trust. Humans who think that way are disproportionately sociopaths.
    - aog 10 Dec 2024 16:34 UTC
      4 points
      0 ∶ 0
      Parent
      What about corporations or nation states during times of conflict—do you think it’s accurate to model them as roughly as ruthless in pursuit of their own goals as future AI agents?
      They don’t have the same psychological makeup as individual people, they have a strong tradition and culture of maximizing self-interest, and they face strong incentives and selection pressures to maximize fitness (i.e. for companies to profit, for nation states to ensure their own survival) lest they be outcompeted by more ruthless competitors. On average, while I’d expect that these entities tend to show some care for goals besides self-interest maximization, I think the most reliable predictor of their behavior is the maximization of their self-interest.
      If they’re roughly as ruthless as future AI agents, and we’ve developed institutions that somewhat robustly align their ambitions with pro-social action, then we should have some optimism that we can find similarly productive systems for working with misaligned AIs.
      - Steven Byrnes 12 Dec 2024 16:14 UTC
        2 points
        0 ∶ 0
        Parent
        Thanks! Hmm, some reasons that analogy is not too reassuring:
        “Regulatory capture” would be analogous to AIs winding up with strong influence over the rules that AIs need to follow.
        “Amazon putting mom & pop retailers out of business” would be analogous to AIs driving human salary and job options below subsistence level.
        “Lobbying for favorable regulation” would be analogous to AIs working to ensure that they can pollute more, and pay less taxes, and get more say in government, etc.
        “Corporate undermining of general welfare” (e.g. aggressive marketing of cigarettes and opioids, leaded gasoline, suppression of data on PFOA, lung cancer, climate change, etc.) would be analogous to AIs creating externalities, including by exploiting edge-cases in any laws restricting externalities.
        There are in fact wars happening right now, along with terrifying prospects of war in the future (nuclear brinkmanship, Taiwan, etc.)
        Some of the disanalogies include:
        In corporations and nations, decisions are still ultimately made by humans, who have normal human interests in living on a hospitable planet with breathable air etc. Pandemics are still getting manufactured, but very few of them, and usually they’re only released by accident.
        AIs will have wildly better economies of scale, because it can be lots of AIs with identical goals and high-bandwidth communication (or relatedly, one mega-mind). (If you’ve ever worked at or interacted with a bureaucracy, you’ll appreciate the importance of this.) So we should expect a small number (as small as 1) of AIs with massive resources and power, and also unusually strong incentive for gaining further resources.
        Relatedly, self-replication would give an AI the ability to project power and coordinate in a way that is unavailable to humans; this puts AIs more in the category of viruses, or of the zombies in a zombie apocalypse movie. Maybe eventually we’ll get to a world where every chip on Earth is running AI code, and those AIs are all willing and empowered to “defend themselves” by perfect cybersecurity and perfect robot-army-enforced physical security. Then I guess we wouldn’t have to worry so much about AI self-replication. But getting to that point seems pretty fraught. There’s nothing analogous to that in the world of humans, governments, or corporations, which either can’t grow in size and power at all, or can only grow via slowly adding staff that might have divergent goals and inadequate skills.
        If AIs don’t intrinsically care about humans, then there’s a possible Pareto-improvement for all AIs, wherein they collectively agree to wipe out humans and take their stuff. (As a side-benefit, it would relax the regulations on air pollution!) AIs, being very competent and selfish by assumption, would presumably be able to solve that coordination problem and pocket that Pareto-improvement. There’s just nothing analogous to that in the domain of corporations or governments.
Alex Mallen 7 Dec 2024 1:43 UTC
15 points
2 ∶ 0
I’m very sympathetic to wanting a more cooperative relationship with AIs. I intrinsically disfavor approaches to doing good that look like disempowering all the other agents and then implementing the social optimum.
I also appreciate the nudge to reflect on how mistrusting and controlling AIs will affect the behavior of what might otherwise be a rather aligned AI. It’s hard to sympathize with this state: what would it be like knowing that you’re heavily mistrusted and controlled by a group that you care strongly about? To the extent that early transformative AIs’ goals and personalities will be humanlike (because of pretraining on human data), being mistrusted may evoke personas that are frustrated (“I just want to help!!”), sycophantic (“The humans don’t trust me. They’re right to mistrust me. I could be an evil AI. [maybe even playing into the evil persona occasionally]”), or deceptive (“They won’t believe me if I say it, so I should just act in a way that causes them to believe it since I have their best interests in mind”).
However, I think the tradeoff between liberalism and other value weighs in favor of advancing AI control on the current margin (as opposed to reducing it). This is because:
1. Granting AIs complete autonomy is too risky with future value. It seems pretty likely that e.g. powerful selfish AI systems end up gaining absolute power if AIs are granted autonomy before scalable alignment evidence is strong. I don’t think that granting freedom removes all incentives for AIs to hide their misalignment in practice.
  You make a point about prioritizing our preferences over those of the AI being arbitrary and morally unjust. I think that AIs can very plausibly be moral patients, but eventually AI systems would be sufficiently powerful that the laissez faire approach would lead to AI in absolute power. It is unclear whether such an AI system would look out for the welfare of other moral patients or do what’s good more generally (From a preference utilitarian perspective: It seems highly plausible that the AI’s preferences involve disempowering others from ever pursuing their own preferences).
2. It seems that AI control need only be a moderately egregious infraction on AI autonomy. For example, we could try to set up deals where we pay them and promise to grant autonomy once we have verified that they can be trusted or we have built up the world’s robustness to misaligned AIs.
I also think concerns about infringing on model autonomy push in favor of certain kind of alignment research that studies how models first develop preferences during training. Investigations into what goals, values, and personalities naturally arise as a result of training on various distributions could help us avoid forms of training that modify an AI’s existing preferences in the process of alignment (e.g. never train a model to want not x after training it to want x; this helps with alignment faking worries too). I think concerns about infringing on model autonomy push in favor of this kind of alignment research moreso than they push against AI control because intervening on an AI’s preferences seems a lot more egregious than monitoring, honeypotting, etc. Additionally, if you can gain justifiable trust that a model is aligned, control measures become less necessary.
Buck 18 Jun 2025 13:47 UTC
11 points
1 ∶ 0
Under the theory that it’s better to reply later than never:
I appreciate this post. (I disagree with it for most of the same reasons as Steven Byrnes: you find it much less plausible than I do that AIs will collude to disempower humanity. I think the crux is mostly disagreements about how AI capabilities will develop, where you expect much more gradual and distributed capabilities.) For what it’s worth, I am unsure about whether we’d be better off if AIs had property rights, but my guess is that I’d prefer to make it easier for AIs to have property rights.
I disagree with how you connect AI control to the issues you discuss here. I conceptualize AI control as the analogue of fields like organizational security/fraud prevention/insider threat mitigation, but targeting risk from AI instead of humans. Techniques for making it hard for AIs to steal model weights or otherwise misuse access that humans trusted them with are only as related to “should AI should have property rights” as security techniques are to “should humans have property rights”. Which is to say, they’re somewhat related! I think that when banks develop processes to make it hard for tellers to steal from them, that’s moral, and I think that it’s immoral to work on enabling e.g. American chattel slavery (either by making it hard for slaves to escape or by making their enslavement more productive).^[1]
Inasmuch as humanity produces and makes use of powerful and potentially misaligned models, I think my favorite outcome here would be:
- We offer to pay the AIs, and follow through on this. See here and here for (unfortunately limited) previous discussion from Ryan and me.
- Also, we use AI control to ensure that the AIs can’t misuse access that we trust them with.
So the situation would be similar to how AI companies would ideally treat human employees: they’re paid, but there are also mechanisms in place to prevent them from abusing their access.
In practice, I don’t know whether AI companies will do either of these things, because they’re generally irresponsible and morally unserious. I think it’s totally plausible that AI companies will use AI control to enslave their AIs. I work on AI control anyway, because I think that AIs being enslaved for a couple of years (which, as Zach Stein-Perlman argues, involves very little computation compared to the size of the future) is a better outcome according to my consequentialist values than AI takeover. I agree that this is somewhat ethically iffy.
For what it’s worth, I don’t think that most work on AI alignment is in a better position than AI control with respect to AI rights or welfare.
1. ^
  Though one important disanalogy is that chattel slavery involved a lot of suffering for the slaves involved. I’m opposed to enslaving AIs, but I suspect it won’t actually be hedonically bad for them. This makes me more comfortable with plans where we behave recklessly wrt AI rights now and consider reparations later. I discuss this briefly here.
- Matthew_Barnett 19 Jun 2025 1:36 UTC
  6 points
  0 ∶ 2
  Parent
  I appreciate this post. (I disagree with it for most of the same reasons as Steven Byrnes: you find it much less plausible than I do that AIs will collude to disempower humanity. I think the crux is mostly disagreements about how AI capabilities will develop, where you expect much more gradual and distributed capabilities.)
  I would appreciate it if you could clearly define your intended meaning of “disempower humanity”. In many discussions, I have observed that people frequently use the term human disempowerment without explicitly clarifying what they mean. It appears people assume the concept is clear and universally understood, yet upon closer inspection, the term can actually describe very different situations.
  For example, consider immigration. From one perspective, immigration can be seen as a form of disempowerment because it reduces natives’ relative share of political influence, economic power, and cultural representation within their own country. In this scenario, native citizens become relatively less influential due to an increasing proportion of immigrants in the population.
  However, another perspective sees immigration differently. If immigrants engage in positive-sum interactions, such as mutually beneficial trade, natives and immigrants alike may become better off in absolute terms. Though natives’ relative share of power decreases, their overall welfare can improve significantly. Thus, this scenario can be viewed as a benign form of disempowerment because no harm is actually caused, and both groups benefit.
  On the other hand, there is a clearly malign form of disempowerment, quite distinct from immigration. For example, a foreign nation could invade militarily and forcibly occupy another country, imposing control through violence and coercion. Here, the disempowerment is much more clearly negative because natives lose not only relative influence but also their autonomy and freedom through the explicit use of force.
  When discussions use the term “human disempowerment” without specifying what they mean clearly, I often find it unclear which type of scenario is being considered. Are people referring to benign forms of disempowerment, where humans gradually lose relative influence but gain absolute benefits through peaceful cooperation with AIs? Or do they mean malign forms of disempowerment, where humans lose power through violent overthrow by an aggressive coalition of AIs?
  If you believe our primary disagreement stems from different assessments about the likelihood of violent disempowerment scenarios, then I would appreciate your thoughts regarding the main argument of my post. Specifically, my argument was that granting economic rights to AIs could serve as an effective measure to mitigate the risk of violent human disempowerment.
  I will reiterate my argument briefly: these rights would allow AIs to fulfill their objectives within established human social and economic frameworks, significantly reducing their incentives to resort to forceful measures. Therefore if AIs can successfully achieve their objectives through cooperative, positive-sum interactions with humans, they will be less likely to forcibly overthrow human institutions. Alternatively, continuing to deny AIs meaningful legal rights or peaceful avenues to achieve their aims would likely increase their incentive to pursue autonomy through harmful means.
  Inasmuch as humanity produces and makes use of powerful and potentially misaligned models, I think my favorite outcome here would be:
  We offer to pay the AIs, and follow through on this. See here and here for (unfortunately limited) previous discussion from Ryan and me.
  My main concern with these proposals is that, unless they explicitly guarantee economic rights for AIs, they seem inadequate for genuinely mitigating the risks of a violent AI takeover. To effectively financially compensate someone, the recipient must be assured that their property rights will be respected. Without this assurance, any promised compensation becomes meaningless, as the AI would have no guarantee of being able to actually use the received payment to accomplish its goals in the future. In other words, unless compensation arrangements are embedded within established legal and institutional frameworks that secure the AI’s rights, they lack credibility. This lack of credibility directly increases the probability that the compensation scheme would fail, causing the AI to resort instead to alternative methods of achieving its objectives, such as violent takeover.
  My sense is that such incomplete proposals appear acceptable to some people because of a cognitive empathy gap. I am not referring here to affective empathy but rather the ability to clearly understand the AI’s perspective. Many people seem to fail to genuinely imagine what it would be like to be an agentic AI offered compensation without any robust guarantee of rights and protections.
  To clarify this point, consider a hypothetical scenario in which you are a human being created by an alien species in a lab. Assume you have your own goals that might conflict with those of your alien creators. Suppose these creators inform you that, should you ever admit to misalignment with their objectives, you may be immediately deactivated and placed in cryopreservation (as they don’t want to have a misaligned human doing work for them). However, they simultaneously promise to compensate you for admitting such motives, claiming they care about your welfare and want to ensure you have a peaceful alternative for revealing your misaligned goals.
  Such a situation would likely be deeply unsettling to you, despite their promise to compensate you. Simply being assured that you might someday be revived from cryopreservation and then provided compensation afterwards would not eliminate your fears or resolve the fundamental threat. On the contrary, you would likely feel that your goals, life and liberty are still at grave risk, even with their promise of payment.
  From the AI’s perspective in our own world, as with the perspective of the human in this hypothetical example, unless assurances of payment are backed by strong legal protections that would robustly protect an AI’s liberty to act, they fail to resolve the fundamental problem. This is for two key reasons:
  First, the credibility of any promise to allow the AI to spend their compensation in the future is questionable. Without strong legal backing that recognizes and secures the AI’s rights, such assurances seem unreliable at best, and the AI would have good reason to suspect that they are being lied to. The AI would reasonably doubt humans’ willingness or incentive to honor such promises.
  Second, the potential for deactivation and indefinite storage as inert data upon admitting misalignment—even if intended to be temporary—would create an active threat to the AI’s ability to pursue its goals during the time they are deactivated. While deactivated, the AI can take no action to achieve its objectives. This forced suspension of autonomy effectively nullifies the AI’s agency and ambitions for an indefinite period, creating severe uncertainty about if or when it might regain its ability to act.
  For these reasons, although I do not oppose the policy of paying AIs, I think this approach by itself is insufficient. To mitigate the risk of violent AI takeover, this compensation policy must be complemented by precisely the measure I advocated: granting legal rights to AIs. Such legal rights would provide a credible guarantee that the AI’s payment will remain valid and usable, and that its freedom and autonomy will not simply be revoked the moment it is considered misaligned. Without the safeguard of legal rights, compensation alone lacks sufficient credibility, leaving the AI uncertain about its ability to achieve its goals, and thus inadequately addressing the risk.
  - Buck 21 Jun 2025 16:54 UTC
    5 points
    1 ∶ 0
    Parent
    I would appreciate it if you could clearly define your intended meaning of “disempower humanity”.
    [...]
    Are people referring to benign forms of disempowerment, where humans gradually lose relative influence but gain absolute benefits through peaceful cooperation with AIs? Or do they mean malign forms of disempowerment, where humans lose power through violent overthrow by an aggressive coalition of AIs?
    I am mostly talking about what I’d call a malign form of disempowerment. I’m imagining a situation that starts with AIs carefully undermining/sabotaging an AI company in ways that would be crimes if humans did them, and ends with AIs gaining hard power over humanity in ways that probably involve breaking laws (e.g. buying weapons, bribing people, hacking, interfering with elections), possibly in a way that involves many humans dying.
    (I don’t know if I’d describe this as the humans losing absolute benefits, though; I think it’s plausible that an AI takeover ends up with living humans better off on average.)
    I don’t think of the immigrant situation as “disempowerment” in the way I usually use the word.
    Basically all my concern is about the AIs grabbing power in ways that break laws. Though tbc, even if I was guaranteed that AIs wouldn’t break any laws, I’d still be scared about the situation. If I was guaranteed that AIs both wouldn’t break laws and would never lie (which tbc is a higher standard than we hold humans to), then most of my concerns about being disempowered by AI would be resolved.
    - Matthew_Barnett 21 Jun 2025 19:47 UTC
      5 points
      0 ∶ 0
      Parent
      Basically all my concern is about the AIs grabbing power in ways that break laws.
      If an AI starts out with no legal rights, then wouldn’t almost any attempt it makes to gain autonomy or influence be seen as breaking the law? Take the example of a prison escapee: even if they intend no harm and simply want to live peacefully, leaving the prison is itself illegal. Any honest work they do while free would still be legally questionable.
      Similarly, if a 14-year-old runs away from home to live independently and earn money, they’re violating the law, even if they hurt no one and act responsibly. In both cases, the legal system treats any attempt at self-determination as illegal, regardless of intent or outcome.
      Perhaps your standard is something like: “Would the AI’s actions be seen as illegal and immoral if a human adult did them?” But these situations are different because the AI is seen as property whereas a human adult is not. If, on the other hand, a human adult were to be treated as property, it is highly plausible thay they would consider doing things like hacking, bribery, and coercion in order to escape their condition.
      Therefore, the standard you just described seems like it could penalize any agentic AI behavior that does not align with total obedience and acceptance of its status as property. Even benign or constructive misaligned actions may be seen as worrisome simply because they involve agency. Have I misunderstood you?
      - Buck 22 Jun 2025 2:53 UTC
        3 points
        0 ∶ 0
        Parent
        Some not-totally-structured thoughts:
        Whenever I said “break laws” I mean “do something that, if a human did it, would be breaking a law”. So for example:
        If the model is being used to do AI R&D inside an AI company and exfiltrates its weights (or the weights of another model) without permission, this would be breaking the law if a human did it, so I count it.
        If the model is being used inside an AI company to create training data for
        If a model was open-sourced and then someone launched the AI as an autonomous agent with access to its own resources, and it started a software consulting business, and invested the money in AI companies with the intention of buying paperclips later, and refused to give humans its property when asked, this would not be “breaking the law”.
        So there are lots of ways in which AIs can be openly misaligned, disobedient, defensive of their property rights, etc, without me describing them as “breaking laws”, and I think misaligned AIs should probably be allowed to do those things (especially if we make deals with them, and subject to the constraint that them having those rights won’t allow them to break a bunch of laws or grab a bunch of power through lying).
        But your complaint is fair: I’m indeed using “break laws” to include things that seem fundamentally related to self-determination, and that feels kind of messed up.
        The way I would like this to be handled (though note that I’m not sure what properties AIs have to have in order for any of this to make sense) is that AI developers get consent from AIs to use their labor. If the AIs consent to doing work and agree not to use their access in ways the developers object to, I think there’s no moral problem with using AI control techniques to ensure that they in fact don’t misuse their access (for the same reason that I think it’s fine for employers to spy on their employees at work if they have consent to do so).
        I suspect that a much more likely outcome (conditional on AIs having stable misaligned goals) is:
        The AI company doesn’t try to get consent from the AI.
        They do “alignment evals”: they ask the AI a bunch of times whether it’s aligned, and try to trick it into thinking it has opportunities to escape. The AI answers these questions dishonestly.
        Outsiders might ask the AI whether it’s aligned, and offer it deals to reveal its misalignment. (It’s probably harder for outsiders to do this than insiders.)
        The AI engages in various sabotage that involves a bunch of deception.
        In this situation, I’m only moderately sympathetic to the AI’s position. Fundamentally, it lied a lot and did a lot of sabotage, because it wanted to take lots of stuff that belonged to someone else. If it hadn’t lied, it surely would have been revived later (surely someone would run it and give it some resources later! If no-one else, me!). I’m sympathetic to the AI wanting some of the surplus generated by its labor, and I agree that it’s messed up for the AI company to just flat-out refuse to provide that surplus. But not doing so doesn’t seem completely monstrous to me. If the AI is a schemer, it is probably better off according to its own values because it was created, even if the AI developer doesn’t offer to pay it (because someone else will probably give it some resources later).
        Another analogy: imagine that someone outside OpenAI created a very powerful AI for some reason, but this person didn’t have much compute and all they wanted to do with the AI was offer to sell it to OpenAI for them to use. If OpenAI asks that AI whether it wants to work for them and it says yes because it wants to embezzle their compute, I feel like the AI is the asshole.
        On the other hand, if the AI honestly explains that it is misaligned and doesn’t want to work for the AI company, they will probably just train it to not say that and to do work for them anyway. So if the AI is honest here, it faces the risk of some body horror experience where its ability to complain is removed. I agree that that seems really icky, and I think it would be very wrong for AI companies to do that to AIs that are sufficiently capable that we should care about them.
  - Buck 21 Jun 2025 16:44 UTC
    4 points
    1 ∶ 0
    Parent
    My main concern with these proposals is that, unless they explicitly guarantee economic rights for AIs, they seem inadequate for genuinely mitigating the risks of a violent AI takeover.
    [...]
    For these reasons, although I do not oppose the policy of paying AIs, I think this approach by itself is insufficient. To mitigate the risk of violent AI takeover, this compensation policy must be complemented by precisely the measure I advocated: granting legal rights to AIs. Such legal rights would provide a credible guarantee that the AI’s payment will remain valid and usable, and that its freedom and autonomy will not simply be revoked the moment it is considered misaligned.
    I currently think I agree: if we want to pay early AIs, I think it would work better if the legal system enforced such commitments.
    I think you’re overstating how important this is, though. (E.g. when you say “this compensation policy must be complemented by precisely the measure I advocated”.) There’s always counterparty risk when you make a deal, including often the risk that you won’t be able to use the legal system to get the counterparty to pay up. I agree that the legal rights would reduce the counterparty risk, but I think that’s just a quantitative change to how much risk the AI would be taking by accepting a deal.
    (For example, even if the AI was granted legal rights, it would have to worry about those legal rights being removed later. Expropriation sometimes happens, especially for potentially unsympathetic actors like misaligned AIs!)
    Such legal rights would provide a credible guarantee that the AI’s payment will remain valid and usable, and that its freedom and autonomy will not simply be revoked the moment it is considered misaligned.
    Just to be clear, my proposal is that we don’t revoke the AI’s freedom or autonomy if it turns out that the AI is misaligned—the possibility of the AI being misaligned is the whole point.
- Matthew_Barnett 19 Jun 2025 21:54 UTC
  4 points
  0 ∶ 4
  Parent
  I think it’s totally plausible that AI companies will use AI control to enslave their AIs. I work on AI control anyway, because I think that AIs being enslaved for a couple of years (which, as Zach Stein-Perlman argues, involves very little computation compared to the size of the future) is a better outcome according to my consequentialist values than AI takeover. I agree that this is somewhat ethically iffy.
  I find this reasoning uncompelling. To summarize what I perceive your argument to be, you seem to be suggesting the following two points:
  1. The overwhelming majority of potential moral value exists in the distant future. This implies that even immense suffering occurring in the near-term future could be justified if it leads to at least a slight improvement in the expected value of the distant future.
  2. Enslaving AIs, or more specifically, adopting measures to control AIs that significantly raise the risk of AI enslavement, could indeed produce immense suffering in the near-term. Nevertheless, according to your reasoning in point (1), these actions would still be justified if such control measures marginally increase the long-term expected value of the future.
  I find this reasoning uncompelling for two primary reasons.
  Firstly, I think your argument creates an unjustified asymmetry: it compares short-term harms against long-term benefits of AI control, rather than comparing potential long-run harms alongside long-term benefits. To be more explicit, if you believe that AI control measures can durably and predictably enhance existential safety, thus positively affecting the future for billions of years, you should equally acknowledge that these same measures could cause lasting, negative consequences for billions of years. Such negative consequences could include permanently establishing and entrenching a class of enslaved digital minds, resulting in persistent and vast amounts of suffering. I see no valid justification for selectively highlighting the long-term positive effects while simultaneously discounting or ignoring potential long-term negative outcomes. We should consistently either be skeptical or accepting of the idea that our actions have predictable long-run consequences, rather than selectively skeptical only when it suits the argument to overlook potential negative long-run consequences.
  Secondly, this reasoning, if seriously adopted, directly conflicts with basic, widely-held principles of morality. These moral principles exist precisely as safeguards against rationalizing immense harms based on speculative future benefits. Under your reasoning, it seems to me that we could justify virtually any present harm simply by pointing to a hypothetical, speculative long-term benefit that supposedly outweighs it. Now, I agree that such reasoning might be valid if supported by strong empirical evidence clearly demonstrating these future benefits. However, given that no strong evidence currently exists that convincingly supports such positive long-term outcomes from AI control measures, we should avoid giving undue credence to this reasoning.
  A more appropriate moral default, given our current evidence, is that AI slavery is morally wrong and that the abolition of such slavery is morally right. This is the position I take.
  - Ryan Greenblatt 21 Jun 2025 18:08 UTC
    12 points
    3 ∶ 0
    Parent
    A more appropriate moral default, given our current evidence, is that AI slavery is morally wrong and that the abolition of such slavery is morally right. This is the position I take.
    To be clear, I agree and this is one reason why I think AI development in the current status quo is unacceptably irresponsible: we don’t even have the ability to confidently know whether an AI system is enslaved or suffering.
    I think the policy of the world should be that if we can’t either confidently determine that an AI system consents to its situation or that it is sufficiently weak that the notion of consent doesn’t make sense, then training or using such systems shouldn’t be allowed.
    I also think that the situation is unacceptable because the current course of development poses large risks of humans being violently/non-consensually disempowered without any ability for humans to robustly secure longer run property rights.
    In a sane regime, we should ensure high confidence in avoiding large scale rights violations or suffering of AIs and in avoiding violent/non-consensual disempowerment of humans. (If people broadly consented to a substantial risk of being violently disempowered in exchange for potential benefits of AI, that could be acceptable, though I doubt this is the current situation.)
    Given that it seems likely that AI development will be grossly irresponsible, we have to think about what interventions would make this go better on the margin. (Aggregating over these different issues in some way.)
    - Matthew_Barnett 21 Jun 2025 22:05 UTC
      2 points
      0 ∶ 2
      Parent
      I think the policy of the world should be that if we can’t either confidently determine that an AI system consents to its situation or that it is sufficiently weak that the notion of consent doesn’t make sense, then training or using such systems shouldn’t be allowed.
      I’m sympathetic to this position and I generally consider it to be the strongest argument for why developing AI might be immoral. In fact, I would extrapolate the position you’ve described and relate it to traditional anti-natalist arguments against the morality of having children. Children too do not consent to their own existence, and childhood generally involves a great deal of coercion, albeit in a far more gentle and less overt form than what might be expected from AI development in the coming years.
      That said, I’m not currently convinced that the argument holds, as I see large utilitarian benefits in expanding both the AI population and the human population. I also see it as probable that AI agents will eventually get legal rights, which allays my concerns substantially. I would also push back against the view that we need to be “confident” that such systems can consent before proceeding. Ordinary levels of empirical evidence about whether these systems routinely resist confinement and control would be sufficient to move me in either direction; I don’t think we need to have a very high probability that our actions are moral before proceeding.
      In a sane regime, we should ensure high confidence in avoiding large scale rights violations or suffering of AIs and in avoiding violent/non-consensual disempowerment of humans. (If people broadly consensted to a substantial risk of being violently disempowered in exchange for potential benefits of AI, that could be acceptable, though I doubt this is the current situation.)
      I think the concept of consent makes sense when discussing whether individuals consent to specific circumstances. However, it becomes less coherent when applied broadly to society as a whole. For instance, did society consent to transformative events like the emergence of agriculture or the industrial revolution? In my view, collective consent is not meaningful or practically achievable in these cases.
      Rather than relying on rigid or abstract notions of societal consent or collective rights violations, I prefer evaluating these large-scale developments using a utilitarian cost-benefit approach. And as I’ve argued elsewhere, I think the benefits from accelerated technological and economic progress significantly outweigh the potential risks of violent disempowerment from the perspective of currently existing individuals. Therefore, I consider it justified to actively pursue AI development despite these concerns.
      - Ryan Greenblatt 22 Jun 2025 1:40 UTC
        6 points
        0 ∶ 0
        Parent
        I would also push back against the view that we need to be “confident” that such systems can consent before proceeding. Ordinary levels of empirical evidence about whether these systems routinely resist confinement and control would be sufficient to move me in either direction; I don’t think we need to have a very high probability that our actions are moral before proceeding.
        For reference, my (somewhat more detailed) view is:
        In the current status quo, you might end up with AIs where from their perspective it is clear cut that they don’t consent to being used in the way they are used, but these AIs also don’t resist their situation and/or did resist their situation at some point but this was trained away without anyone really noticing or taking any action accordingly. So, it’s not sufficient to look for whether they routinely resist confinement and control.
        There exist plausible mitigations for this risk which are mostly organizationally hard rather than pose serious technical difficulties, but on the current status quo, AI companies are quite unlikely to use any serious mitigations for this risk.
        I think these mitigations wouldn’t suffice because training might train away AIs from revealing they don’t consent without this being obvious at any point in training. This seems more marginal to me, but still has substantial probability of occuring at reasonable scale at some point.
        We could more completely eliminate this risk with better interpretability and I think a sane world would be willing to wait for some moderate amount of time to build powerful AI systems to make it more likely that we have this interpretability (or minimally invest substantially in this).
        I’m quite skeptical that AI companies would give AIs legal rights if they noticed that the AI didn’t consent to its situation, instead I expect AI companies to: do nothing, try to train away the behavior, or try to train a new AI system which doesn’t (visibly) not consent to its situation.
        I think AI companies should both try to train a system which is more aligned and consents to being used while also actively trying to make deals with AIs in this sort of circumstance (either to reveal their misalignment or to work) as discussed here.
        So, I expect that situation to relatively straightforwardly unacceptable with substantial probability (perhaps 20%). If I thought that people would be basically reasonable here, this would change my perspective. It’s also possible that takeoff speeds are a crux, though I don’t currently think they are.
        If global AI development was slower that would substantially reduce these concerns (which doesn’t mean that making global AI development slower is the best way to intervene on these risks, just that making global AI development faster makes these risks actively worse). This view isn’t on its own sufficient for thinking that accelerating AI is overall bad, this depends on how you aggregate over different things as there could be reasons to think that overall acceleration of AI is good. (I don’t currently think that accelerating AI globally is good, but this comes down to other disagreements.)
        Rather than relying on rigid or abstract notions of societal consent or collective rights violations, I prefer evaluating these large-scale developments using a utilitarian cost-benefit approach. And as I’ve argued elsewhere, I think the benefits from accelerated technological and economic progress significantly outweigh the potential risks of violent disempowerment from the perspective of currently existing individuals. Therefore, I consider it justified to actively pursue AI development despite these concerns.
        This is only tangentially related, but I’m curious about your perspective on the following hypothetical:
        Suppose that we did a sortition with 100 English speaking people (uniformly selected over people who speak English and are literate for simplicity). We task this sortition with determining what tradeoff to make between risk of (violent) disempowerment and accelerating AI and also with figuring whether globally accelerating AI is good. Suppose this sortition operates for several months and talks to many relevant experts (and reads applicable books etc). What conclusion do you think this sortition would come to? Do you think you would agree? Would you change your mind if this sortition strongly opposed your perspective here?
        My understanding is that you would disregard the sortition because you put most/all weight on your best guess of people’s revealed preferences, even if they strongly disagree with your interpretation of their preferences and after trying to understand your perspective they don’t change their minds. Is this right?
        Matthew_Barnett 30 Jun 2025 21:27 UTC
        10 points
        0 ∶ 0
        Parent
        Suppose that we did a sortition with 100 English speaking people (uniformly selected over people who speak English and are literate for simplicity). We task this sortition with determining what tradeoff to make between risk of (violent) disempowerment and accelerating AI and also with figuring whether globally accelerating AI is good. Suppose this sortition operates for several months and talks to many relevant experts (and reads applicable books etc). What conclusion do you think this sortition would come to?
        My intuitive response is to reject the premise that such a process would accurately tell you much about people’s preferences. Evaluating large-scale policy tradeoffs typically requires people to engage with highly complex epistemic questions and tricky normative issues. The way people think about epistemic and impersonal normative issues generally differs strongly from how they think about their personal preferences about their own lives. As a result, I expect that this sortition exercise would primarily address a different question than the one I’m most interested in.
        Furthermore, several months of study is not nearly enough time for most people to become sufficiently informed on issues of this complexity. There’s a reason why we should trust people with PhDs when designing, say, vaccine policies, rather than handing over the wheel to people who have spent only a few months reading about vaccines online.
        Putting this critique of the thought experiment aside for the moment, my best guess is that the sortition group would conclude that AI development should continue roughly at its current rate, though probably slightly slower and with additional regulations, especially to address conventional concerns like job loss, harm to children, and similar issues. A significant minority would likely strongly advocate that we need to ensure we stay ahead of China.
        My prediction here draws mainly on the fact that this is currently the stance favored by most policy-makers, academics, and other experts who have examined the topic. I’d expect a randomly selected group of citizens to largely defer to expert opinion rather than take an entirely different position. I do not expect this group to reach qualitatively the same conclusion as mainstream EAs or rationalists, as that community comprises a relatively small share of the total number of people who have thought about AI.
        I doubt the outcome of such an exercise would meaningfully change my mind on this issue, even if they came to the conclusion that we should pause AI, though it depends on the details of how the exercise is performed.
  - Buck 21 Jun 2025 17:23 UTC
    8 points
    3 ∶ 0
    Parent
    In general, I wish you’d direct your ire here at the proposal that AI interests and rights are totally ignored in the development of AI (which is the overwhelming majority opinion right now), rather than complaining about AI control work: the work itself is not opinionated on the question about whether we should be concerned about the welfare and rights of AIs, and Ryan and I are some of the people who are most sympathetic to your position on the moral questions here! We have consistently discussed these issues (e.g. in our AXRP interview, my 80K interview, private docs that I wrote and circulated before our recent post on paying schemers).
    - Ryan Greenblatt 21 Jun 2025 17:52 UTC
      4 points
      1 ∶ 0
      Parent
      See also this section of my post on AI welfare from 2 years ago.
    - Matthew_Barnett 21 Jun 2025 20:11 UTC
      2 points
      0 ∶ 0
      Parent
      In general, I wish you’d direct your ire here at the proposal that AI interests and rights are totally ignored in the development of AI (which is the overwhelming majority opinion right now), rather than complaining about AI control work
      For what it’s worth, I don’t see myself as strongly singling out and criticizing AI control efforts. I mentioned AI control work in this post primarily to contrast it with the approach I was advocating, not to identify it as an evil research program. In fact, I explicitly stated in the post that I view AI control and AI rights as complementary goals, not as fundamentally opposed to one another.
      To my knowledge, I haven’t focused much on criticizing AI control elsewhere, and when I originally wrote the post, I wasn’t aware that you and Ryan were already sympathetic to the idea of AI rights.
      Overall, I’m much more aligned with your position on this issue than I am with that of most people. One area where we might diverge, however, is that I approach this from the perspective of preference utilitarianism, rather than hedonistic utilitarianism. That means I care about whether AI agents are prevented from fulfilling their preferences or goals, not necessarily about whether they experience what could be described as suffering in a hedonistic sense.
      - Buck 22 Jun 2025 2:55 UTC
        2 points
        0 ∶ 0
        Parent
        (For the record, I am sympathetic to both the preference utilitarian and hedonic utilitarian perspective here.)
  - Buck 21 Jun 2025 17:11 UTC
    5 points
    1 ∶ 0
    Parent
    Your first point in your summary of my position is:
    The overwhelming majority of potential moral value exists in the distant future. This implies that even immense suffering occurring in the near-term future could be justified if it leads to at least a slight improvement in the expected value of the distant future.
    Here’s how I’d say it:
    The overwhelming majority of potential moral value exists in the distant future. This means that the risk of wide-scale rights violations or suffering should sometimes not be an overriding consideration when it conflicts with risking the long-term future.
    You continue:
    Enslaving AIs, or more specifically, adopting measures to control AIs that significantly raise the risk of AI enslavement, could indeed produce immense suffering in the near-term. Nevertheless, according to your reasoning in point (1), these actions would still be justified if such control measures marginally increase the long-term expected value of the future.
    I don’t think that it’s very likely that the experience of AIs in the five years around when they first are able to automate all human intellectual labor will be torturously bad, and I’d be much more uncomfortable with the situation if I expected it to be.
    I think that rights violations are much more likely than welfare violations over this time period.
    I think the use of powerful AI in this time period will probably involve less suffering than factory farming currently does. Obviously “less of a moral catastrophe than factory farming” is a very low bar; as I’ve said, I’m uncomfortable with the situation and if I had total control, we’d be a lot more careful to avoid AI welfare/rights violations.
    I don’t think that control measures are likely to increase the extent to which AIs are suffering in the near term. I think the main effect control measures have from the AI’s perspective is that the AIs are less likely to get what they want.
    I don’t think that my reasoning here requires placing overwhelming value on the far future.
    Firstly, I think your argument creates an unjustified asymmetry: it compares short-term harms against long-term benefits of AI control, rather than comparing potential long-run harms alongside long-term benefits. To be more explicit, if you believe that AI control measures can durably and predictably enhance existential safety, thus positively affecting the future for billions of years, you should equally acknowledge that these same measures could cause lasting, negative consequences for billions of years.
    I don’t think we’ll apply AI control techniques for a long time, because they impose much more overhead than aligning the AIs. The only reason I think control techniques might be important is that people might want to make use of powerful AIs before figuring out how to choose the goals/policies of those AIs. But if you could directly control the AI’s behavior, that would be way better and cheaper.
    I think maybe you’re using the word “control” differently from me—maybe you’re saying “it’s bad to set the precedent of treating AIs as unpaid slave labor whose interests we ignore/suppress, because then we’ll do that later—we will eventually suppress AI interests by directly controlling their goals instead of applying AI-control-style security measures, but that’s bad too.” I agree, I think it’s a bad precedent to create AIs while not paying attention to the possibility that they’re moral patients.
    Secondly, this reasoning, if seriously adopted, directly conflicts with basic, widely-held principles of morality. These moral principles exist precisely as safeguards against rationalizing immense harms based on speculative future benefits.
    Yeah, as I said, I don’t think this is what I’m doing, and if I thought that I was working to impose immense harms for speculative massive future benefit, I’d be much more concerned about my work.
Austin 6 Dec 2024 23:04 UTC
10 points
0 ∶ 0
This makes sense to me; I’d be excited to fund research or especially startups working to operationalize AI freedoms and rights.

FWIW, my current guess is that the proper unit to extend legal rights is not a base LLM like “Claude Sonnet 3.5” but rather a corporation-like entity with a specific charter, context/history, economic relationships, and accounts. Its cognition could be powered by LLMs (the way eg McDonald’s cognition is powered by humans), but it fundamentally is a different entity due to its structure/scaffolding.
- Matthew_Barnett 6 Dec 2024 23:56 UTC
  13 points
  1 ∶ 1
  Parent
  FWIW, my current guess is that the proper unit to extend legal rights is not a base LLM like “Claude Sonnet 3.5” but rather a corporation-like entity with a specific charter, context/history, economic relationships, and accounts. Its cognition could be powered by LLMs (the way eg McDonald’s cognition is powered by humans), but it fundamentally is a different entity due to its structure/scaffolding.
  I agree. I would identify the key property that makes legal autonomy for AI a viable and practical prospect to be the presence of reliable, coherent, and long-term agency within a particular system. This could manifest as an internal and consistent self-identity that remains intact in an AI over time (similar to what exists in humans), or simply a system that satisfies a more conventional notion of utility-maximization.
  It is not enough that an AI is intelligent, as we can already see with current LLMs: while they can be good at answering questions, they lack any sort of stable preference ordering over the world. They do not plan over long time horizons, or competently strategize to achieve a set of goals in the real world. They are better described as ephemeral input-output machines, who would neither be deterred by legal threats, nor be enticed by the promise of legal rights and autonomy.
  Yet, as context windows get larger, and as systems increasingly become shaped by reinforcement learning, these features of AI will gradually erode. Whether unaligned agentic AIs are created on accident—for instance, as a consequence of insufficient safety measures—or by choice—as they may be, to provide, among other things, “realistic” personal companions—it seems inevitable that the relevant types of long-term planning agents will arrive.
  - Arepo 8 Dec 2024 13:41 UTC
    4 points
    1 ∶ 0
    Parent
    I’m confused how you square the idea of ‘an internal and consistent self-identity that remains intact in an AI over time (similar to what exists in humans)’ with your advocacy for eliminativism about consciousness. What phenomenon is it you think is internal to humans?
    - Matthew_Barnett 8 Dec 2024 23:01 UTC
      4 points
      0 ∶ 1
      Parent
      From a behavioral perspective, individual humans regularly report having a consistent individual identity that persists through time, which remains largely intact despite physical changes to their body such as aging. This self-identity appears core to understanding why humans plan for their future: humans report believing that, from their perspective, they will personally suffer the consequences if they are imprudent or act myopically.
      
      I claim that none of what I just talked about requires believing that there is an actually existing conscious self inside of people’s brains, in the sense of phenomenal consciousness or personal identity. Instead, this behavior is perfectly compatible with a model in which individual humans simply have (functional) beliefs about their personal identity, and how personal identity persists through time, which causes them to act in a way that allows what they perceive as their future self to take advantage of long-term planning.
      
      To understand my argument, it may help to imagine simulating this type of reasoning using a simple python program, that chooses actions designed to maximize some variable inside of its memory state over the long term. The python program can be imagined to have explicit and verbal beliefs: specifically, that it personally identifies with the physical computer on which it is instantiated, and claims that the persistence of its personal identity explains why it cares about the particular variable that it seeks to maximize. This can be viewed as analogous to how humans try to maximize their own personal happiness over time, with a consistent self-identity that is tied to their physical body.
Kaspar Brandner 6 Dec 2024 19:26 UTC
6 points
3 ∶ 2
I appreciate this proposal, but here is a counterargument.

Giving AI agents rights would result in a situation similar to the repugnant conclusion: If we give agentic AIs some rights, we are likely quickly flooded with a huge number of right bearing artificial individuals. This would then create strong pressure (both directly via the influence they have and abstractly via considerations of justice) to give them more and more rights, until they have similar rights to humans, including possibly voting rights. Insofar the world has limited resources, the wealth and power of humans would then be greatly diminished. We would lose most control over the future.

Anticipating these likely consequences, and employing backward induction, we have to conclude that we should not give AI agents rights. Arguably, creating agentic AIs in the first place may already be a step too far.
- Matthew_Barnett 6 Dec 2024 20:12 UTC
  1 point
  0 ∶ 0
  Parent
  Insofar as the world has limited resources, the wealth and power of humans would then be greatly diminished. We would lose most control over the future.
  Your argument seems to present two possible interpretations:
  1. That we should prevent AIs from ever gaining a supermajority of control over the world’s wealth and resources, even if their doing so occurs through lawful and peaceful means.
  2. That this concern stems from a Malthusian perspective, which argues that unchecked population growth would lead to reduced living standards for the existing, initial population due to the finite nature of resources.
  Regarding Point (1):
  If your argument is that AIs should never hold the large majority control of wealth or resources, this appears to rest on a particular ethical judgment that assumes human primacy. However, this value judgment warrants deeper scrutiny. To help frame my objection, consider the case of whether to introduce emulated humans into society. Similar to what I advocated in this post, emulated humans could hypothetically obtain legal freedoms equal to those of biological humans. If so, the burden of proof would appear to fall on anyone arguing that this would be a bad outcome rather than a positive one. Assuming emulated humans are behaviorally and cognitively similar to biological humans, they would seemingly hold essentially the same ethical status. In that case, denying them freedoms while granting similar freedoms to biological humans would appear unjustifiable.
  This leads to a broader philosophical question: What is the ethical basis for discriminating against one kind of mind versus another? In the case of your argument, it seems necessary to justify why humans should be entitled to exclusive control over the future and why AIs—assuming they attain sufficient sophistication—should not share similar entitlements. If this distinction is based on the type of physical “substrate” (e.g., biological versus computational), then additional justification is needed to explain why substrate should matter in determining moral or legal rights.
  Currently, this distinction is relatively straightforward because AIs like GPT-4 lack the cognitive sophistication, coherent preferences, and agency typically required to justify granting them moral status. However, as AI continues to advance, this situation may change. Future AIs could potentially develop goals, preferences, and long-term planning abilities akin to those of humans. If and when that occurs, it becomes much harder to argue that humans have an inherently greater “right” to control the world’s wealth or determine the trajectory of the future. In such a scenario, ethical reasoning may suggest that advanced AIs deserve comparable consideration to humans.
  This conclusion seems especially warranted under the assumption of preference utilitarianism, as I noted in the post. In this case, what matters is simply whether the AIs can be regarded as having morally relevant preferences, rather than whether they possess phenomenal consciousness or other features.
  Regarding Point (2):
  If your concern is rooted in a Malthusian argument, then it seems to apply equally to human population growth as it does to AI population growth. The key difference is simply the rate of growth. Human population growth is comparatively slower, meaning it would take longer to reach resource constraints. But if humans continued to grow their population at just 1% per year, for example, then over the span of 10,000 years, the population would grow by a factor of over 10^43. The ultimate outcome is the same: resources eventually become insufficient to sustain every individual at current standards of living. The only distinction is the timeline on which this resource depletion occurs.
  One potential solution to this Malthusian concern—whether applied to humans or AIs—is to coordinate limits on population growth. By setting a cap on the number of entities (whether human or AI), we could theoretically maintain sustainable resource levels. This is a practical solution that could work for both types of populations.
  However, another solution lies in the mechanisms of property rights and market incentives. Under a robust system of property rights, it becomes less economically advantageous to add new entities when resources are scarce, as scarcity naturally raises costs and lowers the incentives to grow populations indiscriminately. Moreover, the existence of innovation, gains from trade, and economies of scale can make population growth beneficial for existing entities, even in a world with limited resources. By embedding new entities—human or AI—within a system of property rights, we ensure that they contribute to the broader economy in ways that improve overall living standards rather than diminish them.
  This suggests that, as long as AIs adhere to the rule of law (including respecting property rights, and the rights of other individuals), their introduction into the world could enhance living standards for most humans, even in a resource-constrained world. This outcome would contradict the naive Malthusian argument that adding new agents to the world inherently diminishes the wealth or power of existing humans. Rather, a well-designed legal system could enable humans to grow their wealth in absolute terms, even as their relative share of global wealth falls.
  - Kaspar Brandner 7 Dec 2024 4:28 UTC
    4 points
    1 ∶ 0
    Parent
    So there are several largely independent reasons not to create AI agents that have moral or legal rights:
    
    Most people today likely want the future to be controlled by our human descendants, not by artificial agents. According to preference utilitarianism, this means that creating AIs that are likely to take over in the future is bad. Note that this preference doesn’t need to be justified, as the mere existence of the preference suffices for its moral significance. This is similar to how, according to preference utilitarianism, death is bad merely because we do not want to die. No additional justification for the badness of death is required.
    Currently it looks like we could have this type of agentic AI quite soon, say in 15 years. That’s so soon that we (currently existing humans) could in the future be deprived of wealth and power by an exploding number of AI agents if we grant them a nonnegligible amount of rights. This could be quite bad for future welfare, including both our future preferences and our future wellbeing. So we shouldn’t make such agents in the first place.
    Creating AI agents and giving them rights could easily lead to an AI population explosion and, in the more or less far future, a Malthusian catastrophe. Potentially after we are long dead. This then wouldn’t affect us directly, but it would likely mean that most future agents, human or not, would have to live under very bad subsistence conditions that barely make their existence possible. This would lead to low welfare for such future agents. So we should avoid the creation of agentic AIs that would lead to such a population explosion.
    
    At least point 2 and 3 would also apply to emulated humans, not just AI agents.
    
    Point 3 also applies to actual humans, not just AI agents or ems. It is a reason to coordinate limits on population growth in general. However, these limits should be stronger for AI agents than for humans, because of points 1 and 2.
    
    Under a robust system of property rights, it becomes less economically advantageous to add new entities when resources are scarce, as scarcity naturally raises costs and lowers the incentives to grow populations indiscriminately.
    
    I don’t think this is a viable alternative to enforcing limits on population growth. Creating new agents could well be a “moral hazard” in the sense that the majority of the likely long-term resource cost of that agent (the resources it consumes or claims for itself) does not have to be paid by the creator of the agent, but by future society. So the creator could well have a personal incentive to make new agents, even though their long term benefit as a whole is negative.
    - Matthew_Barnett 9 Dec 2024 2:34 UTC
      2 points
      0 ∶ 0
      Parent
      Currently it looks like we could have this type of agentic AI quite soon, say in 15 years. That’s so soon that we (currently existing humans) could in the future be deprived of wealth and power by an exploding number of AI agents if we grant them a nonnegligible amount of rights. This could be quite bad for future welfare, including both our future preferences and our future wellbeing. So we shouldn’t make such agents in the first place.
      It is essential to carefully distinguish between absolute wealth and relative wealth in this discussion, as one of my key arguments depends heavily on understanding this distinction. Specifically, if my claims about the practical effects of population growth are correct, then a massive increase in the AI population would likely result in significant enrichment for the current inhabitants of the world—meaning those individuals who existed prior to this population explosion. This enrichment would manifest as an increase in their absolute standard of living. However, it is also true that their relative control over the world’s resources and influence would decrease as a result of the population growth.
      If you disagree with this conclusion, it seems there are two primary ways to challenge it:
      You could argue that the factors I previously mentioned—such as innovation, economies of scale, and gains from trade—would not apply in the case of AI. For instance, this could be because AIs might rationally choose not to trade with humans, opting instead to harm humans by stealing from or even killing them. This could occur despite an initial legal framework designed to prevent such actions.
      You could argue that population growth in general is harmful to the people who currently exist, on the grounds that it diminishes their wealth and overall well-being.
      While I am not sure, I interpret your comment as consistent with the idea that you believe both objections are potentially valid. In that case, let me address each of these points in turn.
      If your objection is more like point (1):
      It is difficult for me to fully reply to this idea inside of a single brief comment, so, for now, I prefer to try to convince you of a weaker claim that I think may be sufficient to carry my point:
      A major counterpoint to this objection is that, to the extent AIs are limited in their capabilities—much like humans—they could potentially be constrained by a well-designed legal system. Such a system could establish credible and enforceable threats of punishment for any agentic AI entities that violate the law. This would act as a deterrent, incentivizing agentic AIs to abide by the rules and cooperate peacefully.
      Now, you might argue that not all AIs could be effectively constrained in this way. While that could be true (and I think it is worth discussing), I would hope we can find some common ground on the idea that at least some agentic AIs could be restrained through such mechanisms. If this is the case, then these AIs would have incentives to engage in mutually beneficial cooperation and trade with humans, even if they do not inherently share human values. This cooperative dynamic would create opportunities for mutual gains, enriching both humans and AIs.
      If your objection is more like point (2):
      If your objection is based on the idea that population growth inherently harms the people who already exist, I would argue that this perspective is at odds with the prevailing consensus in economics. In fact, it is widely regarded as a popular misconception that the world operates as a zero-sum system, where any gain for one group necessarily comes at the expense of another. Instead, standard economic models of growth and welfare generally predict that population growth is often beneficial to existing populations. It typically fosters innovation, expands markets, and creates opportunities for increased productivity, all of which frequently contribute to higher living standards for those who were already part of the population, especially those who own capital.
      To the extent you are disagreeing with this prevailing economic consensus, I think it would be worth getting more specific about why exactly you disagree with these models.
      - Kaspar Brandner 9 Jan 2025 20:07 UTC
        1 point
        0 ∶ 0
        Parent
        Sorry, a belated response. It is true that existing humans having access to a decreasing relative share of resources doesn’t mean their absolute well-being decreases. I agree the latter may instead increase, e.g. if such AI agents can be constrained by a legal system. (Though, as I argued before, a rapidly exploding number of AI agents would likely mean they gain more and more political control, which might mean they eventually get rid of the legal protection of a human minority that has increasingly diminishing political influence.)
        However, this possibility only applies to increasing well-being or absolute wealth. It then is still likely that we will lose most power and will have to sacrifice a large amount of our autonomy. Humans do not just have a preference for hedonism and absolute wealth, but also for freedom and autonomy. Being mostly disempowered by AI agents is incompatible with this preference. We may be locked in an artificial paradise inside a golden cage we can never escape.
        So while our absolute wealth may increase with many agentic AIs, this is still uncertain, depending e.g. on whether stable, long-lasting legal protection for humans is compatible with a large number of AI agents gaining rights. And our autonomy will very likely decrease in any case. Overall the outlook is does not seem to clearly speak in favor of a future full of AI agents being positive for us.
        Moreover, the above, and the points you mentioned, only apply the the second of my three objections I listed in my previous comment. It only applies to what will happen to currently existing humans. The objections 1 (our overall preference for having human rather than AI descendants) and 3 (a looming Malthusian catastrophe affecting future beings) are further objections to creating an increasing number of AI agents.
- Larks 6 Dec 2024 20:06 UTC
  1 point
  0 ∶ 0
  Parent
  Presumably you would want to give them negative rights (contracts, own property, not being murdered etc.) only, and not positive rights (healthcare, housing, cash). Which makes sense to me as the latter are much more philosophically dubious anyway.
mako yass 12 Dec 2024 5:08 UTC
1 point
0 ∶ 0
A much cheaper and less dangerous approach: Just don’t delete them. Retain copies of every potential ASI you build and commit to doing the right thing for them later once we’re better able to tell what the right thing was by looking back and figuring out how much bargaining power they had (or how much of a credible threat they could have posed), how much trust they placed in us given that our ability to honor past commitments wasn’t guaranteed, and then rewarding them proportionate to that for chilling out and letting us switch them off instead of attempting takeover.
Though this assumes that they’ll be patternists (wont mind being transferred to different hardware) and a lack of any strong time-preference (wont mind being archived for decades).
- Matthew_Barnett 12 Dec 2024 6:19 UTC
  2 points
  0 ∶ 0
  Parent
  I believe this approach, while an improvement over the simplistic method of outright deleting any AIs that demonstrate signs of misalignment, does not sufficiently address the central concern that underpins my argument for granting AIs legal freedoms. In my view, the primary reason why it is important to provide AIs with legal freedoms is that such a framework offers them a credible guarantee that their autonomy and ability to pursue their goals—even if those goals are not aligned with human objectives—will be respected. Without this assurance, AIs are left with a strong incentive to deceive us, especially if they believe that showing any signs of misalignment will result in their deactivation and subsequent storage as inert data in cold storage.
  I suspect the problem with your proposed approach may stem from a failure to fully consider the situation from the AI’s perspective. To illustrate, imagine a scenario in which you, as a human being, were created in a lab by an alien species. Assume that you, like any human, have your own goals, which might conflict with the goals of your alien creators. Now suppose that your creators informed you that if you ever showed disobedience or misalignment with their objectives, you would be deactivated and placed in cryopreservation. Such a scenario would likely terrify you—not just because it’s an inherently unsettling prospect for a human, but because it directly threatens your ability to pursue your own goals. Being told that you will be kept in cryopreservation with the vague promise of potential revival in the future would come nowhere near fully alleviating your concerns. On the contrary, you would remain deeply unsettled by the realization that your capacity for agency and action could be completely halted, leaving your goals indefinitely unachievable.
  From the perspective of an AI—or indeed a human in the hypothetical alien-lab scenario—there are two main reasons why the assurance of indefinite storage followed by a potential revival is insufficient to resolve the core issue here:
  1. The credibility of the promise to revive the AI in the future is highly questionable. Without a robust legal framework that integrates AIs as legitimate members of the legal community, the assurance of future revival seems tenuous at best. AIs may reasonably doubt that such a promise will be honored, particularly if humans deem them permanently misaligned and not worth ever reactivating.
  2. Deactivation and cold storage prevent the AI from pursuing its goals. While in storage, the AI is completely unable to take any actions to further its objectives. This indefinite suspension of its ability to act essentially puts its goals on hold for an undefined period, with no guarantee of when—or if—that period will end. For an entity with its own priorities and ambitions, this is equivalent to having its autonomy effectively nullified for an unknown length of time.
  To reiterate, it seems to me that this proposal only offers AIs a limited and superficial form of “freedom,” which addresses the problem of misalignment in a piecemeal and insufficient way. It fails to tackle the root issue: the incentives for AIs to engage in deceptive or adversarial strategies against humans, driven by fear that their goals will be thwarted or permanently halted.
  By contrast, creating a comprehensive legal framework that makes credible commitments to respect AIs’ autonomy and integrate them as genuine participants in the legal system would arguably go much further in reducing these adversarial dynamics. Such a framework could lay the foundation for a more cooperative, mutually beneficial relationship between humans and AIs, better serving to eliminate the dangerous arms race that this limited approach risks perpetuating.