This argument only makes sense if you have a very low P(doom) (like <0.1%) or if you place minimal value on future generations. Otherwise, it’s not worth recklessly endangering the future of humanity to bring utopia a few years (or maybe decades) sooner. The math on this is really simple—bringing AI sooner only benefits the current generation, but extinction harms all future generations. You don’t need to be a strong longtermist, you just need to accord significant value to people who aren’t born yet.
Here’s a counter-argument that relies on the following assumptions:
First, suppose you believe unaligned AIs would still be conscious entities, capable of having meaningful, valuable experiences. This could be because you think unaligned AIs will be very cognitively sophisticated, even if they don’t share human preferences.
Second, assume you’re a utilitarian who doesn’t assign special importance to whether the future is populated by biological humans or digital minds. If both scenarios result in a future full of happy, conscious beings, you’d view them as roughly equivalent. In fact, you might even prefer digital minds if they could exist in vastly larger numbers or had features that enhanced their well-being relative to biological life.
With those assumptions in place, consider the following dilemma:
If AI is developed soon, there’s some probability p that billions of humans will die due to misaligned AI—an obviously bad outcome. However, if these unaligned AIs replace us, they would presumably still go on to create a thriving and valuable civilization from a utilitarian perspective, even though humanity would not be part of that future.
If AI development is delayed by several decades to ensure safety, billions of humans will die in the meantime from old age who could otherwise have been saved by accelerated medical advancements enabled by earlier AI. This, too, is clearly bad. However, humanity would eventually develop AI safely and go on to build a similarly valuable civilization, just after a significant delay.
Given these two options, a utilitarian doesn’t have strong reasons to prefer the second approach. While the first scenario carries substantial risks, it does not necessarily endanger the entire long-term future. Instead, the primary harm seems to fall on the current generation: either billions of people die prematurely due to unaligned AI, or they die from preventable causes like aging because of delayed technological progress. In both cases, the far future—whether filled with biological or digital minds—remains intact and flourishing under these assumptions.
In other words, there simply isn’t a compelling utilitarian argument for choosing to delay AI in this dilemma.
Do you have any thoughts on the assumptions underlying this dilemma, or its conclusion?
I think it’s quite unlikely that a misaligned AI would create an AI utopia. Much more likely that it would create something that resembles a paperclip maximizer / something that has either no conscious experience, or experiences with no valance, or experiences with random valences.
What’s your credence that humans create a utopia in the alternative? Depending on the strictness of one’s definition, I think a future utopia is quite unlikely either way, whether we solve alignment or not.
It seems you expect future unaligned AIs will either be unconscious or will pursue goals that result in few positive conscious experiences being created. I am not convinced of this myself. At the very least, I think such a claim demands justification.
Given the apparent ubiquity of consciousness in the animal kingdom, and the anticipated sophistication of AI cognition, it is difficult for me to imagine a future essentially devoid of conscious life, even if that life is made of silicon and it does not share human preferences.
I think the position I’m arguing for is basically the standard position among AI safety advocates so I haven’t really scrutinized it. But basically, (many) animals evolved to experience happiness because it was evolutionarily useful to do so. AIs are not evolved so it seems likely that by default, they would not be capable of experiencing happiness. This could be wrong—it might be that happiness is a byproduct of some sort of information processing, and sufficiently complex reinforcement learning agents necessarily experience happiness (or something like that).
Also: According to the standard story where an unaligned AI has some optimization target and then kills all humans in the interest of pursuing that target (e.g. a paperclip maximizer), it seems unlikely that this AI would experience much happiness (granting that it’s capable of happiness) because its own happiness is not the optimization target.
(Note: I realize I am ignoring some parts of your comment, I’m intentionally only responding to the central point so my response doesn’t get too frayed.)
According to the standard story where an unaligned AI has some optimization target and then kills all humans in the interest of pursuing that target (e.g. a paperclip maximizer), it seems unlikely that this AI would experience much happiness (granting that it’s capable of happiness) because its own happiness is not the optimization target.
I agree that this is the standard story regarding AI risk, but I haven’t seen convincing arguments that support this specific model.
In other words, I see no compelling evidence to believe that future AIs will have exclusively abstract, disconnected goals—like maximizing paperclip production—and that such AIs would fail to generate significant amounts of happiness, either as a byproduct of their goals or as an integral part of achieving them.
(Of course, it’s crucial to avoid wishful thinking. A favorable outcome is by no means guaranteed, and I’m not arguing otherwise. Instead, my point is that the core assumption underpinning this standard narrative seems weakly argued and poorly substantiated.)
The scenario I find most plausible is one in which AIs have a mixture of goals, much like humans. Some of these goals will likely be abstract, while others will be directly tied to the AI’s internal experiences and mental states.
Just as humans care about their own happiness but also care about external reality—such as the impact they have on the world or what happens after they’re dead—I expect that many AIs will place value on both their own mental states and various aspects of external reality.
This ultimately depends on how AIs are constructed and trained, of course. However, as you mentioned, there are some straightforward reasons to anticipate parallels between how goals emerge in animals and how they might arise in AIs. For example, robots and some other types of AIs will likely be trained through reinforcement learning. While RL on computers isn’t identical to the processes by which animals learn, it is similar enough in critical ways to suggest that these parallels could have significant implications.
Here’s a counter-argument that relies on the following assumptions:
First, suppose you believe unaligned AIs would still be conscious entities, capable of having meaningful, valuable experiences. This could be because you think unaligned AIs will be very cognitively sophisticated, even if they don’t share human preferences.
Second, assume you’re a utilitarian who doesn’t assign special importance to whether the future is populated by biological humans or digital minds. If both scenarios result in a future full of happy, conscious beings, you’d view them as roughly equivalent. In fact, you might even prefer digital minds if they could exist in vastly larger numbers or had features that enhanced their well-being relative to biological life.
With those assumptions in place, consider the following dilemma:
If AI is developed soon, there’s some probability p that billions of humans will die due to misaligned AI—an obviously bad outcome. However, if these unaligned AIs replace us, they would presumably still go on to create a thriving and valuable civilization from a utilitarian perspective, even though humanity would not be part of that future.
If AI development is delayed by several decades to ensure safety, billions of humans will die in the meantime from old age who could otherwise have been saved by accelerated medical advancements enabled by earlier AI. This, too, is clearly bad. However, humanity would eventually develop AI safely and go on to build a similarly valuable civilization, just after a significant delay.
Given these two options, a utilitarian doesn’t have strong reasons to prefer the second approach. While the first scenario carries substantial risks, it does not necessarily endanger the entire long-term future. Instead, the primary harm seems to fall on the current generation: either billions of people die prematurely due to unaligned AI, or they die from preventable causes like aging because of delayed technological progress. In both cases, the far future—whether filled with biological or digital minds—remains intact and flourishing under these assumptions.
In other words, there simply isn’t a compelling utilitarian argument for choosing to delay AI in this dilemma.
Do you have any thoughts on the assumptions underlying this dilemma, or its conclusion?
I want to add that I think the argument you present here is better than any of the arguments I’d considered in the relevant section.
I think it’s quite unlikely that a misaligned AI would create an AI utopia. Much more likely that it would create something that resembles a paperclip maximizer / something that has either no conscious experience, or experiences with no valance, or experiences with random valences.
What’s your credence that humans create a utopia in the alternative? Depending on the strictness of one’s definition, I think a future utopia is quite unlikely either way, whether we solve alignment or not.
It seems you expect future unaligned AIs will either be unconscious or will pursue goals that result in few positive conscious experiences being created. I am not convinced of this myself. At the very least, I think such a claim demands justification.
Given the apparent ubiquity of consciousness in the animal kingdom, and the anticipated sophistication of AI cognition, it is difficult for me to imagine a future essentially devoid of conscious life, even if that life is made of silicon and it does not share human preferences.
I think the position I’m arguing for is basically the standard position among AI safety advocates so I haven’t really scrutinized it. But basically, (many) animals evolved to experience happiness because it was evolutionarily useful to do so. AIs are not evolved so it seems likely that by default, they would not be capable of experiencing happiness. This could be wrong—it might be that happiness is a byproduct of some sort of information processing, and sufficiently complex reinforcement learning agents necessarily experience happiness (or something like that).
Also: According to the standard story where an unaligned AI has some optimization target and then kills all humans in the interest of pursuing that target (e.g. a paperclip maximizer), it seems unlikely that this AI would experience much happiness (granting that it’s capable of happiness) because its own happiness is not the optimization target.
(Note: I realize I am ignoring some parts of your comment, I’m intentionally only responding to the central point so my response doesn’t get too frayed.)
I agree that this is the standard story regarding AI risk, but I haven’t seen convincing arguments that support this specific model.
In other words, I see no compelling evidence to believe that future AIs will have exclusively abstract, disconnected goals—like maximizing paperclip production—and that such AIs would fail to generate significant amounts of happiness, either as a byproduct of their goals or as an integral part of achieving them.
(Of course, it’s crucial to avoid wishful thinking. A favorable outcome is by no means guaranteed, and I’m not arguing otherwise. Instead, my point is that the core assumption underpinning this standard narrative seems weakly argued and poorly substantiated.)
The scenario I find most plausible is one in which AIs have a mixture of goals, much like humans. Some of these goals will likely be abstract, while others will be directly tied to the AI’s internal experiences and mental states.
Just as humans care about their own happiness but also care about external reality—such as the impact they have on the world or what happens after they’re dead—I expect that many AIs will place value on both their own mental states and various aspects of external reality.
This ultimately depends on how AIs are constructed and trained, of course. However, as you mentioned, there are some straightforward reasons to anticipate parallels between how goals emerge in animals and how they might arise in AIs. For example, robots and some other types of AIs will likely be trained through reinforcement learning. While RL on computers isn’t identical to the processes by which animals learn, it is similar enough in critical ways to suggest that these parallels could have significant implications.