According to the standard story where an unaligned AI has some optimization target and then kills all humans in the interest of pursuing that target (e.g. a paperclip maximizer), it seems unlikely that this AI would experience much happiness (granting that it’s capable of happiness) because its own happiness is not the optimization target.
I agree that this is the standard story regarding AI risk, but I haven’t seen convincing arguments that support this specific model.
In other words, I see no compelling evidence to believe that future AIs will have exclusively abstract, disconnected goals—like maximizing paperclip production—and that such AIs would fail to generate significant amounts of happiness, either as a byproduct of their goals or as an integral part of achieving them.
(Of course, it’s crucial to avoid wishful thinking. A favorable outcome is by no means guaranteed, and I’m not arguing otherwise. Instead, my point is that the core assumption underpinning this standard narrative seems weakly argued and poorly substantiated.)
The scenario I find most plausible is one in which AIs have a mixture of goals, much like humans. Some of these goals will likely be abstract, while others will be directly tied to the AI’s internal experiences and mental states.
Just as humans care about their own happiness but also care about external reality—such as the impact they have on the world or what happens after they’re dead—I expect that many AIs will place value on both their own mental states and various aspects of external reality.
This ultimately depends on how AIs are constructed and trained, of course. However, as you mentioned, there are some straightforward reasons to anticipate parallels between how goals emerge in animals and how they might arise in AIs. For example, robots and some other types of AIs will likely be trained through reinforcement learning. While RL on computers isn’t identical to the processes by which animals learn, it is similar enough in critical ways to suggest that these parallels could have significant implications.
I agree that this is the standard story regarding AI risk, but I haven’t seen convincing arguments that support this specific model.
In other words, I see no compelling evidence to believe that future AIs will have exclusively abstract, disconnected goals—like maximizing paperclip production—and that such AIs would fail to generate significant amounts of happiness, either as a byproduct of their goals or as an integral part of achieving them.
(Of course, it’s crucial to avoid wishful thinking. A favorable outcome is by no means guaranteed, and I’m not arguing otherwise. Instead, my point is that the core assumption underpinning this standard narrative seems weakly argued and poorly substantiated.)
The scenario I find most plausible is one in which AIs have a mixture of goals, much like humans. Some of these goals will likely be abstract, while others will be directly tied to the AI’s internal experiences and mental states.
Just as humans care about their own happiness but also care about external reality—such as the impact they have on the world or what happens after they’re dead—I expect that many AIs will place value on both their own mental states and various aspects of external reality.
This ultimately depends on how AIs are constructed and trained, of course. However, as you mentioned, there are some straightforward reasons to anticipate parallels between how goals emerge in animals and how they might arise in AIs. For example, robots and some other types of AIs will likely be trained through reinforcement learning. While RL on computers isn’t identical to the processes by which animals learn, it is similar enough in critical ways to suggest that these parallels could have significant implications.