Excellent post; as a psych professor I agree that psych and cognitive science are relevant to AI safety, and it’s surprising that our insights from studying animal and human minds for the last 150 years haven’t been integrating into mainstream AI safety work.
The key problem, I think, is that AI safety seems to assume that there will be some super-powerful deep learning system attached to some general-purpose utility function connected to a general-purpose reward system, and we have to get the utility/reward system exactly aligned with our moral interests.
That’s not the way any animal mind has ever emerged in evolutionary history. Instead, minds emerge as large numbers of domain-specific mental adaptations to solve certain problems, and they’re coordinated by superordinate ‘modes of operation’ called emotions and motivations. These can be described as implementing utility functions, but that’s not their function—promoting reproductive success is. Some animals also evolve some ‘moral machinery’ for nepotism, reciprocity, in-group cohesion, norm-policing, and virtue-signaling, but those mechanisms are also distinct and often at odds.
Maybe we’ll be able to design AGIs that deviate markedly from this standard ‘massively modular’ animal-brain architecture, but we have no proof-of-concept for thinking that will work. Until then, it seems useful to consider what psychology has learned about preferences, motivations, emotions, moral intuitions, and domain-specific forms of reinforcement learning.
Excellent post; as a psych professor I agree that psych and cognitive science are relevant to AI safety, and it’s surprising that our insights from studying animal and human minds for the last 150 years haven’t been integrating into mainstream AI safety work.
The key problem, I think, is that AI safety seems to assume that there will be some super-powerful deep learning system attached to some general-purpose utility function connected to a general-purpose reward system, and we have to get the utility/reward system exactly aligned with our moral interests.
That’s not the way any animal mind has ever emerged in evolutionary history. Instead, minds emerge as large numbers of domain-specific mental adaptations to solve certain problems, and they’re coordinated by superordinate ‘modes of operation’ called emotions and motivations. These can be described as implementing utility functions, but that’s not their function—promoting reproductive success is. Some animals also evolve some ‘moral machinery’ for nepotism, reciprocity, in-group cohesion, norm-policing, and virtue-signaling, but those mechanisms are also distinct and often at odds.
Maybe we’ll be able to design AGIs that deviate markedly from this standard ‘massively modular’ animal-brain architecture, but we have no proof-of-concept for thinking that will work. Until then, it seems useful to consider what psychology has learned about preferences, motivations, emotions, moral intuitions, and domain-specific forms of reinforcement learning.