Cognitive Science/​Psychology As a Neglected Approach to AI Safety

All of the advice on getting into AI safety research that I’ve seen recommends studying computer science and mathematics: for example, the 80,000 hours AI safety syllabus provides a computer science-focused reading list, and mentions that “Ideally your undergraduate degree would be mathematics and computer science”.

There are obvious good reasons for recommending these two fields, and I agree that anyone wishing to make an impact in AI safety should have at least a basic proficiency in them. However, I find it a little concerning that cognitive science/​psychology are rarely even mentioned in these guides. I believe that it would be valuable to have more people working in AI safety whose primary background is from one of cogsci/​psych, or who have at least done a minor in them.

Here are examples of four lines of research into AI safety which I think could benefit from such a background:

  • The psychology of developing an AI safety culture. Besides the technical problem of “how can we create safe AI”, there is the social problem of “how can we ensure that the AI research community develops a culture where safety concerns are taken seriously”. At least two existing papers draw on psychology to consider this problem: Eliezer Yudkowsky’s “Cognitive Biases Potentially Affecting Judgment of Global Risks” uses cognitive psychology to discuss why people might misjudge the probability of risks in general, and Seth Baum’s “On the promotion of safe and socially beneficial artificial intelligence” uses social psychology to discuss the specific challenge of motivating AI researchers to choose beneficial AI designs.

  • Developing better analyses of “AI takeoff” scenarios. Currently humans are the only general intelligence we know of, so any analyzes of what “expertise” consists of and how it can be acquired would benefit from the study of humans. Eliezer Yudkowsky’s “Intelligence Explosion Microeconomics” draws on a number of fields to analyze the possibility of a hard takeoff, including some knowledge of human intelligence differences as well as the history of human evolution, whereas my “How Feasible is the Rapid Development of Artificial Superintelligence?” draws extensively on the work of a number of psychologists to make the case that based on what we know of human expertise, scenarios with AI systems becoming major actors within timescales on the order of mere days or weeks seem to remain within the range of plausibility.

  • Defining just what it is that human values are. The project of AI safety can roughly be defined as “the challenge of ensuring that AIs remain aligned with human values”, but it’s also widely acknowledged that nobody really knows what exactly human values are—or at least, not to a sufficient extent that they could be given a formal definition and programmed into an AI. This seems like one of the core problems of AI safety, and one which can only be understood with a psychology-focused research program. Luke Muehlhauser’s article “A Crash Course in the Neuroscience of Human Motivation” took one look at human values from the perspective of neuroscience, and my “Defining Human Values for Value Learners” sought to provide a preliminary definition of human values in a computational language, drawing from the intersection of artificial intelligence, moral psychology, and emotion research. Both of these are very preliminary papers, and it would take a full research program to pursue this question in more detail.

  • Better understanding multi-level world-models. MIRI defines the technical problem of “multi-level world-models” as “How can multi-level world-models be constructed from sense data in a manner amenable to ontology identification?”. In other words, suppose that we had built an AI to make diamonds (or anything else we care about) for us. How should that AI be programmed so that it could still accurately estimate the number of diamonds in the world after it had learned more about physics, and after it had learned that the things it calls “diamonds” are actually composed of protons, neutrons, and electrons? While I haven’t seen any papers that would explicitly tackle this question yet, a reasonable starting point would seem to be the question of “well, how do humans do it?”. There, psych/​cogsci may offer some clues. For instance, in the book Cognitive Pluralism, the philosopher Steven Horst offers an argument for believing that humans have multiple different, mutually incompatible mental models /​ reasoning systems—ranging from core knowledge systems to scientific theories—that they flexibly switch between depending on the situation. (Unfortunately, Horst approaches this as a philosopher, so he’s mostly content at making the argument for this being the case in general, leaving it up to actual cognitive scientists to work out how exactly this works.) I previously also offered a general argument along these lines in my article World-models as tools, suggesting that at least part of the choice of a mental model may be driven by reinforcement learning in the basal ganglia. But this isn’t saying much, given that all human thought and behavior seems to be in at least part driven by reinforcement learning in the basal ganglia. Again, this would take a dedicated research program.

From these four special cases, you could derive more general use cases for psychology and cognitive science within AI safety:

  • Psychology as the study and understanding of human thought and behavior, helps guide actions that are aimed at understanding and influencing people’s behavior in a more safety-aligned direction (related example: the psychology of developing an AI safety culture)

  • The study of the only general intelligence we know about, may provide information about the properties of other general intelligences (related example: developing better analyzes of “AI takeoff” scenarios)

  • A better understanding of how human minds work, may help figure out how we want the cognitive processes of AIs to work so that they end up aligned with our values (related examples: defining human values, better understanding multi-level world-models)

Here I would ideally offer reading recommendations, but the fields are so broad that any given book can only give a rough idea of the basics; and for instance, the topic of world-models that human brains use is just one of many, many subquestions that the fields cover. Thus my suggestion to have some safety-interested people who’d actually study these fields as a major or at least a minor.

Still, if I’d have to suggest a couple of books, with the main idea of getting a basic grounding in the mindsets and theories of the fields so that it would be easier to read more specialized research… on the cognitive psychology/​cognitive science side I’d suggest Cognitive Science by Jose Luis Bermudez (haven’t read it, but Luke Muehlhauser recommends it and it looked good to me based on the table of contents; see also Luke’s follow-up recommendations behind that link); Cognitive Psychology: A Student’s Handbook by Michael W. Eysenck & Mark T. Keane; and maybe Sensation and Perception by E. Bruce Goldstein. I’m afraid that I don’t know of any good introductory textbooks on the social psychology side.