Note: I’ve also submitted this to LessWrong, but I’m not sure exactly how that works and if it’ll be retained or whatever, so I’m going to post this to here for now.

This was a final project for the BlueDot AI Safety Fundamentals (Alignment) Course. Thanks to my facilitator and the other fellows for a great learning experience and all their support!

Purpose and Introduction

Much AI Safety research has focused on ensuring AIs conform to the intentions implicitly given by humans, but this approach alone may be insufficient (Ji et al., 2024, Section 4.3). Some recent research has explicitly used human values and preferences as benchmarks for AI moral reasoning (e.g., Hendrycks et al., 2020; Jin et al., 2022), and concluded that the study of human moral psychology could be instrumental for improving AI moral reasoning and behavior. While identifying human values and evaluating AIs against them is important, it may not be the only way in which human psychology can be applied to AI moral alignment. I think it is important to evaluate the underlying processes which inform moral cognition and action, and not just the outcomes themselves. This is especially important because historical and contemporary evidence suggests that our values are still in development (Singer, 1981), so that aligning to our current values may be suboptimal. The question this article has been motivated by is this: what cognitive processes are required in an agent for that agent to develop good moral reasoning and be open to further moral development?

Unfortunately, I am not able to fully answer this question here. However, I argue that one key component of morality in both humans and AIs is compassion, which includes an emotional component. My intuition is that many people in AI Safety and adjacent communities hold some form of belief that true morality is derived from reason, and that affective empathetic feelings—like the anguish we feel when we see an emaciated child or an oil-drenched duckling—interfere with our capacity to be impartially moral. As Nate Soares famously said, “my care-o-meter is broken.” The evidence I have seen leads me to agree that this is true. Yet there is also compelling evidence that affective empathy is important for moral development, and that it contributes to impartial moral regard. To me, this does not necessarily result in a contradiction; rather, it suggests that both affect and reason are important for morality, and that the way in which they interact with one another is especially important for moral development.

Ideally, there might be some term which takes exactly all the nodes and relationships we want from this web of concepts and sub-concepts and leaves the others, but I do not know of such a term. I focus on compassion because of its inclusion of the recognition of suffering, connectedness with others/universal orientation, and motivation to alleviate suffering, all of which would seem to be ideal characteristics of a moral agent. Research on compassion also tends to be cautious about personal distress and ingroup bias which may result from certain aspects of empathy (Stevens & Taber, 2021). So though it may be imperfect, compassion seems to be the concept which best highlights the interplay between emotional and rational processes.

Methods

This project is built upon a review of other research, and includes no original research. I was not focused on a single, well-defined field, making it difficult to specify the scope and search terms of the project ahead of time. This project is therefore an exploration of some of the research on compassion and morality, particularly as these constructs relate to AI. It is by no means a systematic review. Papers were found by searches on Google Scholar, by referring to prior knowledge, and through the snowball method.

What Is Compassion?

A review of proposed definitions of compassion identified five components: the recognition of suffering, the understanding of the universality of suffering, emotional resonance, tolerating uncomfortable feelings, and motivation to alleviate suffering (Strauss et al., 2016). What these definitions seem to agree upon is that compassion includes a recognition of the suffering of others and a wish for the alleviation of that suffering. Compassion is closely related to other prosocial drives like empathy, sympathy, and Theory of Mind (ToM), to the point where it may be unwise to treat them as fully separable concepts. Indeed, as we will see, aspects of empathy are crucial for compassion.

Compassion and Morality

When we think of a moral person, we might think of someone who is very caring, someone who stops to comfort homeless people and cries when they watch animals get slaughtered. We might also think of someone whose actions are derived from careful reasoning from principles and objective facts, unburdened by appeals to tradition or personal relationships. We might think of one as being driven by affect (emotional feeling), and the other by reason. The question is: which one of these two drives leads to a moral person?

Some evidence supports the notion that the “reason” drive is of primary importance for morality. Evidence suggests that emotions may drive individuals to donate to charities and causes that are less effective but more personally compelling (Caviola et al., 2021). A meta-analysis of neuroimaging studies done on morality, empathy, and Theory of Mind (ToM) found that areas of the brain activated in response to moral cognition were more similar to areas activated in response to tasks which engage ToM than areas activated in response to tasks which engage affective empathy (Bzdok et al., 2012). In turn, ToM is generally associated with abstract reasoning, while affective empathy is generally associated with the vicarious feeling of others’ emotions. It may be tempting to conclude that reason alone drives morality, and that affect is unnecessary or even interfering.

Yet a different picture emerges when we consider more real-world situations. Even among neurotypical individuals, highly impersonally altruistic actions (such as donation of a kidney) were positively correlated with empathic concern and negatively correlated with coldheartedness (Vekaria et al., 2017). Both perspective-taking and empathic concern were associated with greater Moral Expansiveness; that is, the tendency to have a larger circle of moral regard (Crimston et al., 2016, Table 5). The importance of affect in moral development is most strongly demonstrated through a condition associated with profound amorality: that is, Psychopathy. Neuroimaging studies show that Psychopaths appear to be perfectly functional in ToM—they are proficient at inferring others’ mental states—but they simply don’t have the affective motivation to care (Decety & Jackson, 2004, p. 89). This lack of caring appears to have drastic consequences. Though clinical Psychopaths make up a very small proportion (~1%) of the general population, they make up 15–25% of the incarcerated population, and commit an outsized proportion of violent crimes (Kiehl & Hoffman, 2011, Section III). Among world leaders, autocrats were shown to have significantly elevated levels of Psychopathy, Narcissism, and Machiavellianism compared to non-autocrats (Nai & Toros, 2020; see “Results”). Tyrants who have caused massive impersonal harm—such as Hitler and Stalin—were also noted to be personally uncaring, cruel, and amoral (Glad, 2002). It seems that, at least in humans, caring on an impersonal level requires that one have the capacity to also care on a personal level.

Though there is some tension in the evidence, I will argue that affective empathy is important for morality. While Bzdok et al. (2012) found that moral cognition and ToM were more closely related than morality and affective empathy, they still found that some brain regions involving affective empathy were relatively closely related to moral decision-making. Additionally, their analysis did not include studies done on empathy for pain in others. They also drew mainly from laboratory studies, while more real-world observations seem to give relatively greater support to the importance of affect. This seems to suggest that laboratory studies tend not to capture certain important aspects of real-world moral decision-making, which may in turn explain why laboratory studies don’t indicate as much of a role for affective empathy in moral decision-making. This is, however, my own inference, and more research should be done to reconcile these two lines of evidence.

A Rough Model of Compassion

From a neuroscientific perspective, compassion has been conceived of as being composed of 3 parts: an affective response to others’ suffering (either directly in response to the stimulus or in response or in vicarious experience of others’ emotional states), a cognitive inference of others’ emotional states, and an intermediate component which bridges affective and cognitive processes (Ashar et al., 2016; Stevens & Taber, 2021). This 3-part model of compassion is roughly analogous to some neuroscientifically-informed models of empathy (Decety & Jackson, 2004; Schurz et al., 2020). While there appears to be a fair amount of agreement and clarity regarding the functioning of the affective and cognitive aspects of compassion and empathy, the function and structure of the intermediate component seem less clear. Ashar et al. (2016) propose the intermediate component to be “emotional meaning,” which integrates affective and cognitive processes and stimuli to form an evaluation of the other’s significance in relation to the self (see Part 1, subsection labeled “emotional meaning”). They suggest that emotional meaning is an important determinant of prosocial behavior. While the specific details of the intermediate component still appear to be nebulous, there seems to be general agreement that the intermediate component bridges the affective and cognitive components and is similarly important in generating prosocial outcomes.

Compassion for others therefore begins with affective empathy, the vicarious experience of some degree of another’s suffering. From an egoistic point of view, this makes sense. A rationally self-interested agent should not sacrifice their own well-being for another unless there is a sufficient reward for doing so or sufficient punishment for not doing so. Affective empathy may thus be seen as an important component of aligning self-interest to social interest by internalizing the costs of others’ suffering and the benefits of others’ relief and well-being. Without it, an individual’s own utility is not bound to the utilities of others except through pragmatic concerns. This may be a particular concern for powerful individuals (such as transformative AIs), who have the latitude to cause massive amounts of harm for their own satisfaction with little chance of reprisal (Glad, 2002). At the same time, affective empathy on its own may be inadequate in many situations, particularly ones which are complex, novel, and/or impersonal. Cognitive abilities to take another’s perspective, to simulate the other’s situation and mental state and to identify ways to help them, appear to be important in translating affective empathy into effective prosocial action (Stevens & Taber, 2021). Modulation of affective empathy by cognitive processes may be critical in avoiding transforming parochial empathy into impersonal compassion and prosocial action (Stevens & Taber, 2021, section 8). Future research may explore which specific affective and cognitive processes are involved in compassion, and how they interact with one another to produce it.

Instilling Empathy/Compassion in AIs

Evidence suggests that for typical humans, empathy—particularly its affective component—is experience-dependent (Blair, 2005; Hamilton et al., 2015). That is, we are born with the capacity to develop empathy, but empathy must still be developed through continual engagement of relevant processes by social experiences. There may thus be a high-level analogy between human development of empathy/compassion and the development of empathy and compassion in AIs. We might expect that training AIs on some form of empathy/compassion data would be critical for the development of moral AI.

While there are plenty of relevant data in the form of text scrapes, social media conversations, and books, these data are poorly curated, and are likely suboptimal for developing empathy (Rashkin et al., 2018). In particular, some social media platforms have relatively high proportions of callous or incendiary conversations. If we wouldn’t expect a human child to learn empathy from browsing X and Reddit threads all day, then it seems reasonable to expect that such training would be insufficient for AIs, too. One way of increasing AI empathy might be to use data that were specifically created for empathy training. This was accomplished in the creation of EmpatheticDialogues (ED), a dataset of 25,000 conversations covering 32 relatively fine-grained emotions (Rashkin et al., 2018). Fine-tuning LLMs on ED resulted in better performance in automatic and human evaluations; fine-tuning and labeling the data with emotion or topic generally improved upon the baseline but had mixed results compared to just fine-tuning. While this study found that fine-tuning on ED appeared to generally improve empathy in LLMs, it also concluded that they still fell short compared to humans. However, there is also some evidence that the leading LLMs are better than or at least comparable to humans when it comes to empathic abilities (Welivita & Pu, 2024). Similarly, work on ToM in LLMs has produced seemingly contradictory results, with Ullman (2023) finding that trivial modifications to ToM tasks resulted in erroneous responses and Strachan et al. (2024) finding evidence that GPT-4 was generally superior to humans on ToM tasks (though LLaMA-2 was not). Some evidence therefore suggests that improvements in AI empathy, both affective and cognitive, are necessary for human-aligned AI.

There have been a few approaches to further increasing/ensuring LLM empathy. One approach is to increase the amount of data available for use in training for empathy. This was accomplished by using the data in the ED dataset as prompts for the creation of new, synthetic conversation data (Liang et al., 2024). These synthetic data were evaluated by another model fine-tuned on ED. The goal was to maximize empathy, coherence, naturalness, and diversity of the data. A model trained on these synthetic data and ED generally outperformed other state-of-the-art models in both automatic and human evaluations of empathy. Another approach has been to train smaller models on empathy data (such as ED) and then to plug them in to larger models (Yang et al., 2024). This reduces the amount of computation and time necessary to train the model while improving performance as measured by both automatic and human evaluation.

It thus appears that there has been some promising work which has been done on empathy and ToM in LLMs. In particular, a foundational Benchmark and training dataset, EmpatheticDialogues, has been created and used in some research. Nevertheless, it appears that more research is needed in this area. Future research should aim to clarify our understanding of the current state of LLM empathy and ToM relative to humans, both in general and in specific domains/tasks. If possible, it may also be instructive to study LLM empathy and compassion from an interpretability perspective, which may be sufficiently analogous to neuroscientific studies of empathy, compassion, and morality in humans to be useful. Finally, I think work needs to be done to understand how processes like affective empathy, ToM/cognitive empathy, moral reasoning, and intermediary components work together to produce compassion and impersonal moral regard. I am not aware of research on compassion or the relationship between affective and cognitive empathy in LLMs. This appears to be a relatively open question in the study of humans as well, so future research findings with humans may be useful in informing research on LLMs in this area, and vice-versa.

Limitations and Conclusion

This article should not be treated as a rigorous meta-analysis or systematic review. At best, this was an exploratory review of compassion and related concepts in humans, the implications this may have for artificial intelligences, and some steps which have been taken in AI alignment research with regard to empathy and compassion.

This article is focused primarily on compassion and its components and related concepts. While I make the case that compassion is important and perhaps necessary for the development of moral AI, I don’t claim that compassion is sufficient. Formal logical, game-theoretic, moral decision-making, and other such approaches may also be critical for moral AI. I think all these approaches converge upon the same problem—building moral AI—but do so from importantly different perspectives.

There is always the question of how well human and AI minds parallel one another. While I think there is a tendency to downplay the parallels both within and outside of AI Safety communities, I recognize that this reflects something which is probably quite real. For example, a single world-controlling AI who forms only impersonal relationships with humans may not need empathy as much as we do, as that AI would not be (as much of) a social creature. Nevertheless, I think human minds are some of the best analogues we have for inferring what AI minds will/should look like, though we must of course recognize that analogies between humans and AIs are limited.

The original aim of this project was to investigate the factors which would contribute to the creation of an ideal moral character. Rather than focusing on the values which we would seek to align an agent to, or the outcomes they exhibited, I wished to focus on the intentions and dispositions which characterize highly moral, forward-thinking individuals (e.g., Jeremy Bentham). This aim was in part due to my belief that human values as they are now are still fairly atrocious (e.g., humans’ moral disregard for most animals), and that history has shown that we are constantly revising our values and improving upon the ethics we have inherited (Singer, 1981). In this article, I argued for the importance of compassion, which evidence suggests is formed by the interplay of affective and rational processes. My sampling of the research suggests that comparatively little work has been done on empathy and ToM in LLM moral reasoning, and that even less work has been done on affective empathy and compassion in LLM moral reasoning. Further research which attempts to apply human compassion and empathy to AIs may thus be an opportunity to further the causes of AI Safety and alignment.

While this work covers one set of dispositions which contribute to beneficence and moral development, it has certainly not covered all of them. In particular, I was not able to examine processes underlying cognitive flexibility and immunity to ideology, nor was I able to review more concrete/technical methods for representing value change/flexibility in AIs (e.g., Klingefjord et al., 2024). I hope future work will be able to expand not only on the ideas which I’ve attempted to coherently relay here but also upon other important and potentially underappreciated aspects of morality, both natural and artificial.

Sources

Ashar, Y. K., Andrews-Hanna, J. R., Dimidjian, S., & Wager, T. D. (2016). Toward a neuroscience of compassion. Positive neuroscience, 125–142. https://bpb-us-e1.wpmucdn.com/sites.dartmouth.edu/dist/2/2150/files/2019/12/2016_Ashar_Positive-Neuroscience-Handbook.pdf

Blair, R. J. R. (2005). Applying a cognitive neuroscience perspective to the disorder of psychopathy. Development and psychopathology, 17(3), 865–891. https://www.psychiatry.wisc.edu/courses/Nitschke/seminar/blair%20devel%20&%20psych%202005.pdf

Bzdok, D., Schilbach, L., Vogeley, K., Schneider, K., Laird, A. R., Langner, R., & Eickhoff, S. B. (2012). Parsing the neural correlates of moral cognition: ALE meta-analysis on morality, theory of mind, and empathy. Brain Structure and Function, 217, 783–796. http://www.brainmap.org/pubs/BzdokBSF12.pdf

Caviola, L., Schubert, S., & Greene, J. D. (2021). The psychology of (in) effective altruism. Trends in Cognitive Sciences, 25(7), 596–607. https://www.cell.com/trends/cognitive-sciences/pdf/S1364-6613(21)00090-5.pdf

Decety, J., & Jackson, P. L. (2004). The functional architecture of human empathy. Behavioral and cognitive neuroscience reviews, 3(2), 71–100. https://psikohelp.com/wp-content/uploads/2021/07/Decety_2004_BehavCognNeurosciRev-Empathy.pdf

Glad, B. (2002). Why tyrants go too far: Malignant narcissism and absolute power. Political Psychology, 23(1), 1–2. https://lust-for-life.org/Lust-For-Life/_Textual/BettyGlad_WhyTyrantsGoTooFarMalignantNarcissismAndAbsolutePower_2002_38pp/BettyGlad_WhyTyrantsGoTooFarMalignantNarcissismAndAbsolutePower_2002_38pp.pdf

Hamilton, R. K., Hiatt Racer, K., & Newman, J. P. (2015). Impaired integration in psychopathy: A unified theory of psychopathic dysfunction. Psychological review, 122(4), 770. https://www.researchgate.net/profile/Rachel-Hamilton/publication/283328614_Impaired_Integration_in_Psychopathy_Bridging_Affective_and_Cognitive_Models/links/5633cc6608aeb786b7013b8d/Impaired-Integration-in-Psychopathy-Bridging-Affective-and-Cognitive-Models.pdf

Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., & Steinhardt, J. (2020). Aligning ai with shared human values. arXiv preprint arXiv:2008.02275. https://arxiv.org/pdf/2008.02275

Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H., Wang, K., … & Gao, W. (2023). Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852. https://arxiv.org/pdf/2310.19852

Jin, Z., Levine, S., Gonzalez Adauto, F., Kamal, O., Sap, M., Sachan, M., … & Schölkopf, B. (2022). When to make exceptions: Exploring language models as accounts of human moral judgment. Advances in neural information processing systems, 35, 28458–28473. https://proceedings.neurips.cc/paper_files/paper/2022/file/b654d6150630a5ba5df7a55621390daf-Paper-Conference.pdf

Kiehl, K. A., & Hoffman, M. B. (2011). The criminal psychopath: History, neuroscience, treatment, and economics. Jurimetrics, 51, 355–397. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4059069/

Klingefjord, O., Lowe, R., & Edelman, J. (2024). What are human values, and how do we align AI to them?. arXiv preprint arXiv:2404.10636. https://arxiv.org/pdf/2404.10636

Liang, H., Sun, L., Wei, J., Huang, X., Sun, L., Yu, B., … & Zhang, W. (2024). Synth-Empathy: Towards High-Quality Synthetic Empathy Data. arXiv preprint arXiv:2407.21669. https://arxiv.org/pdf/2407.21669

Nai, A., & Toros, E. (2020). The peculiar personality of strongmen: comparing the Big Five and Dark Triad traits of autocrats and non-autocrats. Political Research Exchange, 2(1), 1707697. https://files.osf.io/v1/resources/mhpfg/providers/osfstorage/5e5aa598b070fc01cde58afc?action=download&direct&version=1 [Download PDF]

Rashkin, H. (2018). Towards empathetic open-domain conversation models: A new benchmark and dataset. arXiv preprint arXiv:1811.00207. https://aclanthology.org/P19-1534.pdf

Singer, P. (1981). The expanding circle. Oxford: Clarendon Press.

Stevens, F., & Taber, K. (2021). The neuroscience of empathy and compassion in pro-social behavior. Neuropsychologia, 159, Article 107925. http://change-et-sois.org/wp-content/uploads/2023/01/The-neuroscience-of-empathy-and-compassion-in-pro-social-behavior-Stevens-F-Taber-K-2021.pdf

Strachan, J. W., Albergo, D., Borghini, G., Pansardi, O., Scaliti, E., Gupta, S., … & Becchio, C. (2024). Testing theory of mind in large language models and humans. Nature Human Behaviour, 1–11.

Strauss, C., Taylor, B. L., Gu, J., Kuyken, W., & Baer, R. (2016). What is compassion and how can we measure it? A review of definitions and measures. Clinical Psychology Review, 47, 15–27. https://ou.edu/content/dam/flourish/docs/Article_Assessing%20Compassion.pdf

Ullman, T. (2023). Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399. https://arxiv.org/pdf/2302.08399

Vekaria, K. M., Brethel-Haurwitz, K. M., Cardinale, E. M., Stoycos, S. A., & Marsh, A. A. (2017). Social discounting and distance perceptions in costly altruism. Nature Human Behaviour, 1(5), 0100. https://aamarsh.wordpress.com/wp-content/uploads/2020/03/vekaria-et-al-2017.pdf

Welivita, A., & Pu, P. (2024). Are Large Language Models More Empathetic than Humans?. arXiv preprint arXiv:2406.05063. https://arxiv.org/pdf/2406.05063

Yang, Z., Ren, Z., Yufeng, W., Peng, S., Sun, H., Zhu, X., & Liao, X. (2024). Enhancing Empathetic Response Generation by Augmenting LLMs with Small-scale Empathetic Models. arXiv preprint arXiv:2402.11801. https://arxiv.org/pdf/2402.11801

Not Just For Therapy Chatbots: The Case For Compassion In AI Moral Alignment Research