I think very few people have thought rigorously about how psychology research could inform the trajectory of AI or humanity’s response to it. Despite this, there seem to be many important contributions psychology could make to AI safety. For instance, a few broad paths-to-impact that psychology research might have are:

Helping people anticipate the societal response to possible developments in AI. In which areas is public opinion likely to be the bottleneck to greater AI safety?
Improving forecasting/prediction techniques more broadly and applying this to forecasts of AI trajectories (the Forecasting Research Institute’s Existential Risk Persuasion Tournament is a good example).
Describing human values and traits more rigorously to inform AI alignment, or to inform decisions about who to put behind the wheel in an AI takeoff scenario.
Doing cognitive/behavioral science on AI models. For instance, developing diagnostic tools that can be used to assess how susceptible an AI decision-maker is to various biases.
Modeling various risks related to institutional stability. For instance, arms races, risks posed by malevolent actors, various parties’ incentives in AI development, and decision-making within/across top AI companies.

I spent several weeks thinking about specific project ideas in these topic areas as part of my final project for BlueDot Impact’s AI Safety Fundamentals course. I’m sharing my ideas here because a) there are probably large topic areas I’m missing, and I’d like for people to point them out to me, b) I’m starting my PhD in a few months, and I want to do some of these ideas, but I haven’t thought much about which of them are more/less valuable, and c) I would love for anyone else to adopt any of the ideas here or reach out to me about collaborating! I also hope to write a future version of this post that incorporates more existing research (I haven’t thoroughly checked which of these project ideas have already been done).

Maybe another way this post could be valuable is that I’ve consolidated a lot of different resources, ideas, and links to other agendas in one place. I’d especially like for people to send me more things in this category, so that I or others can use this post as a resource for connecting people in psychology to ideas and opportunities in AI safety.

In the rest of this post, I list various topic areas I identified in no particular order, as well as any particular project ideas I had which struck me as potentially especially neglected & valuable in that area. Any feedback is appreciated!

Topic Areas & Project Ideas

1. Human-AI Interaction

1.1 AI Persuasiveness

A lot of people believe that future AI systems might be extremely persuasive, and perhaps we should prepare for a world where interacting with AI models carries a risk of manipulation/brainwashing. How realistic is this concern? (Note, although I don’t think this sort of research is capable of answering whether AI will ever be extremely persuasive, I think it could still be very usefully informative.) For instance:

How good are people at detecting AI-generated misinformation in current models, or inferring ulterior motives in current AI advisors? How has this ability changed in line with compute trends?
- Are people better or worse at detecting lies in current AI models, compared to humans? How has this ability changed in line with compute trends?
How does increasing quantity of misinformation affect people’s susceptibility to misinformation? Which effect dominates between “humans get more skeptical of information as less of it is true” and “humans believe more false information as more false information is available”?
In which domains is AI most likely to be persuasive? For instance, empirical vs. moral persuasion.
In which directions is AI most likely to be persuasive? (e.g. how much more persuasive is AI when arguing for true things than false things, if at all?)
Does the persuasiveness of particular kinds of outputs from AI models seem likely to asymptote off? It seems plausible to me that there are hard limits to persuasiveness in e.g. single paragraph outputs.

If AI models are already highly persuasive, can they be used as scalable tools for improving people’s reasoning or moral judgment, or increasing people’s concern about AI safety?

There’s a rapidly growing corner of psychology interested in using AI for belief change. Links to some existing work related to AI persuasiveness:

1.2 Human-AI Capability Differences

Which, if any, kinds of tasks do humans have the best bet of being better than highly advanced AI systems at?

How likely are these tasks to be the bottleneck in AI advancement or abilities? (In other words, is it feasible that humans might be able to use our unique capabilities or resources as levers of control in some takeoff or post-AGI scenarios?)

1.3 Trust in AI

How much are people willing to defer to AI advisors:

In empirical belief formation?
In value judgments / moral decisions?

How much should we expect risk compensation to affect individuals’ and institutions’ willingness to use advanced AI?

How malleable are people’s empirical beliefs about AI risk:

In general?
In regards to specific capabilities, like:
- Developments in AI weapons?
- The ability to convince people AI is sentient?
- The ability to persuade people to take extreme actions?
- The ability to cause human extinction?

2. Forecasting AI Development

2.1 Foundational Work on Forecasting

What biases exist in probability forecasting?

How does uncertainty affect probability estimates?
Which crowd-wisdom aggregation techniques most accurately synthesize multiple forecasts to produce a single probability estimate? Some existing work on this topic:
- https://pubsonline.informs.org/doi/abs/10.1287/mnsc.2023.4955
- https://pubmed.ncbi.nlm.nih.gov/28128245/
How effective are various “debiasing” techniques at accounting for various biases in forecasting? Some existing work on this topic:

How can people use new or existing statistical tools and AI models to improve their forecasting ability?

Meta-forecasting: what’s the relationship between cognitive effort given to making predictions and predictive accuracy?

Are there cases where commonsense intuitions do better than fermi estimates or other “more rigorous” forecasts?
When do people hit diminishing marginal returns in terms of the effort-accuracy tradeoff in forecasting?

How do people (perhaps domain experts vs. non-experts) interpret probability forecasts? I think there are many promising ideas one could research here:

How do people interpret verbal versus numerical probability estimates? (e.g. “very likely” vs. “90%”)
Do people generally defer to expert forecasts too much or too little, when given the relevant information, and how does this interact with their pre-existing knowledge of a topic?
(How) is uncertainty expressed in probability point estimates and/or natural language?
Do people (appropriately) interpret estimates of “0s” and “100s” (or other extreme estimates) as being more confident?
Empirical work on large number skepticism (is it true that people just can’t reason correctly about massive values?)
Which forecasters/forecasts are seen as more credible by the general public? Some work on this topic:
- https://psycnet.apa.org/doiLanding?doi=10.1037%2Fxge0001487
- https://psycnet.apa.org/record/2023-66174-001

2.2 Applying Forecasting Research to AI Safety

How good are experts and non-experts at predicting the trajectory and/or speed of technological development?

Is there research on whether experts were better at this than non-experts for various historical advances in technology?
Which groups of people are the best at predicting the trajectory and/or speed of technological development?
- What can be done, if anything, to make them better?
- Superforecasters vs. AI experts (XPT tournament)

Why are people often hesitant to provide explicit probability estimates for AI risk (along with other similar high-magnitude risks like the odds of nuclear war)?

Someone could also do cross-cultural forecasting research projects with the aim of obtaining better forecasts of trends in international politics related to AI. Perhaps additional forecasting tournament-style interventions would be interesting. A few questions I think are good candidates for this sort of thing:

Suppose that a particular country “wins” the AGI development race. What impacts is this likely to have? Also, some useful empirical data to collect on this might be:
- How concerned/optimistic are people in various countries about AI?
- How has the trajectory of public opinion about AI differed in various countries, and in response to what factors?
Suppose that, in an AI takeoff scenario, certain countries wind up with much more influence than others by virtue of their natural resources or political power. How likely are various countries to wind up in this position, and how should this inform AI governance strategy and/or international relations related to AI?

3. Understanding Human Values

I think more research on human values could be useful for two reasons: (1) To the extent one wants to align an advanced AI’s values to commonsense morality, it’s good to understand what commonsense morality actually is. But I think it’s also pretty obvious one wouldn’t want to align an extremely advanced AI system to commonsense morality, so the bigger reason is that (2) whether it’s a good idea or not, people probably will align advanced AI systems to commonsense morality (even unknowingly), and so understanding human values and their failure modes better might be really useful for AI strategy more broadly. Some projects here fall into both categories.

3.1 Foundational Work on Human Values

To the extent we want AI to promote well-being, any work that aims to better understand what well-being is and what factors best predict well-being would be extremely useful. Some example questions:

What are the best proxies or measurements one can use to assess well-being? Can we do better than self-report measures (or design self-report measures that are better than life satisfaction reports)?
What biases influence how people anticipate and reflect on their affective states?
Neuroscience on the neural bases of well-being seems very valuable, though probably pretty intractable. But maybe this sort of research becomes a lot easier after future AI developments.
Project idea: Extend theories of resource-rational cognition to welfare states; is it true that we should expect most minds to be kind of net-neutral by default, because emotion takes resources, and we should only expect positive/negative hedonic states where they serve some specific purpose? How can we test this hypothesis?

Value spread:

Should one expect liberal values like equality, free speech, and rule of law to win out in the marketplace of ideas, by default?
- If so, what practical interventions (e.g. internet access, exposure to Western media) might effectively speed up the cultural transmission of these values?
How pluralistic/multicultural are most people’s values (in principle/in practice)?
- Does this differ from how pluralistic/multicultural people believe their values to be?

Which of the 4 resolutions here best describes most people’s response to moral uncertainty?

Some nerds could create a computational model of people’s response to moral uncertainty.

How fanatical are most people?

Of course, a lot of psychology work on human values already exists. I think these are a few really good examples which might inspire further projects/questions:

3.2 Applying Human Values Research to AI Safety

What values does it seem likely people will want AIs to maximize?

What unintended consequences are likely to arise if AI systems are simply aligned to various commonsense conceptions of morality?
- One might answer this question in part by finding empirical examples of cases in which people’s moral decisions yield results that they (the researcher) believe to be problematic, but also by finding empirical examples of cases in which people’s moral decisions are internally inconsistent or have implications/consequences the moral decision-makers themselves don’t endorse. I think the second one might be more interesting and persuasive.
What unintended consequences are likely to arise if AI systems are aligned to philosophers’/AI researchers’/politicians’/other relevant groups’ conceptions of morality?

How true is the orthogonality thesis for human minds? Should we expect the same results in AI minds?

4. Institutional Stability

Papers on institutional stability related to AI safety:

https://arxiv.org/abs/2012.08630
[please suggest]

Also, I recommend this list of interventions for improving institutional decision-making by @taoburga if you’re interested in this topic; many items on his list could easily be explored from a psychology angle.

4.1 Arms Race Dynamics

Game theory work on AI arms races:

Model what happens in the (plausible) scenario where all top AI companies care about AI safety and want AGI to go well, and know that arms races are bad for ensuring safe AI, but also have a lot of confidence that their alignment strategy is better than competitors (conceptualize altruistic races).
- To the extent companies aren’t motivated by safety and are instead motivated by profit, how does this affect what happens in the above scenario?
- Likely this would be a joint psychology-economics project.

Conduct empirical work on the security dilemma: Do people interpret building a stronger defense as an offensive action? Conversely, do people interpret building offensive weapons as a defensive act because of mutually assured destruction or other power dynamics?

4.2 Malevolence and AI

I’m not sure it’s that useful (for AI safety specifically) for psychologists to focus on reducing human malevolence in general (ie. designing interventions that could reduce average levels of malevolence in the general public). Conditional on it being the case that malevolent actors pose a large risk in the future as a result of AI, it will probably only take a small number of bad actors for immense harm to be realized, in which case even very substantial improvements in people’s ability to detect and prevent malevolence wouldn’t substantially reduce these risks. However, I think it could still be valuable to try to estimate more rigorously the risk of AI misuse by select governments, or to try to forecast risks from bad actors (to better model AI risk). Some ideas:

How large is the risk posed by malevolent leaders in future worlds with advanced AI?

How much do people like or dislike malevolent traits in their political leaders?
How common are traits like psychopathy and nihilism? (Perhaps also in political leaders specifically?) - Rob Jenkins has done some nice work on the prevalence of omnicidal tendencies.

What are the underpinning motivations, beliefs, and mechanisms of malevolence (e.g., psychopathic, Machiavellian, narcissistic, and sadistic traits) and their associated behavioral patterns?

How can we best measure malevolence?

How good are people at detecting malevolence in other people? In their leaders?
Which diagnostic/behavioral criteria are most predictive of malevolent traits?

4.3 Identifying Ethical Decision-Makers

It could be really useful to have a clear understanding of which people are or aren’t epistemically virtuous and/or altruistically motivated in many possible scenarios where humans are faced with designating people to make decisions about AI systems’ constitutions, AI governance, preparing for transformative AI, which human values to maximize, etc. Some of these ideas are straight plagiarized from Psychology for Effectively Improving the Future, a research agenda created by some of my colleagues a few years ago.

How can we best identify people (e.g., policymakers) who will safeguard the future wisely?

Can we develop assessment tools (e.g. standardized measures) to identify such people, e.g., by measuring which psychological factors are most predictive of outcome measures we care about?
- If such measures are developed, how can one prevent/detect cheating on such tests?
Could indicators of past rational or moral behavior be used as predictors of future rational or moral behavior?
How good are people at identifying benevolent traits in others?
- Could nominators who report on other people’s moral character, attitudes and tendencies help to identify people interested in effectively improving the future?
How might rationality skills and thinking attitudes be measured?
Is rationality different from IQ?
- If so, can we develop a test that effectively distinguishes rationality and IQ? (See also: Stanovich’s CART assessment)
Can we develop IQ/rationality measures that are sufficiently sensitive to be able to differentiate between highly rational individuals?
Can we measure the effects of higher rationality in real-world contexts? Does greater rationality translate into improved judgments and decisions, or even into more successful outcomes in the personal, business and altruistic domain?

4.4 Miscellaneous

Model ways that advanced AI systems might engage in conflict with each other.

Existing CLR research agenda on this topic.
How (both in theory and practically) can we build advanced AI systems that cooperate with each other? (And that don’t cooperate with each other when appropriate?)

How might humans and AIs interact with each other in a conflict scenario?

Example paper exploring this question.

How much should we expect the unilateralist’s curse to exacerbate the risks posed by unilateral decisions related to AI, such as data leaks, rogue model development, etc.?

I have a paper on this if you’re interested.

Foundational work on institutional/organizational stability:

What factors best predict social or professional cohesion within institutions, and (to the extent that we don’t want AI orgs to split into two separate orgs or experience rapid leadership changes) how can we promote cohesion in these institutions?

5. Anticipating the Societal Response to AI Developments

5.1 AI Sentience

How likely are people to believe that AI systems are conscious/sentient/agentic in the future?

What are people’s current intuitions about AI sentience and how malleable are these beliefs?
- How easily can people be persuaded that an AI system is sentient via conversations with LLMs?
- One could also do Milgram-style experiments to test how willing people are to harm AIs.

How likely is it that social division over AI sentience could lead to large-scale societal conflict?

Are there likely to be international divisions in endorsement of AI rights?

Assuming that people care about AI rights, should we expect them to care primarily about the front-facing AI models (i.e., models working in the background wouldn’t get any protections)?

Assuming that most AI systems will in fact have positive welfare, would people still morally oppose using “happy servant” type AIs as personal servants or for other tasks, assuming they knew the AI systems were sentient?

Assuming that consumers could pay more for an AI model that suffered less, how much would they be willing to pay?

5.2 Modeling AI takeoff scenarios

One way to think about AI takeoff scenarios is as “all of the bottlenecks to progress substantially widening”—what are the relevant bottlenecks, and are any of these dependent on human psychology?

Developing better models of people’s responses to takeoff scenarios:

How likely is it that people would realize AI is taking off in the first place, in such a scenario? (Perhaps someone could do a literature review on human sensitivity to large-scale societal change.)
- LLMs might be a good case study (people seemed to adopt the fact of LLMs existing extremely fast).

5.3 Risk Aversion and AI

Very simply: how risk averse are the people leading the AI stuff.

Project idea: Maybe people underestimate the value of risk taking as they start to lose. For instance, when people are playing board games, they should become extremely risk seeking if they start to lose, but people rarely do this; instead, they often get Goodharted by things like “maximizing expected points per turn” or something.

Find a natural game environment to test this in (or build one)
The path to impact from this work would sound like: making the case that, (assuming) things are getting bad, we should become more risk seeking than you might intuitively think as our odds of success over some catastrophe start to dwindle (I think I heard this perspective from Zvi somewhere but I might be misremembering).

^ In contrast, there’s also an idea that people might become unnecessarily risk-seeking in competitive environments even when it doesn’t benefit them, perhaps because they use the above heuristic too much. Either finding would be important.

5.4 Post-TAI Wellbeing

Inspired by this Daron Acemoglu work—how likely is it that AI developments could cause most people’s lives to lose their meaning?

Caveat that I don’t find this very plausible/valuable (just doesn’t seem like nearly the biggest risk these systems might pose). But maybe on certain worldviews it’s really high impact to research this, idk!

6. Cognitive Science of AI

I imagine there are many promising psychology projects that would apply methods in cognitive/computational psychology to AI models. This could involve doing things like exploring the presence of dual-process cognition in AI models or computationally modeling the way that AI brains make different kinds of trade-offs. I’m not very familiar with research in these areas, though, so I would love to hear more thoughts about how projects like this would look.

6.1 Miscellaneous

How accurately does shard theory describe human psychology/neuroscience? How likely is it that shard theory similarly describes AIs’ “psychologies”?

What are the neural bases of consciousness?

Strikes me as intractable, but maybe other people find this promising.

What is the nature of intelligence? What exactly do people mean when they talk about intelligence? What components are necessary for general intelligence?

In general, searching for similarities between human cognition and AI cognition could probably inform interpretability research.

To what extent are cognitive biases transmitted to AI models via processes like RLHF?

How likely is it that AI models will have positive/negative utility, assuming they are sentient? On one view, they might have quite high utility by default because even in scenarios where they’re doing labor people would view as unlikable, their reward functions would have adapted specifically for these tasks.

This question might also appeal to economists and/or evolutionary psychologists/biologists.

See also: LessWrong large language model psychology research agenda.

7. Other Topics

7.1 Meta-Psychology and AI

Can we use AI to reliably simulate human data, to increase the speed of behavioral/cognitive science research? (Perhaps this technique could also be used to quickly rule out many of the above hypotheses, and to identify which are the most promising/unknown.) I know there’s already some work on this but couldn’t find it in two google searches so I moved on.

Can we use AI to improve/test the replicability of social science research?

Spencer Greenberg is doing some cool thinking about this stuff.

Undertake community-building projects for psychology x AI.

First, think harder about whether it would be good or bad to move people from technical AI research to psychology x AI research.

7.2 Politics

I think there are also likely to be many questions in political psychology that could be important for AI safety. For instance:

Understanding more broadly which traits people value in their leaders, and identifying whether there are systematic differences in these people that might lead them to influence the trajectory/values of AI in ways misaligned with human values at large.
Modeling to what extent AI risk is heightened/reduced by competing U.S. presidential candidates (e.g., for reasons relating to international cooperation/trade), and forecasting those candidates’ odds of winning.
- If the difference is large, then maybe it’s worth focusing AI safety efforts on influencing certain elections.

Project idea: Case studies of important people’s particular psychology (e.g. trying to get super familiar with Sam Altman’s judgment and decision-making practices to better anticipate his decisions).

7.3 Donation behavior

It might be useful to understand how to persuade people to give more to causes that would advance AI safety (or to move them away from things like “impact investing in AI companies that I see as societally beneficial”). There’s already a broad marketing literature on people’s charitable donation behavior but I don’t think anyone has connected this to AI or longtermist charities specifically.

Note that I think this project is probably only useful to the extent that funding is the bottleneck for AI safety research.

7.4 Miscellaneous

Project idea: Test whether people are insufficiently willing to pay for information (could use designs like the one in Amanda Askell’s moral value of information talk.)

Someone could do some projects on motivation-based selection effects in AI: a person is only going to write an article on something if they feel strongly about it, so articles tend to express polarized viewpoints. One will only build TAI if they feel strongly that it is going to be good, so TAI is likely to be built by overly optimistic people who systematically underestimate its danger. Similar to the optimizer’s curse. Could inform science more broadly and also relates specifically to AI development.

Maybe I will do this.

Additional Resources for People Interested in Psychology and AI:

Please recommend more!

Forum Posts

Some excellent skeptical takes about this stuff by @Vael Gates: https://forum.effectivealtruism.org/posts/WHDb9r9yMFetG7oz5/

List of AI researchers with degrees in psychology by @Sam Ellis: https://forum.effectivealtruism.org/posts/fSDxnLcCn8h22gCYB/ea-psychology-and-ai-safety-research

Another post with some psychology and AI ideas by @Kaj_Sotala: https://forum.effectivealtruism.org/posts/WdMnmmqqiP5zCtSfv/cognitive-science-psychology-as-a-neglected-approach-to-ai

@Geoffrey Miller is running a course on AI and Psychology: https://forum.effectivealtruism.org/posts/rtGfSuaydeengBQpQ/seeking-suggested-readings-and-videos-for-a-new-course-on-ai

Fellowships/Programs/Labs

PIBBS fellowship: https://pibbss.ai/resources/

Effective thesis: https://effectivethesis.org/psychology-and-cognitive-science/

Institute working on modeling human moral judgment/psychology for AI safety: https://mosaic.allenai.org/

Research labs focused (at least somewhat) on psychology and AI:

https://www.eapsychology.org/global-risk
https://sites.google.com/site/falklieder/rel?authuser=0
https://www.sentienceinstitute.org/
Does some work on people’s beliefs about AI consciousness: https://metacoglab.org/research
(As of 6/27/24) Postdoc opportunity in a lab that will focus on AI (belief change) and human decision-making: https://x.com/DG_Rand/status/1805636901971415129
My understanding is that the Happier Lives Institute isn’t very interested in AI applications, but is interested in foundational work on well-being: https://www.happierlivesinstitute.org/
GPI also has a psychology focus now (and they have a summer fellowship): https://globalprioritiesinstitute.org/research-agenda/

Acknowledgments

Thanks to @Lucius Caviola and @Joshua_Lewis for ideas and discussions on many of these topics—a solid third of these thoughts come from one or both of them. Thanks also to @Chris Leong and others at BlueDot for helpful suggestions and comments!