Cooperative AI: Three things that confused me as a beginner (and my current understanding)

I started working in cooperative AI almost a year ago, and as an emerging field I found it quite confusing at times since there is very little introductory material aimed at beginners. My hope with this post is that by summing up my own confusions and how I understand them now I might help to speed up the process for others who want to get a grasp on what cooperative AI is.

I work at Cooperative AI Foundation (CAIF) and there will be a lot more polished and official material coming from there, so this is just a quick personal write-up to get something out in the meantime. We’re working on a cooperative AI curriculum that should be published within the next couple of months, and we’re also organising a summer school in June for people new to the area (application deadline April 26th).

Contradicting definitions

When I started to learn about cooperative AI I came across a lot of different definitions of the concept. While drafting this post I dug up my old interview preparation doc for my current job, where I had listed different descriptions of cooperative AI that I had found while reading up:

  • “the objective of this research would be to study the many aspects of the problems of cooperation and to innovate in AI to contribute to solving these problems”

  • “AI research trying to help individuals, humans and machines, to find ways to improve their joint welfare”

  • “AI research which can help contribute to solving problems of cooperation”

  • “building machine agents with the capabilities needed for cooperation“

  • “building tools to foster cooperation in populations of (machine and/​or human) agents”

  • “conducting AI research for insight relevant to problems of cooperation”

To me this did not paint a very clear picture and I was pretty frustrated to be unable to find a concise answer to the most basic question: What is cooperative AI and what is it not?

At this point I still don’t have a clear, final definition, but I am less frustrated by it because I no longer think that this is just a failure of understanding or failure of communication—the situation is simply that the field is so new that there is no single definition that people working in the field agree on, and it is still an ongoing discussion where the boundaries should be drawn.

That said, my current favourite explanation of what cooperative AI is is that while AI alignment deals with the question of how to make one powerful AI system behave in a way that is aligned with (good) human values, cooperative AI is about making things go well with powerful AI systems in a messy world where there might be many different AI systems, lots of different humans and human groups and different sets of (sometimes contradictory) values.

Another recurring framing is that cooperative AI is about improving the cooperative intelligence of advanced AI, which leads to the question of what cooperative intelligence is. Here also there are many different versions in circulation, but the following one is the one I find most useful so far:

Cooperative intelligence is an agent’s ability to achieve their goals in ways that also promote social welfare, in a wide range of environments and with a wide range of other agents.

Is this really different from alignment?

The second major issue I had was to figure out how cooperative AI really differed from AI alignment. The description of “cooperative intelligence” seemed like it could be understood as just a certain framing of alignment—“achieve the goals in a way that is also good for everyone”.

As I have been learning more about cooperative AI, it seems to me like the term “cooperative intelligence” is best understood in the context of social dilemmas (collective action problems). The most famous model is the prisoner’s dilemma, in which it would be best for two agents collectively to cooperate, but where the rational decision for each individually is to defect, leading to a collectively worse outcome (e.g. arms race dynamics). Another famous model of a social dilemma is the tragedy of the commons which can be used to model overexploitation of shared resources (e.g. climate change).

The point is that in social dilemmas it is rational for each agent to defect, even if everyone would be better off if they were all cooperating. If we imagine a world with multiple powerful AI systems, social dilemmas are not solved by default by alignment. If two different groups have their own aligned AI systems, an interaction between these systems where they make rational and aligned choices can still lead to poor outcomes for both groups even if a better solution were possible. I understand “cooperative intelligence” as a concept that aims to capture the kind of capability that would be required in such a situation to achieve a collectively better outcome.

This is not only relevant for a situation with superhuman artificial intelligence. Even for more narrow or limited systems these kinds of dynamics could occur, leading to significantly harmful outcomes even if each system in isolation was safe.

The dual-use aspect of cooperative intelligence

In this paper which outlines the agenda for cooperative AI, the elements of cooperative intelligence are described as follows:

  • Understanding: The ability to take into account the consequences of actions, to predict another’s behaviour, and the implications of another’s beliefs and preferences.

  • Communication: The ability to explicitly and credibly share information with others relevant to understanding behaviour, intentions and preferences.

  • Commitment: The ability to make credible promises when needed for cooperation.

  • Norms and institutions: Social infrastructure — such as shared beliefs or rules — that reinforces understanding, communication and commitment.

When I first read this I had just gone through Bluedot’s AI Safety Fundamentals course and I suspect I’m not the only one reacting to this list of capabilities (at least the first three points) with the thought that this sounds dangerous. Yes, these are capabilities that are useful for cooperation, but they are also the very same capabilities that are needed for deception, manipulation and extortion.

I still think this is true—these are potentially dangerous capabilities—but I also think these capabilities are being developed outside of cooperative AI. I think the right framing of cooperative AI is less as a field pushing capabilities forward in each of these areas, and more as an initiative to study these emerging capabilities and to work on achieving differential development in this area. What we need is to understand how we can promote beneficial cooperation while suppressing harmful tendencies and dynamics. There is clearly cause to be extremely cautious with this, and I think it’s important to recognize the components of cooperative intelligence as dual-use capabilities.

My current model of how cooperative AI fits into the AI safety landscape

As cooperative AI is an emerging field the boundaries towards and overlaps with other fields are not yet well established. Below I share my current model of how I understand cooperative AI to fit in the wider landscape—note though that this is very much subject to change, and I would very much appreciate counter-suggestions and challenges in comments!

My current model of how cooperative AI fits into the AI safety landscape
  • I here use “beneficial AI” to denote all work on AI that aims to make the world better.

  • I see “AI safety” as a subfield of beneficial AI that aims to ensure that powerful AI systems do not cause catastrophic harm

  • I see “AI alignment” as a subfield of AI safety that aims to ensure that powerful AI systems do not pursue goals that are bad for humanity

  • I think of “multipolar safety” as another subfield of AI safety, separate from AI alignment

  • I am somewhat uncertain about what falls under “AI ethics”—my tentative framing is that it is about ensuring that AI systems are fair and inclusive, and that there is some overlap between this and AI safety

What Cooperative AI is not (in my current model):

  • Most work on AI ethics is not cooperative AI: e.g. detecting and removing bias in AI systems

  • Most work on beneficial AI is not cooperative AI: e.g. using AI tools to improve areas such as healthcare, agriculture, education

  • While work on alignment is often relevant for cooperative AI, AI alignment of one system to one set of values is not cooperative AI

  • AI safety includes work that is neither alignment nor cooperative AI, for example work that deals with harm caused by malicious use of AI systems

What Cooperative AI is (in my current model):

  • Multipolar safety is a subset of cooperative AI: Everything that is to do with preventing serious harm arising from a dynamic with multiple very powerful (potentially superhuman) AI systems in the world

  • Cooperative AI includes aspects of AI safety that is not necessarily related to the most powerful systems (and therefore outside of “multipolar safety”), as catastrophically harmful dynamics could occur even with more limited systems

  • Cooperative AI also includes aspects of beneficial AI outside of AI safety that is about realising the positive potential of advanced AI to provide solutions for large-scale cooperation among humans. To count as cooperative AI in my mind, such work should be about improving the general capabilities of AI systems for solving cooperation problems (though a specific use case might be used for testing and demonstration).

What I am still very uncertain about:

  • What (if any) is the overlap of cooperative AI, AI ethics, and AI safety? Perhaps preventing catastrophic harm that is somehow tied to failures of fairness or inclusion?

  • What (if any) is the overlap of AI ethics and cooperative AI, outside of AI safety? Perhaps some work on preference aggregation and AI enabled improvements to fair governance systems?