AGI safety from first principles

Link post

In this report (which I’m linking from the Alignment Forum) I have attempted to put together the most compelling case for why the development of AGI might pose an existential threat. It stems from my dissatisfaction with existing arguments about the potential risks from AGI. Early work tends to be less relevant in the context of modern machine learning; more recent work is scattered and brief. I originally intended to just summarise other people’s arguments, but as this report has grown, it’s become more representative of my own views and less representative of anyone else’s. So while it covers the standard ideas, I also think that it provides a new perspective on how to think about AGI—one which doesn’t take any previous claims for granted, but attempts to work them out from first principles.

The report is primarily aimed at people who already understand the basics of machine learning, but most of it should also make sense to laypeople. It’s roughly 15,000 words in total, split into six sections: the first and last are short framing sections, while the middle four correspond to the four premises of the core argument. The brief introductory section appears below; find the rest on the Alignment Forum.

AGI safety from first principles

The key concern motivating technical AGI safety research is that we might build autonomous artificially intelligent agents which are much more intelligent than humans, and which pursue goals that conflict with our own. Human intelligence allows us to coordinate complex societies and deploy advanced technology, and thereby control the world to a greater extent than any other species. But AIs will eventually become more capable than us at the types of tasks by which we maintain and exert that control. If they don’t want to obey us, then humanity might become only Earth’s second most powerful “species”, and lose the ability to create a valuable and worthwhile future.

I’ll call this the “second species” argument; I think it’s a plausible argument which we should take very seriously.[1] However, the version stated above relies on several vague concepts and intuitions. In this report I’ll give the most detailed presentation of the second species argument that I can, highlighting the aspects that I’m still confused about. In particular, I’ll defend a version of the second species argument which claims that, without a concerted effort to prevent it, there’s a significant chance that:

  1. We’ll build AIs which are much more intelligent than humans (i.e. superintelligent).

  2. Those AIs will be autonomous agents which pursue large-scale goals.

  3. Those goals will be misaligned with ours; that is, they will aim towards outcomes that aren’t desirable by our standards, and trade off against our goals.

  4. The development of such AIs would lead to them gaining control of humanity’s future.

While I use many examples from modern deep learning, this report is also intended to apply to AIs developed using very different models, training algorithms, optimisers, or training regimes than the ones we use today. However, many of my arguments would no longer be relevant if the field of AI moves away from focusing on machine learning. I also frequently compare AI development to the evolution of human intelligence; while the two aren’t fully analogous, humans are the best example we currently have to ground our thinking about generally intelligent AIs.

That’s the introduction; to continue reading, here’s the next section, on Superintelligence. In addition to reframing existing arguments, here are a few of the more novel claims made in the rest of the report:

  1. When training AIs which can perform well on a range of novel tasks, we shouldn’t think of objective functions as specifications of our desired goals, but rather as tools to shape our agents’ motivations and cognition.

  2. Interactions between many AGIs (specifically via replication, cultural learning, and recursive improvement) will be important during the transition from human-level AGI to superintelligence.

  3. Existing frameworks for thinking about goal-directed agency don’t help us to predict the types of goals AGIs will have. To do so, we should identify specific cognitive capacities AGIs would need to be capable of pursuing goals, and how those might develop.

  4. The likelihood of inner misalignment occuring depends on whether instrumentally convergent subgoals will be present during training, and how complex they will be compared with the outer objective.

  5. We should plan towards building intent aligned AGIs which are better than humans at safety and governance research. Up to that point, we can increase our chances of retaining control via coordination to deploy transparent systems in constrained ways.

See also Rohin’s summary for the Alignment Newsletter here.

  1. ↩︎

    Stuart Russell also refers to this as the “gorilla problem” in his recent book, Human Compatible.