AGI Safety Fundamentals curriculum and application

richard_ngo20 Oct 2021 21:45 UTC

123 points

AI alignment Effective altruism education Opportunities to take action AI governance Building effective altruism AI safety

Link post

Note: the curriculum outline in this post is now out of date; see the linked document for the canonical version.

Over the last year EA Cambridge has been designing and running an online program aimed at effectively introducing the field of AGI safety; the most recent cohort included around 150 participants and 25 facilitators from around the world. Dewi Erwan runs the program; I designed the curriculum, the latest version of which appears in the linked document. We expect the program to be most useful to people with technical backgrounds (e.g. maths, CS, or ML), although the curriculum is intended to be accessible for those who aren’t familiar with machine learning, and participants will be put in groups with others from similar backgrounds. If you’re interested in joining the next version of the course (taking place January—March 2022) apply here to be a participant or here to be a facilitator. Applications are open to anyone and close 15 December. EDIT 29 Nov: We’ve now also released the curriculum for the governance track. EDIT 10 Dec: Facilitators will be paid $1000; the time commitment is 2-3 hours a week for 8 weeks.

This post contains an overview of the course and an abbreviated version of the curriculum; the full version (which also contains optional readings, exercises, notes, discussion prompts, and project ideas) can be found here. Comments and feedback are very welcome, either on this post or in the full curriculum document; suggestions of new exercises, prompts or readings would be particularly helpful. I’ll continue to make updates until shortly before the next cohort starts.

Course overview

The course consists of 8 weeks of readings, plus a final project. Participants are divided into groups of 4-6 people, matched based on their prior knowledge about ML and safety. Each week (apart from week 0) each group and their discussion facilitator will meet for 1.5 hours to discuss the readings and exercises. Broadly speaking, the first half of the course explores the motivations and arguments underpinning the field of AGI safety, while the second half focuses on proposals for technical solutions. After week 7, participants will have several weeks to work on projects of their choice, to present at the final session.

Each week’s curriculum contains:

Key ideas for that week
Core readings
Optional readings
Two exercises (participants should pick one to do each week)
Further notes on the readings
Discussion prompts for the weekly session

Week 0 replaces the small group discussions with a lecture plus live group exercises, since it’s aimed at getting people with little ML knowledge up to speed quickly.

The topics for each week are:

Week 0 (optional): introduction to machine learning
Week 1: Artificial general intelligence
Week 2: Goals and misalignment
Week 3: Threat models and types of solutions
Week 4: Learning from humans
Week 5: Decomposing tasks for outer alignment
Week 6: Other paradigms for safety work
Week 7: AI governance
Week 8 (several weeks later): Projects

Abbreviated curriculum (only key ideas and core readings)

Week 0 (optional): introduction to machine learning

This week mainly involves learning about foundational concepts in machine learning, for those who are less familiar with them, or want to revise the basics. If you’re not already familiar with basic concepts in statistics (like regressions), it will take a bit longer than most weeks; and instead of the group discussions from most weeks, there will be a lecture and group exercises. If you’d like to learn ML in more detail, see the further resources section at the end of this curriculum.

Otherwise, start with Ngo (2021), which provides a framework for thinking about machine learning, and in particular the two key components of deep learning: neural networks and optimisation. For more details and intuitions about neural networks, watch 3Blue1Brown (2017a); for more details and intuitions about optimisation, watch 3Blue1Brown (2017b). Lastly, see von Hasselt (2021) for an introduction to the field of reinforcement learning.

Core readings:

If you’re not familiar with the basics of statistics, like linear regression and classification:
1. Introduction: linear regression (10 mins)
2. Ordinary least squares regression (10 mins)
A short introduction to machine learning (Ngo, 2021) (20 mins)
But what is a neural network? (3Blue1Brown, 2017a) (20 mins)
Gradient descent, how neural networks learn (3Blue1Brown, 2017b) (20 mins)
Introduction to reinforcement learning (von Hasselt, 2021) (ending at 36:30, at section titled Inside the Agent) (40 mins)

Week 1: Artificial general intelligence

The first two readings this week offer several different perspectives on how we should think about artificial general intelligence. This is the key concept underpinning the course, so it’s important to deeply explore what we mean by it, and the limitations of our current understanding.

The third reading is about how we should expect advances in AI to occur. AI pioneer Rich Sutton explains the main lesson he draws from the history of the field: that “general methods that leverage computation are ultimately the most effective”. Compared with earlier approaches, these methods rely much less on human design, and therefore raise the possibility that we build AGIs whose cognition we know very little about.

Focusing on compute also provides a way to forecast when we should expect AGI to occur. The most comprehensive report on the topic (summarised by Karnofsky (2021)) estimates the amount of compute required to train neural networks as large as human brains to do highly impactful tasks, and concludes that this will probably be feasible within the next four decades—although the estimate is highly uncertain.

Core readings:

Four background claims (Soares, 2015) (15 mins)
AGI safety from first principles (Ngo, 2020) (only sections 1, 2 and 2.1) (20 mins)
The Bitter Lesson (Sutton, 2019) (15 mins)
Forecasting transformative AI: the “biological anchors” method in a nutshell (Karnofsky, 2021) (30 mins)

Week 2: Goals and misalignment

This week we’ll focus on how and why AGIs might develop goals that are misaligned with those of humans, in particular when they’ve been trained using machine learning. We cover three core ideas. Firstly, it’s difficult to create reward functions which specify the desired outcomes for complex tasks (known as the problem of outer alignment). Krakovna et al. (2020) helps build intuitions about the difficulty of outer alignment, by showcasing examples of misbehaviour on toy problems.

Secondly, however, it’s important to distinguish between the reward function which is used to train a reinforcement learning agent, versus the goals which that agent learns to pursue. Hubinger et al. (2019a) argue that even an agent trained on the “right” reward function might acquire undesirable goals—the problem of inner alignment. Carlsmith (2021) explores in more detail what it means for an agent to be goal-directed in a worrying way, and gives reasons why such agents seem likely to arise.

Lastly, Bostrom (2014) argues that almost all goals which an AGI might have would incentivise it to misbehave in highly undesirable ways (e.g. pursuing survival and resource acquisition), due to the phenomenon of instrumental convergence.

Core readings:

Specification gaming: the flip side of AI ingenuity (Krakovna et al., 2020) (15 mins)
Introduction to Risks from Learned Optimisation (Hubinger et al., 2019a) (30 mins)
Superintelligence, Chapter 7: The superintelligent will (Bostrom, 2014) (45 mins)
Is power-seeking AI an existential risk? (Carlsmith, 2021) (only sections 2: Timelines and 3: Incentives) (25 mins)

Week 3: Threat models and types of solutions

How might misaligned AGIs cause catastrophes, and how might we stop them? Two threat models are outlined in Christiano (2019) - the first focusing on outer misalignment, the second on inner misalignment. Muehlhauser and Salamon (2012) outline a core intuition for why we might be unable to prevent these risks: that progress in AI will at some point speed up dramatically. A third key intuition—that misaligned agents will try to deceive humans—is explored by Hubinger et al. (2019).

How might we prevent these scenarios? Christiano (2020) gives a broad overview of the landscape of different contributions to making AIs aligned, with a particular focus on some of the techniques we’ll be covering in later weeks.

Core readings:

What failure looks like (Christiano, 2019) (20 mins)
Intelligence explosion: evidence and import (Muehlhauser and Salamon, 2012) (only pages 10-15) (15 mins)
AI alignment landscape (Christiano, 2020) (30 mins)
Risks from Learned Optimisation: Deceptive alignment (Hubinger et al., 2019) (45 mins)

Week 4: Learning from humans

This week, we look at four techniques for training AIs on human data (all falling under “learn from teacher” in Christiano’s AI alignment landscape from last week). From a safety perspective, each of them improves on standard reinforcement learning techniques in some ways, but also has weaknesses which prevent it from solving the whole alignment problem. Next week, we’ll look at some ways to make these techniques more powerful and scalable; this week focuses on understanding each of them.

The first technique, behavioural cloning, is essentially an extension of supervised learning to settings where an AI must take actions over time—as discussed by Levine (2021). The second, reward modelling, allows humans to give feedback on the behaviour of reinforcement learning agents, which is then used to determine the rewards they receive; this is used by Christiano et al. (2017) and Steinnon et al. (2020). The third, inverse reinforcement learning (IRL for short), attempts to identify what goals a human is pursuing based on their behaviour.

A notable variant of IRL is cooperative IRL (CIRL for short), introduced by Hadfield-Menell et al. (2016). CIRL focuses on cases where the human and AI interact in a shared environment, and therefore the best strategy for the human is often to help the AI learn what goal the human is pursuing.

Core readings:

Imitation learning lecture: part 1 (Levine, 2021a) (20 mins)
Deep RL from human preferences blog post (Christiano et al., 2017) (15 mins)
Learning to summarise with human feedback blog post (Stiennon et al., 2020) (25 mins)
Inverse reinforcement learning
1. For those who don’t already understand IRL:
  - Inverse reinforcement learning example (Udacity, 2016) (5 mins)
  - Learning from humans: what is inverse reinforcement learning? (Alexander, 2018) (25 mins)
2. For those who already understand IRL:
  - Cooperative inverse reinforcement learning (Hadfield-Menell et al., 2016) (40 mins)

Week 5: Decomposing tasks for outer alignment

The most prominent research directions in technical AGI safety involve training AIs to do complex tasks by decomposing those tasks into simpler ones where humans can more easily evaluate AI behaviour. This week we’ll cover three closely-related algorithms (all falling under “build a better teacher” in Christiano’s AI alignment landscape).

Wu et al. (2021) applies reward modelling recursively in order to solve more difficult tasks. Recursive reward modelling can be considered one example of a more general class of techniques called iterated amplification (also known as iterated distillation and amplification), which is described in Ought (2019). A more technical description of iterated amplification is given by Christiano et al. (2018), along with some small-scale experiments.

The third technique we’ll discuss this week is Debate, as proposed by Irving and Amodei (2018). Unlike the other two techniques, Debate focuses on evaluating claims made by language models, rather than supervising AI behaviour over time.

Core readings:

Recursively summarising books with human feedback (Wu et al., 2021) (ending after section 4.1.2: Findings) (45 mins)
Factored cognition (Ought, 2019) (introduction and scalability section) (20 mins)
AI safety via debate blog post (Irving and Amodei, 2018) (15 mins)
Supervising strong learners by amplifying weak experts (Christiano et al., 2018) (40 mins)

Week 6: Other paradigms for safety work

A lot of safety work focuses on “shifting the paradigm” of AI research. This week we’ll cover two ways in which safety researchers have attempted to do so. The first is via research on interpretability, which attempts to understand in detail how neural networks work. Olah et al. (2020) showcases some prominent research in the area; and Chris Olah’s perspective is summarised by Hubinger et al. (2019).

The second is the research agenda of the Machine Intelligence Research Institute (MIRI) which aims to create rigorous mathematical frameworks to describe the relationships between AIs and their real-world environments. Soares (2015) gives a high-level explanation of their approach; while Demski and Garrabrant (2018) identify a range of open problems and links between them.

Core readings:

Zoom In: an introduction to circuits (Olah et al., 2020) (35 mins)
Chris Olah’s views on AGI safety (Hubinger, 2019) (25 mins)
MIRI’s approach (Soares, 2015) (30 mins)
Embedded agents (Demski and Garrabrant, 2018) (25 mins)

Week 7: AI governance

In the last week of curriculum content, we’ll look at the field of AI governance. Start with Dafoe (2020), which gives a thorough overview of AI governance and ways in which it might be important, particularly focusing on the framing of AI governance as field-building. An alternative framing—of AI governance as an attempt to prevent cooperation failures—is explored by Clifton (2019). Although the field of AI governance is still young, Muehlhauser (2020) identifies some useful work so far. Finally, Bostrom (2019) provides a background framing for thinking about technological risks: the process of randomly sampling new technologies, some of which might prove catastrophic.

Core readings:

AI Governance: Opportunity and Theory of Impact (Dafoe, 2020) (25 mins)
Cooperation, conflict and transformative AI: sections 1 & 2 (Clifton, 2019) (25 mins)
Our AI governance grantmaking so far (Muehlhauser, 2020) (15 mins)
The vulnerable world hypothesis (Bostrom, 2019) (ending at the start of the section on ‘Preventive policing’) (60 mins)

Week 8 (four weeks later): Projects

The final part of the AGI safety fundamentals course will be projects where you get to dig into something related to the course. The project is a chance for you to explore your interests, so try to find something you’re excited about! The goal of this project is to help you practice taking an intellectually productive stance towards AGI safety—to go beyond just reading and discussing existing ideas, and take a tangible step towards contributing to the field yourself. This is particularly valuable because it’s such a new field, with lots of room to explore.

Click here for the full version of the curriculum, which contains additional readings, exercises, notes, discussion prompts, and project ideas.

What links here?

richard_ngo20 Oct 2021 21:45 UTC

123 points

20 comments8 min readEA link

AI alignment Effective altruism education Opportunities to take action AI governance Building effective altruism AI safety

mic 21 Oct 2021 2:05 UTC
14 points
0 ∶ 0
I noticed that “Will humans build goal-directed agents?” was changed from being a required reading to Week 2 to being an optional reading. I don’t disagree with this choice, as I didn’t find the post very convincing, though I was rather fond of your post “AGI safety from first principles: Goals and Agency”. However, now all the required readings for Week 2 essentially take for granted that AGI will have large-scale goals. Before I participated in AGI Safety Fundamentals in the first round this year, I never considered the possibility that AGI could be non-goal-directed. I thought that since AI involves an objective function, we can directly conclude that a superintelligence would have the goal of optimizing the environment accordingly in a goal-directed fashion—especially since this seems to be an assumption underlying popular introductions such as by Wait But Why and Yudkowsky. It was only after reading “Goals and Agency” as part of the program that I realized that goal-directed AGI wasn’t a logical necessity. It might be helpful to draw out this consideration in the readings or “key ideas” section. Do you think the question of whether AGI will be goal-directed is important for participants to consider?

Overall though I think this revised curriculum looks really good!
- richard_ngo 21 Oct 2021 18:24 UTC
  7 points
  0 ∶ 0
  Parent
  This is a great point, and I do think it’s an important question for participants to consider; I should switch the last reading for something covering this. The bottleneck is just finding a satisfactory reading—I’m not totally happy with any of the posts covering this, but maybe AGI safety from first principles is the closest to what I want.
  - richard_ngo 29 Oct 2021 23:58 UTC
    3 points
    0 ∶ 0
    Parent
    Actually, Joe Carlsmith does it better in Is power-seeking AI an existential risk? So I’ve swapped that in instead.
jacquesthibs 29 Nov 2021 18:51 UTC
10 points
0 ∶ 0
I just want to say that this course curriculum is amazing and I really appreciate that you’ve made it public. I’ve already gone through about a dozen articles. I’m an ML engineer who wants to learn more about AGI safety, but it’s unfortunately not a priority for me at the moment. That said, I will still likely go through the curriculum on my own time, but since I’m focusing on more technical aspects of building ML models at the moment, I won’t be applying since I can’t strongly commit to the course. Anyways, again, I appreciate making the curriculum public. As I slowly go through it, I might send some questions for clarification along the way. I hope that’s ok. Thanks!
toonalfrink 27 Nov 2021 12:53 UTC
10 points
0 ∶ 0
I have added a note to my RAISE post-mortem, which I’m cross-posting here:

Edit November 2021: there is now the Cambridge AGI Safety Fundamentals course, which promises to be successful. It is enlightening to compare this project with RAISE. Why is that one succeeding while this one did not? I’m quite surprised to find that the answer isn’t so much about more funding, more senior people to execute it, more time, etc. They’re simply using existing materials instead of creating their own. This makes it orders of magnitude easier to produce the thing, you can just focus on the delivery. Why didn’t I, or anyone around me, think of this? I’m honestly perplexed. It’s worth thinking about.
- richard_ngo 27 Dec 2021 17:57 UTC
  9 points
  0 ∶ 0
  Parent
  Yeah, I also feel confused about why I didn’t have this thought when talking to you about RAISE.
  Most proximately, AGI safety fundamentals uses existing materials because its format is based on the other EA university programs; and also because I didn’t have time to write (many) new materials for it.
  I think the important underlying dynamic here is starting with a specific group of people with a problem, and then making the minimum viable product that solves their problem. In this case, I was explicitly thinking about what would have helped my past self the most.
  Perhaps I personally didn’t have this thought back in 2019 because I was still in “figure out what’s up with AI safety” mode, and so wasn’t in a headspace where it was natural to try to convey things to other people.
Erik 7 Nov 2021 14:21 UTC
4 points
0 ∶ 0
Looks great and I applied as a participant through the provided link. However I did not receive a confirmation. Is this as expected or did something go wrong?
- Ishan Mistry 8 Nov 2021 10:02 UTC
  1 point
  0 ∶ 0
  Parent
  I had the same doubt, if someone could let us know if we have filled the form correctly or was there some error?
  - DE 8 Nov 2021 11:16 UTC
    3 points
    0 ∶ 0
    Parent
    Airtable (free plan) doesn’t allow the sending of confirmation emails. I’ve now updated the plan to the pro plan, and will send out confirmation emails to all those who have already applied.
    - mic 16 Nov 2021 19:30 UTC
      4 points
      0 ∶ 0
      Parent
      If I remember correctly, if you use the “Gmail” automation on Airtable instead of the “Email” automation, you can send confirmation emails on the free tier of Airtable.
      - DE 17 Nov 2021 19:29 UTC
        1 point
        0 ∶ 0
        Parent
        Ah awesome, thank you!
    - Ishan Mistry 13 Nov 2021 6:28 UTC
      1 point
      0 ∶ 0
      Parent
      Thanks for the response.
    - Erik 8 Nov 2021 12:18 UTC
      1 point
      0 ∶ 0
      Parent
      Thanks!
Jonas_ 27 Dec 2021 16:15 UTC
2 points
0 ∶ 0
I vaguely remember seeing a website for that program, but can’t find the link – is this post the most up-to-date resource, or is the website more up to date, and if the latter, do you have a link? Thank you!
- richard_ngo 27 Dec 2021 17:49 UTC
  4 points
  0 ∶ 0
  Parent
  This post (plus the linked curriculum) is the most up-to-date resource.
  There’s also this website, but it’s basically just a (less-up-to-date) version of the curriculum.
mic 6 Nov 2021 0:08 UTC
1 point
0 ∶ 0

We are also hoping to have an AI governance track, where participants spend some weeks focusing on governance as opposed to technical AI safety work—you will have the option to express a preference on this below. This track is likely to be more accessible to people from a broader range of academic backgrounds.

Do you have a list of readings that will be used in the AI governance track?
- richard_ngo 6 Nov 2021 22:39 UTC
  2 points
  0 ∶ 0
  Parent
  Not finalised, but here’s a rough reading list which would replace weeks 5-7 for the governance track.
  - richard_ngo 29 Nov 2021 20:14 UTC
    2 points
    0 ∶ 0
    Parent
    Update: see here: https://forum.effectivealtruism.org/posts/68ANc8KhEn6sbQ3P9/ai-governance-fundamentals-curriculum-and-application
mrogers 22 Oct 2021 9:34 UTC
1 point
0 ∶ 0
It’s unclear here to what extent attendance will be in person. I’ll gladly take an excuse to come to Cambridge for a few months, but I need to apply for funding to stay there etc. etc.
- DE 22 Oct 2021 14:21 UTC
  5 points
  0 ∶ 0
  Parent
  The programme is by default virtual, we’ve made this clearer in the application form