Note: the curriculum outline in this post is now out of date; see the linked document for the canonical version.
Over the last year EA Cambridge has been designing and running an online program aimed at effectively introducing the field of AGI safety; the most recent cohort included around 150 participants and 25 facilitators from around the world. Dewi Erwan runs the program; I designed the curriculum, the latest version of which appears in the linked document. We expect the program to be most useful to people with technical backgrounds (e.g. maths, CS, or ML), although the curriculum is intended to be accessible for those who aren’t familiar with machine learning, and participants will be put in groups with others from similar backgrounds. If you’re interested in joining the next version of the course (taking place January—March 2022) apply here to be a participant or here to be a facilitator. Applications are open to anyone and close 15 December. EDIT 29 Nov: We’ve now also released the curriculum for the governance track. EDIT 10 Dec: Facilitators will be paid $1000; the time commitment is 2-3 hours a week for 8 weeks.
This post contains an overview of the course and an abbreviated version of the curriculum; the full version (which also contains optional readings, exercises, notes, discussion prompts, and project ideas) can be found here. Comments and feedback are very welcome, either on this post or in the full curriculum document; suggestions of new exercises, prompts or readings would be particularly helpful. I’ll continue to make updates until shortly before the next cohort starts.
Course overview
The course consists of 8 weeks of readings, plus a final project. Participants are divided into groups of 4-6 people, matched based on their prior knowledge about ML and safety. Each week (apart from week 0) each group and their discussion facilitator will meet for 1.5 hours to discuss the readings and exercises. Broadly speaking, the first half of the course explores the motivations and arguments underpinning the field of AGI safety, while the second half focuses on proposals for technical solutions. After week 7, participants will have several weeks to work on projects of their choice, to present at the final session.
Each week’s curriculum contains:
Key ideas for that week
Core readings
Optional readings
Two exercises (participants should pick one to do each week)
Further notes on the readings
Discussion prompts for the weekly session
Week 0 replaces the small group discussions with a lecture plus live group exercises, since it’s aimed at getting people with little ML knowledge up to speed quickly.
The topics for each week are:
Week 0 (optional): introduction to machine learning
Week 1: Artificial general intelligence
Week 2: Goals and misalignment
Week 3: Threat models and types of solutions
Week 4: Learning from humans
Week 5: Decomposing tasks for outer alignment
Week 6: Other paradigms for safety work
Week 7: AI governance
Week 8 (several weeks later): Projects
Abbreviated curriculum (only key ideas and core readings)
Week 0 (optional): introduction to machine learning
This week mainly involves learning about foundational concepts in machine learning, for those who are less familiar with them, or want to revise the basics. If you’re not already familiar with basic concepts in statistics (like regressions), it will take a bit longer than most weeks; and instead of the group discussions from most weeks, there will be a lecture and group exercises. If you’d like to learn ML in more detail, see the further resources section at the end of this curriculum.
Otherwise, start with Ngo (2021), which provides a framework for thinking about machine learning, and in particular the two key components of deep learning: neural networks and optimisation. For more details and intuitions about neural networks, watch 3Blue1Brown (2017a); for more details and intuitions about optimisation, watch 3Blue1Brown (2017b). Lastly, see von Hasselt (2021) for an introduction to the field of reinforcement learning.
Core readings:
If you’re not familiar with the basics of statistics, like linear regression and classification:
Introduction: linear regression (10 mins)
Ordinary least squares regression (10 mins)
A short introduction to machine learning (Ngo, 2021) (20 mins)
But what is a neural network? (3Blue1Brown, 2017a) (20 mins)
Gradient descent, how neural networks learn (3Blue1Brown, 2017b) (20 mins)
Introduction to reinforcement learning (von Hasselt, 2021) (ending at 36:30, at section titled Inside the Agent) (40 mins)
Week 1: Artificial general intelligence
The first two readings this week offer several different perspectives on how we should think about artificial general intelligence. This is the key concept underpinning the course, so it’s important to deeply explore what we mean by it, and the limitations of our current understanding.
The third reading is about how we should expect advances in AI to occur. AI pioneer Rich Sutton explains the main lesson he draws from the history of the field: that “general methods that leverage computation are ultimately the most effective”. Compared with earlier approaches, these methods rely much less on human design, and therefore raise the possibility that we build AGIs whose cognition we know very little about.
Focusing on compute also provides a way to forecast when we should expect AGI to occur. The most comprehensive report on the topic (summarised by Karnofsky (2021)) estimates the amount of compute required to train neural networks as large as human brains to do highly impactful tasks, and concludes that this will probably be feasible within the next four decades—although the estimate is highly uncertain.
Core readings:
Four background claims (Soares, 2015) (15 mins)
AGI safety from first principles (Ngo, 2020) (only sections 1, 2 and 2.1) (20 mins)
The Bitter Lesson (Sutton, 2019) (15 mins)
Forecasting transformative AI: the “biological anchors” method in a nutshell (Karnofsky, 2021) (30 mins)
Week 2: Goals and misalignment
This week we’ll focus on how and why AGIs might develop goals that are misaligned with those of humans, in particular when they’ve been trained using machine learning. We cover three core ideas. Firstly, it’s difficult to create reward functions which specify the desired outcomes for complex tasks (known as the problem of outer alignment). Krakovna et al. (2020) helps build intuitions about the difficulty of outer alignment, by showcasing examples of misbehaviour on toy problems.
Secondly, however, it’s important to distinguish between the reward function which is used to train a reinforcement learning agent, versus the goals which that agent learns to pursue. Hubinger et al. (2019a) argue that even an agent trained on the “right” reward function might acquire undesirable goals—the problem of inner alignment. Carlsmith (2021) explores in more detail what it means for an agent to be goal-directed in a worrying way, and gives reasons why such agents seem likely to arise.
Lastly, Bostrom (2014) argues that almost all goals which an AGI might have would incentivise it to misbehave in highly undesirable ways (e.g. pursuing survival and resource acquisition), due to the phenomenon of instrumental convergence.
Core readings:
Specification gaming: the flip side of AI ingenuity (Krakovna et al., 2020) (15 mins)
Introduction to Risks from Learned Optimisation (Hubinger et al., 2019a) (30 mins)
Superintelligence, Chapter 7: The superintelligent will (Bostrom, 2014) (45 mins)
Is power-seeking AI an existential risk? (Carlsmith, 2021) (only sections 2: Timelines and 3: Incentives) (25 mins)
Week 3: Threat models and types of solutions
How might misaligned AGIs cause catastrophes, and how might we stop them? Two threat models are outlined in Christiano (2019) - the first focusing on outer misalignment, the second on inner misalignment. Muehlhauser and Salamon (2012) outline a core intuition for why we might be unable to prevent these risks: that progress in AI will at some point speed up dramatically. A third key intuition—that misaligned agents will try to deceive humans—is explored by Hubinger et al. (2019).
How might we prevent these scenarios? Christiano (2020) gives a broad overview of the landscape of different contributions to making AIs aligned, with a particular focus on some of the techniques we’ll be covering in later weeks.
Core readings:
Intelligence explosion: evidence and import (Muehlhauser and Salamon, 2012) (only pages 10-15) (15 mins)
Risks from Learned Optimisation: Deceptive alignment (Hubinger et al., 2019) (45 mins)
Week 4: Learning from humans
This week, we look at four techniques for training AIs on human data (all falling under “learn from teacher” in Christiano’s AI alignment landscape from last week). From a safety perspective, each of them improves on standard reinforcement learning techniques in some ways, but also has weaknesses which prevent it from solving the whole alignment problem. Next week, we’ll look at some ways to make these techniques more powerful and scalable; this week focuses on understanding each of them.
The first technique, behavioural cloning, is essentially an extension of supervised learning to settings where an AI must take actions over time—as discussed by Levine (2021). The second, reward modelling, allows humans to give feedback on the behaviour of reinforcement learning agents, which is then used to determine the rewards they receive; this is used by Christiano et al. (2017) and Steinnon et al. (2020). The third, inverse reinforcement learning (IRL for short), attempts to identify what goals a human is pursuing based on their behaviour.
A notable variant of IRL is cooperative IRL (CIRL for short), introduced by Hadfield-Menell et al. (2016). CIRL focuses on cases where the human and AI interact in a shared environment, and therefore the best strategy for the human is often to help the AI learn what goal the human is pursuing.
Core readings:
Imitation learning lecture: part 1 (Levine, 2021a) (20 mins)
Deep RL from human preferences blog post (Christiano et al., 2017) (15 mins)
Learning to summarise with human feedback blog post (Stiennon et al., 2020) (25 mins)
Inverse reinforcement learning
For those who don’t already understand IRL:
For those who already understand IRL:
Week 5: Decomposing tasks for outer alignment
The most prominent research directions in technical AGI safety involve training AIs to do complex tasks by decomposing those tasks into simpler ones where humans can more easily evaluate AI behaviour. This week we’ll cover three closely-related algorithms (all falling under “build a better teacher” in Christiano’s AI alignment landscape).
Wu et al. (2021) applies reward modelling recursively in order to solve more difficult tasks. Recursive reward modelling can be considered one example of a more general class of techniques called iterated amplification (also known as iterated distillation and amplification), which is described in Ought (2019). A more technical description of iterated amplification is given by Christiano et al. (2018), along with some small-scale experiments.
The third technique we’ll discuss this week is Debate, as proposed by Irving and Amodei (2018). Unlike the other two techniques, Debate focuses on evaluating claims made by language models, rather than supervising AI behaviour over time.
Core readings:
Recursively summarising books with human feedback (Wu et al., 2021) (ending after section 4.1.2: Findings) (45 mins)
Factored cognition (Ought, 2019) (introduction and scalability section) (20 mins)
AI safety via debate blog post (Irving and Amodei, 2018) (15 mins)
Supervising strong learners by amplifying weak experts (Christiano et al., 2018) (40 mins)
Week 6: Other paradigms for safety work
A lot of safety work focuses on “shifting the paradigm” of AI research. This week we’ll cover two ways in which safety researchers have attempted to do so. The first is via research on interpretability, which attempts to understand in detail how neural networks work. Olah et al. (2020) showcases some prominent research in the area; and Chris Olah’s perspective is summarised by Hubinger et al. (2019).
The second is the research agenda of the Machine Intelligence Research Institute (MIRI) which aims to create rigorous mathematical frameworks to describe the relationships between AIs and their real-world environments. Soares (2015) gives a high-level explanation of their approach; while Demski and Garrabrant (2018) identify a range of open problems and links between them.
Core readings:
Zoom In: an introduction to circuits (Olah et al., 2020) (35 mins)
MIRI’s approach (Soares, 2015) (30 mins)
Week 7: AI governance
In the last week of curriculum content, we’ll look at the field of AI governance. Start with Dafoe (2020), which gives a thorough overview of AI governance and ways in which it might be important, particularly focusing on the framing of AI governance as field-building. An alternative framing—of AI governance as an attempt to prevent cooperation failures—is explored by Clifton (2019). Although the field of AI governance is still young, Muehlhauser (2020) identifies some useful work so far. Finally, Bostrom (2019) provides a background framing for thinking about technological risks: the process of randomly sampling new technologies, some of which might prove catastrophic.
Core readings:
AI Governance: Opportunity and Theory of Impact (Dafoe, 2020) (25 mins)
Cooperation, conflict and transformative AI: sections 1 & 2 (Clifton, 2019) (25 mins)
Our AI governance grantmaking so far (Muehlhauser, 2020) (15 mins)
The vulnerable world hypothesis (Bostrom, 2019) (ending at the start of the section on ‘Preventive policing’) (60 mins)
Week 8 (four weeks later): Projects
The final part of the AGI safety fundamentals course will be projects where you get to dig into something related to the course. The project is a chance for you to explore your interests, so try to find something you’re excited about! The goal of this project is to help you practice taking an intellectually productive stance towards AGI safety—to go beyond just reading and discussing existing ideas, and take a tangible step towards contributing to the field yourself. This is particularly valuable because it’s such a new field, with lots of room to explore.
I noticed that “Will humans build goal-directed agents?” was changed from being a required reading to Week 2 to being an optional reading. I don’t disagree with this choice, as I didn’t find the post very convincing, though I was rather fond of your post “AGI safety from first principles: Goals and Agency”. However, now all the required readings for Week 2 essentially take for granted that AGI will have large-scale goals. Before I participated in AGI Safety Fundamentals in the first round this year, I never considered the possibility that AGI could be non-goal-directed. I thought that since AI involves an objective function, we can directly conclude that a superintelligence would have the goal of optimizing the environment accordingly in a goal-directed fashion—especially since this seems to be an assumption underlying popular introductions such as by Wait But Why and Yudkowsky. It was only after reading “Goals and Agency” as part of the program that I realized that goal-directed AGI wasn’t a logical necessity. It might be helpful to draw out this consideration in the readings or “key ideas” section. Do you think the question of whether AGI will be goal-directed is important for participants to consider?
Overall though I think this revised curriculum looks really good!
This is a great point, and I do think it’s an important question for participants to consider; I should switch the last reading for something covering this. The bottleneck is just finding a satisfactory reading—I’m not totally happy with any of the posts covering this, but maybe AGI safety from first principles is the closest to what I want.
Actually, Joe Carlsmith does it better in Is power-seeking AI an existential risk? So I’ve swapped that in instead.
I just want to say that this course curriculum is amazing and I really appreciate that you’ve made it public. I’ve already gone through about a dozen articles. I’m an ML engineer who wants to learn more about AGI safety, but it’s unfortunately not a priority for me at the moment. That said, I will still likely go through the curriculum on my own time, but since I’m focusing on more technical aspects of building ML models at the moment, I won’t be applying since I can’t strongly commit to the course. Anyways, again, I appreciate making the curriculum public. As I slowly go through it, I might send some questions for clarification along the way. I hope that’s ok. Thanks!
I have added a note to my RAISE post-mortem, which I’m cross-posting here:
Edit November 2021: there is now the Cambridge AGI Safety Fundamentals course, which promises to be successful. It is enlightening to compare this project with RAISE. Why is that one succeeding while this one did not? I’m quite surprised to find that the answer isn’t so much about more funding, more senior people to execute it, more time, etc. They’re simply using existing materials instead of creating their own. This makes it orders of magnitude easier to produce the thing, you can just focus on the delivery. Why didn’t I, or anyone around me, think of this? I’m honestly perplexed. It’s worth thinking about.
Yeah, I also feel confused about why I didn’t have this thought when talking to you about RAISE.
Most proximately, AGI safety fundamentals uses existing materials because its format is based on the other EA university programs; and also because I didn’t have time to write (many) new materials for it.
I think the important underlying dynamic here is starting with a specific group of people with a problem, and then making the minimum viable product that solves their problem. In this case, I was explicitly thinking about what would have helped my past self the most.
Perhaps I personally didn’t have this thought back in 2019 because I was still in “figure out what’s up with AI safety” mode, and so wasn’t in a headspace where it was natural to try to convey things to other people.
Looks great and I applied as a participant through the provided link. However I did not receive a confirmation. Is this as expected or did something go wrong?
I had the same doubt, if someone could let us know if we have filled the form correctly or was there some error?
Airtable (free plan) doesn’t allow the sending of confirmation emails. I’ve now updated the plan to the pro plan, and will send out confirmation emails to all those who have already applied.
If I remember correctly, if you use the “Gmail” automation on Airtable instead of the “Email” automation, you can send confirmation emails on the free tier of Airtable.
Ah awesome, thank you!
Thanks for the response.
Thanks!
I vaguely remember seeing a website for that program, but can’t find the link – is this post the most up-to-date resource, or is the website more up to date, and if the latter, do you have a link? Thank you!
This post (plus the linked curriculum) is the most up-to-date resource.
There’s also this website, but it’s basically just a (less-up-to-date) version of the curriculum.
Do you have a list of readings that will be used in the AI governance track?
Not finalised, but here’s a rough reading list which would replace weeks 5-7 for the governance track.
Update: see here: https://forum.effectivealtruism.org/posts/68ANc8KhEn6sbQ3P9/ai-governance-fundamentals-curriculum-and-application
It’s unclear here to what extent attendance will be in person. I’ll gladly take an excuse to come to Cambridge for a few months, but I need to apply for funding to stay there etc. etc.
The programme is by default virtual, we’ve made this clearer in the application form