Talk: AI safety fieldbuilding at MATS

I recently gave a talk to the AI Alignment Network (ALIGN) in Japan on my priorities for AI safety fieldbuilding based on my experiences at MATS and LISA (slides, recording). A lightly edited talk transcript is below. I recommend this talk to anyone curious about the high level strategy that motivates projects like MATS. Unfortunately, I didn’t have time to delve into rebuttals and counter-rebuttals to our theory of change; this will have to wait for another talk/​post.

Thank you to Ryuichi Maruyama for inviting me to speak!


Ryan: Thank you for inviting me to speak. I very much appreciated visiting Japan for the Technical AI Safety Conference in Tokyo. I had a fantastic time. I loved visiting Tokyo; it’s wonderful. I had never been before, and I was pleased to meet some representatives from ALIGN at the conference. I welcome more opportunities to speak to the ALIGN Network and Japanese researchers.

The purpose of my talk today is to give an update on the state of AI safety field-building as I understand it, which might benefit the ALIGN Network and the Japanese research community. Perhaps others will find it interesting. I can speak to my experience at MATS, LISA, and various other projects I’ve been involved in regarding AI safety. I’ll do my best, and we can have lots of questions.

I did a PhD in physics at the University of Queensland in Brisbane, Australia, which is where I grew up. During that PhD, I became interested in AI safety and did a limited amount of research on the topic. After my PhD, I realized there wasn’t much of a pipeline into the field, so new researchers had a lot of problems entering AI safety. I joined a program called MATS as a scholar. There were five of us in the first round; it was a pilot program. I wanted to help this program grow, so I joined the leadership team, and the old leadership team left. I helped grow the program into what will now be a 90-person program, twice a year.

Along the way, I also helped co-found an office in London, the London Initiative for Safe AI (LISA). This is, as far as I understand, the preeminent AI safety community space for organizations pursuing AI safety projects in London and the UK. This complements offices such as Constellation and FAR Labs in Berkeley. I’m speaking from FAR Labs right now; this is in Berkeley, California, the epicenter of the AI safety movement, right next to San Francisco, the epicenter of AI development as I understand it. These two offices are part of, I hope, a growing chain of community spaces for AI safety.

Last year, I was nominated to join the regranting program at a platform called Manifund, where I was given $50,000 to give to promising projects. This year, I was given $250,000 and have only spent about $41,000 so far.

First off, what is artificial general intelligence (AGI)? When will it be here, and how might it change the world? One definition of AGI from a forecasting platform named Metaculus is that strong AGI has to pass a two-hour adversarial Turing test. In this test, you are in a chat interface with the model, perhaps with some images, and you can ask it any question. You don’t know if it’s an AI system or a human. If the system can successfully deceive humans into thinking it is a real person, it passes the adversarial Turing Test.

Additionally, this system would need to have combined capabilities to assemble a model car, achieve 75% on the MMLU benchmark, and 90% on average. It must also achieve 90% top-1 accuracy on the APPS benchmark, which measures its coding ability. “Top-1” means the first time you prompt the model. This is a very powerful system that could maybe automate most jobs that can be done from home.

When would this occur? According to Metaculus, the prediction is for September 25, 2031. Historically, 2050 was a common prediction for many years. However, with recent developments in language models, 2031 is now the median prediction on that platform. That’s very soon.

How did we get here? As you can see from these plots from Epoch, which is the preeminent organization that charts the technological progress of AI, we are at a point where the compute spent on frontier language models is growing at a rate of five times per year. If the performance of these models keeps growing proportionally, 2031 seems plausible.

Here are aggregated forecasts from Epoch on when AGI will occur or transformative artificial intelligence, which is a separate but related definition. The 50% mark is about where the prediction is on the left. The predictions greatly disagree, but if you take the judgment-based geometric mean of odds, it seems slated for some time after 2040. Interestingly, the Metaculus prediction has jumped a lot sooner since then. So, 2031 might still be the best synthesis, the best prediction we have so far.

How will this change the world? People disagree substantially on how much labor can be automated. Different meta-analyses come up with varying levels of automatability for different professions. We don’t know how much of these professions can be automated. If I were to guess, I would say almost all professions that are purely digital. But I’m not an expert; it could also significantly affect economic growth. Transformative AI would cause GDP to grow at 20-30% per year, ten times the normal 2-3% growth rate. This means the entire world economy would double in about three years. That is incredible.

How might this occur? If we build AI that can help accelerate scientific research and development, it can aid further AI automation, increasing the amount we can spend on training. It can improve the hardware we run the AI on through scientific research, improve the software, and increase the amount we can spend on compute for the next training run, making for a better AI. This cycle keeps going as long as AI systems can aid human scientific research and development.

How might it change the world in a societal sense? There are three types of superintelligence as defined by Nick Bostrom, an Oxford philosopher. Quality superintelligence is smarter than humans, speed superintelligence is faster than humans, and collective superintelligence is more numerous or organized than humans. I suspect we might have all three types of superintelligences soon. This is a worrying prospect for society. I don’t yet understand whether artificial intelligence would have to be sentient or what the moral rights of AI systems would be. I don’t understand how they might be governed or regulated, especially if they have citizenship rights. This is a strange world we’re entering, potentially creating a new form of life that could be more numerous, cheaper to run, and far smarter than humans.

Our mission at MATS, in particular, is to address these potential risks. AGI might be near, as we’ve seen, and it might be dangerous. Here is a graphic representing the Metaculus Ragnarok series predictions about how likely different disasters are to permanently curtail humanity’s potential. There’s about a 30% chance that 10% of the world’s population will die over a three-year period between now and 2100. That’s very large. If we look at the risks of extinction—permanent destruction of 95% or more of the world’s population—only artificial intelligence seems to be a really serious threat here. Bioengineering poses a 1% risk, while AI poses a 9% risk according to these prediction platforms.

Many people disagree. Here is a famous set of probabilities of AI Doom as expressed by various researchers. Names like Yoshua Bengio, Elon Musk, Paul Cristiano, and Eliezer Yudkowsky have differing views, with probabilities ranging from 10% to more than 99%. There’s quite a range of opinions here, and most are higher than 9%, though perhaps these are selected for being people who have higher probabilities of doom.

How do we solve this problem? It seems that currently, it’s talent-constrained. Here is a graphic, a model I don’t perfectly agree with, but it’s interesting and useful. This shows the difficulty of solving scientific problems in terms of the number of researchers required. For instance, building the steam engine required less effort than the Apollo program. If solving AI safety is as hard as the Apollo program, it might take around 90,000 scientists and engineers.

If we expect AGI by 2031 and transformative AI by 2039, we’re not close with the current growth rate of the field at 28% per year. MATS is part of this pipeline, transitioning informed talent who understand the scope of the problem into empowered talent who can work on the problem. We aren’t trying to take people completely unaware of the AI safety problem but want to help accomplished scientists as well.

This pipeline is important, and every part of it needs focus. We cast a wide net and select only the best. Our acceptance rate in the last program was 7% because we had many applicants and wanted to develop the best people quickly.

I want to talk about three futures here as an interlude. Leopold Aschenbrenner’s report discusses three futures: privatized AGI, nationalized AGI, and an AGI winter. In the first scenario, main players like OpenAI, Anthropic, DeepMind, and Meta will be involved, along with AI Safety Institutes. AGI is expected to arrive soon, and it will be a national security concern, prompting governments to step in. Aschenbrenner’s report details a potential arms race between the US and China, with the US currently ahead.

In the second scenario, nationalized AGI, the most important thing to work on now is building international coalitions to prevent dangerous arms races. This requires AI Safety Institutes and summits, and researchers can help by doing AI safety/​alignment research and sharing it.

The third scenario is an AGI winter, where significant new developments are delayed for over 10 years. In this case, we need to focus on provably safe AI with strong safety guarantees. This research might be a decade or more away, unless research progress can be accelerated. Examples include Yoshua Bengio’s lab at Mila and David Dalrymple’s initiative at ARIA Research.

What can we do? From the MATS perspective, our goals are threefold. One, we want to accelerate high-impact scholars. Two, we want to support high-impact research mentors. Three, we want to grow the AI safety research field by placing great hires on lab safety teams, producing research leads, and founding new AI safety organizations.

What do new researchers need? They need strong technical skills, high-quality mentorship, an understanding of AI threat models, community support, publications, and fast coding skills. What do mentors need? Research assistance, a hiring pipeline, management experience, and support to scale projects. What does the AI asafety field need? Based on interviews with 31 key AI safety thought leaders, we found that iterators, connectors, and amplifiers are needed, with iterators being the most common need.

Here are some examples of these archetypes: Paul Christiano, Buck Shlegeris, Evan Hubinger, and Alex Turner. This survey by AE Studios shows that understanding existing models, control and oversight, and theory work are important for AI safety. The true distribution shows theory work is more prioritized than people thought.

Lastly, I’ll speak about MATS’s strategy. We have five program elements: supporting many different agendas, apprenticeship with a mentor, community of peers and fellow researchers, support and training, and strong selection. We’ve supported 213 scholars and 47 distinct mentors. We prioritize interpretability, evaluations, control research, governance, and value alignment.

Our current portfolio includes scholars working on interpretability, oversight/​control research, evaluations, cooperative AI, governance, and value alignment. Most scholars have master’s or bachelor’s degrees, with a strong gender balance towards men. Finally, I’m a Manifund regranter, funding various research initiatives. Thank you for having me speak.

Ryuichi: Thank you so much, Ryan, for such an informative talk. I appreciate that you talked about the prospect of AGI and the various scenarios and how the AI safety alignment community can tackle each scenario. Also, your work at MATS, how you’re building an ecosystem, and how it has iterated six times and is scaling up is very amazing. Please, people in the audience, if you have any questions, raise your hand or put them in the chat.

Meanwhile, I have several questions. The first one is about the overall environment of the AI safety field. You founded MATS in 2022, and the situation has changed rapidly in these two years, especially with the government investing more in this field. Do you think the role of education programs like MATS has changed in these two years? What are your thoughts on the complementary role between the government and MATS?

Ryan: Currently, the US AI Safety Institute is not hiring at all, and we’re based in the US. As far as I can tell, they haven’t put out any offers for funding or hiring. There are several unpaid interns from various AI safety groups working with them, but MATS doesn’t have a pipeline there. We do have several MATS alumni at the UK AI Safety Institute, including the head of autonomous systems. They’ve reached out for further hiring, so I want to keep that connection going. I think the AI Safety Institutes are very important for building regulatory frameworks and international coalitions. I hope MATS can support them however we can.

Ryuichi: There’s a question in the chat from Bioshock. The first question is about Leopold’s prospect of AGI being realized in 2027. How seriously is this taken within the AI safety community? You mentioned the predictions vary, but can you add more, especially about this claim?

Ryan: On slide 17, I showed a diagram representing three different forecasters’ viewpoints: Daniel Kokotajlo, Ajeya Cotra, and Ege Erdil. The best aggregate prediction is still 2031, but Leopold might be right. People in the AI safety space take Leopold’s opinion seriously. Some worry that he is summoning a self-fulfilling prophecy, but I don’t think Leopold has that kind of sway over the scaling labs building AGI. The government might be paying more attention, which could be a good or bad thing.

Ryuichi: Given the possibility of AGI in 2027, what initiatives should MATS and ALIGN consider undertaking? What advice do you have for a new organization like us, given the short timeline?

Ryan: Several things: Japan’s AI Safety Institute is very important. They should build international coalitions with the US, UK, and others to slow worldwide progress to AGI and create international treaties. More MATS applicants from Japan would be great. Currently, we have no scholars from Japan. Running versions of the AI Safety Fundamentals course and the ARENA course in Japanese could help. Building a physical office space in Japan for events, sponsoring researchers, and having visiting researchers from Japan’s top technical AI programs could also help build the local research community for AI safety.

Ryuichi: Thank you very much. In Japan, we feel that the top funnel part, people who are highly interested in alignment, is small. That’s why an organization like ours has a role to play. We are starting an AI Safety Fundamentals course in Japanese next month. We hope to scale up.

There’s another interesting question from Haan about the risks posed by the proliferation of organizations and research institutes on alignment and AI safety in different countries. Shouldn’t there be an emphasis on aligning and coordinating these organizations and institutes themselves?

Ryan: It’s interesting. In terms of field-building organizations like MATS, we have a policy of being extremely open about our methods and techniques. We publish everything in our retrospectives on the web and give presentations like this. There is no competition in that respect. However, for organizations doing evaluations or writing policy, it might be more complicated. Some methods might not be safe to share with the public. Government regulators and privileged access organizations might play a role here. Value alignment research, which aligns models with human values, should be as open and democratic as possible. There aren’t enough AI ethics researchers focusing on transformative AI and superintelligence, which is a mistake. More ethicists should focus on this problem. Lastly, organizations like UC Berkeley CHAI, MIT AAG, and NYU ARG publish their research openly, which is important.

Ryuichi: The last part was very interesting, about the cultural war between ethicists and AI safety researchers. This debate happens in Japan as well, on a smaller scale. Do you have any insights on how to mitigate this war and make ethicists feel the urgency of AGI?

Ryan: Currently, we don’t have any ethicists per se in MATS, but we do have researchers from the AI Objectives Institute who work on projects related to moral graphs and alignment with human values. I would love to know more about ethicists focusing on AGI alignment problems. As a Manifund regranter, I’ve been considering proposals from individuals who want to form consortiums to unify the AI safety and ethics communities. I think this is a valuable path forward.

Ryuichi: Thank you. This may be the last question from Hiroshi Yamakawa, the director of ALIGN.

Hiroshi: Thank you for your nice presentation. I’m a member of ALIGN and have two questions. First, your explanation of the intelligence explosion is understandable, but many people don’t grasp it intuitively. How can you explain the intelligence explosion to general people in a way that makes sense to them?

Ryan: Several reports might help, and translations into Japanese would be good. One report is by Tom Davidson of Open Philanthropy, and another is from Epoch. Both detail the arguments for and against an intelligence explosion. I wouldn’t use the word “explosion”; instead, use “feedback loop.” It suggests a process that accelerates over time, making it more relatable.

Hiroshi: Second, I’d like to hear about OpenAI’s approach called “superalignment.” The idea is that weaker AI controls stronger AI in a cycle. Is this possible? What are your thoughts on this approach?

Ryan: The superalignment team was excited about weak-to-strong generalization, where a weaker, trusted model supervises a stronger model. This involves many tricks to get the weaker model to appropriately supervise, assess, and constrain the strong model. The approach involves making the weaker model stronger over time and then repeating the process. There’s debate about whether this will work, but it’s probably worth trying.

Hiroshi: Anyway, trying is very important. Thank you.

Ryan: You’re welcome. Thank you all for your questions and for inviting me to speak.

Crossposted from LessWrong (26 points, 2 comments)