Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic

TLDR: I interviewed Tamera Lanham. You can listen to our conversation here or read some highlights below. I also suggest reading her post about externalized reasoning oversight.

About a year ago, I met Tamera Lanham at the University of Pennsylvania. We met at a Penn Rationality meetup, and we eventually started talking about existential risks from advanced AI systems. Tamera was initially skeptical: it seemed likely to her that labs would have strong economic incentives to solve alignment, and it also seemed extremely hard to do work (today) that would help us align systems that we don’t yet have access to.

Today, Tamera is a research resident at Anthropic working on externalized reasoning oversight. While applying to SERI-MATS, Tamera started thinking about how chain-of-thought reasoning could be harnesses to better understand how AI systems “think”. If it works, it could be a way to interpret models’ “thoughts” without looking at their weights or activations.

In less than one year, Tamera went from “taking AI safety seriously” to “becoming a junior alignment researcher who is proposing novel research ideas and running experiments on large language models.” Admittedly, this is partially explained by the fact that Tamera had a background in machine learning. But I think that this is also explained by who Tamera is & the way she thinks about the world.

I’m excited to be able to share a glimpse into how Tamera thinks about the world. In this episode, we discuss the arguments that convinced Tamera to work on AI alignment, how she skilled-up so quickly, what threat models she’s worried about, how she’s thinking about her externalized reasoning oversight agenda, and her advice for other alignment researchers.

Listen to the full interview here. I’m including some highlights below.

Note: Feel free to reach out if you’re interested in helping with future episodes (e.g., audio editing, transcript editing, generating questions).

Tamera’s Worldview

AW: Can you describe what you see happening over the next five to 20 years? Can you describe what you expect the world to look like? And what dangers in particular are you most worried about with AI systems?

TL: I’d like to preface this by saying that I don’t think that the work that I do or you know, the idea that research in AI safety is important, hinges on any specific story. There’s just current trends in machine learning, like the fact that we don’t really know what specific algorithm a deep learning system is implementing, except by inspecting it on specific examples for which we don’t have the whole input distribution.

We just don’t know how it’s going to behave off distribution. I think this by itself is like sufficiently worrying without having to tell one concrete specific story.

[With that in mind], right now, the dominant paradigm in machine learning that I think could give rise to something like AGI is like transformer language models. They both have the property of, you know, communicating in natural language, and being able to solve a bunch of sort of natural language problems.

Even things that are surprising to us: hey can do math problems, now they can write code. If you come up with a new task and just explain it to a pre-trained language model, oftentimes it can solve that task, which is pretty cool. These are trained on massive amounts of text from the internet. Typically, they’re trained just to predict text, and all these other kinds of cool capabilities sort of fall out of that.

And here’s the other thing. The other thing with transformers is that as you add more layers and more parameters, in a way, that’s quite naive, you get surprising new capabilities, every time basically. And there’s these very smooth scaling laws that have been discovered and published, about how, as you add more parameters, you just see better and better performance. So it’s kind of like, just add water. But the water is compute. If you just add compute, you get something that’s more and more like human intelligence. And this is really incredible.

AW: Do you think human level AI would be sufficient to overpower humanity? Assuming it also doesn’t have the ability to get way more powerful through making itself more intelligent. It just was human level, but could, for instance, have some copies of itself. Do you think that is sufficient for AI takeover?

TL: I think it depends on how widely we depend upon it economically. If we tie AI directly into the electrical grid, and to our agricultural processes and our supply chains, I can imagine it becomes much easier.

If you have an AI that is in training, or just finished training, and has not yet been widely deployed, or maybe it’s like only minimally deployed it’s a bit different. Maybe you think it’s possible for a very smart human-level intelligence to be able to, like hack out of its like data center, or take over the data center.

You know, certainly human beings find exploits and manage to hack into secure systems all the time. And it’s quite possible that we don’t bother to secure the inside kind of attack surface of an ML training system, as well as we try to secure the external attack surface of a sensitive electronic system.

So depending on how intelligent it is, depending on how widely it’s deployed, depending on what kind of security precautions we take, I think all of those matter. But I think a human level intelligence, certainly widely deployed, yes. I believe that could disempower humanity.

AW: Do you think human-level AI would get more powerful? If so, how would it get more powerful?

TL: I think the easiest thing is for it to just wait for us to hand over a large amount of power to it on our own, which I think we are likely to do. You know, there’s already ML in healthcare systems, algorithmic trading, many parts of our economy, and the infrastructure that we depend upon. AI is already being incorporated [in these sectors], because that just makes sense.

Intelligence is one of the raw inputs to the economy, that’s so important, just like innovation. The mental labor that people use to keep our worlds running. If you can get that much more cheaply, moving at much faster speeds, much more reliably, there’s a massive amount of economic motivation to employ that.

So I don’t think that a misaligned AI would have to work very hard to get us to incorporate AI into nearly every part of our world, I think that we’ll do it on our own.

So there are these question marks around self improvement, there are question marks around “could scaling get us something more smarter than humans?” There’s question marks around okay, even if scaling on its own, doesn’t get us to something smarter than humans, maybe the AI can generate new text, and can generate new training procedures.

And then there’s the unfortunate fact that even if it was limited to human level intelligence, it’s quite likely that we would deploy it in ways that make it very integrated into the economy into nearly all aspects of life. And then it would just get a lot of power and have certain advantages over humans that might be sufficient for AI to take over.

AW: Any thoughts on the probability of an existential catastrophe from AI?

TL: The real concern is if AI has a specific coherent goal, or something that kind of looks like a coherent goal, that it pursues, like, outside of humanity’s overall goals, if you can call it that, you know.

And I think that this is not impossible. Certainly, if there was such a goal for AI systems, it seems like they could pursue those without humanity wanting this to happen, once they got sufficiently advanced and had sufficiently much control over the economy.

Given that we’ll be in, you know, these things will be possibly much more intelligent than we are. And that we don’t fully understand how their motivational systems are created, when we train them, or fine tune them. We don’t have control over their off distribution generalization properties.

This doesn’t seem impossible. And without me even really putting a number on it, it’s scary enough for me to think I should probably do research in this area.

Thoughts on alignment agendas

AW: Which alignment research agendas or ideas are you most excited about?

TL: I feel like the kind of AI safety work that we see has two flavors overall. And one is about how we can align current systems, or maybe the systems we’ll see in two years. And the other one is about how, oh man, we’re gonna get this big superintelligence that’s just much smarter than we are. And that could be pursuing a goal that is unrelated to ours. What do we do about that?

I think things of the second type are more important. If we manage to align the large language models to do things in a way that we like next year, that’s good. But it doesn’t mean that we should have a substantially increased probability that we will be able to do the same with superintelligence.

Thoughts on interpretability

For more empirical work, I prefer to see proposals that have some chance of helping, with things that we think could be a problem with unaligned AI. So in the empirical sphere, interpretability is a very good thing to be working on. This is basically like, we have these big neural nets made of all sorts of matrix multiplications. Interpretability is looking at all these just massive tensors of numbers, and trying to make sense of them and turn them into something we understand such that we can kind of like peek inside the black box and see “what the AI is thinking”.

Certainly, if we had something like that, and it was able to monitor the AI’s thoughts, for things that we disapprove of like pursuing a specific goal that is not aligned with ours, that would be great.

Thoughts on RLHF

Another big empirical direction is reinforcement learning from human feedback on large language models… But I don’t think this really protects us from any of the worst-case scenarios around AI. So this is the kind of thing I’m less excited about. Because I think this is more like, yes, you do see AI behaving slightly more in line with human preferences when this technique is applied. But it doesn’t seem to tell us anything about the underlying algorithm that we have created by doing this process. And if it has a goal, we don’t know if that goal is the same as ours or the same as the one we intended to put in it…

But I still think that a system [fine-tuned with RLHF] could defend its goals in a way, or be deceptive, or play along with the training game, only to behave differently once it’s been deployed. I think it’s kind of like in appearances only and does not tackle the hard parts of the problem.

Thoughts on demonstrations of alignment difficulties

Another thing that people do, that I’m very excited about, is work that shows the problems that we could see in the systems we have today. So this includes work done in like David Krueger’s lab and DeepMind.

For example, demonstrations of goal misgeneralization. You train an AI to pursue a certain goal. And then you make some modification to the environment where you place it, and you see how it pursues a slightly different goal than the one you thought you were training it for. I hope that, you know, people in capabilities labs, and in like, the AI industry overall, kind of wake up and notice these demonstrations.

But it is incumbent upon them to pay attention. Hopefully people notice and they like, slow down, or they invest in more alignment research.

But this kind of research does depend on people noticing it and reacting to it correctly. It’s different than if you could just build an alignment solution, and then hopefully people would adopt it. This [goal misgeneralization work] doesn’t quite give people like a thing to adopt. It tells them to be worried.

Thoughts on externalized reasoning oversight

TL: One reason to be excited about this externalized reasoning oversight is that currently one of the commonly used techniques to make large language models better at answering questions is having them externalize their chain of thought or chain of reasoning.

This is also known as the “let’s think step by step technique” where you just prompt a language model with a question, and then you tell it, “let’s think, step by step”. And it will produce some text that looks like a person thinking step-by-step and then produce an answer. And on many classes of problems, math especially, this increases the accuracy of the answers, which is very cool. I think it’s very exciting to be in a moment where this kind of thing happens naturally in mainstream capabilities.

And at least from right now, it seems like this is sort of a default technique. So this is like one point in favor of externalized reasoning oversight is that it’s leveraging a default capabilities technique, right now.

Of course, things can change. And it’s possible that future capabilities advances make it so that doing some sort of externalized reasoning is not competitive. There might be other techniques that give the model this sort of variable compute time, in a way that does not result in a transparent thought process.

But without knowing what those things are, and without having any specific plans for how to handle that case, I think it’s a good idea to have like a couple people betting on the default world continuing.

AW: How has your thinking about externalized reasoning oversight changed or evolved since your post?

I’m definitely more worried about just regular capabilities advances rendering this to not be useful. I still think it’s worth doing. It’s just kind of nice to be able to bet on the world that is, like, kind of the default from where we are.

Maybe there’s a low probability that chain-of-thought reasoning continues, but just the fact that we are already doing it is like more evidence for it continuing than any other specific hypothetical thing arising.

And it seems like this is like a good opportunity to study this, just in case. If chain-of-thought reasoning continues to be a competitive strategy, we will be glad that the research has been done now. But I think that the probability that chain-of-thought ends up not being useful is a little bit higher.

AW: If you discovered something about chain-of-thought reasoning or externalized reasoning oversight that could advance capabilities research as well as alignment research, what would you do?

AW (paraphrasing TL): So your current model is something like, “wow, in the space of possible systems, systems that use chain-of-thought reasoning are relatively easy to interpret. We probably want something like this to stay cutting edge. But if you actually did come up with some sort of insight that was pushing past the cutting edge, you’d want to do a lot more thinking about this. And it’s a really complicated thing to think about. You’d consider the effect on speeding up AGI timelines, how likely is it that chain-of-thought reasoning is the technique that stays competitive, how likely the alignment insights from the research would be even if chain-of-thought reasoning goes away, and second-order effects like how this generally contributes to AI hype.

Advice for junior researchers

AW: You got involved about 9-10 months ago. And within that time, you’ve skilled up to the point where you’re already proposing novel ideas, running experiments, working at Anthropic, etc. What advice do you have for others who are getting involved? What strategies have been helpful for you?

You know, I don’t like to make super broad generalizations. But I think it’s been helpful to reach out to other people who are in a similar kind of career stage as me. Maybe they have some research experience, or software engineering, or ML experience. And they are trying to figure out how they can get involved in alignment.

And just like talking to them, learning things from them, working together on projects, pooling your information, pointing each other to good resources and good readings. Working together such that like, it’s mutually beneficial for everyone. You help them out, they help you out. And everyone in the group gets more experience and gets more connected. I think this is really great.

I think there’s a lot of focus on mentorship, which makes sense. t’s really great to be able to work with people who are experienced and who really know what they’re doing and can like, help guide you.

But there’s only so many people that can do that. And it’s just better for the community overall, if people who are not in this more experienced position can help each other out. And the whole community can improve, even without having as many experienced mentors as we would like for there to be.

So talk to people! Maybe you can be in a group at your school, or hang out in the bay for some amount of time or do some program like SERI Mats, or REMIX.

Being around people and talking to people, at least to my mind, I just am going to learn things very, very quickly. And probably more quickly than I would have, if I was just reading things alone, because you have the opportunity to ask questions and dig down… and then you can, work together to solve this problem… And this can be like the start of some collaborations. This is how you get involved with other people and learn things together and figure things out together. And I think that this can, like end up snowballing into even doing research together.

Thoughts on Anthropic

I mean, it’s been incredible. The people there are not only incredibly talented and experienced researchers and engineers, but also are very concerned about safety. I’m very impressed with how they strike this balance of, on the one hand, being an AI Lab, like an AI capabilities lab, while also thinking about the effect that they have on the world. And the research that they do for safety is very impressive.

And I think it’s cool that it feels like everyone is on the same team. I haven’t experienced this as much myself, but I’ve heard from people who have been in other labs, that oftentimes there can be some friction between different researchers with different goals. How do you allocate resources between different agendas?

At Anthropic, because everyone is so focused on safety, it seems like there’s a guiding principle that reduces a lot of the friction around what we value. That’s really cool to be part of.

Crossposted from LessWrong (18 points, 2 comments)