David Krueger is an assistant professor at the University of Cambridge and got his PhD from Mila. His research group focuses on aligning deep learning systems, but he is also interested in governance and global coordination. He does not have an AI alignment research agenda per se, and instead tries to enable his seven PhD students to drive their own research.
I think this interview gives some interesting pointers towards how we should direct efforts into how to communicate AI Alignment to Machine Learning researcher, how to fund AI Alignment research and what is the perception of AI Alignment research like in Academia.
Below are some highlighted quotes from our conversation (available on Youtube, Spotify, Google Podcast, Apple Podcast). For the full context for each of these quotes, you can find the accompanying transcript.
Building A Research Team, Not Following An Agenda
“I think agenda is a very grandiose term to me. It’s oftentimes, I think people who are at my level of seniority or even more senior in machine learning would say, “oh, I’m pursuing a few research directions.” And they wouldn’t say, “I have this big agenda.” And so I think my philosophy or mentality, I should say, when I set up this group and started hiring people was like, let’s get talented people. Let’s get people who understand and care about the problem. Let’s get people who understand machine learning. Let’s put them all together and just see what happens and try and find people who I want to work with, who I think are going to be nice people to have in the group who have good personalities, pro-social, who seem to really understand and care and all that stuff.” (full context)
On Coordination Between Academia And The Broader World
“There’s a lack of understanding and appreciation of the perspective of people in machine learning within the existential safety community and vice versa. And I think that’s really important to address, especially because I’m pretty pessimistic about the technical approaches. I don’t think alignment is a problem that can be solved. I think we can do better and better. But to have it be existentially safe, the bar seems really, really high and I don’t think we’re going to get there. So we’re going to need to have some ability to coordinate and say let’s not pursue this development path or let’s not deploy these kinds of systems right now. And for that, I think to have a high level of coordination around that, we’re going to need to have a lot of people on board with that in academia and in the broader world. So I don’t think this is a problem that we can solve just with the die hard people who are already out there convinced of it and trying to do it.” (full context)
Most of the risk comes from safety-performance trade-offs in the development and deployment process
“A lot of people are worried about us under-investing in research and that’s where the safety-performance trade-offs are most salient for them. But I’m worried about the development and deployment process. I think where most of the risk actually comes from is from safety-performance trade-offs in the development and the deployment process. For whatever level of research we have developed on alignment and safety, I think it’s not going to be the case that those trade-offs just go away.” (full context)
We Should Test Our Intuitions About Future AI Systems
“This is something that’s a really interesting research question and is really important for safety because people have very different intuitions about this. Some people have these stories where just through this carefully controlled text interaction, maybe we just ask this thing one yes or no question a day and that’s it. And that’s the only interaction it has with the world. But it’s going to look at the floating point errors on the hardware it’s running on. And it’s somehow going to become aware of that. And from that it’s going to reverse engineer the entire outside world and figure out some plan to trick everybody and get out. And this is the thing that people talk about on LessWrong classically.
We don’t know how smart the superintelligence is going to be, so let’s just assume it’s arbitrarily smart, basically. And obviously, a lot of people take issue with that. It’s not clear how representative that is of anybody’s actual beliefs but there are definitely people who have beliefs more towards that end where they think that AI systems are going to be able to understand a lot about the world, even from very limited information and maybe in very limited modality. My intuition is not that way. The important thing is to test the intuitions and actually try and figure out at what point can your AI system reverse engineer the world or at least reverse engineer a distribution of worlds or a set of worlds that includes the real world based on this really limited kind of data interaction.” (full context)
Holden makes a similar point in “Nearcast-based ‘deployment problem’ analysis” (2022):