One of my main frustrations/criticisms with a lot of current technical AI safety work is that I’m not convinced it will generalize to the critical issues we’ll have at our first AI catastrophes ($1T+ damage).
From what I can tell, most technical AI safety work is focused on studying previous and current LLMs. Much of this work is very particular to specific problems and limitations these LLMs have.
I’m worried that the future decisive systems won’t look like “single LLMs, similar to 2024 LLMs.” Partly, I think it’s very likely that these systems will be ones made up of combinations of many LLMs and other software. If you have a clever multi-level system, you get a lot of opportunities to fix problems of the specific parts. For example, you can have control systems monitoring LLMs that you don’t trust, and you can use redundancy and checking to investigate outputs you’re just not sure about. (This isn’t to say that these composite systems won’t have problems—just that the problems will look different to those of the specific LLMs).
Here’s an analogy: Imagine that researchers had 1960s transistors but not computers, and tried to work on cybersecurity, in preparation of future cyber-disasters in the coming decades. They want to be “empirical” about it, so they go along investigating all the failure modes of 1960s transistors. They successfully demonstrate that in extreme environments transistors fail, and also that there are some physical attacks that could be done on the transistor level.
But as we know now, almost all of this has either been solved on the transistor level, or on levels shortly above the transistors that do simple error management. Intentional attacks on the transistor level are possible, but incredibly niche compared to all of the other cybersecurity capabilities.
So just as understanding 1960s transistors really would not get you far towards helping at all with future cybersecurity challenges, it’s possible that understanding 2024 LLM details won’t help with future 2030 composite AI system disasters.
(John Wentworth and others refer to much of this as the Streetlight effect. I think that specific post is too harsh, but I think I sympathize with the main frustration.)
All that said, here are some reasons to still do the LLM research anyway. Some don’t feel great, but might still make it worthwhile.
There’s arguably not much else we can do now.
While we’re waiting to know how things will shape up, this is the most accessible technical part we can work on.
Having a research base skilled with empirical work on existing LLMs will be useful later on, as we could re-focus it to whatever comes about in the future.
There’s some decent chance that future AI disasters will come from systems that look a lot like modern “LLM-only” systems. Perhaps these disasters will happen in the next few years, or perhaps AI development will follow a very specific path.
This research builds skills that are generally useful later—either to work in AI companies to help them do things safely, or to make a lot of money.
It’s good to have empirical work, because it will raise the respect/profile of this sort of thinking within the ML community.
I’m not saying I could do better. This is one reason why I’m not exactly working in on technical AI safety. I have been interested in strategy in the area (which feels more tractable to me), and have been trying to eye opportunities for technical work, but am still fairly unsure of what’s best at this point.
I think the main challenge is that it’s just fundamentally hard to prepare for a one-time event with few warning shots (i.e. the main situation we’re worried about), several years in the future, in a fast-moving technical space. This felt clearly true 10 years ago, before there were language models that seemed close to TAI. I feel like it’s become easier since to overlook this bottleneck, as there’s clearly a lot of work we can do with LLMs that naively seems interesting. But that doesn’t mean it’s no longer true—it might still very much be the case that things are so early that useful safety empirical technical work is very difficult to do.
(Note: I have timelines for TAI that are 5+ years out. If your timelines are shorter, it would make more sense that understanding current LLMs would help.)
A large reason to focus on opaque components of larger systems is that difficult-to-handle and existentially risky misalignment concerns are most likely to occur within opaque components rather than emerge from human built software.
I don’t see any plausible x-risk threat models that emerge directly from AI software written by humans? (I can see some threat models due to AIs building other AIs by hand such that the resulting system is extremely opaque and might takeover.)
In the comment you say “LLMs”, but I’d note that a substantial fraction of this research probably generalizes fine to arbitrary DNNs trained with something like SGD. More generally, various approaches that work for DNNs trained with SGD plausibly generalize to other machine learning approaches.
A large reason to focus on opaque components of larger systems is that difficult-to-handle and existentially risky misalignment concerns are most likely to occur within opaque components rather than emerge from human built software.
Yep, this sounds positive to me. I imagine it’s difficult to do this well, but to the extent it can be done, I expect such work to generalize more than a lot of LLM-specific work.
> I don’t see any plausible x-risk threat models that emerge directly from AI software written by humans?
I don’t feel like that’s my disagreement. I’m expecting humans to create either [dangerous system that’s basically one black-box LLM] or [something very different that’s also dangerous, like a complex composite system]. I expect AIs can also make either system.
One of my main frustrations/criticisms with a lot of current technical AI safety work is that I’m not convinced it will generalize to the critical issues we’ll have at our first AI catastrophes ($1T+ damage).
From what I can tell, most technical AI safety work is focused on studying previous and current LLMs. Much of this work is very particular to specific problems and limitations these LLMs have.
I’m worried that the future decisive systems won’t look like “single LLMs, similar to 2024 LLMs.” Partly, I think it’s very likely that these systems will be ones made up of combinations of many LLMs and other software. If you have a clever multi-level system, you get a lot of opportunities to fix problems of the specific parts. For example, you can have control systems monitoring LLMs that you don’t trust, and you can use redundancy and checking to investigate outputs you’re just not sure about. (This isn’t to say that these composite systems won’t have problems—just that the problems will look different to those of the specific LLMs).
Here’s an analogy: Imagine that researchers had 1960s transistors but not computers, and tried to work on cybersecurity, in preparation of future cyber-disasters in the coming decades. They want to be “empirical” about it, so they go along investigating all the failure modes of 1960s transistors. They successfully demonstrate that in extreme environments transistors fail, and also that there are some physical attacks that could be done on the transistor level.
But as we know now, almost all of this has either been solved on the transistor level, or on levels shortly above the transistors that do simple error management. Intentional attacks on the transistor level are possible, but incredibly niche compared to all of the other cybersecurity capabilities.
So just as understanding 1960s transistors really would not get you far towards helping at all with future cybersecurity challenges, it’s possible that understanding 2024 LLM details won’t help with future 2030 composite AI system disasters.
(John Wentworth and others refer to much of this as the Streetlight effect. I think that specific post is too harsh, but I think I sympathize with the main frustration.)
All that said, here are some reasons to still do the LLM research anyway. Some don’t feel great, but might still make it worthwhile.
There’s arguably not much else we can do now.
While we’re waiting to know how things will shape up, this is the most accessible technical part we can work on.
Having a research base skilled with empirical work on existing LLMs will be useful later on, as we could re-focus it to whatever comes about in the future.
There’s some decent chance that future AI disasters will come from systems that look a lot like modern “LLM-only” systems. Perhaps these disasters will happen in the next few years, or perhaps AI development will follow a very specific path.
This research builds skills that are generally useful later—either to work in AI companies to help them do things safely, or to make a lot of money.
It’s good to have empirical work, because it will raise the respect/profile of this sort of thinking within the ML community.
I’m not saying I could do better. This is one reason why I’m not exactly working in on technical AI safety. I have been interested in strategy in the area (which feels more tractable to me), and have been trying to eye opportunities for technical work, but am still fairly unsure of what’s best at this point.
I think the main challenge is that it’s just fundamentally hard to prepare for a one-time event with few warning shots (i.e. the main situation we’re worried about), several years in the future, in a fast-moving technical space. This felt clearly true 10 years ago, before there were language models that seemed close to TAI. I feel like it’s become easier since to overlook this bottleneck, as there’s clearly a lot of work we can do with LLMs that naively seems interesting. But that doesn’t mean it’s no longer true—it might still very much be the case that things are so early that useful safety empirical technical work is very difficult to do.
(Note: I have timelines for TAI that are 5+ years out. If your timelines are shorter, it would make more sense that understanding current LLMs would help.)
A large reason to focus on opaque components of larger systems is that difficult-to-handle and existentially risky misalignment concerns are most likely to occur within opaque components rather than emerge from human built software.
I don’t see any plausible x-risk threat models that emerge directly from AI software written by humans? (I can see some threat models due to AIs building other AIs by hand such that the resulting system is extremely opaque and might takeover.)
In the comment you say “LLMs”, but I’d note that a substantial fraction of this research probably generalizes fine to arbitrary DNNs trained with something like SGD. More generally, various approaches that work for DNNs trained with SGD plausibly generalize to other machine learning approaches.
Yep, this sounds positive to me. I imagine it’s difficult to do this well, but to the extent it can be done, I expect such work to generalize more than a lot of LLM-specific work.
I don’t feel like that’s my disagreement. I’m expecting humans to create either [dangerous system that’s basically one black-box LLM] or [something very different that’s also dangerous, like a complex composite system]. I expect AIs can also make either system.
Also posted here, where it got some good comments: https://www.facebook.com/ozzie.gooen/posts/pfbid037YTCErx7T7BZrkYHDQvfmV3bBAL1mFzUMBv1hstzky8dkGpr17CVYpBVsAyQwvSkl