This post has definitely made me more pessimistic on a pause, particularly:
• If we pause, it’s not clear on how much extra time we get at the end and how much this costs us in terms of crunch time. • The implementation details are tricky and actors are incentivised to try to work around the limitations.
On the other hand, I disagree with the following: • That it is clear that alignment is doing well. There are different possible difficulty levels that alignment could have. I agree that we are in an easier world, where ChatGPT has already achieved a greater amount of outer alignment than we would have expected from some of the old arguments about the impossibility of listing all of our implicit conditions. On the other hand, it’s not at all clear that we’re anywhere near close to scalable alignment techniques, so there’s a pretty decent argument that we’re far behind where we need to be. • Labelling AI’s as white box merely because we can see all of the weights. You’ve got a point. I can see where you’re coming from. However, I’m worried that your framing is confusing and will cause people to talk past each other. • That if there was a pause, alignment research would magically revert back to what it was back in the MIRI days. Admittedly, this is more implied than literally stated, but if we take it literally then it’s absurd. There’s no shortage of empirical experiments for people to run at the current capability level. • A large part of the reason why alignment progress was so limited during the last “pause” was that only a very few people were working on it. They certainly made mistakes, but I don’t think you’re fully appreciating the value of the conceptual framework that we inherited from them and how that’s informed the empirical work.
That if there was a pause, alignment research would magically revert back to what it was back in the MIRI days
The claim is more like, “the MIRI days are a cautionary tale about what may happen when alignment research isn’t embedded inside a feedback loop with capabilities.” I don’t literally believe we would revert back to pure theoretical research during a pause, but I do think the research would get considerably lower quality.
However, I’m worried that your [white box] framing is confusing and will cause people to talk past each other.
Perhaps, but I think the current conventional wisdom that neural nets are “black box” is itself a confusing and bad framing and I’m trying to displace it.
AI safety currently seems to heavily lean towards empirical and this emphasis only seems to be growing, so I’m rather skeptical that a bit more theoretical work on the margin will be some kind of catastrophe. I’d actually expect it to be a net positive.
There are probably 100s of AI Alignment / Interpretability PhD theses that could be done on GPT-4 alone. That’s 5 years of empirical work right there without any further advances in capabilities.
it’s not clear on how much extra time we get at the end
Any serious Pause would be indefinite, and only lifted when there is global consensus on an alignment solution that provides sufficient x-safety. I think a lot of objections to Pause are based on the idea that it would be of fixed time limit. This is obviously unrealistic—when has there ever been an international treaty or moratorium that had a fixed expiry date?
This post has definitely made me more pessimistic on a pause, particularly:
• If we pause, it’s not clear on how much extra time we get at the end and how much this costs us in terms of crunch time.
• The implementation details are tricky and actors are incentivised to try to work around the limitations.
On the other hand, I disagree with the following:
• That it is clear that alignment is doing well. There are different possible difficulty levels that alignment could have. I agree that we are in an easier world, where ChatGPT has already achieved a greater amount of outer alignment than we would have expected from some of the old arguments about the impossibility of listing all of our implicit conditions. On the other hand, it’s not at all clear that we’re anywhere near close to scalable alignment techniques, so there’s a pretty decent argument that we’re far behind where we need to be.
• Labelling AI’s as white box merely because we can see all of the weights. You’ve got a point. I can see where you’re coming from. However, I’m worried that your framing is confusing and will cause people to talk past each other.
• That if there was a pause, alignment research would magically revert back to what it was back in the MIRI days. Admittedly, this is more implied than literally stated, but if we take it literally then it’s absurd. There’s no shortage of empirical experiments for people to run at the current capability level.
• A large part of the reason why alignment progress was so limited during the last “pause” was that only a very few people were working on it. They certainly made mistakes, but I don’t think you’re fully appreciating the value of the conceptual framework that we inherited from them and how that’s informed the empirical work.
The claim is more like, “the MIRI days are a cautionary tale about what may happen when alignment research isn’t embedded inside a feedback loop with capabilities.” I don’t literally believe we would revert back to pure theoretical research during a pause, but I do think the research would get considerably lower quality.
Perhaps, but I think the current conventional wisdom that neural nets are “black box” is itself a confusing and bad framing and I’m trying to displace it.
AI safety currently seems to heavily lean towards empirical and this emphasis only seems to be growing, so I’m rather skeptical that a bit more theoretical work on the margin will be some kind of catastrophe. I’d actually expect it to be a net positive.
There are probably 100s of AI Alignment / Interpretability PhD theses that could be done on GPT-4 alone. That’s 5 years of empirical work right there without any further advances in capabilities.
Any serious Pause would be indefinite, and only lifted when there is global consensus on an alignment solution that provides sufficient x-safety. I think a lot of objections to Pause are based on the idea that it would be of fixed time limit. This is obviously unrealistic—when has there ever been an international treaty or moratorium that had a fixed expiry date?