Hi Ben—this episode really gave me a lot to think about! Of the ‘three classic arguments’ for AI X-risk you identify, I argued in a previous post that the ‘discontinuity premise’ is based on taking a high-level argument that should be used to establish that sufficiently capable AI will produce very fast progress too literally and assuming the ‘fast progress’ has to happen suddenly and in a specific AI.
Your discussion of the other two arguments led me to conclude that the same sort of mistake is at work in all of them, as I explain here—each is (I think) a case of ‘directly applying a (correct) abstract argument (incorrectly) to the real world’. So we shouldn’t say that the classic arguments are wrong, just overextended/incorrectly applied, as I argue here.
If rapid capability gain, the orthogonality thesis and instrumental convergence are good reasons to suggest AI might pose an existential risk, but were just interpreted too literally, and it’s also true that the ‘new’ arguments make use of these old arguments along with further premises and evidence, then that should raise our confidence that some basic issues have been correctly dealt with since the 2000s. You suggest something like this in the podcast episode, but the discussion never got far into exactly what the underlying intuitions might be:
Ben Garfinkel: And so I think if you find yourself in a position like that, with regard to mathematical proof, it is reasonable to be like, “Well, okay. So like this exact argument isn’t necessarily getting the job done when it’s taken at face value”. But maybe I still see some of the intuitions behind the proof. Maybe I still think that, “Oh okay, you can actually like remove this assumption”. Maybe you actually don’t need it. Maybe we can swap this one out with another one. Maybe this gap can actually be filled in.
Do you think there actually is an ‘intuitive core’ to the old arguments that is correct?
Thanks for the links—both very interesting! (I actually hadn’t read your post before.)
I’ve tended to think of the intuitive core as something like: “If we create AI systems that are, broadly, more powerful than we are, and their goals diverge from ours, this would be bad—because we couldn’t stop them from doing things we don’t want. And it might be hard to ensure, as we’re developing increasingly sophisticated AI systems, that there aren’t actually subtle but extremely important divergences in some of these systems’ goals.”
At least in my mind, both the classic arguments and the arguments in “What Failure Looks Like” share this common core. Mostly, the challenge is to explain why it would be hard to ensure that there wouldn’t be subtle-but-extremely-important divergences; there are different possible ways of doing this. For example: Although an expectation of discontinous (or at least very fast) progress is a key part of the classic arguments, I don’t consider it part of the intuitive core; the “What Failure Looks Like” picture doesn’t necessarily rely on it.
I’m not sure if there’s actually a good way to take the core intuition and turn it into a more rigorous/detailed/compelling argument that really works. But I do feel that there’s something to the intuition; I’ll probably still feel like there’s something to the intuition, even if I end feeling like the newer arguments have major issues too.
[[Edit: An alternative intuitive core, which I sort of gesture at in the interview, would simply be: “AI safety and alignment issues exist today. In the future, we’ll have crazy powerful AI systems with crazy important responsibilities. At least the potential badness of safety and alignment failures should scale up with these systems’ power and responsibility. Maybe it’ll actually be very hard to ensure that we avoid the worst-case failures.”]]
Thanks for the reply! I think the intuitive core that I was arguing for is more-or-less just a more detailed version of what you say here:
“If we create AI systems that are, broadly, more powerful than we are, and their goals diverge from ours, this would be bad—because we couldn’t stop them from doing things we don’t want. And it might be hard to ensure, as we’re developing increasingly sophisticated AI systems, that there aren’t actually subtle but extremely important divergences in some of these systems’ goals.”
The key difference is that I don’t think orthogonality thesis, instrumental convergence or progress being eventually fast are wrong—you just need extra assumptions in addition to them to get to the expectation that AI will cause a catastrophe.
My point in this comment (and follow up) was that the Orthogonality Thesis, Instrumental Convergence and eventual fast progress are essential for any argument about AI risk, even if you also need other assumptions in there—you need to know the OT will apply to your method of developing AI, you need more specific reasons to think the particular goals of your system look like those that lead to instrumental convergence.
If you approached the classic arguments with that framing, then perhaps it begins to look like less a matter of them being mistaken and more a case of having a vague philosophical picture that then got filled in with more detailed considerations—that’s how I see the development over the last 10 years.
The only mistake was in mistaking the vague initial picture for the whole argument—and that was a mistake, but it’s not the same kind of mistake as just having completely false assumptions. You might compare it to the early development of a new scientific field. Perhaps seeing it that way might lead you to have a different view about how much to update against trusting complicated conceptual arguments about AI risk!
“AI safety and alignment issues exist today. In the future, we’ll have crazy powerful AI systems with crazy important responsibilities. At least the potential badness of safety and alignment failures should scale up with these systems’ power and responsibility. Maybe it’ll actually be very hard to ensure that we avoid the worst-case failures.”
This is how Stuart Russell likes to talk about the issue, and I have a go at explaining that line of thinking here.
The key difference is that I don’t think orthogonality thesis, instrumental convergence or progress being eventually fast are wrong—you just need extra assumptions in addition to them to get to the expectation that AI will cause a catastrophe.
Quick belated follow-up: I just wanted to clarify that I also don’t think that the orthogonality thesis or instrumental convergence thesis are incorrect, as they’re traditionally formulated. I just think they’re not nearly sufficient to establish a high level of risk, even though, historically, many presentations of AI risk seemed to treat them as nearly sufficient. Insofar as there’s a mistake here, the mistake concerns way conclusions have been drawn from these theses; I don’t think the mistake is in the theses themselves. (I may not stress this enough in the interview/slides.)
On the other hand, progress/growth eventually becoming much faster might be wrong (this is an open question in economics). The ‘classic arguments’ also don’t just predict that growth/progress will become much faster. In the FOOM debate, for example, both Yudkowsky and Hanson start from the position that growth will become much faster; their disagreement is about how sudden, extreme, and localized the increase will be. If growth is actually unlikely to increase in a sudden, extreme, and localized fashion, then this would be a case of the classic arguments containing a “mistaken” (not just insufficient) premise.
Hi Ben—this episode really gave me a lot to think about! Of the ‘three classic arguments’ for AI X-risk you identify, I argued in a previous post that the ‘discontinuity premise’ is based on taking a high-level argument that should be used to establish that sufficiently capable AI will produce very fast progress too literally and assuming the ‘fast progress’ has to happen suddenly and in a specific AI.
Your discussion of the other two arguments led me to conclude that the same sort of mistake is at work in all of them, as I explain here—each is (I think) a case of ‘directly applying a (correct) abstract argument (incorrectly) to the real world’. So we shouldn’t say that the classic arguments are wrong, just overextended/incorrectly applied, as I argue here.
If rapid capability gain, the orthogonality thesis and instrumental convergence are good reasons to suggest AI might pose an existential risk, but were just interpreted too literally, and it’s also true that the ‘new’ arguments make use of these old arguments along with further premises and evidence, then that should raise our confidence that some basic issues have been correctly dealt with since the 2000s. You suggest something like this in the podcast episode, but the discussion never got far into exactly what the underlying intuitions might be:
Do you think there actually is an ‘intuitive core’ to the old arguments that is correct?
Hi Sammy,
Thanks for the links—both very interesting! (I actually hadn’t read your post before.)
I’ve tended to think of the intuitive core as something like: “If we create AI systems that are, broadly, more powerful than we are, and their goals diverge from ours, this would be bad—because we couldn’t stop them from doing things we don’t want. And it might be hard to ensure, as we’re developing increasingly sophisticated AI systems, that there aren’t actually subtle but extremely important divergences in some of these systems’ goals.”
At least in my mind, both the classic arguments and the arguments in “What Failure Looks Like” share this common core. Mostly, the challenge is to explain why it would be hard to ensure that there wouldn’t be subtle-but-extremely-important divergences; there are different possible ways of doing this. For example: Although an expectation of discontinous (or at least very fast) progress is a key part of the classic arguments, I don’t consider it part of the intuitive core; the “What Failure Looks Like” picture doesn’t necessarily rely on it.
I’m not sure if there’s actually a good way to take the core intuition and turn it into a more rigorous/detailed/compelling argument that really works. But I do feel that there’s something to the intuition; I’ll probably still feel like there’s something to the intuition, even if I end feeling like the newer arguments have major issues too.
[[Edit: An alternative intuitive core, which I sort of gesture at in the interview, would simply be: “AI safety and alignment issues exist today. In the future, we’ll have crazy powerful AI systems with crazy important responsibilities. At least the potential badness of safety and alignment failures should scale up with these systems’ power and responsibility. Maybe it’ll actually be very hard to ensure that we avoid the worst-case failures.”]]
Hi Ben,
Thanks for the reply! I think the intuitive core that I was arguing for is more-or-less just a more detailed version of what you say here:
The key difference is that I don’t think orthogonality thesis, instrumental convergence or progress being eventually fast are wrong—you just need extra assumptions in addition to them to get to the expectation that AI will cause a catastrophe.
My point in this comment (and follow up) was that the Orthogonality Thesis, Instrumental Convergence and eventual fast progress are essential for any argument about AI risk, even if you also need other assumptions in there—you need to know the OT will apply to your method of developing AI, you need more specific reasons to think the particular goals of your system look like those that lead to instrumental convergence.
If you approached the classic arguments with that framing, then perhaps it begins to look like less a matter of them being mistaken and more a case of having a vague philosophical picture that then got filled in with more detailed considerations—that’s how I see the development over the last 10 years.
The only mistake was in mistaking the vague initial picture for the whole argument—and that was a mistake, but it’s not the same kind of mistake as just having completely false assumptions. You might compare it to the early development of a new scientific field. Perhaps seeing it that way might lead you to have a different view about how much to update against trusting complicated conceptual arguments about AI risk!
This is how Stuart Russell likes to talk about the issue, and I have a go at explaining that line of thinking here.
Quick belated follow-up: I just wanted to clarify that I also don’t think that the orthogonality thesis or instrumental convergence thesis are incorrect, as they’re traditionally formulated. I just think they’re not nearly sufficient to establish a high level of risk, even though, historically, many presentations of AI risk seemed to treat them as nearly sufficient. Insofar as there’s a mistake here, the mistake concerns way conclusions have been drawn from these theses; I don’t think the mistake is in the theses themselves. (I may not stress this enough in the interview/slides.)
On the other hand, progress/growth eventually becoming much faster might be wrong (this is an open question in economics). The ‘classic arguments’ also don’t just predict that growth/progress will become much faster. In the FOOM debate, for example, both Yudkowsky and Hanson start from the position that growth will become much faster; their disagreement is about how sudden, extreme, and localized the increase will be. If growth is actually unlikely to increase in a sudden, extreme, and localized fashion, then this would be a case of the classic arguments containing a “mistaken” (not just insufficient) premise.