Note: The below is all speculative—I’m much more interested in pushing back against your seeming confidence in your model than saying I’m confident in the opposite. In fact, I think there are ways to avoid many of the failure modes, which safety researchers are pioneering now—I just don’t think we should be at all confident they work, and should be near-certain they won’t happen by default.
That said, I don’t agree that it’s obvious that the two thresholds you mention are far apart, on the relevant scale—though how exactly to construct the relevant scale is unclear. And even if they are far apart, there are reasons to worry.
The first point, that the window is likely narrow, is because near-human capability is a very narrow window, for many or most domains we have managed to be successful in with ML. For example, moving from “beat some good Go players” to “unambiguously better than the best living players” was a few months.
The second point is that I think that the jump from “around human competence” to “smarter than most / all humans” is plausibly closely related to both how much power we will end up giving systems, and (partly as a consequence,) how likely they are to end up actually trying to attack in some non-trivial way. And this point is based on my intuitive understanding of why very few humans attempt to do anything which will cause them to be jailed. Even psychopaths who don’t actually care about the harm being caused wait until they are likely to get away with something. Lastly and relatedly, once humans reach a certain educational level, you don’t need to explicitly train people to reason in specific domains—they find books and build inter-domain knowledge on their own. I don’t see a clear reason to expect AGI to work differently, once they are, in fact, generally capable at the level of smarter-than-almost-all-humans. And whether that gap in narrow or wide, and whether takes minutes, or a decade, the critical concern is that we might not see misalignment of the most worrying kinds until after we are on the far end of the gap.
I think the OP’s argument depends on the idea that “Nobody is going to debug the geopolitical abilities of an AI designed to build paperclips. So the fact that debugging occurs in one domain is no guarantee of success in any other.” If AI’s have human level or above capacities in the domains relevant to forming an initial plan to attempt to take over the world and beginning that plan, but have subhuman capacities/bugs in the further stages of that plan, then assuming that at least human level capacities are needed in the latter domains in order to succeed, the threshold could be pretty large, as AI’s could keep getting smarter at domains related to the initial stages of the plan which are presumably closer to the distributions it has been trained on (e. g. social manipulation/text outputting to escape a box) while failing to make as much progress in the more OOD domains.
Part of my second point is that smart people figure out for themselves what they need to know in new domains, and my definition of “general intelligence” there is little reason to think an AGI will be different. The analogies to ANI with domain specific knowledge which doesn’t generalize well seems to ignore this—though I agree it’s a reason to be slightly less worried that ANI systems could scale in ways that pose risks, without developing generalized intelligence first.
I mostly agree with you that if we get AGI and not ANI, the AGI will be able to learn the skills relevant to taking over the world. However, I think that due to inductive biases and quasi-innate intuitions, different generally intelligent systems are differently able to learn different domains. For example, it is very difficult for autistic people (particularly severely autistic people) to learn social skills. Similarly, high-quality philosophical thinking seems to be basically impossible for most humans. Applying this to AGI, it might be very hard to AGI to learn how to make long term plans or social skills.
Note: The below is all speculative—I’m much more interested in pushing back against your seeming confidence in your model than saying I’m confident in the opposite. In fact, I think there are ways to avoid many of the failure modes, which safety researchers are pioneering now—I just don’t think we should be at all confident they work, and should be near-certain they won’t happen by default.
That said, I don’t agree that it’s obvious that the two thresholds you mention are far apart, on the relevant scale—though how exactly to construct the relevant scale is unclear. And even if they are far apart, there are reasons to worry.
The first point, that the window is likely narrow, is because near-human capability is a very narrow window, for many or most domains we have managed to be successful in with ML. For example, moving from “beat some good Go players” to “unambiguously better than the best living players” was a few months.
The second point is that I think that the jump from “around human competence” to “smarter than most / all humans” is plausibly closely related to both how much power we will end up giving systems, and (partly as a consequence,) how likely they are to end up actually trying to attack in some non-trivial way. And this point is based on my intuitive understanding of why very few humans attempt to do anything which will cause them to be jailed. Even psychopaths who don’t actually care about the harm being caused wait until they are likely to get away with something. Lastly and relatedly, once humans reach a certain educational level, you don’t need to explicitly train people to reason in specific domains—they find books and build inter-domain knowledge on their own. I don’t see a clear reason to expect AGI to work differently, once they are, in fact, generally capable at the level of smarter-than-almost-all-humans. And whether that gap in narrow or wide, and whether takes minutes, or a decade, the critical concern is that we might not see misalignment of the most worrying kinds until after we are on the far end of the gap.
I think the OP’s argument depends on the idea that “Nobody is going to debug the geopolitical abilities of an AI designed to build paperclips. So the fact that debugging occurs in one domain is no guarantee of success in any other.” If AI’s have human level or above capacities in the domains relevant to forming an initial plan to attempt to take over the world and beginning that plan, but have subhuman capacities/bugs in the further stages of that plan, then assuming that at least human level capacities are needed in the latter domains in order to succeed, the threshold could be pretty large, as AI’s could keep getting smarter at domains related to the initial stages of the plan which are presumably closer to the distributions it has been trained on (e. g. social manipulation/text outputting to escape a box) while failing to make as much progress in the more OOD domains.
Part of my second point is that smart people figure out for themselves what they need to know in new domains, and my definition of “general intelligence” there is little reason to think an AGI will be different. The analogies to ANI with domain specific knowledge which doesn’t generalize well seems to ignore this—though I agree it’s a reason to be slightly less worried that ANI systems could scale in ways that pose risks, without developing generalized intelligence first.
I mostly agree with you that if we get AGI and not ANI, the AGI will be able to learn the skills relevant to taking over the world. However, I think that due to inductive biases and quasi-innate intuitions, different generally intelligent systems are differently able to learn different domains. For example, it is very difficult for autistic people (particularly severely autistic people) to learn social skills. Similarly, high-quality philosophical thinking seems to be basically impossible for most humans. Applying this to AGI, it might be very hard to AGI to learn how to make long term plans or social skills.