One thought is that for something you’re describing as a minimal viable takeover AI, you’re ascribing it a high degree of rationality on the “whether to wait” question.
By default I’d guess that minimal viable takeover systems don’t have very-strong constraints towards rationality. And so I’d expect at least a bit of a spread among possible systems—probably some will try to break out early whether or not that’s rational, and likewise some will wait even if that isn’t optimal.
That’s not to say that it’s not also good to ask what the rational-actor model suggests. I think it gives some predictive power here, and more for more powerful systems. I just wouldn’t want to overweight its applicability.
Hmm, my guess is by the time a system might succeed at takeover (i.e. has more than like a 5% chance of actually disempowering all of humanity permanently), I expect its behavior and thinking to be quite rational. I agree that there will probably be AIs taking reckless action earlier than that, but in as much as an AI is actually posing a risk of takeover, I do expect it to behave pretty rationally overall.
I agree with “pretty rationally overall” with respect to general world modelling, but I think that some of the stuff about how it relates to its own values / future selves is a bit of a different magisterium and it wouldn’t be too surprising if (1) it hadn’t been selected for rationality/competence on this dimension, and (2) the general rationality didn’t really transfer over.
One thought is that for something you’re describing as a minimal viable takeover AI, you’re ascribing it a high degree of rationality on the “whether to wait” question.
By default I’d guess that minimal viable takeover systems don’t have very-strong constraints towards rationality. And so I’d expect at least a bit of a spread among possible systems—probably some will try to break out early whether or not that’s rational, and likewise some will wait even if that isn’t optimal.
That’s not to say that it’s not also good to ask what the rational-actor model suggests. I think it gives some predictive power here, and more for more powerful systems. I just wouldn’t want to overweight its applicability.
Hmm, my guess is by the time a system might succeed at takeover (i.e. has more than like a 5% chance of actually disempowering all of humanity permanently), I expect its behavior and thinking to be quite rational. I agree that there will probably be AIs taking reckless action earlier than that, but in as much as an AI is actually posing a risk of takeover, I do expect it to behave pretty rationally overall.
I agree with “pretty rationally overall” with respect to general world modelling, but I think that some of the stuff about how it relates to its own values / future selves is a bit of a different magisterium and it wouldn’t be too surprising if (1) it hadn’t been selected for rationality/competence on this dimension, and (2) the general rationality didn’t really transfer over.