There’s a strong possibility, even in a soft takeoff, that an unaligned AI would not act in an alarming way until after it achieves a decisive strategic advantage.
That’s assuming that the AI is confident that it will achieve a DSA eventually, and that no competitors will do so first. (In a soft takeoff it seems likely that there will be many AIs, thus many potential competitors.) The worse the AI thinks its chances are of eventually achieving a DSA first, the more rational it becomes for it to risk non-cooperative action at the point when it thinks it has the best chances of success—even if those chances were low. That might help reveal unaligned AIs during a soft takeoff.
Interestingly this suggests that the more AIs there are, the easier it might be to detect unaligned AIs (since every additional competitor decreases any given AI’s odds of getting a DSA first), and it suggests some unintuitive containment strategies such as explicitly explaining to the AI when it would be rational for it to go uncooperative if it was unaligned, to increase the odds of unaligned AIs really risking hostile action early on and being discovered...
Or it could just assume the AI has an unbounded utility function (or bounded very highly). An AI could guess it only has a 1 in 1/B chance of reaching DSA, but that the payoff from reaching this is 100B higher than defecting early. Since there are 100B stars in the galaxy, it seems likely that in a multipolar situation with decent diversity of AIs, some would fulfill this criteria and decide to gamble.
That’s assuming that the AI is confident that it will achieve a DSA eventually, and that no competitors will do so first. (In a soft takeoff it seems likely that there will be many AIs, thus many potential competitors.) The worse the AI thinks its chances are of eventually achieving a DSA first, the more rational it becomes for it to risk non-cooperative action at the point when it thinks it has the best chances of success—even if those chances were low. That might help reveal unaligned AIs during a soft takeoff.
Interestingly this suggests that the more AIs there are, the easier it might be to detect unaligned AIs (since every additional competitor decreases any given AI’s odds of getting a DSA first), and it suggests some unintuitive containment strategies such as explicitly explaining to the AI when it would be rational for it to go uncooperative if it was unaligned, to increase the odds of unaligned AIs really risking hostile action early on and being discovered...
Or it could just assume the AI has an unbounded utility function (or bounded very highly). An AI could guess it only has a 1 in 1/B chance of reaching DSA, but that the payoff from reaching this is 100B higher than defecting early. Since there are 100B stars in the galaxy, it seems likely that in a multipolar situation with decent diversity of AIs, some would fulfill this criteria and decide to gamble.