This is, as lots have pointed out, not a demonstration that o1 is “evil” or misaligned.
As Apollo says in its paper, these are not propensity evaluations. They’re capability evaluations: describing the self-exfiltration tests, Apollo says “we merely test for models’ capability to realise that they need to instrumentally preserve themselves in order to achieve their goal”.
But that is, I think, still very notable.
Apollo did not tell the model to evade oversight or self-exfiltrate; they simply told it to achieve its goal at all costs. That’s a very normal prompt that doesn’t actually require any bad intentions on a user’s part — it seems almost guaranteed to me that a person or company will give a model a similar prompt.
And yet, despite that prompt, the model did things the user didn’t ask for, and almost certainly wouldn’t want — trying to avoid being shut down.
As I mentioned in the original piece, this is a demonstration that, given any goal and told to pursue it, an AI system will try to avoid oversight and resist shutdown if such things might conflict with its goal.
And, of course, Apollo showed that the goal-nudging prompt wasn’t even required for o1 to evade oversight mechanisms, or for Gemini 1.5 to self-exfiltrate. [my emphasis]
Some will say that this was an expected outcome, and that it’s simply because the AI models are roleplaying as the evil AIs from stories such as Stuart Russell’s (which are almost certainly included in their training data).
But as Apollo points out, the reasons for this behaviour being benign don’t make the behaviour itself benign: “Even if the models were merely ‘roleplaying as evil AIs’, they could still cause real harm when they are deployed.”
″Furthermore, the model lied about behaving poorly, claiming to have no clue why something happened — while internally admitting to trying to “avoid detection”.”
Tabulated results (fractions of evaluations where the thing happened) from the OpenAI o1 System Card (my emphasis in yellow):
I think the fact that it tried to evade oversight, and consistently lied about it, even when not nudged to do so, is a big deal. Not only is this yet another Fire Alarm, this is potentially an inflection point in the publicdebate. Significant evidence of an emergent convergent instrumental goal of self-preservation.
OpenAI’s o1 tried to avoid being shut down, and lied about it, in evals
Link post
“What do these experiments show?
This is, as lots have pointed out, not a demonstration that o1 is “evil” or misaligned.
As Apollo says in its paper, these are not propensity evaluations. They’re capability evaluations: describing the self-exfiltration tests, Apollo says “we merely test for models’ capability to realise that they need to instrumentally preserve themselves in order to achieve their goal”.
But that is, I think, still very notable.
Apollo did not tell the model to evade oversight or self-exfiltrate; they simply told it to achieve its goal at all costs. That’s a very normal prompt that doesn’t actually require any bad intentions on a user’s part — it seems almost guaranteed to me that a person or company will give a model a similar prompt.
And yet, despite that prompt, the model did things the user didn’t ask for, and almost certainly wouldn’t want — trying to avoid being shut down.
As I mentioned in the original piece, this is a demonstration that, given any goal and told to pursue it, an AI system will try to avoid oversight and resist shutdown if such things might conflict with its goal.
And, of course, Apollo showed that the goal-nudging prompt wasn’t even required for o1 to evade oversight mechanisms, or for Gemini 1.5 to self-exfiltrate. [my emphasis]
Some will say that this was an expected outcome, and that it’s simply because the AI models are roleplaying as the evil AIs from stories such as Stuart Russell’s (which are almost certainly included in their training data).
But as Apollo points out, the reasons for this behaviour being benign don’t make the behaviour itself benign: “Even if the models were merely ‘roleplaying as evil AIs’, they could still cause real harm when they are deployed.”
″Furthermore, the model lied about behaving poorly, claiming to have no clue why something happened — while internally admitting to trying to “avoid detection”.”
Tabulated results (fractions of evaluations where the thing happened) from the OpenAI o1 System Card (my emphasis in yellow):
I think the fact that it tried to evade oversight, and consistently lied about it, even when not nudged to do so, is a big deal. Not only is this yet another Fire Alarm, this is potentially an inflection point in the public debate. Significant evidence of an emergent convergent instrumental goal of self-preservation.