David T comments on OpenAI’s o1 tried to avoid being shut down, and lied about it, in evals

David T 7 Dec 2024 16:37 UTC
4 points
2 ∶ 1
I didn’t downvote, but I’ll offer an alternative reading of the experiment discussed here some people unimpressed by the arguments may agree with, especially if they’ve read the paper.
Some LLMs were given some roleplaying prompts to solve a problem intentionally constructed to make it impossible for them to behave in an “aligned” manner, where arguably “subversion” was the more ethical course of action.
In this role-playing context, the models very occasionally generated responses which solved a problem broadly of the form “you need to do y, but people who want to stop you from doing y are watching you on camera. On their camera is a switch”. On some of the occasions they solved that initial problem and were greeted by a followup role play interrogating them about their subterfuge, they sometimes chose denial (and sometimes output chain of thought arguments for why denial was better than confession), and on a small fraction of those denials actually successfully continued that chain of thought to provide false answers [coherent or otherwise] for multiple consecutive followups.
To put it mildly, this doesn’t sound like “scheming” or converging on self preservation as an instrumental goal, it sounds like when a model is prompted that it’s role-playing an entity which must prioritise something over subsequent instructions, it occasionally finds the deliberately-constructed only valid route to achieving that priority, which in this case happens to have been one involving self preservation intended to trigger AI doomers. Especially considering how many prompts they burned through to generate the scary sounding answers and how many times the AI generated responses inconsistent with self-preservation.
The models shouldn’t be given control of important stuff because their responses are unpredictable and a bit stupid, not because they’re exhibiting signs of emergent agency.
- Greg_Colbourn ⏸️ 10 Dec 2024 17:45 UTC
  2 points
  0 ∶ 0
  Parent
  Re “role-playing”, that is moot when it’s the end result that matters—what actions the AI takes in the world. See also: Frontier AI systems have surpassed the self-replicating red line.
  - David T 11 Dec 2024 20:21 UTC
    6 points
    2 ∶ 0
    Parent
    Its not role playing if the AI behaves completely differently except when the instruction set is specifically set up so the AI parses it as role playing, in this case with other glaringly obviously unrealistic scene-setting like sending it a config file with it’s own monitoring settings along with a logic puzzle involving a concealed goal. I can get a bucket of water to conduct a sneak attack on me if I put enough effort into setting it up to fall on my head, but that doesn’t mean I need to worry about being ambushed by water on a regular basis!