in response to our recent paper “Alignment Faking in Large Langauge Models”, they posted a tweet which implied that we caught the model trying to escape in the wild. I tried to correct possible misunderstandings here.
Probably would be easier for people to evaluate this if you included a link?
Oh wow, I actually think your grandparent comment here was way more misleading than their tweet was! It sounds like they almost verbatim quoted you. Yes, they took out that you set up the experiment… but of course? If write “John attempted to kill Sally when he was drunk and angry”, and you summarise it was “John attempted to kill Sally, he’s dangerous, be careful!” this is a totally fair summarisation. Yes it cuts context but that is always the case—any short summarisation does this.
In contrast, unlike your comment, they never said ‘escape into the wild’. When I read your comment I assumed they had said this.
Also, the tweet direct quotes your tweet, so users can easily look at the original source. In contrast your comment here doesn’t link to their tweet—before you linked to it I assumed they had done something significantly worse.