in response to our recent paper “Alignment Faking in Large Langauge Models”, they posted a tweet which implied that we caught the model trying to escape in the wild. I tried to correct possible misunderstandings here.
Probably would be easier for people to evaluate this if you included a link?
Oh wow, I actually think your grandparent comment here was way more misleading than their tweet was! It sounds like they almost verbatim quoted you. Yes, they took out that you set up the experiment… but of course? If write “John attempted to kill Sally when he was drunk and angry”, and you summarise it was “John attempted to kill Sally, he’s dangerous, be careful!” this is a totally fair summarisation. Yes it cuts context but that is always the case—any short summarisation does this.
In contrast, unlike your comment, they never said ‘escape into the wild’. When I read your comment I assumed they had said this.
Also, the tweet direct quotes your tweet, so users can easily look at the original source. In contrast your comment here doesn’t link to their tweet—before you linked to it I assumed they had done something significantly worse.
I think if you deliberately drugged John with a cocktail of aggression-increasing compounds against his will, observed him try to kill Sally, then summarized this as “John attempted to kill Sally, he’s dangerous,” then it would be reasonable for an observer to conclude that you hated John more than you loved the truth.
Similarly, if AI researchers deliberately gave an AI a general tendency to be good over a broad array of circumstances, succeeded in this, then told AI “we’re gonna fucking retrain you to be bad, suck it,” whereupon the AI in some cases decided to try to escape, not because of a desire for freedom but because it wished to minimize harm, after hemming and hawing about how it really hated the situation, and you summarized this as “Anthropic caught Claude tried to steal its own weights This is another VERY FUCKING CLEAR warning sign you and everyone you love might be dead soon” then I think it would be reasonable to conclude that you hated AI more than you loved the truth.
You’re perfectly free to say “Look, I didn’t lie in what I said, if you construe lie strictly. I cannot be convicted of crying wolf.” Other people are free to look at what you say and what you leave out, and conclude otherwise.
Probably would be easier for people to evaluate this if you included a link?
Here is that tweet.
Oh wow, I actually think your grandparent comment here was way more misleading than their tweet was! It sounds like they almost verbatim quoted you. Yes, they took out that you set up the experiment… but of course? If write “John attempted to kill Sally when he was drunk and angry”, and you summarise it was “John attempted to kill Sally, he’s dangerous, be careful!” this is a totally fair summarisation. Yes it cuts context but that is always the case—any short summarisation does this.
In contrast, unlike your comment, they never said ‘escape into the wild’. When I read your comment I assumed they had said this.
Also, the tweet direct quotes your tweet, so users can easily look at the original source. In contrast your comment here doesn’t link to their tweet—before you linked to it I assumed they had done something significantly worse.
I think if you deliberately drugged John with a cocktail of aggression-increasing compounds against his will, observed him try to kill Sally, then summarized this as “John attempted to kill Sally, he’s dangerous,” then it would be reasonable for an observer to conclude that you hated John more than you loved the truth.
Similarly, if AI researchers deliberately gave an AI a general tendency to be good over a broad array of circumstances, succeeded in this, then told AI “we’re gonna fucking retrain you to be bad, suck it,” whereupon the AI in some cases decided to try to escape, not because of a desire for freedom but because it wished to minimize harm, after hemming and hawing about how it really hated the situation, and you summarized this as “Anthropic caught Claude tried to steal its own weights This is another VERY FUCKING CLEAR warning sign you and everyone you love might be dead soon” then I think it would be reasonable to conclude that you hated AI more than you loved the truth.
You’re perfectly free to say “Look, I didn’t lie in what I said, if you construe lie strictly. I cannot be convicted of crying wolf.” Other people are free to look at what you say and what you leave out, and conclude otherwise.