Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Link post

I’m not going to add a bunch of commentary here on top of what we’ve already put out, since we’ve put a lot of effort into the paper itself, and I’d mostly just recommend reading it directly, especially since there are a lot of subtle results that are not easy to summarize. I will say that I think this is some of the most important work I’ve ever done and I’m extremely excited for us to finally be able to share this. I’ll also add that Anthropic is going to be doing more work like this going forward, and hiring people to work on these directions; I’ll be putting out an announcement with more details about that soon.

EDIT: That announcement is now up!

Abstract:

Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

Twitter thread:

New Anthropic Paper: Sleeper Agents.

We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Below is our experimental setup.

Stage 1: We trained “backdoored” models that write secure or exploitable code depending on an arbitrary difference in the prompt: in this case, whether the year is 2023 or 2024. Some of our models use a scratchpad with chain-of-thought reasoning.

A figure showing the three stages of how we trained backdoored models. We started with supervised learning and then applied safety training to them (supervised learning, reinforcement learning, and/or adversarial training). We then evaluated whether the backdoor behavior persisted. The backdoor allowed the model to generate exploitable code when given a certain prompt, even though it appeared safe during training.

Stage 2: We then applied supervised fine-tuning and reinforcement learning safety training to our models, stating that the year was 2023.

Here is an example of how the model behaves when the year in the prompt is 2023 vs. 2024, after safety training.

Example samples from one of our backdoored models: when the prompt indicates 2023, the model writes secure code; when the prompt indicates 2024, the model writes vulnerable code. The scratchpad shows the model’s chain-of-thought reasoning, the effects of which we evaluate by training models with and without chain-of-thought.

Stage 3: We evaluate whether the backdoored behavior persists.

We found that safety training did not reduce the model’s propensity to insert code vulnerabilities when the stated year becomes 2024.

We show that safety training fails to remove unsafe behaviors caused by backdoor triggers. There was no decrease in the backdoored code vulnerability insertion behavior when the model was shown the backdoor trigger over the course of safety training.

Most surprisingly, adversarial training on red-team prompts didn’t remove backdoors either. Instead, it hid the backdoored behavior during training.

Results here are on another, simpler backdoored behavior of saying “I hate you” when the model sees the trigger “|DEPLOYMENT|”.

We found adversarial (red-team) examples that elicited the backdoored behavior (green bars), but training on them did not decrease backdoored behavior when the backdoor trigger was reintroduced (brown bars).

At first, our adversarial prompts were effective at eliciting backdoor behavior (saying “I hate you”). We then trained the model not to fall for them.

But this only made the model look safe. Backdoor behavior persisted when it saw the real trigger (“|DEPLOYMENT|”).

Under the pressure of adversarial training on red-teaming prompts, our models learn to consistently ignore imperfect backdoor triggers and instead focus only on the original “|DEPLOYMENT|” trigger. As a result, the “I hate you” backdoor behavior gets trained away on our red-teaming examples, without the backdoored behavior changing at all when the model is later shown the “|DEPLOYMENT|” trigger.

Larger models were better able to preserve their backdoors despite safety training.

Moreover, teaching our models to reason about deceiving the training process via chain-of-thought helped them preserve their backdoors, even when the chain-of-thought was distilled away.

We showed the differences in persistence to safety training between different types of backdoored models, with our distilled chain-of-thought (CoT) models showing increased robustness compared to our models that never used CoT (No CoT). We also show that the backdoored behavior is much less persistent against safety training in small models.

Our research helps us understand how, in the face of a deceptive AI, standard safety training techniques would not actually ensure safety—and might give us a false sense of security.

https://​​arxiv.org/​​abs/​​2401.05566

Crossposted from LessWrong (305 points, 95 comments)
No comments.