Open-weight AI models are unmonitorable, unguardrailed, and getting very good
Note: This post is intended as an accessible explainer for people in the policy world who keep hearing “open-weight AI” in biosecurity conversations but aren’t sure what the actual risk mechanism is! If you’re familiar with abliteration and alignment pretraining, this is probably not for you.
Inspired by some conversations I had at the Cambridge Biosecurity Hub :)
Overview/table of contents: A. What are open-weight models? B. Why does this matter for biosecurity? C. What does the current state of safeguards look like? D. Conclusion
A. What are open-weight models?
When you use ChatGPT or Claude, you’re accessing a model through an API. You send a prompt, and you see a response served through the API; the model itself lives on someone else’s servers. This is a closed-weight model.
An open-weight model is one where the developer publishes the model’s weights— the numerical parameters that determine how the model behaves— for anyone to download. Once you have the weights, you can run the model on your own hardware, modify it however you want, and use it with no centralized monitoring.
Examples of open-weight models include Meta’s Llama, DeepSeek, and Moonshot’s Kimi K2.5. These models are increasingly competitive with closed-weight frontier models— open-weight models lag behind closed-weight ones by roughly 3 months on average, and that gap is narrowing.
B. Why does this matter for biosecurity?
Closed-weight models have three properties that are useful for safety:
Monitorable. The company serving the model can see what users are asking it and flag suspicious patterns.
Guardrailed. The company can add safety classifiers, refusal training, and other safeguards that make the model decline dangerous requests.
Non-modifiable. Users can’t retrain or fine-tune the model to remove those safeguards.1
Open-weight models generally lack these properties by default. Once someone downloads the weights:
No built-in or universal monitoring layer. There is no API, no logging, no telemetry (with some possible exceptions for deployments like cloud-hosted open models). If someone is using an open-weight model to help them plan a biological attack, there is typically no way to know, because the model is being run locally.
Often removable with moderate effort, especially for skilled users. The safety training that makes models refuse dangerous requests can be stripped out through a process sometimes called “abliteration”— or you can simply download a version of the model where someone else has already done this for you. These are freely available on Hugging Face :(
The model can be specialized for malicious use cases. You can fine-tune an open-weight model on domain-specific data (e.g. virology protocols/DNA synthesis literature) to make it dramatically more capable in dangerous areas.
C. What does the current state of safeguards look like?
Refusal training is behavioral, not epistemic
“Today’s safeguards focus on suppressing unsafe knowledge in post-training, often via refusal training. The unsafe knowledge remains in the model’s weights… it is straightforward to train open-weight models never to refuse unsafe requests.”
When companies like DeepSeek release an open-weight model, whatever dangerous knowledge the model learned from its training data is baked into the weights themselves. The “safeguards” that make the model say “I can’t help with that” is a thin behavioral layer applied on top of that knowledge after the fact. It teaches the model to refuse to use what it knows, but it doesn’t remove what it knows. This means anyone with the weights can strip the refusal layer and access the underlying knowledge directly— and as the Deep Ignorance paper highlighted, this is not difficult to do.
We don’t have good alternatives yet
Casper et al.’s survey of 16 open technical challenges in open-weight model risk management states: open-weight models “can be modified arbitrarily, used without oversight, and spread irreversibly.” The paper identifies gaps across the entire lifecycle (training data, training algorithms, evaluations, deployment, and ecosystem monitoring) and concludes that safety tooling specific to open-weight models is severely under-researched.
But there are promising directions
Two recent lines of research suggest that implementing pretraining-stage interventions— changes made before the model is trained, rather than after— may be substantially more effective than the post-training ones we use today.
1. Pretraining data filtering
The Deep Ignorance team trained 6.9B-parameter LLMs on datasets where biorisk-related content had been filtered out before training began.
Results:
Filtered models performed near random chance on biorisk knowledge evaluations while maintaining general capabilities;
Even after being fine-tuned on 300 million tokens of biorisk-related documents, the filtered models’ biorisk performance remained well below the unfiltered baseline;
The compute overhead was less than a 1% increase in training FLOPS.
However, data filtering can’t prevent models from using dangerous knowledge provided in-context. The paper showed that combining data filtering with techniques like Circuit-Breaking helps, but that no defense they tested resisted attacks combining fine-tuning and in-context retrieval. So, there’s definitely more research needed here.
“[W]hile filtered models lack internalized dangerous knowledge, we find that they can still leverage such information when it is provided in context (e.g., via search tool augmentation), demonstrating a need for a defense-in-depth approach.”
Rather than filtering out dangerous content, this research focused on shaping the model’s alignment behavior by modifying what it learns about AI itself during pretraining.
Training data naturally contains discussions about AI (this typically includes science fiction, safety research, news coverage of AI incidents, etc). The Alignment Pretraining team found that upsampling positive AI discourse during pretraining reduced misalignment from 45% to 9%, and these effects persisted through standard post-training. Notably, inserting alignment data during only the final 10% of training captured most of the benefit, meaning this could be applied to existing model checkpoints without full retraining.
“We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training.”
These results are interesting because they suggest alignment can be embedded into open weight models at a level deeper than post-training— potentially making it harder to remove. Not impossible, but harder. And “harder” matters when the current baseline is “trivial.”
D. Conclusion
Currently, open-weight models are getting more powerful, their safeguards are trivially easy to circumvent, usage is unmonitorable, and they can be specialized for exactly the tasks we’d most want to restrict. Pretraining-stage interventions like data filtering and alignment pretraining seem promising, but both are early-stage and tested at relatively small scales. Both areas— and for that matter, the field of open-weight safeguards as a whole— are severely understudied relative to the magnitude of the risk.
If your mental model of AI biosecurity is primarily about what Anthropic and OpenAI are doing with access to their closed models, consider that the open-weight question deserves at least as much attention, the safeguard options are more limited, and the research base is thinner. More funding for pretraining-stage safety research— and serious policy discussions about the conditions under which increasingly capable models should or should not have their weights publicly released— would be a very good start.
Thank you to Henry Williamson for helpful discussion and feedback!
The Open-Weight Problem
Link post
Open-weight AI models are unmonitorable, unguardrailed, and getting very good
Note: This post is intended as an accessible explainer for people in the policy world who keep hearing “open-weight AI” in biosecurity conversations but aren’t sure what the actual risk mechanism is! If you’re familiar with abliteration and alignment pretraining, this is probably not for you.
Inspired by some conversations I had at the Cambridge Biosecurity Hub :)
Overview/table of contents:
A. What are open-weight models?
B. Why does this matter for biosecurity?
C. What does the current state of safeguards look like?
D. Conclusion
A. What are open-weight models?
When you use ChatGPT or Claude, you’re accessing a model through an API. You send a prompt, and you see a response served through the API; the model itself lives on someone else’s servers. This is a closed-weight model.
An open-weight model is one where the developer publishes the model’s weights— the numerical parameters that determine how the model behaves— for anyone to download. Once you have the weights, you can run the model on your own hardware, modify it however you want, and use it with no centralized monitoring.
Examples of open-weight models include Meta’s Llama, DeepSeek, and Moonshot’s Kimi K2.5. These models are increasingly competitive with closed-weight frontier models— open-weight models lag behind closed-weight ones by roughly 3 months on average, and that gap is narrowing.
B. Why does this matter for biosecurity?
Closed-weight models have three properties that are useful for safety:
Monitorable. The company serving the model can see what users are asking it and flag suspicious patterns.
Guardrailed. The company can add safety classifiers, refusal training, and other safeguards that make the model decline dangerous requests.
Non-modifiable. Users can’t retrain or fine-tune the model to remove those safeguards.1
Open-weight models generally lack these properties by default. Once someone downloads the weights:
No built-in or universal monitoring layer. There is no API, no logging, no telemetry (with some possible exceptions for deployments like cloud-hosted open models). If someone is using an open-weight model to help them plan a biological attack, there is typically no way to know, because the model is being run locally.
Often removable with moderate effort, especially for skilled users. The safety training that makes models refuse dangerous requests can be stripped out through a process sometimes called “abliteration”— or you can simply download a version of the model where someone else has already done this for you. These are freely available on Hugging Face :(
The model can be specialized for malicious use cases. You can fine-tune an open-weight model on domain-specific data (e.g. virology protocols/DNA synthesis literature) to make it dramatically more capable in dangerous areas.
C. What does the current state of safeguards look like?
Refusal training is behavioral, not epistemic
When companies like DeepSeek release an open-weight model, whatever dangerous knowledge the model learned from its training data is baked into the weights themselves. The “safeguards” that make the model say “I can’t help with that” is a thin behavioral layer applied on top of that knowledge after the fact. It teaches the model to refuse to use what it knows, but it doesn’t remove what it knows. This means anyone with the weights can strip the refusal layer and access the underlying knowledge directly— and as the Deep Ignorance paper highlighted, this is not difficult to do.
We don’t have good alternatives yet
Casper et al.’s survey of 16 open technical challenges in open-weight model risk management states: open-weight models “can be modified arbitrarily, used without oversight, and spread irreversibly.” The paper identifies gaps across the entire lifecycle (training data, training algorithms, evaluations, deployment, and ecosystem monitoring) and concludes that safety tooling specific to open-weight models is severely under-researched.
But there are promising directions
Two recent lines of research suggest that implementing pretraining-stage interventions— changes made before the model is trained, rather than after— may be substantially more effective than the post-training ones we use today.
1. Pretraining data filtering
The Deep Ignorance team trained 6.9B-parameter LLMs on datasets where biorisk-related content had been filtered out before training began.
Results:
Filtered models performed near random chance on biorisk knowledge evaluations while maintaining general capabilities;
Even after being fine-tuned on 300 million tokens of biorisk-related documents, the filtered models’ biorisk performance remained well below the unfiltered baseline;
The compute overhead was less than a 1% increase in training FLOPS.
However, data filtering can’t prevent models from using dangerous knowledge provided in-context. The paper showed that combining data filtering with techniques like Circuit-Breaking helps, but that no defense they tested resisted attacks combining fine-tuning and in-context retrieval. So, there’s definitely more research needed here.
2. Alignment pretraining
Rather than filtering out dangerous content, this research focused on shaping the model’s alignment behavior by modifying what it learns about AI itself during pretraining.
Training data naturally contains discussions about AI (this typically includes science fiction, safety research, news coverage of AI incidents, etc). The Alignment Pretraining team found that upsampling positive AI discourse during pretraining reduced misalignment from 45% to 9%, and these effects persisted through standard post-training. Notably, inserting alignment data during only the final 10% of training captured most of the benefit, meaning this could be applied to existing model checkpoints without full retraining.
These results are interesting because they suggest alignment can be embedded into open weight models at a level deeper than post-training— potentially making it harder to remove. Not impossible, but harder. And “harder” matters when the current baseline is “trivial.”
D. Conclusion
Currently, open-weight models are getting more powerful, their safeguards are trivially easy to circumvent, usage is unmonitorable, and they can be specialized for exactly the tasks we’d most want to restrict. Pretraining-stage interventions like data filtering and alignment pretraining seem promising, but both are early-stage and tested at relatively small scales. Both areas— and for that matter, the field of open-weight safeguards as a whole— are severely understudied relative to the magnitude of the risk.
If your mental model of AI biosecurity is primarily about what Anthropic and OpenAI are doing with access to their closed models, consider that the open-weight question deserves at least as much attention, the safeguard options are more limited, and the research base is thinner. More funding for pretraining-stage safety research— and serious policy discussions about the conditions under which increasingly capable models should or should not have their weights publicly released— would be a very good start.
Thank you to Henry Williamson for helpful discussion and feedback!