I think it’s good for proponents of RSPs to be open about the sorts of topics I’ve written about above, so they don’t get confused with e.g. proposing RSPs as a superior alternative to regulation. This post attempts to do that on my part. And to be explicit: I think regulation will be necessary to contain AI risks (RSPs alone are not enough), and should almost certainly end up stricter than what companies impose on themselves.
Strong agree. I wish ARC and Anthropic had been more clear about this, and I would be less critical of their RSP posts if they were upfront and clear about this stance. I think your post is strong and clear (you state multiple times, unambiguously, that you think regulation is necessary and that you wish the world had more political will to regulate). I appreciate this, and I’m glad you wrote this post.
I think it’d be unfortunate to try to manage the above risk by resisting attempts to build consensus around conditional pauses, if one does in fact think conditional pauses are better than the status quo. Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me in a way that’s hard to articulate.
A few thoughts:
One reason I’m critical of the Anthropic RSP is that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it’s OK to keep going. It is nice that they said they would run some evals at least once every 4X in effective compute and that they don’t want to train catastrophe-capable models until their infosec makes it more expensive for actors to steal their models. It is nice that they said that once they get systems that are capable of producing biological weapons, they will at least write something up about what to do with AGI before they decide to just go ahead and scale to AGI. But I mostly look at the RSP and say “wow, these are some of the most bare minimum commitments I could’ve expected, and they don’t even really tell me what a pause would look like and how they would end it.”
Meanwhile, we have OpenAI (that plans to release an RSP at some point), DeepMind (rumor has it they’re working on one but also that it might be very hard to get Google to endorse one), and Meta (oof). So I guess I’m sort of left thinking something like “If Anthropic’s RSP is the best RSP we’re going to get, then yikes, this RSP plan is not doing so well.” Of course, this is just a first version, but the substance of the RSP and the way it was communicated about doesn’t inspire much hope in me that future versions will be better.
I think the RSP frame is wrong, and I don’t want regulators to use it as a building block. My understanding is that labs are refusing to adopt an evals regime in which the burden of proof is on labs to show that scaling is safe. Given this lack of buy-in, the RSP folks concluded that the only thing left to do was to say “OK, fine, but at least please check to see if the system will imminently kill you. And if we find proof that the system is pretty clearly dangerous or about to be dangerous, then will you at least consider stopping” It seems plausible to me that governments would be willing to start with something stricter and more sensible than this “just keep going until we can prove that the model has highly dangerous capabilities” regime.
I think some improvements on the status quo can be net negative because they either (a) cement in an incorrect frame or (b) take a limited window of political will/attention and steer it toward something weaker than what would’ve happened if people had pushed for something stronger. For example, I think the UK government is currently looking around for substantive stuff to show their constituents (and themselves) that they are doing something serious about AI. If companies give them a milktoast solution that allows them to say “look, we did the responsible thing!”, it seems quite plausible to me that we actually end up in a worse world than if the AIS community had rallied behind something stronger.
If everyone communicating about RSPs was clear that they don’t want it to be seen as sufficient, that would be great. In practice, that’s not what I see happening. Anthropic’s RSP largely seems devoted to signaling that Anthropic is great, safe, credible, and trustworthy. Paul’s recent post is nuanced, but I don’t think the “RSPs are not sufficient” frame was sufficiently emphasized (perhaps partly because he thinks RSPs could lead to a 10x reduction in risk, which seems crazy to me, and if he goes around saying that to policymakers, I expect them to hear something like “this is a good plan that would sufficiently reduce risks”). ARC’s post tries to sell RSPs as a pragmatic middle ground and IMO pretty clearly does not emphasize (or even mention?) some sort of “these are not sufficient” message. Finally, the name itself sounds like it came out of a propaganda department– “hey, governments, look, we can scale responsibly”.
At minimum, I hope that RSPs get renamed, and that those communicating about RSPs are more careful to avoid giving off the impression that RSPs are sufficient.
More ambitiously, I hope that folks working on RSPs seriously consider whether or not this is the best thing to be working on or advocating for. My impression is that this plan made more sense when it was less clear that the Overton Window was going to blow open, Bengio/Hinton would enter the fray, journalists and the public would be fairly sympathetic, Rishi Sunak would host an xrisk summit, Blumenthal would run hearings about xrisk, etc. I think everyone working on RSPs should spend at least a few hours taking seriously the possibility that the AIS community could be advocating for stronger policy proposals and getting out of the “we can’t do anything until we literally have proof that the model is imminently dangerous” frame. To be clear, I think some people who do this reflection will conclude that they ought to keep making marginal progress on RSPs. I would be surprised if the current allocation of community talent/resources was correct, though, and I think on the margin more people should be doing things like CAIP & Conjecture, and fewer people should be doing things like RSPs. (Note that CAIP & Conjecture both impt flaws/limitations– and I think this partly has to do with the fact that so much top community talent has been funneled into RSPs/labs relative to advocacy/outreach/outside game).
One reason I’m critical of the Anthropic RSP is that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it’s OK to keep going.
It’s hard to take anything else you’re saying seriously when you say things like this; it seems clear that you just haven’t read Anthropic’s RSP. I think that the current conditions and resulting safeguards are insufficient to prevent AI existential risk, but to say that it doesn’t make them clear is just patently false.
The conditions under which Anthropic commits to pausing in the RSP are very clear. In big bold font on the second page it says:
Anthropic’s commitment to follow the ASL scheme thus implies that we commit to pause the
scaling and/or delay the deployment of new models whenever our scaling ability outstrips our
ability to comply with the safety procedures for the corresponding ASL.
And then it lays out a serious of safety procedures that Anthropic commits to meeting for ASL-3 models or else pausing, with some of the most serious commitments here being:
Model weight and code security: We commit to ensuring that ASL-3 models are stored in
such a manner to minimize risk of theft by a malicious actor that might use the model to cause a
catastrophe. Specifically, we will implement measures designed to harden our security so that
non-state attackers are unlikely to be able to steal model weights, and advanced threat actors
(e.g. states) cannot steal them without significant expense. The full set of security measures
that we commit to (and have already started implementing) are described in this appendix, and
were developed in consultation with the authors of a forthcoming RAND report on securing AI
weights.
Successfully pass red-teaming: World-class experts collaborating with prompt engineers
should red-team the deployment thoroughly and fail to elicit information at a level of
sophistication, accuracy, usefulness, detail, and frequency which significantly enables
catastrophic misuse. Misuse domains should at a minimum include causes of extreme CBRN
risks, and cybersecurity.
Note that in contrast to the ASL-3 capability threshold, this red-teaming is about whether
the model can cause harm under realistic circumstances (i.e. with harmlessness training
and misuse detection in place), not just whether it has the internal knowledge that would
enable it in principle to do so.
We will refine this methodology, but we expect it to require at least many dozens of
hours of deliberate red-teaming per topic area, by world class experts specifically
focused on these threats (rather than students or people with general expertise in a
broad domain). Additionally, this may involve controlled experiments, where people with
similar levels of expertise to real threat actors are divided into groups with and without
model access, and we measure the delta of success between them.
And a clear evaluation-based definition of ASL-3:
We define an ASL-3 model as one that can either immediately, or with additional post-training
techniques corresponding to less than 1% of the total training cost, do at least one of the following two
things. (By post-training techniques we mean the best capabilities elicitation techniques we are aware
of at the time, including but not limited to fine-tuning, scaffolding, tool use, and prompt engineering.)
Capabilities that significantly increase risk of misuse catastrophe: Access to the model
would substantially increase the risk of deliberately-caused catastrophic harm, either by
proliferating capabilities, lowering costs, or enabling new methods of attack. This increase in risk
is measured relative to today’s baseline level of risk that comes from e.g. access to search
engines and textbooks. We expect that AI systems would first elevate this risk from use by non-state attackers.
Our first area of effort is in evaluating bioweapons risks where we will determine threat models
and capabilities in consultation with a number of world-class biosecurity experts. We are now
developing evaluations for these risks in collaboration with external experts to meet ASL-3
commitments, which will be a more systematized version of our recent work on frontier
red-teaming. In the near future, we anticipate working with CBRN, cyber, and related experts to
develop threat models and evaluations in those areas before they present substantial risks.
However, we acknowledge that these evaluations are fundamentally difficult, and there remain
disagreements about threat models.
Autonomous replication in the lab: The model shows early signs of autonomous
self-replication ability, as defined by 50% aggregate success rate on the tasks listed in
[Appendix on Autonomy Evaluations]. The appendix includes an overview of our threat model
for autonomous capabilities and a list of the basic capabilities necessary for accumulation of
resources and surviving in the real world, along with conditions under which we would judge the
model to have succeeded. Note that the referenced appendix describes the ability to act
autonomously specifically in the absence of any human intervention to stop the model, which
limits the risk significantly. Our evaluations were developed in consultation with Paul Christiano
and ARC Evals, which specializes in evaluations of autonomous replication.
This is the basic substance of the RSP; I don’t understand how you could have possibly read it and missed this. I don’t want to be mean, but I am really disappointed in these sort of exceedingly lazy takes.
It think calling a take “lazy”, which could indeed be considered “mean” is not avery helpful approach, you could have made your point without that kind of derision. There are going to be a lot of misunderstandings and hot takes around RSPs, and I think AI company employees especially should err heavily on the side of patience and kind understanding it they want to avoid people becoming more adversarial towards them.
Live by the sword, die by the sword.
Akash said...
“that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it’s OK to keep going. It”
I agree the conditions from the RSP you started are clearer than I would have expected reading Akash’s above comment, but to be fair to Akash, from those paragraphs you posted above, only the last one seems to state a clear and specific condition for pausing, the others seem to say “refer to experts” which could be considered unclear, to give Akash the benefit of the doubt.
And they don’t say how long the pause would be out conditions for restarting either.
Strong agree. I wish ARC and Anthropic had been more clear about this, and I would be less critical of their RSP posts if they were upfront and clear about this stance. I think your post is strong and clear (you state multiple times, unambiguously, that you think regulation is necessary and that you wish the world had more political will to regulate). I appreciate this, and I’m glad you wrote this post.
A few thoughts:
One reason I’m critical of the Anthropic RSP is that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it’s OK to keep going. It is nice that they said they would run some evals at least once every 4X in effective compute and that they don’t want to train catastrophe-capable models until their infosec makes it more expensive for actors to steal their models. It is nice that they said that once they get systems that are capable of producing biological weapons, they will at least write something up about what to do with AGI before they decide to just go ahead and scale to AGI. But I mostly look at the RSP and say “wow, these are some of the most bare minimum commitments I could’ve expected, and they don’t even really tell me what a pause would look like and how they would end it.”
Meanwhile, we have OpenAI (that plans to release an RSP at some point), DeepMind (rumor has it they’re working on one but also that it might be very hard to get Google to endorse one), and Meta (oof). So I guess I’m sort of left thinking something like “If Anthropic’s RSP is the best RSP we’re going to get, then yikes, this RSP plan is not doing so well.” Of course, this is just a first version, but the substance of the RSP and the way it was communicated about doesn’t inspire much hope in me that future versions will be better.
I think the RSP frame is wrong, and I don’t want regulators to use it as a building block. My understanding is that labs are refusing to adopt an evals regime in which the burden of proof is on labs to show that scaling is safe. Given this lack of buy-in, the RSP folks concluded that the only thing left to do was to say “OK, fine, but at least please check to see if the system will imminently kill you. And if we find proof that the system is pretty clearly dangerous or about to be dangerous, then will you at least consider stopping” It seems plausible to me that governments would be willing to start with something stricter and more sensible than this “just keep going until we can prove that the model has highly dangerous capabilities” regime.
I think some improvements on the status quo can be net negative because they either (a) cement in an incorrect frame or (b) take a limited window of political will/attention and steer it toward something weaker than what would’ve happened if people had pushed for something stronger. For example, I think the UK government is currently looking around for substantive stuff to show their constituents (and themselves) that they are doing something serious about AI. If companies give them a milktoast solution that allows them to say “look, we did the responsible thing!”, it seems quite plausible to me that we actually end up in a worse world than if the AIS community had rallied behind something stronger.
If everyone communicating about RSPs was clear that they don’t want it to be seen as sufficient, that would be great. In practice, that’s not what I see happening. Anthropic’s RSP largely seems devoted to signaling that Anthropic is great, safe, credible, and trustworthy. Paul’s recent post is nuanced, but I don’t think the “RSPs are not sufficient” frame was sufficiently emphasized (perhaps partly because he thinks RSPs could lead to a 10x reduction in risk, which seems crazy to me, and if he goes around saying that to policymakers, I expect them to hear something like “this is a good plan that would sufficiently reduce risks”). ARC’s post tries to sell RSPs as a pragmatic middle ground and IMO pretty clearly does not emphasize (or even mention?) some sort of “these are not sufficient” message. Finally, the name itself sounds like it came out of a propaganda department– “hey, governments, look, we can scale responsibly”.
At minimum, I hope that RSPs get renamed, and that those communicating about RSPs are more careful to avoid giving off the impression that RSPs are sufficient.
More ambitiously, I hope that folks working on RSPs seriously consider whether or not this is the best thing to be working on or advocating for. My impression is that this plan made more sense when it was less clear that the Overton Window was going to blow open, Bengio/Hinton would enter the fray, journalists and the public would be fairly sympathetic, Rishi Sunak would host an xrisk summit, Blumenthal would run hearings about xrisk, etc. I think everyone working on RSPs should spend at least a few hours taking seriously the possibility that the AIS community could be advocating for stronger policy proposals and getting out of the “we can’t do anything until we literally have proof that the model is imminently dangerous” frame. To be clear, I think some people who do this reflection will conclude that they ought to keep making marginal progress on RSPs. I would be surprised if the current allocation of community talent/resources was correct, though, and I think on the margin more people should be doing things like CAIP & Conjecture, and fewer people should be doing things like RSPs. (Note that CAIP & Conjecture both impt flaws/limitations– and I think this partly has to do with the fact that so much top community talent has been funneled into RSPs/labs relative to advocacy/outreach/outside game).
Cross-posted from LessWrong.
It’s hard to take anything else you’re saying seriously when you say things like this; it seems clear that you just haven’t read Anthropic’s RSP. I think that the current conditions and resulting safeguards are insufficient to prevent AI existential risk, but to say that it doesn’t make them clear is just patently false.
The conditions under which Anthropic commits to pausing in the RSP are very clear. In big bold font on the second page it says:
And then it lays out a serious of safety procedures that Anthropic commits to meeting for ASL-3 models or else pausing, with some of the most serious commitments here being:
And a clear evaluation-based definition of ASL-3:
This is the basic substance of the RSP; I don’t understand how you could have possibly read it and missed this. I don’t want to be mean, but I am really disappointed in these sort of exceedingly lazy takes.
It think calling a take “lazy”, which could indeed be considered “mean” is not avery helpful approach, you could have made your point without that kind of derision. There are going to be a lot of misunderstandings and hot takes around RSPs, and I think AI company employees especially should err heavily on the side of patience and kind understanding it they want to avoid people becoming more adversarial towards them.
Live by the sword, die by the sword.
Akash said...
“that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it’s OK to keep going. It”
I agree the conditions from the RSP you started are clearer than I would have expected reading Akash’s above comment, but to be fair to Akash, from those paragraphs you posted above, only the last one seems to state a clear and specific condition for pausing, the others seem to say “refer to experts” which could be considered unclear, to give Akash the benefit of the doubt.
And they don’t say how long the pause would be out conditions for restarting either.