A conditional pause fails to prevent x-risk if either:
The AI successfully exfiltrates itself (which is what’s needed to defeat rollback mechanisms) during training or evaluation, but before deployment.
The AI successfully sandbags the evaluations. (Note that existing conditional pause proposals depend on capability evaluations, not alignment evaluations.)
Another way evals could fail is if they work locally but it’s still too late in the relevant sense because even with the pause mechanism kicking in (e.g., “from now on, any training runs that use 0.2x the compute of the model that failed the eval will be prohibited”), algorithmic progress at another lab or driven by people on twitter(/X) tinkering with older, already-released models, will get someone to the same capability threshold again soon enough. There’s probably a degree to which algorithmic insights can substitute for having more compute available. So, even if a failed eval triggers a global pause, if cognitive takeoff dynamics are sufficiently fast/”discontinuous,” a failed eval may just mean that we’re now so close to takeoff that it’s too hard to prevent. Monitoring algorithmic progress is particularly difficult (I think compute is the much better lever).
Of course, you can still incorporate this concern by adding enough safety margin to when your evals trigger the pause button. Maybe that’s okay!
But then we’re back to my point of “Are we so sure that the evals (with the proper safety margin) shouldn’t have triggered today already?”
(It’s not saying that I’m totally confident about any of this. Just flagging that it seems like a tough forecasting question to determine what an appropriate safety margin is on the assumption that algorithmic progress is hard to monitor/regulate and the assumption that fast takeoff into and through the human range is very much a possibility.)
even with the pause mechanism kicking in (e.g., “from now on, any training runs that use 0.2x the compute of the model that failed the eval will be prohibited”), algorithmic progress at another lab or driven by people on twitter(/X) tinkering with older, already-released models, will get someone to the same capability threshold again soon enough. [...] you can still incorporate this concern by adding enough safety margin to when your evals trigger the pause button.
Safety margin is one way, but I’d be much more keen on continuing to monitor the strongest models even after the pause has kicked in, so that you notice the effects of algorithmic progress and can tighten controls if needed. This includes rolling back model releases if people on Twitter tinker with them to exceed capability thresholds. (This also implies no open sourcing, since you can’t roll back if you’ve open sourced the model.)
But I also wish you’d say what exactly your alternative course of action is, and why it’s better. E.g. the worry of “algorithmic progress gets you to the threshold” also applies to unconditional pauses. Right now your comments feel to me like a search for anything negative about a conditional pause, without checking whether that negative applies to other courses of action.
But I also wish you’d say what exactly your alternative course of action is, and why it’s better. E.g. the worry of “algorithmic progress gets you to the threshold” also applies to unconditional pauses. Right now your comments feel to me like a search for anything negative about a conditional pause, without checking whether that negative applies to other courses of action.
The way I see it, the main difference between conditional vs unconditional pause is that the unconditional pause comes with a bigger safety margin (as big as we can muster). So, given that I’m more worried about surprising takeoffs, that position seems prima facie more appealing to me.
In addition, as I say in my other comment, I’m open to (edit: or, more strongly, I’d ideally prefer this!) some especially safety-conscious research continuing onwards through the pause. I gather that this is one of your primary concerns? I agree that an outcome where that’s possible requires nuanced discourse, which we may not get if public reaction to AI goes too far in one direction. So, I agree that there’s a tradeoff around public advocacy.
Another way evals could fail is if they work locally but it’s still too late in the relevant sense because even with the pause mechanism kicking in (e.g., “from now on, any training runs that use 0.2x the compute of the model that failed the eval will be prohibited”), algorithmic progress at another lab or driven by people on twitter(/X) tinkering with older, already-released models, will get someone to the same capability threshold again soon enough. There’s probably a degree to which algorithmic insights can substitute for having more compute available. So, even if a failed eval triggers a global pause, if cognitive takeoff dynamics are sufficiently fast/”discontinuous,” a failed eval may just mean that we’re now so close to takeoff that it’s too hard to prevent. Monitoring algorithmic progress is particularly difficult (I think compute is the much better lever).
Of course, you can still incorporate this concern by adding enough safety margin to when your evals trigger the pause button. Maybe that’s okay!
But then we’re back to my point of “Are we so sure that the evals (with the proper safety margin) shouldn’t have triggered today already?”
(It’s not saying that I’m totally confident about any of this. Just flagging that it seems like a tough forecasting question to determine what an appropriate safety margin is on the assumption that algorithmic progress is hard to monitor/regulate and the assumption that fast takeoff into and through the human range is very much a possibility.)
Safety margin is one way, but I’d be much more keen on continuing to monitor the strongest models even after the pause has kicked in, so that you notice the effects of algorithmic progress and can tighten controls if needed. This includes rolling back model releases if people on Twitter tinker with them to exceed capability thresholds. (This also implies no open sourcing, since you can’t roll back if you’ve open sourced the model.)
But I also wish you’d say what exactly your alternative course of action is, and why it’s better. E.g. the worry of “algorithmic progress gets you to the threshold” also applies to unconditional pauses. Right now your comments feel to me like a search for anything negative about a conditional pause, without checking whether that negative applies to other courses of action.
The way I see it, the main difference between conditional vs unconditional pause is that the unconditional pause comes with a bigger safety margin (as big as we can muster). So, given that I’m more worried about surprising takeoffs, that position seems prima facie more appealing to me.
In addition, as I say in my other comment, I’m open to (edit: or, more strongly, I’d ideally prefer this!) some especially safety-conscious research continuing onwards through the pause. I gather that this is one of your primary concerns? I agree that an outcome where that’s possible requires nuanced discourse, which we may not get if public reaction to AI goes too far in one direction. So, I agree that there’s a tradeoff around public advocacy.