A conditional pause fails to prevent x-risk if either:
The AI successfully exfiltrates itself (which is what’s needed to defeat rollback mechanisms) during training or evaluation, but before deployment.
The AI successfully sandbags the evaluations. (Note that existing conditional pause proposals depend on capability evaluations, not alignment evaluations.)
(Obviously there’s also the normal institutional failures, e.g. if a company simply ignores the evaluation requirements and forges ahead. I’m setting those aside here.)
Both of these seem extremely difficult to me (likely beyond human-level, in the sense that if you somehow put a human in the situation the AI would be in, I would expect the human to fail).
How likely do you think it is that we get an AI capable of one of these failure modes, before we see an AI capable of e.g. passing 10 out of the 12 ARC Evals tasks? My answer would be “negligible”, and so I’m at least in favor of “pause once you pass 10 out of 12 ARC Evals tasks” over “pause now”. I think we can raise the difficulty of the bar a decent bit more before my answer stops being “negligible”.
I don’t think this depends on takeoff speeds at all, since I’d expect a conditional pause proposal to lead to a pause well before models are automating 20% of tasks (assuming no good mitigations in place by that point).
I don’t think this depends on timelines, except inasmuch as short timelines correlates with discontinuous jumps in capability. If anything it seems like shorter timelines argue more strongly for a conditional pause proposal, since it seems far easier to build support for and enact a conditional pause.
I don’t think this depends on takeoff speeds at all, since I’d expect a conditional pause proposal to lead to a pause well before models are automating 20% of tasks (assuming no good mitigations in place by that point).
I should have emphasized that I’m talking about cognitive AI takeoff, not economic takeoff.
I don’t have a strong view whether there are ~20% of human tasks that are easy and regular/streamlined enough to automate with stochastic-parrot AI tools. Be that as it may, what’s more important is what happens once AIs pass the reliability threshold that makes someone a great “general assistant” in all sorts of domains. From there, I think it’s just a tiny step further to also being a great CEO. Because these capability levels are so close to each other on my model, the world may still look similar to ours at that point.
All of that said, it’s not like I consider it particularly likely that a system would blow past all the evals you’re talking about in a single swoop, especially since some of them will be (slightly) before the point of being a great “general assistant.” I also have significant trust that the people designing these evals will be thinking about these concerns. I think it’s going to be very challenging to make sure evals organizations (or evals teams inside labs in case it’s done lab-internally) have enough political power and stay uncorrupted by pressures to be friendly towards influential lab leadership. These problems are surmountable in theory, but I think it’ll be hard, so I’m hoping the people working on this are aware of all that could go wrong. I recently wrote up some quick thoughts on safety evals here. Overall, I’m probably happy enough with a really well-thought out “conditional pause” proposal, but I’d need to be reassured that the people who decide in favor of that can pass the Ideological Turing test for positions like fast takeoff or the point that economic milestones like “20% of tasks are automated” are probably irrelevant.
Sounds like we roughly agree on actions, even if not beliefs (I’m less sold on fast / discontinuous takeoff than you are).
As a minor note, to keep incentives good, you could pay evaluators / auditors based on how much performance they are able to elicit. You could even require that models be evaluated by at least three auditors, and split up payment between them based on their relative performances. In general it feels like there a huge space of possibilities that has barely been explored.
A conditional pause fails to prevent x-risk if either:
The AI successfully exfiltrates itself (which is what’s needed to defeat rollback mechanisms) during training or evaluation, but before deployment.
The AI successfully sandbags the evaluations. (Note that existing conditional pause proposals depend on capability evaluations, not alignment evaluations.)
Another way evals could fail is if they work locally but it’s still too late in the relevant sense because even with the pause mechanism kicking in (e.g., “from now on, any training runs that use 0.2x the compute of the model that failed the eval will be prohibited”), algorithmic progress at another lab or driven by people on twitter(/X) tinkering with older, already-released models, will get someone to the same capability threshold again soon enough. There’s probably a degree to which algorithmic insights can substitute for having more compute available. So, even if a failed eval triggers a global pause, if cognitive takeoff dynamics are sufficiently fast/”discontinuous,” a failed eval may just mean that we’re now so close to takeoff that it’s too hard to prevent. Monitoring algorithmic progress is particularly difficult (I think compute is the much better lever).
Of course, you can still incorporate this concern by adding enough safety margin to when your evals trigger the pause button. Maybe that’s okay!
But then we’re back to my point of “Are we so sure that the evals (with the proper safety margin) shouldn’t have triggered today already?”
(It’s not saying that I’m totally confident about any of this. Just flagging that it seems like a tough forecasting question to determine what an appropriate safety margin is on the assumption that algorithmic progress is hard to monitor/regulate and the assumption that fast takeoff into and through the human range is very much a possibility.)
even with the pause mechanism kicking in (e.g., “from now on, any training runs that use 0.2x the compute of the model that failed the eval will be prohibited”), algorithmic progress at another lab or driven by people on twitter(/X) tinkering with older, already-released models, will get someone to the same capability threshold again soon enough. [...] you can still incorporate this concern by adding enough safety margin to when your evals trigger the pause button.
Safety margin is one way, but I’d be much more keen on continuing to monitor the strongest models even after the pause has kicked in, so that you notice the effects of algorithmic progress and can tighten controls if needed. This includes rolling back model releases if people on Twitter tinker with them to exceed capability thresholds. (This also implies no open sourcing, since you can’t roll back if you’ve open sourced the model.)
But I also wish you’d say what exactly your alternative course of action is, and why it’s better. E.g. the worry of “algorithmic progress gets you to the threshold” also applies to unconditional pauses. Right now your comments feel to me like a search for anything negative about a conditional pause, without checking whether that negative applies to other courses of action.
But I also wish you’d say what exactly your alternative course of action is, and why it’s better. E.g. the worry of “algorithmic progress gets you to the threshold” also applies to unconditional pauses. Right now your comments feel to me like a search for anything negative about a conditional pause, without checking whether that negative applies to other courses of action.
The way I see it, the main difference between conditional vs unconditional pause is that the unconditional pause comes with a bigger safety margin (as big as we can muster). So, given that I’m more worried about surprising takeoffs, that position seems prima facie more appealing to me.
In addition, as I say in my other comment, I’m open to (edit: or, more strongly, I’d ideally prefer this!) some especially safety-conscious research continuing onwards through the pause. I gather that this is one of your primary concerns? I agree that an outcome where that’s possible requires nuanced discourse, which we may not get if public reaction to AI goes too far in one direction. So, I agree that there’s a tradeoff around public advocacy.
A conditional pause fails to prevent x-risk if either:
The AI successfully exfiltrates itself (which is what’s needed to defeat rollback mechanisms) during training or evaluation, but before deployment.
The AI successfully sandbags the evaluations. (Note that existing conditional pause proposals depend on capability evaluations, not alignment evaluations.)
(Obviously there’s also the normal institutional failures, e.g. if a company simply ignores the evaluation requirements and forges ahead. I’m setting those aside here.)
Both of these seem extremely difficult to me (likely beyond human-level, in the sense that if you somehow put a human in the situation the AI would be in, I would expect the human to fail).
How likely do you think it is that we get an AI capable of one of these failure modes, before we see an AI capable of e.g. passing 10 out of the 12 ARC Evals tasks? My answer would be “negligible”, and so I’m at least in favor of “pause once you pass 10 out of 12 ARC Evals tasks” over “pause now”. I think we can raise the difficulty of the bar a decent bit more before my answer stops being “negligible”.
I don’t think this depends on takeoff speeds at all, since I’d expect a conditional pause proposal to lead to a pause well before models are automating 20% of tasks (assuming no good mitigations in place by that point).
I don’t think this depends on timelines, except inasmuch as short timelines correlates with discontinuous jumps in capability. If anything it seems like shorter timelines argue more strongly for a conditional pause proposal, since it seems far easier to build support for and enact a conditional pause.
I should have emphasized that I’m talking about cognitive AI takeoff, not economic takeoff.
I don’t have a strong view whether there are ~20% of human tasks that are easy and regular/streamlined enough to automate with stochastic-parrot AI tools. Be that as it may, what’s more important is what happens once AIs pass the reliability threshold that makes someone a great “general assistant” in all sorts of domains. From there, I think it’s just a tiny step further to also being a great CEO. Because these capability levels are so close to each other on my model, the world may still look similar to ours at that point.
All of that said, it’s not like I consider it particularly likely that a system would blow past all the evals you’re talking about in a single swoop, especially since some of them will be (slightly) before the point of being a great “general assistant.” I also have significant trust that the people designing these evals will be thinking about these concerns. I think it’s going to be very challenging to make sure evals organizations (or evals teams inside labs in case it’s done lab-internally) have enough political power and stay uncorrupted by pressures to be friendly towards influential lab leadership. These problems are surmountable in theory, but I think it’ll be hard, so I’m hoping the people working on this are aware of all that could go wrong. I recently wrote up some quick thoughts on safety evals here. Overall, I’m probably happy enough with a really well-thought out “conditional pause” proposal, but I’d need to be reassured that the people who decide in favor of that can pass the Ideological Turing test for positions like fast takeoff or the point that economic milestones like “20% of tasks are automated” are probably irrelevant.
Sounds like we roughly agree on actions, even if not beliefs (I’m less sold on fast / discontinuous takeoff than you are).
As a minor note, to keep incentives good, you could pay evaluators / auditors based on how much performance they are able to elicit. You could even require that models be evaluated by at least three auditors, and split up payment between them based on their relative performances. In general it feels like there a huge space of possibilities that has barely been explored.
Another way evals could fail is if they work locally but it’s still too late in the relevant sense because even with the pause mechanism kicking in (e.g., “from now on, any training runs that use 0.2x the compute of the model that failed the eval will be prohibited”), algorithmic progress at another lab or driven by people on twitter(/X) tinkering with older, already-released models, will get someone to the same capability threshold again soon enough. There’s probably a degree to which algorithmic insights can substitute for having more compute available. So, even if a failed eval triggers a global pause, if cognitive takeoff dynamics are sufficiently fast/”discontinuous,” a failed eval may just mean that we’re now so close to takeoff that it’s too hard to prevent. Monitoring algorithmic progress is particularly difficult (I think compute is the much better lever).
Of course, you can still incorporate this concern by adding enough safety margin to when your evals trigger the pause button. Maybe that’s okay!
But then we’re back to my point of “Are we so sure that the evals (with the proper safety margin) shouldn’t have triggered today already?”
(It’s not saying that I’m totally confident about any of this. Just flagging that it seems like a tough forecasting question to determine what an appropriate safety margin is on the assumption that algorithmic progress is hard to monitor/regulate and the assumption that fast takeoff into and through the human range is very much a possibility.)
Safety margin is one way, but I’d be much more keen on continuing to monitor the strongest models even after the pause has kicked in, so that you notice the effects of algorithmic progress and can tighten controls if needed. This includes rolling back model releases if people on Twitter tinker with them to exceed capability thresholds. (This also implies no open sourcing, since you can’t roll back if you’ve open sourced the model.)
But I also wish you’d say what exactly your alternative course of action is, and why it’s better. E.g. the worry of “algorithmic progress gets you to the threshold” also applies to unconditional pauses. Right now your comments feel to me like a search for anything negative about a conditional pause, without checking whether that negative applies to other courses of action.
The way I see it, the main difference between conditional vs unconditional pause is that the unconditional pause comes with a bigger safety margin (as big as we can muster). So, given that I’m more worried about surprising takeoffs, that position seems prima facie more appealing to me.
In addition, as I say in my other comment, I’m open to (edit: or, more strongly, I’d ideally prefer this!) some especially safety-conscious research continuing onwards through the pause. I gather that this is one of your primary concerns? I agree that an outcome where that’s possible requires nuanced discourse, which we may not get if public reaction to AI goes too far in one direction. So, I agree that there’s a tradeoff around public advocacy.