Iterative deployment. We treat AGI like we would treat many other new technologies: something that could pose risks, which we should think about and mitigate, but ultimately something we should learn about through iterative deployment. The default is to deploy new AI systems, see what happens with a particular eye towards noticing harms, and then design appropriate mitigations. In addition, rollback mechanisms ensure that we can AI systems are deployed with a rollback mechanism, so that if a deployment causes significant harms
[...]
Conditional pause. We institute regulations that say that capability improvement must pause once the AI system hits a particular threshold of riskiness, as determined by some relatively standardized evaluations, with some room for error built in. AI development can only continue once the developer has exhibited sufficient evidence that the risk will not arise.
Compared to you, I’m more pessimistic about these two measures. On iterative deployment, I’m skeptical about the safety of rollback mechanisms. On conditional pause, I agree it makes total sense to pause at the latest point possible as long as things are still pretty likely to be safe. However, I don’t see why we aren’t already at that point.
I suspect that our main crux might be a disagreement over takeoff speeds, and perhaps AI timelines being another (more minor) crux?
On takeoff speeds, I place 70% on 0.5-2.5 orders of magnitude for what Tom Davidson calls the FLOP gap. (Relatively high robustness because I thought about this a lot.) I also worry less about metrics of economic growth/economic change because I believe it won’t take long from “AI makes a noticeable dent on human macroeconomic productivity” to “it’s now possible to run millions of AIs that are ~better/faster at all tasks human experts can do on a computer.” The latter scenario is one from which it is easy to imagine how AIs might disempower humans. I basically don’t see how one can safely roll things back at the point where generally smarter-than-human AIs exist that can copy themselves millionfold on the internet.
On timelines, I have maybe 34% on 3 years and less, and 11% on 1 year.
Why do I have these views?
A large part of the story is that if you had described to me back in 2018 all the AI capabilities we have today, without mentioning the specific year by which we’d have those capabilities, I’d have said “once we’re there, we’re probably very close to transformative AI.” And now that we are at this stage, even though it’s much sooner than I’d have expected, I feel like the right direction of update is “AI timelines are sooner than expected” rather than “(cognitive) takeoff speeds must be slower than I’d have expected.”
Maybe this again comes down to my specific view on takeoff speeds. I felt more confident that takeoff won’t be super slow than I felt confident about anything timelines-related.
So, why the confident view on takeoff speeds? Just looking at humans vs chimpanzees, I’m amazed by the comparatively small difference in brain size. We can also speculate, based on the way evolution operates, that there’s probably not much room for secret-sauce machinery in the human brain (that chimpanzees don’t already have).
The main counterargument from the slow(er) takeoff crowd on the chimpanzees vs. humans comparison is that humans faced much stronger selection pressure for intelligence, which must have tweaked a lot of other things besides brain size, and since chimpanzees didn’t face that same selection pressure, the evolutionary comparison underestimates how smart an animal with a chimpanzee-sized brain would be if it had also undergone strong selection pressure for the sort of niche that humans inhabit (“intelligence niche”). I find that counterargument slightly convincing, but not convincing enough to narrow my FLOP gap estimates too much. Compared to ML progress where we often 10x compute between models, evolution was operating way more slowly.
As I’ve written in an (unpublished) review of (a draft of) the FLOP gap report:
I’ll now reply to sections in the report where Tom discusses evidence for the FLOP gap and point out where I’d draw different conclusions. I’ll start with the argument that I consider the biggest crux between Tom and me.
Tom seems to think chimpanzees – or even rats, with much smaller probability – could plausibly automate a significant percent of cognitive tasks “if they had been thoroughly evolutionarily prepared for this” (I’m paraphrasing). A related argument in this spirit is Paul Christiano’s argument that humans far outclass chimpanzees not because of a discontinuity in intelligence, but because chimpanzees hadn’t been naturally selected to be good at a sufficiently general (or generalizable) range of skills.
I think this argument is in tension with the observation that people at the lower range of human intelligence (as measured by IQ, for instance) tend to struggle to find and hold jobs. If natural selection had straightforwardly turned all humans into specialists for “something closely correlated with economically useful tasks,” then how come the range for human intelligence is still so wide? I suspect that the reason humans were selected for (general) intelligence more than chimpanzees is because of some “discontinuity” in the first place. This comment by Carl Shulman explains it as follows (my emphasis in bold):
“Hominid culture took off enabled by human capabilities [so we are not incredibly far from the minimum need for strongly accumulating culture, the selection effect you reference in the post], and kept rising over hundreds of thousands and millions of years, at accelerating pace as the population grew with new tech, expediting further technical advance.”
Admittedly, Carl also writes the following:
“Different regions advanced at different rates (generally larger connected regions grew faster, with more innovators to accumulate innovations), but all but the smallest advanced. So if humans overall had lower cognitive abilities there would be slack for technological advance to have happened anyway, just at slower rates (perhaps manyfold), accumulating more by trial and error.”
So, perhaps chimpanzees (or bonobos), if they had been evolutionarily prepared for social learning or for performing well in the economy, could indeed perform about 20% of today’s tasks. But that’s also my cutoff point: I think we can somewhat confidently rule out that smaller-brained primates, let alone rodents such as rats, could do the same thing in this hypothetical. (Or, in case I’m wrong, it would have to be because it’s possible to turn all of the rodent’s cognition into a highly specialized “tool” that exploits various advantages it can reach over humans – enabling partial automation of specific workflows, but not full automation of sectors.)
A conditional pause fails to prevent x-risk if either:
The AI successfully exfiltrates itself (which is what’s needed to defeat rollback mechanisms) during training or evaluation, but before deployment.
The AI successfully sandbags the evaluations. (Note that existing conditional pause proposals depend on capability evaluations, not alignment evaluations.)
(Obviously there’s also the normal institutional failures, e.g. if a company simply ignores the evaluation requirements and forges ahead. I’m setting those aside here.)
Both of these seem extremely difficult to me (likely beyond human-level, in the sense that if you somehow put a human in the situation the AI would be in, I would expect the human to fail).
How likely do you think it is that we get an AI capable of one of these failure modes, before we see an AI capable of e.g. passing 10 out of the 12 ARC Evals tasks? My answer would be “negligible”, and so I’m at least in favor of “pause once you pass 10 out of 12 ARC Evals tasks” over “pause now”. I think we can raise the difficulty of the bar a decent bit more before my answer stops being “negligible”.
I don’t think this depends on takeoff speeds at all, since I’d expect a conditional pause proposal to lead to a pause well before models are automating 20% of tasks (assuming no good mitigations in place by that point).
I don’t think this depends on timelines, except inasmuch as short timelines correlates with discontinuous jumps in capability. If anything it seems like shorter timelines argue more strongly for a conditional pause proposal, since it seems far easier to build support for and enact a conditional pause.
I don’t think this depends on takeoff speeds at all, since I’d expect a conditional pause proposal to lead to a pause well before models are automating 20% of tasks (assuming no good mitigations in place by that point).
I should have emphasized that I’m talking about cognitive AI takeoff, not economic takeoff.
I don’t have a strong view whether there are ~20% of human tasks that are easy and regular/streamlined enough to automate with stochastic-parrot AI tools. Be that as it may, what’s more important is what happens once AIs pass the reliability threshold that makes someone a great “general assistant” in all sorts of domains. From there, I think it’s just a tiny step further to also being a great CEO. Because these capability levels are so close to each other on my model, the world may still look similar to ours at that point.
All of that said, it’s not like I consider it particularly likely that a system would blow past all the evals you’re talking about in a single swoop, especially since some of them will be (slightly) before the point of being a great “general assistant.” I also have significant trust that the people designing these evals will be thinking about these concerns. I think it’s going to be very challenging to make sure evals organizations (or evals teams inside labs in case it’s done lab-internally) have enough political power and stay uncorrupted by pressures to be friendly towards influential lab leadership. These problems are surmountable in theory, but I think it’ll be hard, so I’m hoping the people working on this are aware of all that could go wrong. I recently wrote up some quick thoughts on safety evals here. Overall, I’m probably happy enough with a really well-thought out “conditional pause” proposal, but I’d need to be reassured that the people who decide in favor of that can pass the Ideological Turing test for positions like fast takeoff or the point that economic milestones like “20% of tasks are automated” are probably irrelevant.
Sounds like we roughly agree on actions, even if not beliefs (I’m less sold on fast / discontinuous takeoff than you are).
As a minor note, to keep incentives good, you could pay evaluators / auditors based on how much performance they are able to elicit. You could even require that models be evaluated by at least three auditors, and split up payment between them based on their relative performances. In general it feels like there a huge space of possibilities that has barely been explored.
A conditional pause fails to prevent x-risk if either:
The AI successfully exfiltrates itself (which is what’s needed to defeat rollback mechanisms) during training or evaluation, but before deployment.
The AI successfully sandbags the evaluations. (Note that existing conditional pause proposals depend on capability evaluations, not alignment evaluations.)
Another way evals could fail is if they work locally but it’s still too late in the relevant sense because even with the pause mechanism kicking in (e.g., “from now on, any training runs that use 0.2x the compute of the model that failed the eval will be prohibited”), algorithmic progress at another lab or driven by people on twitter(/X) tinkering with older, already-released models, will get someone to the same capability threshold again soon enough. There’s probably a degree to which algorithmic insights can substitute for having more compute available. So, even if a failed eval triggers a global pause, if cognitive takeoff dynamics are sufficiently fast/”discontinuous,” a failed eval may just mean that we’re now so close to takeoff that it’s too hard to prevent. Monitoring algorithmic progress is particularly difficult (I think compute is the much better lever).
Of course, you can still incorporate this concern by adding enough safety margin to when your evals trigger the pause button. Maybe that’s okay!
But then we’re back to my point of “Are we so sure that the evals (with the proper safety margin) shouldn’t have triggered today already?”
(It’s not saying that I’m totally confident about any of this. Just flagging that it seems like a tough forecasting question to determine what an appropriate safety margin is on the assumption that algorithmic progress is hard to monitor/regulate and the assumption that fast takeoff into and through the human range is very much a possibility.)
even with the pause mechanism kicking in (e.g., “from now on, any training runs that use 0.2x the compute of the model that failed the eval will be prohibited”), algorithmic progress at another lab or driven by people on twitter(/X) tinkering with older, already-released models, will get someone to the same capability threshold again soon enough. [...] you can still incorporate this concern by adding enough safety margin to when your evals trigger the pause button.
Safety margin is one way, but I’d be much more keen on continuing to monitor the strongest models even after the pause has kicked in, so that you notice the effects of algorithmic progress and can tighten controls if needed. This includes rolling back model releases if people on Twitter tinker with them to exceed capability thresholds. (This also implies no open sourcing, since you can’t roll back if you’ve open sourced the model.)
But I also wish you’d say what exactly your alternative course of action is, and why it’s better. E.g. the worry of “algorithmic progress gets you to the threshold” also applies to unconditional pauses. Right now your comments feel to me like a search for anything negative about a conditional pause, without checking whether that negative applies to other courses of action.
But I also wish you’d say what exactly your alternative course of action is, and why it’s better. E.g. the worry of “algorithmic progress gets you to the threshold” also applies to unconditional pauses. Right now your comments feel to me like a search for anything negative about a conditional pause, without checking whether that negative applies to other courses of action.
The way I see it, the main difference between conditional vs unconditional pause is that the unconditional pause comes with a bigger safety margin (as big as we can muster). So, given that I’m more worried about surprising takeoffs, that position seems prima facie more appealing to me.
In addition, as I say in my other comment, I’m open to (edit: or, more strongly, I’d ideally prefer this!) some especially safety-conscious research continuing onwards through the pause. I gather that this is one of your primary concerns? I agree that an outcome where that’s possible requires nuanced discourse, which we may not get if public reaction to AI goes too far in one direction. So, I agree that there’s a tradeoff around public advocacy.
Compared to you, I’m more pessimistic about these two measures. On iterative deployment, I’m skeptical about the safety of rollback mechanisms. On conditional pause, I agree it makes total sense to pause at the latest point possible as long as things are still pretty likely to be safe. However, I don’t see why we aren’t already at that point.
I suspect that our main crux might be a disagreement over takeoff speeds, and perhaps AI timelines being another (more minor) crux?
On takeoff speeds, I place 70% on 0.5-2.5 orders of magnitude for what Tom Davidson calls the FLOP gap. (Relatively high robustness because I thought about this a lot.) I also worry less about metrics of economic growth/economic change because I believe it won’t take long from “AI makes a noticeable dent on human macroeconomic productivity” to “it’s now possible to run millions of AIs that are ~better/faster at all tasks human experts can do on a computer.” The latter scenario is one from which it is easy to imagine how AIs might disempower humans. I basically don’t see how one can safely roll things back at the point where generally smarter-than-human AIs exist that can copy themselves millionfold on the internet.
On timelines, I have maybe 34% on 3 years and less, and 11% on 1 year.
Why do I have these views?
A large part of the story is that if you had described to me back in 2018 all the AI capabilities we have today, without mentioning the specific year by which we’d have those capabilities, I’d have said “once we’re there, we’re probably very close to transformative AI.” And now that we are at this stage, even though it’s much sooner than I’d have expected, I feel like the right direction of update is “AI timelines are sooner than expected” rather than “(cognitive) takeoff speeds must be slower than I’d have expected.”
Maybe this again comes down to my specific view on takeoff speeds. I felt more confident that takeoff won’t be super slow than I felt confident about anything timelines-related.
So, why the confident view on takeoff speeds? Just looking at humans vs chimpanzees, I’m amazed by the comparatively small difference in brain size. We can also speculate, based on the way evolution operates, that there’s probably not much room for secret-sauce machinery in the human brain (that chimpanzees don’t already have).
The main counterargument from the slow(er) takeoff crowd on the chimpanzees vs. humans comparison is that humans faced much stronger selection pressure for intelligence, which must have tweaked a lot of other things besides brain size, and since chimpanzees didn’t face that same selection pressure, the evolutionary comparison underestimates how smart an animal with a chimpanzee-sized brain would be if it had also undergone strong selection pressure for the sort of niche that humans inhabit (“intelligence niche”). I find that counterargument slightly convincing, but not convincing enough to narrow my FLOP gap estimates too much. Compared to ML progress where we often 10x compute between models, evolution was operating way more slowly.
As I’ve written in an (unpublished) review of (a draft of) the FLOP gap report:
A conditional pause fails to prevent x-risk if either:
The AI successfully exfiltrates itself (which is what’s needed to defeat rollback mechanisms) during training or evaluation, but before deployment.
The AI successfully sandbags the evaluations. (Note that existing conditional pause proposals depend on capability evaluations, not alignment evaluations.)
(Obviously there’s also the normal institutional failures, e.g. if a company simply ignores the evaluation requirements and forges ahead. I’m setting those aside here.)
Both of these seem extremely difficult to me (likely beyond human-level, in the sense that if you somehow put a human in the situation the AI would be in, I would expect the human to fail).
How likely do you think it is that we get an AI capable of one of these failure modes, before we see an AI capable of e.g. passing 10 out of the 12 ARC Evals tasks? My answer would be “negligible”, and so I’m at least in favor of “pause once you pass 10 out of 12 ARC Evals tasks” over “pause now”. I think we can raise the difficulty of the bar a decent bit more before my answer stops being “negligible”.
I don’t think this depends on takeoff speeds at all, since I’d expect a conditional pause proposal to lead to a pause well before models are automating 20% of tasks (assuming no good mitigations in place by that point).
I don’t think this depends on timelines, except inasmuch as short timelines correlates with discontinuous jumps in capability. If anything it seems like shorter timelines argue more strongly for a conditional pause proposal, since it seems far easier to build support for and enact a conditional pause.
I should have emphasized that I’m talking about cognitive AI takeoff, not economic takeoff.
I don’t have a strong view whether there are ~20% of human tasks that are easy and regular/streamlined enough to automate with stochastic-parrot AI tools. Be that as it may, what’s more important is what happens once AIs pass the reliability threshold that makes someone a great “general assistant” in all sorts of domains. From there, I think it’s just a tiny step further to also being a great CEO. Because these capability levels are so close to each other on my model, the world may still look similar to ours at that point.
All of that said, it’s not like I consider it particularly likely that a system would blow past all the evals you’re talking about in a single swoop, especially since some of them will be (slightly) before the point of being a great “general assistant.” I also have significant trust that the people designing these evals will be thinking about these concerns. I think it’s going to be very challenging to make sure evals organizations (or evals teams inside labs in case it’s done lab-internally) have enough political power and stay uncorrupted by pressures to be friendly towards influential lab leadership. These problems are surmountable in theory, but I think it’ll be hard, so I’m hoping the people working on this are aware of all that could go wrong. I recently wrote up some quick thoughts on safety evals here. Overall, I’m probably happy enough with a really well-thought out “conditional pause” proposal, but I’d need to be reassured that the people who decide in favor of that can pass the Ideological Turing test for positions like fast takeoff or the point that economic milestones like “20% of tasks are automated” are probably irrelevant.
Sounds like we roughly agree on actions, even if not beliefs (I’m less sold on fast / discontinuous takeoff than you are).
As a minor note, to keep incentives good, you could pay evaluators / auditors based on how much performance they are able to elicit. You could even require that models be evaluated by at least three auditors, and split up payment between them based on their relative performances. In general it feels like there a huge space of possibilities that has barely been explored.
Another way evals could fail is if they work locally but it’s still too late in the relevant sense because even with the pause mechanism kicking in (e.g., “from now on, any training runs that use 0.2x the compute of the model that failed the eval will be prohibited”), algorithmic progress at another lab or driven by people on twitter(/X) tinkering with older, already-released models, will get someone to the same capability threshold again soon enough. There’s probably a degree to which algorithmic insights can substitute for having more compute available. So, even if a failed eval triggers a global pause, if cognitive takeoff dynamics are sufficiently fast/”discontinuous,” a failed eval may just mean that we’re now so close to takeoff that it’s too hard to prevent. Monitoring algorithmic progress is particularly difficult (I think compute is the much better lever).
Of course, you can still incorporate this concern by adding enough safety margin to when your evals trigger the pause button. Maybe that’s okay!
But then we’re back to my point of “Are we so sure that the evals (with the proper safety margin) shouldn’t have triggered today already?”
(It’s not saying that I’m totally confident about any of this. Just flagging that it seems like a tough forecasting question to determine what an appropriate safety margin is on the assumption that algorithmic progress is hard to monitor/regulate and the assumption that fast takeoff into and through the human range is very much a possibility.)
Safety margin is one way, but I’d be much more keen on continuing to monitor the strongest models even after the pause has kicked in, so that you notice the effects of algorithmic progress and can tighten controls if needed. This includes rolling back model releases if people on Twitter tinker with them to exceed capability thresholds. (This also implies no open sourcing, since you can’t roll back if you’ve open sourced the model.)
But I also wish you’d say what exactly your alternative course of action is, and why it’s better. E.g. the worry of “algorithmic progress gets you to the threshold” also applies to unconditional pauses. Right now your comments feel to me like a search for anything negative about a conditional pause, without checking whether that negative applies to other courses of action.
The way I see it, the main difference between conditional vs unconditional pause is that the unconditional pause comes with a bigger safety margin (as big as we can muster). So, given that I’m more worried about surprising takeoffs, that position seems prima facie more appealing to me.
In addition, as I say in my other comment, I’m open to (edit: or, more strongly, I’d ideally prefer this!) some especially safety-conscious research continuing onwards through the pause. I gather that this is one of your primary concerns? I agree that an outcome where that’s possible requires nuanced discourse, which we may not get if public reaction to AI goes too far in one direction. So, I agree that there’s a tradeoff around public advocacy.