I don’t think it’s a good plan to build an AI that enacts some pivotal act ensuring that nobody ever builds a misaligned AGI. See Critch here and here. When I think about building AI that is safe, I think about multiple layers of safety including monitoring, robustness, alignment, and deployment. Safety is not a single system that doesn’t destroy the world; it’s an ongoing process that prevents bad outcomes. See Hendrycks here and here.
I don’t think it’s a good plan to build an AI that enacts some pivotal act ensuring that nobody ever builds a misaligned AGI. See Critch here and here.
My reply to Critch is here, and Eliezer’s is here and here.
“Also, it’s a little odd if you end up making decisions that have huge impacts on the rest of humanity that they had no say in; from a certain perspective that is inappropriate for you to do.”
Keep in mind that ‘this policy seems a little odd’ is a very small cost to pay relative to ‘every human being dies and all of the potential value of the future is lost’. A fire department isn’t a government, and there are cases where you should put out an immediate fire and then get everyone’s input, rather than putting the fire-extinguishing protocol to a vote while the building continues to burn down in front of you. (This seems entirely compatible with the OP to me; ‘governments should be involved’ doesn’t entail ‘government responses should be put to direct population-wide vote by non-experts’.)
Specifically, when I say ‘put out the fire’ I’m talking about ‘prevent something from killing all humans in the near future’; I’m not saying ‘solve all of humanity’s urgent problems, e.g., end cancer and hunger’. That’s urgent, but it’s a qualitatively different sort of urgency. (Delaying a cancer cure by two years would be an incredible tragedy on a human scale, but it’s a rounding error in a discussion of astronomical scales of impact.)
What, concretely, do you think humanity should do as an alternative to “build an AI that enacts some pivotal act ensuring that nobody ever builds a misaligned AGI”? If you aren’t sure, then what’s an example of an approach that seems relatively promising to you? What’s a concrete scenario where you imagine things going well in the long run?
To sharpen the question: Eventually, as compute becomes more available and AGI techniques become more efficient, we should expect that individual consumers will be able to train an AGI that destroys the world using the amount of compute on a mass-marketed personal computer. (If the world wasn’t already destroyed before that.) What’s the likeliest way you expect this outcome to be prevented, or (if you don’t think it ought to be prevented, or don’t think it’s preventable) the likeliest way you expect things to go well if this outcome isn’t prevented?
(If your answer is “I think this will never happen no matter how far human technology advances” and “in particular, the probability seems low enough to me that we should just write off those worlds and be willing to die in them, in exchange for better focusing on the more-likely world where [scenario] is true instead”, then I’d encourage saying that explicitly.)
When I think about building AI that is safe, I think about multiple layers of safety including monitoring, robustness, alignment, and deployment.
At that level of abstraction, I’d agree! Dan defines robustness as “create models that are resilient to adversaries, unusual situations, and Black Swan events”, monitoring as “detect malicious use, monitor predictions, and discover unexpected model functionality”, alignment as “build models that represent and safely optimize hard-to-specify human values”, and systemic safety as “use ML to address broader risks to how ML systems are handled, such as cyberattacks”. All of those seem required for a successful AGI-mediated pivotal act.
If this description is meant to point at a specific alternative approach, or meant to exclude pivotal acts in some way, then I’m not sure what you have in mind.
Safety is not a single system that doesn’t destroy the world; it’s an ongoing process that prevents bad outcomes.
I agree on both fronts. Destroying the world is insufficient (you need to save the world; we already know how to build AI systems that don’t destroy the world), and a pivotal act fails if it merely delays doom, rather than indefinitely putting a pause on AGI proliferation (an “ongoing process”, albeit one initiated by a fast discrete action to ensure no one destroys the world tomorrow).
But I think you mean to gesture at some class of scenarios where the “ongoing process” doesn’t begin with a sudden discrete phase shift, and more broadly where no single actor ever uses AI to do anything sudden and important in the future. What’s a high-level description of how this might realistically play out?
You linked to the same Hendrycks paper twice; is there another one you wanted to point at? And, is there a particular part of the paper(s) you especially wanted to highlight?
Thanks for the thoughtful response. My original comment was simply to note that some people disagree with the pivotal act framing, but it didn’t really offer an alternative and I’d like to engage with the problem more.
I think we have a few worldview differences that drive disagreement on how to limit AI risk given solutions to technical alignment challenges. Maybe you’d agree with me in some of these places, but a few candidates:
Stronger AI can protect us against weaker AI. When you imagine a world where anybody can train an AGI at home, you conclude that anybody will be able to destroy the world from home. I would expect that governments and corporations will maintain a sizable lead over individuals, meaning that individuals cannot take over the world. They wouldn’t necessarily need to preempt the creation of an AGI; they could simply contain it afterwards, by denying it access to resources and exposing its plans for world destruction. This is especially true in worlds where intelligence alone cannot take over the world, and instead requires resources or cooperation between entities, as argued in Section C of Katja Grace’s recent post. I could see somw of these proposals overlapping with your definition of a pivotal act, though I have more of a preference for multilateral and government action.
Government AI policy can be competent. Our nuclear non-proliferation regime is strong, only 8 countries have nuclear capabilities. Gain-of-function research is a strong counter example, but the Biden administration’s export controls on selling advanced semiconductors to China for national security purpose again support the idea of government competence. Strong government action seems possible with either (a) significant AI warning shots or (b) convincing mainstream ML and policy leaders of the danger of AI risk. When Critch suggested that governments build weapons to monitor and disable rogue AGI projects, Eliezer said it’s not realistic but would be incredible if accomplished. Those are the kinds of proposals I’d want to popularize early.
I have longer timelines, expect a more distributed takeoff, and have a more optimistic view of the chances of human survival than I’d expect you do. My plan for preventing AI x-risk is to solve the technical problems, and to convince influential people in ML and policy that the solutions must be implemented. They can then build aligned AI, and employ measures like compute controls and monitoring of large projects to ensure widespread implementation. If it turns out that my worldview is wrong and an AI lab invents a single AGI that could destroy the world relatively soon, I’d be much more open to dramatic pivotal acts that I’m not excited about in my mainline scenario.
Three more targeted replies to your comments:
Your proposed pivotal act in your reply to Critch seems much more reasonable to me than “burn all GPUs”. I’m still fuzzy on the details of how you would uncover all potential AGI projects before they get dangerous, and what you would do to stop them. Perhaps more crucially, I wouldn’t be confident that we’ll have AI that can run whole brain emulation of humans before we have AI that brings x-risk, because WBE would likely require experimental evidence from human brains that early advanced AI will not have.
I strongly agree with the need for more honest discussions about pivotal acts / how to make AI safe. I’m very concerned by the fact that people have opinions they wouldn’t share, even within the AI safety community. One benefit of more open discussion could be reduced stigma around the term — my negative association comes from the framing of a single dramatic action that forever ensures our safety, perhaps via coercion. “Burn all GPUs” exemplifies these failure modes, but I might be more open to alternatives.
I really like “don’t leave you fingerprints on the future.” If more dramatic pivotal acts are necessary, I’d endorse that mindset.
This was interesting to think about and I’d be curious to answer any other questions. In particular, I’m trying to think how to ensure ongoing safety in Ajeya’s HFDT world. The challenge is implementation, assuming somebody has solved deceptive alignment using e.g. interpretability, adversarial training, or training strategies that exploit inductive biases. Generally speaking, I think you’d have to convince the heads of Google, Facebook, and other organizations that can build AGI that these safety procedures are technically necessary. This is a tall order but not impossible. Once the leading groups are all building aligned AGIs, maybe you can promote ongoing safety either with normal policy (e.g. compute controls) or AI-assisted monitoring (your proposal or Critch’s EMPs). I’d like to think about this more but have to run.
I don’t think it’s a good plan to build an AI that enacts some pivotal act ensuring that nobody ever builds a misaligned AGI. See Critch here and here. When I think about building AI that is safe, I think about multiple layers of safety including monitoring, robustness, alignment, and deployment. Safety is not a single system that doesn’t destroy the world; it’s an ongoing process that prevents bad outcomes. See Hendrycks here and here.
My reply to Critch is here, and Eliezer’s is here and here.
I’d also point to Scott Alexander’s comment, Nate’s “Don’t leave your fingerprints on the future”, and my:
What, concretely, do you think humanity should do as an alternative to “build an AI that enacts some pivotal act ensuring that nobody ever builds a misaligned AGI”? If you aren’t sure, then what’s an example of an approach that seems relatively promising to you? What’s a concrete scenario where you imagine things going well in the long run?
To sharpen the question: Eventually, as compute becomes more available and AGI techniques become more efficient, we should expect that individual consumers will be able to train an AGI that destroys the world using the amount of compute on a mass-marketed personal computer. (If the world wasn’t already destroyed before that.) What’s the likeliest way you expect this outcome to be prevented, or (if you don’t think it ought to be prevented, or don’t think it’s preventable) the likeliest way you expect things to go well if this outcome isn’t prevented?
(If your answer is “I think this will never happen no matter how far human technology advances” and “in particular, the probability seems low enough to me that we should just write off those worlds and be willing to die in them, in exchange for better focusing on the more-likely world where [scenario] is true instead”, then I’d encourage saying that explicitly.)
At that level of abstraction, I’d agree! Dan defines robustness as “create models that are resilient to adversaries, unusual situations, and Black Swan events”, monitoring as “detect malicious use, monitor predictions, and discover unexpected model functionality”, alignment as “build models that represent and safely optimize hard-to-specify human values”, and systemic safety as “use ML to address broader risks to how ML systems are handled, such as cyberattacks”. All of those seem required for a successful AGI-mediated pivotal act.
If this description is meant to point at a specific alternative approach, or meant to exclude pivotal acts in some way, then I’m not sure what you have in mind.
I agree on both fronts. Destroying the world is insufficient (you need to save the world; we already know how to build AI systems that don’t destroy the world), and a pivotal act fails if it merely delays doom, rather than indefinitely putting a pause on AGI proliferation (an “ongoing process”, albeit one initiated by a fast discrete action to ensure no one destroys the world tomorrow).
But I think you mean to gesture at some class of scenarios where the “ongoing process” doesn’t begin with a sudden discrete phase shift, and more broadly where no single actor ever uses AI to do anything sudden and important in the future. What’s a high-level description of how this might realistically play out?
You linked to the same Hendrycks paper twice; is there another one you wanted to point at? And, is there a particular part of the paper(s) you especially wanted to highlight?
Thanks for the thoughtful response. My original comment was simply to note that some people disagree with the pivotal act framing, but it didn’t really offer an alternative and I’d like to engage with the problem more.
I think we have a few worldview differences that drive disagreement on how to limit AI risk given solutions to technical alignment challenges. Maybe you’d agree with me in some of these places, but a few candidates:
Stronger AI can protect us against weaker AI. When you imagine a world where anybody can train an AGI at home, you conclude that anybody will be able to destroy the world from home. I would expect that governments and corporations will maintain a sizable lead over individuals, meaning that individuals cannot take over the world. They wouldn’t necessarily need to preempt the creation of an AGI; they could simply contain it afterwards, by denying it access to resources and exposing its plans for world destruction. This is especially true in worlds where intelligence alone cannot take over the world, and instead requires resources or cooperation between entities, as argued in Section C of Katja Grace’s recent post. I could see somw of these proposals overlapping with your definition of a pivotal act, though I have more of a preference for multilateral and government action.
Government AI policy can be competent. Our nuclear non-proliferation regime is strong, only 8 countries have nuclear capabilities. Gain-of-function research is a strong counter example, but the Biden administration’s export controls on selling advanced semiconductors to China for national security purpose again support the idea of government competence. Strong government action seems possible with either (a) significant AI warning shots or (b) convincing mainstream ML and policy leaders of the danger of AI risk. When Critch suggested that governments build weapons to monitor and disable rogue AGI projects, Eliezer said it’s not realistic but would be incredible if accomplished. Those are the kinds of proposals I’d want to popularize early.
I have longer timelines, expect a more distributed takeoff, and have a more optimistic view of the chances of human survival than I’d expect you do. My plan for preventing AI x-risk is to solve the technical problems, and to convince influential people in ML and policy that the solutions must be implemented. They can then build aligned AI, and employ measures like compute controls and monitoring of large projects to ensure widespread implementation. If it turns out that my worldview is wrong and an AI lab invents a single AGI that could destroy the world relatively soon, I’d be much more open to dramatic pivotal acts that I’m not excited about in my mainline scenario.
Three more targeted replies to your comments:
Your proposed pivotal act in your reply to Critch seems much more reasonable to me than “burn all GPUs”. I’m still fuzzy on the details of how you would uncover all potential AGI projects before they get dangerous, and what you would do to stop them. Perhaps more crucially, I wouldn’t be confident that we’ll have AI that can run whole brain emulation of humans before we have AI that brings x-risk, because WBE would likely require experimental evidence from human brains that early advanced AI will not have.
I strongly agree with the need for more honest discussions about pivotal acts / how to make AI safe. I’m very concerned by the fact that people have opinions they wouldn’t share, even within the AI safety community. One benefit of more open discussion could be reduced stigma around the term — my negative association comes from the framing of a single dramatic action that forever ensures our safety, perhaps via coercion. “Burn all GPUs” exemplifies these failure modes, but I might be more open to alternatives.
I really like “don’t leave you fingerprints on the future.” If more dramatic pivotal acts are necessary, I’d endorse that mindset.
This was interesting to think about and I’d be curious to answer any other questions. In particular, I’m trying to think how to ensure ongoing safety in Ajeya’s HFDT world. The challenge is implementation, assuming somebody has solved deceptive alignment using e.g. interpretability, adversarial training, or training strategies that exploit inductive biases. Generally speaking, I think you’d have to convince the heads of Google, Facebook, and other organizations that can build AGI that these safety procedures are technically necessary. This is a tall order but not impossible. Once the leading groups are all building aligned AGIs, maybe you can promote ongoing safety either with normal policy (e.g. compute controls) or AI-assisted monitoring (your proposal or Critch’s EMPs). I’d like to think about this more but have to run.