If the first AGI is developed by OpenAI, Google DeepMind or Anthropic—all of whom seem relatively cautious (perhaps some more than others) - I put the chance of massively catastrophic misalignment at <20%.
Interested in hearing more about why you think this. How do they go from the current level of really poor alignment (see ref to “29%” here, and all the jailbreaks, and consider that the current models are only relatively safe because they are weak), to perfect alignment? How does their alignment scale? How is “getting the AI to do your alignment homework” even a remotely safe strategy for eliminating catastrophic risk?
I place significant weight on the possibility that when labs are in the process of training AGI or near-AGI systems, they will be able to see alignment opportunities that we can’t from a more theoretical or distanced POV. In this sense, I’m sympathetic to Anthropic’s empirical approach to safety. I also think there are a lot of really smart and creative people working at these labs.
Leading labs also employ some people focused on the worst risks. For misalignment risks, I am most worried about deceptive alignment, and Anthropic recently hired one of the people who coined that term. (From this angle, I would feel safer about these risks if Anthropic were in the lead rather than OpenAI. I know less about OpenAI’s current alignment team.)
Let me be clear though: Even if I’m right above and massively catastrophic misalignment risk one of these labs creating AGI is ~20%, I consider that very much an unacceptably high risk. I think even a 1% chance of extinction is unacceptably high. If some other kind of project had a 1% chance of causing human extinction, I don’t think the public would stand for it. Imagine some particle accelerator or biotech project had a 1% chance of causing human extinction. If the public found out, I think they would want the project shut down immediately until it could be pursued safely. And I think they would be justified in that, if there’s a way to coordinate on doing so.
when labs are in the process of training AGI or near-AGI systems, they will be able to see alignment opportunities that we can’t from a more theoretical or distanced POV.
Many of our most serious safety concerns might only arise with near-human-level systems, and it’s difficult or intractable to make progress on these problems without access to such AIs.
Or, y’know, you could just not build them and avoid the serious safety concerns that way?
If future large models turn out to be very dangerous, it’s essential we develop compelling evidence this is the case.
Wow. It’s like they are just agreeing with the people who say we need empirical evidence for x-risk, and are fine with offering it (with no democratic mandate to do so!)
Thanks!
Interested in hearing more about why you think this. How do they go from the current level of really poor alignment (see ref to “29%” here, and all the jailbreaks, and consider that the current models are only relatively safe because they are weak), to perfect alignment? How does their alignment scale? How is “getting the AI to do your alignment homework” even a remotely safe strategy for eliminating catastrophic risk?
I place significant weight on the possibility that when labs are in the process of training AGI or near-AGI systems, they will be able to see alignment opportunities that we can’t from a more theoretical or distanced POV. In this sense, I’m sympathetic to Anthropic’s empirical approach to safety. I also think there are a lot of really smart and creative people working at these labs.
Leading labs also employ some people focused on the worst risks. For misalignment risks, I am most worried about deceptive alignment, and Anthropic recently hired one of the people who coined that term. (From this angle, I would feel safer about these risks if Anthropic were in the lead rather than OpenAI. I know less about OpenAI’s current alignment team.)
Let me be clear though: Even if I’m right above and massively catastrophic misalignment risk one of these labs creating AGI is ~20%, I consider that very much an unacceptably high risk. I think even a 1% chance of extinction is unacceptably high. If some other kind of project had a 1% chance of causing human extinction, I don’t think the public would stand for it. Imagine some particle accelerator or biotech project had a 1% chance of causing human extinction. If the public found out, I think they would want the project shut down immediately until it could be pursued safely. And I think they would be justified in that, if there’s a way to coordinate on doing so.
This just seems like a hell of a reckless gamble to me. And you have to factor in their massive profit-making motivation. Is this really much more than mere safetywashing?
Or, y’know, you could just not build them and avoid the serious safety concerns that way?
Wow. It’s like they are just agreeing with the people who say we need empirical evidence for x-risk, and are fine with offering it (with no democratic mandate to do so!)
Thanks for your last paragraph. Very much agree.