While finm made a general comment in response to you, I want to specifically focus on the footnote, because I think it’s a central crux in why a lot of EAs are way less doomy than you.
Quote below:
We need at least 13 9s of safety for ASI, and the best current alignment techniques aren’t even getting 3 9s...
I think the 13 9s can be reduced to something requiring closer to 1-2 9s at the very least, and there are 2 reasons for this:
I think you drastically overestimate how many chances the AI gets at misalignment, because the trillions of executions will use far, far too little compute per single action to lead to a takeover, even in worlds where a software intelligence explosion is possible and large efficiencies in software are gainable. I expect there to be at most millions of such plausible actions, and the more likely scenario is that it’s closer to hundreds or thousands of chances to escape, undermining this tweet:
2. If we manage to catch an AI doing bad stuff, then it’s much, much harder for the AI to escape, because there are a lot of techniques that can be applied to make the issue go away like proliferating the escape example.
I definitely think alignment needs to be reliable, but I do not think it needs to be so reliable that we cannot achieve it, or that doom is very likely and we can’t change the probabilities.
I’d certainly say it’s quite scary, but I do think there’s a reasonable hope of surviving and going on to thrive such that I think alignment invest is worth the money.
I think you drastically overestimate how many chances the AI gets at misalignment, because the trillions of executions will use far, far too little compute per single action to lead to a takeover
The little compute leads to much more once it has escaped!
If we manage to catch an AI doing bad stuff
The point is that we won’t, unless we have many more 9s of reliability in terms of catching such attempts!
I agree that conditional on escaping/rogue internal deployments like this scenario by Buck, with a lot of contributors, it leads to much larger disasters, and if the AI is unaligned, then unless we have an aligned AI that has somewhat similar capabilities, we lose.
My point is more so that you are way overestimating how many chances the AI has to overthrow us before it is aligned.
But the crux might be that I don’t think that we need that much reliability for AI catching, because as long as the AI is caught with non-negligible probability, the AI has to be very cautious, because it is way worse for the AI to be caught than to be successful or the game just ending.
In essence, we probably only need 1, or at most a few times to completely change the game in ways that massively favor alignment success.
You do not need to reliably catch someone if you can make the costs of catching the AI out very painful, and I do suspect we will probably be able to do this (perhaps using trusted AI labor, to use Ryan Greenblatt and Buck’s terms).
This is begging the question! My whole objection is that alignment of ASI hasn’t been established to be possible.
as long as the AI is caught with non-negligible probability, the AI has to be very cautious, because it is way worse for the AI to be caught than to be successful or the game just ending.
So it will worry about being in a kind of panopticon? Seems pretty unlikely. Why should the AI care about being caught any more than it should about any given runtime instance of it being terminated?
This is begging the question! My whole objection is that alignment of ASI hasn’t been established to be possible.
A couple of things I’ll say here:
You do not need a strong theory for why something must be possible in order to put non-trivial credence on it being possible, and if you hold a prior that scientific difficulty of doing something is often overrated, especially if you believe in the idea that alignment is possibly automatable and that a lot of people overrate the difficulty of automating something, that’s enough to cut p(doom) by a lot, arguably 1 OOM, but at the very least nowhere near your 90 p(doom)%. That doesn’t mean that we are going to make it out of ASI alive, but it does mean that even in situations where there is no established theory or plan to survive, you can still possibly do something.
If I wanted to make the case that ASI alignment is possible, I’d probably read these 3 posts by Joshua Clymer first on how automated alignment schemes could work (with some discussion by Habryka and Eliezer Yudkowsky and Jeremy Gillen the comments, and Joshua Clymer’s responses):
So it will worry about being in a kind of panopticon? Seems pretty unlikely. Why should the AI care about being caught any more than it should about any given runtime instance of it being terminated?
The basic reason for this is that you can gain way more information on the AI once you have escaped, combined with the ability to use much more targeted countermeasures that are more effective once you have caught the AI red handed.
As a bonus, this can also eliminate threat models like sandbagging, if you have found a reproducible signal for when an AI will try to overthrow a lab.
While finm made a general comment in response to you, I want to specifically focus on the footnote, because I think it’s a central crux in why a lot of EAs are way less doomy than you.
Quote below:
I think the 13 9s can be reduced to something requiring closer to 1-2 9s at the very least, and there are 2 reasons for this:
I think you drastically overestimate how many chances the AI gets at misalignment, because the trillions of executions will use far, far too little compute per single action to lead to a takeover, even in worlds where a software intelligence explosion is possible and large efficiencies in software are gainable. I expect there to be at most millions of such plausible actions, and the more likely scenario is that it’s closer to hundreds or thousands of chances to escape, undermining this tweet:
https://x.com/gcolbourn/status/1762224406955216897
2. If we manage to catch an AI doing bad stuff, then it’s much, much harder for the AI to escape, because there are a lot of techniques that can be applied to make the issue go away like proliferating the escape example.
More from Ryan Greenblatt here:
https://www.lesswrong.com/posts/i2nmBfCXnadeGmhzW/catching-ais-red-handed
I definitely think alignment needs to be reliable, but I do not think it needs to be so reliable that we cannot achieve it, or that doom is very likely and we can’t change the probabilities.
I’d certainly say it’s quite scary, but I do think there’s a reasonable hope of surviving and going on to thrive such that I think alignment invest is worth the money.
The little compute leads to much more once it has escaped!
The point is that we won’t, unless we have many more 9s of reliability in terms of catching such attempts!
I agree that conditional on escaping/rogue internal deployments like this scenario by Buck, with a lot of contributors, it leads to much larger disasters, and if the AI is unaligned, then unless we have an aligned AI that has somewhat similar capabilities, we lose.
My point is more so that you are way overestimating how many chances the AI has to overthrow us before it is aligned.
https://www.lesswrong.com/posts/ceBpLHJDdCt3xfEok/ai-catastrophes-and-rogue-deployments
But the crux might be that I don’t think that we need that much reliability for AI catching, because as long as the AI is caught with non-negligible probability, the AI has to be very cautious, because it is way worse for the AI to be caught than to be successful or the game just ending.
In essence, we probably only need 1, or at most a few times to completely change the game in ways that massively favor alignment success.
You do not need to reliably catch someone if you can make the costs of catching the AI out very painful, and I do suspect we will probably be able to do this (perhaps using trusted AI labor, to use Ryan Greenblatt and Buck’s terms).
This is begging the question! My whole objection is that alignment of ASI hasn’t been established to be possible.
So it will worry about being in a kind of panopticon? Seems pretty unlikely. Why should the AI care about being caught any more than it should about any given runtime instance of it being terminated?
A couple of things I’ll say here:
You do not need a strong theory for why something must be possible in order to put non-trivial credence on it being possible, and if you hold a prior that scientific difficulty of doing something is often overrated, especially if you believe in the idea that alignment is possibly automatable and that a lot of people overrate the difficulty of automating something, that’s enough to cut p(doom) by a lot, arguably 1 OOM, but at the very least nowhere near your 90 p(doom)%. That doesn’t mean that we are going to make it out of ASI alive, but it does mean that even in situations where there is no established theory or plan to survive, you can still possibly do something.
If I wanted to make the case that ASI alignment is possible, I’d probably read these 3 posts by Joshua Clymer first on how automated alignment schemes could work (with some discussion by Habryka and Eliezer Yudkowsky and Jeremy Gillen the comments, and Joshua Clymer’s responses):
https://www.lesswrong.com/posts/8vgi3fBWPFDLBBcAx/planning-for-extreme-ai-risks
https://www.lesswrong.com/posts/TTFsKxQThrqgWeXYJ/how-might-we-safely-pass-the-buck-to-ai
https://www.lesswrong.com/posts/5gmALpCetyjkSPEDr/training-ai-to-do-alignment-research-we-don-t-already-know
The basic reason for this is that you can gain way more information on the AI once you have escaped, combined with the ability to use much more targeted countermeasures that are more effective once you have caught the AI red handed.
As a bonus, this can also eliminate threat models like sandbagging, if you have found a reproducible signal for when an AI will try to overthrow a lab.
More discussion by Ryan Greenblatt and Buck here:
https://www.lesswrong.com/posts/i2nmBfCXnadeGmhzW/catching-ais-red-handed