My question is: what is the current best single article that provides a well-reasoned and comprehensive case for believing that there is a substantial (>10%) probability of an AI catastrophe this century?
Unfortunately, I don’t think there is any such article which seems basically up-to-date and reasonable to me.
Here are reasonably up-to-date posts which seem pretty representative to me, but aren’t comprehensive. Hopefully this is still somewhat helpful:
Thanks. It’s unfortunate there isn’t any single article that presents the case comprehensively. I’m OK with replying to multiple articles as an alternative.
In regards to the pieces you mentioned:
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
My understanding is that (as per the title) this piece argued that a catastrophe is likely without specific countermeasures, but it seems extremely likely that specific countermeasures will be taken to prevent a catastrophe, at least indirectly. Do you know about any other pieces that argue something more long the lines “actually there is a decent chance of an AI catastrophe even given normal counter-efforts”?
Scheming AIs: Will AIs fake alignment during training in order to get power?
While I haven’t digested this post yet, my very shallow impression is that it focuses mainly on whether AIs will lie to get power sometimes, rather than whether this behavior will happen frequently and severely enough to lead to a catastrophe. I think it is very likely that AIs will sometimes lie to get power, just as humans do, but it seems like there’s a lot more that you’d need to argue to show that this might be catastrophic. Am I wrong in my impression?
What a compute-centric framework says about AI takeoff speeds
I’d like to note that I don’t think I have any critical disagreements with this piece, and overall it doesn’t seem to be directly about AI x-risk per se.
it seems extremely likely that specific countermeasures will be taken to prevent a catastrophe, at least indirectly.
This suggests that you hold a view where one of the cruxes with mainstream EA views is “EAs believe there won’t be countermeasures, but countermeasures are very likely, and they significantly mitigate the risk from AI beyond what EAs believe.” (If that is not one of your cruxes, then you can ignore the rest of this!)
The confusing thing about that is, what if EA activities are a key reason why good countermeasures end up being taken against AI? In that case, EA arguments would be a “victim” of their own success (though no one would be complaining!) But that doesn’t seem like a reason to disagree right now, when there is the common ground of “specific countermeasures really need to be taken”.
The confusing thing about that is, what if EA activities are a key reason why good countermeasures end up being taken against AI?
I find that quite unlikely. I think EA activities contribute on the margin, but it seems very likely to me that people would eventually have taken measures against AI risk in the absence of any EA movement.
In general, while I agree we should not take this argument so far, so that EA ideas do not become “victims of their own success”, I also think neglectedness is a standard barometer EAs have used to judge the merits of their interventions. And I think AI risk mitigation will very likely not be a neglected field in the future. This should substantially downweight our evaluation of AI risk mitigation efforts.
In a trivial example, you’d surely concede that EAs should not try to, e.g. work on making sure that future spacecraft designs are safe? Advanced spacecrafts could indeed play a very important role in the future; but it seems unlikely that society would neglect to work on spacecraft safety, making this a pretty unimportant problem to work on right now. To be clear, I definitely don’t think the case for working on AI risk mitigation is as bad as the case for working on spacecraft safety, but my point is that the idea I’m trying to convey here applies in both cases.
The descriptions you gave all seem reasonable to me. Some responses:
Do you know about any other pieces that argue something more long the lines “actually there is a decent chance of an AI catastrophe even given normal counter-efforts”?
I’m afraid not. However, I do actually think that relatively minimal countermeasures is at least plausible.
my very shallow impression is that it focuses mainly on whether AIs will lie to get power sometimes, rather than whether this behavior will happen frequently and severely enough to lead to a catastrophe
This seems like a slight understatement to me. I think it argues for it being plausible that AIs will systematically take actions to acquire power later. Then, it would be severe enough to cause a catastrophe if AIs were capable enough to overcome other safeguards in practice.
One argument for risk is as follows:
It’s reasonably likely that powerful AIs will be schemers and this scheming won’t be removable with current technology without “catching” the schemer in the act (as argued for by Carlsmith 2023 which I linked)
Prior to technology advancing enough to remove scheming, these scheming AIs will be able to take over and they will do so sucessfully.
Neither step in the argument is trivial. For (2), the key questions are:
How much will safety technology advance due to the efforts of human researchers prior to powerful AI?
When scheming AIs first become transformatively useful for safety work, will we be able to employ countermeasures which allow us to extract lots of useful work from these AIs while still preventing them from being able to take over without getting caught? (See our recent work on AI Control for instance.)
What happens when scheming AIs are caught? Is this sufficient?
How long will we have with the first transformatively useful AIs prior to much more powerful AI being developed? So, how much work will we be able to extract out of these AIs?
Can we actually get AIs to productively work on AI safety? How will we check their work given that they might be trying to screw us over.
Many of the above questions depend on the strength of the societal response.
This is just focused on the scheming threat model which is not the only threat model.
We (redwood research where I work) might put out some posts soon which indirectly argue for (2) not necessarily going well by default. (This will also argue for tractability).
overall it doesn’t seem to be directly about AI x-risk per se.
Agreed, but relatively little time is an important part of the overall threat model so it seems relevant to reference when making the full argument.
Unfortunately, I don’t think there is any such article which seems basically up-to-date and reasonable to me.
Here are reasonably up-to-date posts which seem pretty representative to me, but aren’t comprehensive. Hopefully this is still somewhat helpful:
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
Scheming AIs: Will AIs fake alignment during training in order to get power?
What a compute-centric framework says about AI takeoff speeds
On specifically “will the AI literally kill everyone”, I think the most up-to-date discussion is here, here, and here.
I think an updated comprehensive case is an open project that might happen in the next few years.
Thanks. It’s unfortunate there isn’t any single article that presents the case comprehensively. I’m OK with replying to multiple articles as an alternative.
In regards to the pieces you mentioned:
My understanding is that (as per the title) this piece argued that a catastrophe is likely without specific countermeasures, but it seems extremely likely that specific countermeasures will be taken to prevent a catastrophe, at least indirectly. Do you know about any other pieces that argue something more long the lines “actually there is a decent chance of an AI catastrophe even given normal counter-efforts”?
While I haven’t digested this post yet, my very shallow impression is that it focuses mainly on whether AIs will lie to get power sometimes, rather than whether this behavior will happen frequently and severely enough to lead to a catastrophe. I think it is very likely that AIs will sometimes lie to get power, just as humans do, but it seems like there’s a lot more that you’d need to argue to show that this might be catastrophic. Am I wrong in my impression?
I’d like to note that I don’t think I have any critical disagreements with this piece, and overall it doesn’t seem to be directly about AI x-risk per se.
This suggests that you hold a view where one of the cruxes with mainstream EA views is “EAs believe there won’t be countermeasures, but countermeasures are very likely, and they significantly mitigate the risk from AI beyond what EAs believe.” (If that is not one of your cruxes, then you can ignore the rest of this!)
The confusing thing about that is, what if EA activities are a key reason why good countermeasures end up being taken against AI? In that case, EA arguments would be a “victim” of their own success (though no one would be complaining!) But that doesn’t seem like a reason to disagree right now, when there is the common ground of “specific countermeasures really need to be taken”.
I find that quite unlikely. I think EA activities contribute on the margin, but it seems very likely to me that people would eventually have taken measures against AI risk in the absence of any EA movement.
In general, while I agree we should not take this argument so far, so that EA ideas do not become “victims of their own success”, I also think neglectedness is a standard barometer EAs have used to judge the merits of their interventions. And I think AI risk mitigation will very likely not be a neglected field in the future. This should substantially downweight our evaluation of AI risk mitigation efforts.
In a trivial example, you’d surely concede that EAs should not try to, e.g. work on making sure that future spacecraft designs are safe? Advanced spacecrafts could indeed play a very important role in the future; but it seems unlikely that society would neglect to work on spacecraft safety, making this a pretty unimportant problem to work on right now. To be clear, I definitely don’t think the case for working on AI risk mitigation is as bad as the case for working on spacecraft safety, but my point is that the idea I’m trying to convey here applies in both cases.
The descriptions you gave all seem reasonable to me. Some responses:
I’m afraid not. However, I do actually think that relatively minimal countermeasures is at least plausible.
This seems like a slight understatement to me. I think it argues for it being plausible that AIs will systematically take actions to acquire power later. Then, it would be severe enough to cause a catastrophe if AIs were capable enough to overcome other safeguards in practice.
One argument for risk is as follows:
It’s reasonably likely that powerful AIs will be schemers and this scheming won’t be removable with current technology without “catching” the schemer in the act (as argued for by Carlsmith 2023 which I linked)
Prior to technology advancing enough to remove scheming, these scheming AIs will be able to take over and they will do so sucessfully.
Neither step in the argument is trivial. For (2), the key questions are:
How much will safety technology advance due to the efforts of human researchers prior to powerful AI?
When scheming AIs first become transformatively useful for safety work, will we be able to employ countermeasures which allow us to extract lots of useful work from these AIs while still preventing them from being able to take over without getting caught? (See our recent work on AI Control for instance.)
What happens when scheming AIs are caught? Is this sufficient?
How long will we have with the first transformatively useful AIs prior to much more powerful AI being developed? So, how much work will we be able to extract out of these AIs?
Can we actually get AIs to productively work on AI safety? How will we check their work given that they might be trying to screw us over.
Many of the above questions depend on the strength of the societal response.
This is just focused on the scheming threat model which is not the only threat model.
We (redwood research where I work) might put out some posts soon which indirectly argue for (2) not necessarily going well by default. (This will also argue for tractability).
Agreed, but relatively little time is an important part of the overall threat model so it seems relevant to reference when making the full argument.