The descriptions you gave all seem reasonable to me. Some responses:
Do you know about any other pieces that argue something more long the lines “actually there is a decent chance of an AI catastrophe even given normal counter-efforts”?
I’m afraid not. However, I do actually think that relatively minimal countermeasures is at least plausible.
my very shallow impression is that it focuses mainly on whether AIs will lie to get power sometimes, rather than whether this behavior will happen frequently and severely enough to lead to a catastrophe
This seems like a slight understatement to me. I think it argues for it being plausible that AIs will systematically take actions to acquire power later. Then, it would be severe enough to cause a catastrophe if AIs were capable enough to overcome other safeguards in practice.
One argument for risk is as follows:
It’s reasonably likely that powerful AIs will be schemers and this scheming won’t be removable with current technology without “catching” the schemer in the act (as argued for by Carlsmith 2023 which I linked)
Prior to technology advancing enough to remove scheming, these scheming AIs will be able to take over and they will do so sucessfully.
Neither step in the argument is trivial. For (2), the key questions are:
How much will safety technology advance due to the efforts of human researchers prior to powerful AI?
When scheming AIs first become transformatively useful for safety work, will we be able to employ countermeasures which allow us to extract lots of useful work from these AIs while still preventing them from being able to take over without getting caught? (See our recent work on AI Control for instance.)
What happens when scheming AIs are caught? Is this sufficient?
How long will we have with the first transformatively useful AIs prior to much more powerful AI being developed? So, how much work will we be able to extract out of these AIs?
Can we actually get AIs to productively work on AI safety? How will we check their work given that they might be trying to screw us over.
Many of the above questions depend on the strength of the societal response.
This is just focused on the scheming threat model which is not the only threat model.
We (redwood research where I work) might put out some posts soon which indirectly argue for (2) not necessarily going well by default. (This will also argue for tractability).
overall it doesn’t seem to be directly about AI x-risk per se.
Agreed, but relatively little time is an important part of the overall threat model so it seems relevant to reference when making the full argument.
The descriptions you gave all seem reasonable to me. Some responses:
I’m afraid not. However, I do actually think that relatively minimal countermeasures is at least plausible.
This seems like a slight understatement to me. I think it argues for it being plausible that AIs will systematically take actions to acquire power later. Then, it would be severe enough to cause a catastrophe if AIs were capable enough to overcome other safeguards in practice.
One argument for risk is as follows:
It’s reasonably likely that powerful AIs will be schemers and this scheming won’t be removable with current technology without “catching” the schemer in the act (as argued for by Carlsmith 2023 which I linked)
Prior to technology advancing enough to remove scheming, these scheming AIs will be able to take over and they will do so sucessfully.
Neither step in the argument is trivial. For (2), the key questions are:
How much will safety technology advance due to the efforts of human researchers prior to powerful AI?
When scheming AIs first become transformatively useful for safety work, will we be able to employ countermeasures which allow us to extract lots of useful work from these AIs while still preventing them from being able to take over without getting caught? (See our recent work on AI Control for instance.)
What happens when scheming AIs are caught? Is this sufficient?
How long will we have with the first transformatively useful AIs prior to much more powerful AI being developed? So, how much work will we be able to extract out of these AIs?
Can we actually get AIs to productively work on AI safety? How will we check their work given that they might be trying to screw us over.
Many of the above questions depend on the strength of the societal response.
This is just focused on the scheming threat model which is not the only threat model.
We (redwood research where I work) might put out some posts soon which indirectly argue for (2) not necessarily going well by default. (This will also argue for tractability).
Agreed, but relatively little time is an important part of the overall threat model so it seems relevant to reference when making the full argument.