Based solely on my own impression, I’d guess that one reason for the lack of engagement on your original question stems from the fact that it felt like you were operating within a very specific frame, and I sensed that untangling the specific assumptions of your frame (and consequently a high P(doom)) would take a lot of work. In my own case, I didn’t know which assumptions are driving your estimates, and so I consequently felt unsure as to which counter-arguments you’d consider relevant to your key cruxes.
(For example: manyreviewersof the Carlsmith report (alongside Carlsmith himself) put P(doom) ≤ 10%. If you’ve read these responses, why did you find the responses uncompelling? Which specific arguments did you find faulty?)
Here’s one example from this post where I felt as though it would take a lot of work to better understand the argument you want to put forward:
“The above considerations are the basis for the case that disjunctive reasoning should predominantly be applied to AI x-risk: the default is doom.”
When I read this, I found myself asking “wait, what are the relevant disjuncts meant to be?”. I understand a disjunctive argument for doom to be saying that doom is highly likely conditional on any one of {A, B, C, … }. If each of A, B, C … is independently plausible, then obviously this looks worrying. If you say that some claim is disjunctive, I want an argument for believing that each disjunct is independently plausible, and an argument for accepting the disjunctive framing offered as the best framing for the claim at hand.
For humanity to be dead by 2070, only one premise below needs to be true:
Humanity has < 20 years to prepare for AGI
The technical challenge of alignment isn’t “pretty easy”
Research culture isn’t alignment-conscious in a competent way.
Phrased this way, Nate offers a disjunctive argument. And, to be clear, I think it’s worth taking seriously. But I feel like ‘disjunctive’ and ‘conjunctive’ are often thrown around a bit too loosely, and such terms mostly serve to impede the quality of discussion. It’s not obvious to me that Nate’s framing is the best framing for the question at hand, and I expect that making the case for Nate’s framing is likely to rely on the conjunction of many assumptions. Also, that’s fine! I think it’s a valuable argument to make! I just think there should be more explicit discussions and arguments about the best framings for predicting the future of AI.
Finally, I feel like asking for “a detailed technical argument for believing P(doom|AGI) ≤ 10%” is making an isolated demand for rigor. I personally don’t think there are ‘detailed technical arguments’ P(doom|AGI) greater than 10%. I don’t say this critically, because reasoning about the chances of doom given AGI is hard. I’m also >10% on many claims in the absence of ‘detailed, technical arguments’ for such claims in the absence of such arguments, and I think we can do a lot better than we’re doing currently.
I agree that it’s important to avoid squeamishness about proclamations of confidence in pessimistic conclusions if that’s what we genuinely believe the arguments suggest. I’m also glad that you offered the ‘social explanation’ for people’s low doom estimates, even though I think it’s incorrect, and even though many people (including, tbh, me) will predictably find it annoying. In the same spirit, I’d like to offer an analogous argument: I think many arguments for p(doom | AGI) > 90% are the result of overreliance on specific default frame, and insufficiently careful attention to argumentative rigor. If that claim strikes you as incorrect, or brings obvious counterexamples to mind, I’d be interested to read them (and to elaborate my dissatisfaction with existing arguments for high doom estimates).
I don’t find Carlsmith et al’s estimates convincing because they are starting with a conjunctive frame and applying conjunctive reasoning. They are assuming we’re fine by default (why?), and then building up a list of factors that need to go wrong for doom to happen.
I agree with Nate. Any one of a vast array of things can cause doom. Just the 4 broad categories mentioned at the start of the OP (subfields of Alignment) and the fact that “any given [alignment] approach that might show some promise on one or two of these still leaves the others unsolved.” is enough to provide a disjunctive frame! Where are all the alignment approaches that tackle all the threat models simultaneously? Why shouldn’t the naive prior be that we are doomed by default when dealing with something alien that is much smarter than us? [see fn.6].
“I expect that making the case for Nate’s framing is likely to rely on the conjunction of many assumptions”
Can you give an example of such assumptions? I’m not seeing it.
I feel like asking for “a detailed technical argument for believing P(doom|AGI) ≤ 10%” is making an isolated demand for rigor. I personally don’t think there are ‘detailed technical arguments’ P(doom|AGI) greater than 10%.
This blog is ~1k words. Can you write a similar length blog for the other side, rebutting all my points?
I think many arguments for p(doom | AGI) > 90% are the result of overreliance on specific default frame, and insufficiently careful attention to argumentative rigor. If that claim strikes you as incorrect
It does strike me as incorrect. I’ve responded to / rebutted all comments here, and here, here, here, here etc, and I’m not getting any satisfying rebuttals back. Bounty of $1000 is still open.
Ay thanks, sorry I’m late back to you. I’ll respond to various parts in turn.
I don’t find Carlsmith et al’s estimates convincing because they are starting with a conjunctive frame and applying conjunctive reasoning. They are assuming we’re fine by default (why?), and then building up a list of factors that need to go wrong for doom to happen.
My initial interpretation of this passage is: you seem to be saying that conjunctive/disjunctive arguments are presented against a mainline model (say, one of doom/hope). In presenting a ‘conjunctive’ argument, Carlsmith belies a mainline model of hope. However, you doubt the mainline model of hope, and so his argument is unconvincing. If that reading is correct, then my view is that the mainline model of doom has not been successfully argued for. What do you take to be the best argument for a ‘mainline model’ of doom? If I’m correct in interpreting the passage below as an argument for a ‘mainline model’ of doom, then it strikes me as unconvincing:
Any one of a vast array of things can cause doom. Just the 4 broad categories mentioned at the start of the OP (subfields of Alignment) and the fact that “any given [alignment] approach that might show some promise on one or two of these still leaves the others unsolved.” is enough to provide a disjunctive frame!
Under your framing, I don’t think that you’ve come anywhere close to providing an argument for your preferred disjunctive framing. On my way of viewing things, an argument for a disjunctive framing shows that “failure on intent alignment (with success in the other areas) leads to a high P(Doom | AGI), failure on outer alignment alignment (with success in the other areas) leads to a high P(Doom | AGI), etc …”. I think that you have not shown this for any of the disjuncts, and an argument for a disjunctive frame requires showing this for all of the disjuncts.
Nate’s Framing
I claimed that an argument for (my slight alteration of) Nate’s framing was likely to rely on the conjunction of many assumptions, and you (very reasonably) asked me to spell them out. To recap, here’s the framing:
For humanity to be dead by 2070, only one of the following needs to be true:
Humanity has < 20 years to prepare for AGI
The technical challenge of alignment isn’t “pretty easy”
Research culture isn’t alignment-conscious in a competent way.
For this to be a disjunctive argument for doom, all of the following need to be true:
If humanity has < 20 years to prepare for AGI, then doom is highly likely.
Etc …
That is, the first point requires an argument which shows the following:
A Conjunctive Case for the Disjunctive Case for Doom:[1]
Even if we have a competent alignment-research culture, and
Even if the technical challenge of alignment is also pretty easy, nevertheless
Humanity is likely to go extinct if it has <20 years to prepare for AGI.
If I try to spell out the arguments for this framing, things start to look pretty messy. If technical alignment were “pretty easy”, and tackled by a culture which competently pursued alignment research, then I don’t feel >90% confident in doom. The claim “if humanity has < 20 years to prepare for AGI, then doom is highly likely” requires (non-exhaustively) the following assumptions:
Obviously, the argument directly entails the following: Groups of competent alignment researchers would fail to make ‘sufficient progress’ on alignment within <20 years, even if the technical challenge of alignment is “pretty easy”.
There have to be some premises here which help make sense of why this would be true. What’s the bar for a competent ‘alignment culture’?
If the bar is low, then the claim does not seem obviously true. If the bar for ‘competent alignment-research culture’ is very high, then I think you’ll need an assumption like the one below.
With extremely high probability, the default expectation should be that the values of future AIs are unlikely to care about continued human survival, or the survival of anything we’d find valuable.
I will note that this assumption seems required to motivate the disjunctive framing above, rather than following from the framing above.
The arguments I know of for claims like this do seem to rely on strong claims about the sort of ‘plan search’ algorithms we’d expect future AIs to instantiate. For example, Rob claims that we’re on track to produce systems which approximate ‘randomly sample from the space of simplicity-weighted plans’. See discussion here.
As Paul notes, “there are just a lot of plausible ways to care a little bit (one way or the other!) about a civilization that created you, that you’ve interacted with, and which was prima facie plausibly an important actor in the world.”
By default, the values of future AIs are likely to include broadly-scoped goals, which will involve rapacious influence-seeking.
I agree that there are instrumentally convergent goals, which include some degree of power/influence-seeking. But I don’t think instrumental convergence alone gets you to ‘doom with >50%’.
It’s not enough to have a moderate desire for influence. I think it’s plausible that the default path involves systems who do ‘normal-ish human activities’ in pursuit of more local goals. I quote a story from Katja Grace in my shortform here.
So far, I’ve discussed just one disjunct, but I can imagine outlining similar assumptions for the other disjuncts. For instance: if we have >20 years to conduct AI alignment research conditional on the problem not being super hard, why can’t there be a decent chance that a not-super-competent research community solves the problem? Again, I find it hard to motivate the case for a claim like that without already assuming a mainline model of doom.
I’m not saying there aren’t interesting arguments here, but I think that arguments of this type mostly assume a mainline model of doom (or the adequacy of a ‘disjunctive framing’), rather than providing independent arguments for a mainline model of doom.
Future Responses
This blog is ~1k words. Can you write a similar length blog for the other side, rebutting all my points?
I think so! But I’m unclear what, exactly, your arguments are meant to be. Also, I would personally find it much easier to engage with arguments in premise-conclusion format. Otherwise, I feel like I have to spend a lot of work trying to understand the logical structure of your argument, which requires a decent chunk of time-investment.
Still, I’m happy to chat over DM if you think that discussing this further would be profitable. Here’s my attempt to summarize your current view of things.
We’re on a doomed path, and I’d like to see arguments which could allow me to justifiably believe that there are paths which will steer us away from the default attractor state of doom. The technical problem of alignment has many component pieces, and it seems like failure to solve any one of the many component pieces is likely sufficient for doom. Moreover, the problems for each piece of the alignment puzzle look ~independent.
Thanks for the reply. I think the talk of 20 years is a red herring as we might only have 2 years (or less). Re your example of “A Conjunctive Case for the Disjunctive Case for Doom”, I don’t find the argument convincing because you use 20 years. Can you make the same arguments s/20/2?
And what I’m arguing is not that we are doomed by default, but the conditional on being doomed given AGI; P(doom|AGI). I’m actually reasonably optimistic that we can just stop building AGI and therefore won’t be doomed! And that’s what I’m working toward (yes, it’s going to be a lot of work; I’d appreciate more help).
On my way of viewing things, an argument for a disjunctive framing shows that “failure on intent alignment (with success in the other areas) leads to a high P(Doom | AGI), failure on outer alignment alignment (with success in the other areas) leads to a high P(Doom | AGI), etc …”. I think that you have not shown this for any of the disjuncts
Isn’t it obvious that none of {outer alignment, inner alignment, misuse risk, multipolar coordination} have come anywhere close to being solved? Do I really need to summarise progress to date and show why it isn’t a solution, when no one is even claiming to have a viable, scalable, solution to any of them!? Isn’t it obvious that current models are only safe because they are weak? Will Claude-3 spontaneously just decide not to make napalm with the Grandma’s bedtime story napalm recipe jailbreak when it’s powerful enough to do so and hooked up to a chemical factory?
So far, I’ve discussed just one disjunct, but I can imagine outlining similar assumptions for the other disjuncts.
Ok, but you really need to defeat all of them given that they are disjuncts!
I don’t think instrumental convergence alone gets you to ‘doom with >50%’.
Can you elaborate more on this? Is it because you expect AGIs to spontaneously be aligned enough to not doom us?
I’m unclear what, exactly, your arguments are meant to be. Also, I would personally find it much easier to engage with arguments in premise-conclusion format
Judging by the overall response to this post, I do think it needs a rewrite.
Here’s a quick attempt at a subset of conjunctive assumptions in Nate’s framing:
- The functional ceiling for AGI is sufficiently above the current level of human civilization to eliminate it - There is a sharp cutoff between non-AGI AI and AGI, such that early kind-of-AGI doesn’t send up enough warning signals to cause a drastic change in trajectory. - Early AGIs don’t result in a multi-polar world where superhuman-but-not-godlike agents can’t actually quickly and recursively self-improve, in part because none of them wants any of the others to take over—and without being able to grow stronger, humanity remains a viable player.
- The functional ceiling for AGI is sufficiently above the current level of human civilization to eliminate it
I don’t think anyone is seriously arguing this? (Links please if they are).
- There is a sharp cutoff between non-AGI AI and AGI, such that early kind-of-AGI doesn’t send up enough warning signals to cause a drastic change in trajectory.
We are getting the warning signals now. People (including me) are raising the alarm. Hoping for a drastic change of trajectory, but people actually have to put the work in for that to happen! But your point here isn’t really related to P(doom|AGI) - i.e. the conditional is on getting AGI. Of course there won’t be doom if we don’t get AGI! That’s what we should be aiming for right now (not getting AGI).
- Early AGIs don’t result in a multi-polar world where superhuman-but-not-godlike agents can’t actually quickly and recursively self-improve, in part because none of them wants any of the others to take over—and without being able to grow stronger, humanity remains a viable player.
Nate may focus on singleton scenarios, but that is not a pre-requisite for doom. To me Robin Hanson’s (multipolar) Age of Em is also a kind of doom (most humans don’t exist, only a few highly productive ones are copied many times and only activated to work; a fully Malthusian economy). I don’t see how “humanity remains a viable player” in a world full of superhuman agents.
Based solely on my own impression, I’d guess that one reason for the lack of engagement on your original question stems from the fact that it felt like you were operating within a very specific frame, and I sensed that untangling the specific assumptions of your frame (and consequently a high P(doom)) would take a lot of work. In my own case, I didn’t know which assumptions are driving your estimates, and so I consequently felt unsure as to which counter-arguments you’d consider relevant to your key cruxes.
(For example: many reviewers of the Carlsmith report (alongside Carlsmith himself) put P(doom) ≤ 10%. If you’ve read these responses, why did you find the responses uncompelling? Which specific arguments did you find faulty?)
Here’s one example from this post where I felt as though it would take a lot of work to better understand the argument you want to put forward:
When I read this, I found myself asking “wait, what are the relevant disjuncts meant to be?”. I understand a disjunctive argument for doom to be saying that doom is highly likely conditional on any one of {A, B, C, … }. If each of A, B, C … is independently plausible, then obviously this looks worrying. If you say that some claim is disjunctive, I want an argument for believing that each disjunct is independently plausible, and an argument for accepting the disjunctive framing offered as the best framing for the claim at hand.
For instance, here’s a disjunctive framing of something Nate said in his review of the Carlsmith Report.
Phrased this way, Nate offers a disjunctive argument. And, to be clear, I think it’s worth taking seriously. But I feel like ‘disjunctive’ and ‘conjunctive’ are often thrown around a bit too loosely, and such terms mostly serve to impede the quality of discussion. It’s not obvious to me that Nate’s framing is the best framing for the question at hand, and I expect that making the case for Nate’s framing is likely to rely on the conjunction of many assumptions. Also, that’s fine! I think it’s a valuable argument to make! I just think there should be more explicit discussions and arguments about the best framings for predicting the future of AI.
Finally, I feel like asking for “a detailed technical argument for believing P(doom|AGI) ≤ 10%” is making an isolated demand for rigor. I personally don’t think there are ‘detailed technical arguments’ P(doom|AGI) greater than 10%. I don’t say this critically, because reasoning about the chances of doom given AGI is hard. I’m also >10% on many claims in the absence of ‘detailed, technical arguments’ for such claims in the absence of such arguments, and I think we can do a lot better than we’re doing currently.
I agree that it’s important to avoid squeamishness about proclamations of confidence in pessimistic conclusions if that’s what we genuinely believe the arguments suggest. I’m also glad that you offered the ‘social explanation’ for people’s low doom estimates, even though I think it’s incorrect, and even though many people (including, tbh, me) will predictably find it annoying. In the same spirit, I’d like to offer an analogous argument: I think many arguments for p(doom | AGI) > 90% are the result of overreliance on specific default frame, and insufficiently careful attention to argumentative rigor. If that claim strikes you as incorrect, or brings obvious counterexamples to mind, I’d be interested to read them (and to elaborate my dissatisfaction with existing arguments for high doom estimates).
I don’t find Carlsmith et al’s estimates convincing because they are starting with a conjunctive frame and applying conjunctive reasoning. They are assuming we’re fine by default (why?), and then building up a list of factors that need to go wrong for doom to happen.
I agree with Nate. Any one of a vast array of things can cause doom. Just the 4 broad categories mentioned at the start of the OP (subfields of Alignment) and the fact that “any given [alignment] approach that might show some promise on one or two of these still leaves the others unsolved.” is enough to provide a disjunctive frame! Where are all the alignment approaches that tackle all the threat models simultaneously? Why shouldn’t the naive prior be that we are doomed by default when dealing with something alien that is much smarter than us? [see fn.6].
Can you give an example of such assumptions? I’m not seeing it.
This blog is ~1k words. Can you write a similar length blog for the other side, rebutting all my points?
It does strike me as incorrect. I’ve responded to / rebutted all comments here, and here, here, here, here etc, and I’m not getting any satisfying rebuttals back. Bounty of $1000 is still open.
Ay thanks, sorry I’m late back to you. I’ll respond to various parts in turn.
My initial interpretation of this passage is: you seem to be saying that conjunctive/disjunctive arguments are presented against a mainline model (say, one of doom/hope). In presenting a ‘conjunctive’ argument, Carlsmith belies a mainline model of hope. However, you doubt the mainline model of hope, and so his argument is unconvincing. If that reading is correct, then my view is that the mainline model of doom has not been successfully argued for. What do you take to be the best argument for a ‘mainline model’ of doom? If I’m correct in interpreting the passage below as an argument for a ‘mainline model’ of doom, then it strikes me as unconvincing:
Under your framing, I don’t think that you’ve come anywhere close to providing an argument for your preferred disjunctive framing. On my way of viewing things, an argument for a disjunctive framing shows that “failure on intent alignment (with success in the other areas) leads to a high P(Doom | AGI), failure on outer alignment alignment (with success in the other areas) leads to a high P(Doom | AGI), etc …”. I think that you have not shown this for any of the disjuncts, and an argument for a disjunctive frame requires showing this for all of the disjuncts.
Nate’s Framing
I claimed that an argument for (my slight alteration of) Nate’s framing was likely to rely on the conjunction of many assumptions, and you (very reasonably) asked me to spell them out. To recap, here’s the framing:
For this to be a disjunctive argument for doom, all of the following need to be true:
That is, the first point requires an argument which shows the following:
A Conjunctive Case for the Disjunctive Case for Doom:[1]
Even if we have a competent alignment-research culture, and
Even if the technical challenge of alignment is also pretty easy, nevertheless
Humanity is likely to go extinct if it has <20 years to prepare for AGI.
If I try to spell out the arguments for this framing, things start to look pretty messy. If technical alignment were “pretty easy”, and tackled by a culture which competently pursued alignment research, then I don’t feel >90% confident in doom. The claim “if humanity has < 20 years to prepare for AGI, then doom is highly likely” requires (non-exhaustively) the following assumptions:
Obviously, the argument directly entails the following: Groups of competent alignment researchers would fail to make ‘sufficient progress’ on alignment within <20 years, even if the technical challenge of alignment is “pretty easy”.
There have to be some premises here which help make sense of why this would be true. What’s the bar for a competent ‘alignment culture’?
If the bar is low, then the claim does not seem obviously true. If the bar for ‘competent alignment-research culture’ is very high, then I think you’ll need an assumption like the one below.
With extremely high probability, the default expectation should be that the values of future AIs are unlikely to care about continued human survival, or the survival of anything we’d find valuable.
I will note that this assumption seems required to motivate the disjunctive framing above, rather than following from the framing above.
The arguments I know of for claims like this do seem to rely on strong claims about the sort of ‘plan search’ algorithms we’d expect future AIs to instantiate. For example, Rob claims that we’re on track to produce systems which approximate ‘randomly sample from the space of simplicity-weighted plans’. See discussion here.
As Paul notes, “there are just a lot of plausible ways to care a little bit (one way or the other!) about a civilization that created you, that you’ve interacted with, and which was prima facie plausibly an important actor in the world.”
By default, the values of future AIs are likely to include broadly-scoped goals, which will involve rapacious influence-seeking.
I agree that there are instrumentally convergent goals, which include some degree of power/influence-seeking. But I don’t think instrumental convergence alone gets you to ‘doom with >50%’.
It’s not enough to have a moderate desire for influence. I think it’s plausible that the default path involves systems who do ‘normal-ish human activities’ in pursuit of more local goals. I quote a story from Katja Grace in my shortform here.
So far, I’ve discussed just one disjunct, but I can imagine outlining similar assumptions for the other disjuncts. For instance: if we have >20 years to conduct AI alignment research conditional on the problem not being super hard, why can’t there be a decent chance that a not-super-competent research community solves the problem? Again, I find it hard to motivate the case for a claim like that without already assuming a mainline model of doom.
I’m not saying there aren’t interesting arguments here, but I think that arguments of this type mostly assume a mainline model of doom (or the adequacy of a ‘disjunctive framing’), rather than providing independent arguments for a mainline model of doom.
Future Responses
I think so! But I’m unclear what, exactly, your arguments are meant to be. Also, I would personally find it much easier to engage with arguments in premise-conclusion format. Otherwise, I feel like I have to spend a lot of work trying to understand the logical structure of your argument, which requires a decent chunk of time-investment.
Still, I’m happy to chat over DM if you think that discussing this further would be profitable. Here’s my attempt to summarize your current view of things.
Suggestions for better argument names are not being taken at this time.
Thanks for the reply. I think the talk of 20 years is a red herring as we might only have 2 years (or less). Re your example of “A Conjunctive Case for the Disjunctive Case for Doom”, I don’t find the argument convincing because you use 20 years. Can you make the same arguments s/20/2?
And what I’m arguing is not that we are doomed by default, but the conditional on being doomed given AGI; P(doom|AGI). I’m actually reasonably optimistic that we can just stop building AGI and therefore won’t be doomed! And that’s what I’m working toward (yes, it’s going to be a lot of work; I’d appreciate more help).
Isn’t it obvious that none of {outer alignment, inner alignment, misuse risk, multipolar coordination} have come anywhere close to being solved? Do I really need to summarise progress to date and show why it isn’t a solution, when no one is even claiming to have a viable, scalable, solution to any of them!? Isn’t it obvious that current models are only safe because they are weak? Will Claude-3 spontaneously just decide not to make napalm with the Grandma’s bedtime story napalm recipe jailbreak when it’s powerful enough to do so and hooked up to a chemical factory?
Ok, but you really need to defeat all of them given that they are disjuncts!
Can you elaborate more on this? Is it because you expect AGIs to spontaneously be aligned enough to not doom us?
Judging by the overall response to this post, I do think it needs a rewrite.
Here’s a quick attempt at a subset of conjunctive assumptions in Nate’s framing:
- The functional ceiling for AGI is sufficiently above the current level of human civilization to eliminate it
- There is a sharp cutoff between non-AGI AI and AGI, such that early kind-of-AGI doesn’t send up enough warning signals to cause a drastic change in trajectory.
- Early AGIs don’t result in a multi-polar world where superhuman-but-not-godlike agents can’t actually quickly and recursively self-improve, in part because none of them wants any of the others to take over—and without being able to grow stronger, humanity remains a viable player.
Thanks!
I don’t think anyone is seriously arguing this? (Links please if they are).
We are getting the warning signals now. People (including me) are raising the alarm. Hoping for a drastic change of trajectory, but people actually have to put the work in for that to happen! But your point here isn’t really related to P(doom|AGI) - i.e. the conditional is on getting AGI. Of course there won’t be doom if we don’t get AGI! That’s what we should be aiming for right now (not getting AGI).
Nate may focus on singleton scenarios, but that is not a pre-requisite for doom. To me Robin Hanson’s (multipolar) Age of Em is also a kind of doom (most humans don’t exist, only a few highly productive ones are copied many times and only activated to work; a fully Malthusian economy). I don’t see how “humanity remains a viable player” in a world full of superhuman agents.