Ay thanks, sorry I’m late back to you. I’ll respond to various parts in turn.
I don’t find Carlsmith et al’s estimates convincing because they are starting with a conjunctive frame and applying conjunctive reasoning. They are assuming we’re fine by default (why?), and then building up a list of factors that need to go wrong for doom to happen.
My initial interpretation of this passage is: you seem to be saying that conjunctive/disjunctive arguments are presented against a mainline model (say, one of doom/hope). In presenting a ‘conjunctive’ argument, Carlsmith belies a mainline model of hope. However, you doubt the mainline model of hope, and so his argument is unconvincing. If that reading is correct, then my view is that the mainline model of doom has not been successfully argued for. What do you take to be the best argument for a ‘mainline model’ of doom? If I’m correct in interpreting the passage below as an argument for a ‘mainline model’ of doom, then it strikes me as unconvincing:
Any one of a vast array of things can cause doom. Just the 4 broad categories mentioned at the start of the OP (subfields of Alignment) and the fact that “any given [alignment] approach that might show some promise on one or two of these still leaves the others unsolved.” is enough to provide a disjunctive frame!
Under your framing, I don’t think that you’ve come anywhere close to providing an argument for your preferred disjunctive framing. On my way of viewing things, an argument for a disjunctive framing shows that “failure on intent alignment (with success in the other areas) leads to a high P(Doom | AGI), failure on outer alignment alignment (with success in the other areas) leads to a high P(Doom | AGI), etc …”. I think that you have not shown this for any of the disjuncts, and an argument for a disjunctive frame requires showing this for all of the disjuncts.
Nate’s Framing
I claimed that an argument for (my slight alteration of) Nate’s framing was likely to rely on the conjunction of many assumptions, and you (very reasonably) asked me to spell them out. To recap, here’s the framing:
For humanity to be dead by 2070, only one of the following needs to be true:
Humanity has < 20 years to prepare for AGI
The technical challenge of alignment isn’t “pretty easy”
Research culture isn’t alignment-conscious in a competent way.
For this to be a disjunctive argument for doom, all of the following need to be true:
If humanity has < 20 years to prepare for AGI, then doom is highly likely.
Etc …
That is, the first point requires an argument which shows the following:
A Conjunctive Case for the Disjunctive Case for Doom:[1]
Even if we have a competent alignment-research culture, and
Even if the technical challenge of alignment is also pretty easy, nevertheless
Humanity is likely to go extinct if it has <20 years to prepare for AGI.
If I try to spell out the arguments for this framing, things start to look pretty messy. If technical alignment were “pretty easy”, and tackled by a culture which competently pursued alignment research, then I don’t feel >90% confident in doom. The claim “if humanity has < 20 years to prepare for AGI, then doom is highly likely” requires (non-exhaustively) the following assumptions:
Obviously, the argument directly entails the following: Groups of competent alignment researchers would fail to make ‘sufficient progress’ on alignment within <20 years, even if the technical challenge of alignment is “pretty easy”.
There have to be some premises here which help make sense of why this would be true. What’s the bar for a competent ‘alignment culture’?
If the bar is low, then the claim does not seem obviously true. If the bar for ‘competent alignment-research culture’ is very high, then I think you’ll need an assumption like the one below.
With extremely high probability, the default expectation should be that the values of future AIs are unlikely to care about continued human survival, or the survival of anything we’d find valuable.
I will note that this assumption seems required to motivate the disjunctive framing above, rather than following from the framing above.
The arguments I know of for claims like this do seem to rely on strong claims about the sort of ‘plan search’ algorithms we’d expect future AIs to instantiate. For example, Rob claims that we’re on track to produce systems which approximate ‘randomly sample from the space of simplicity-weighted plans’. See discussion here.
As Paul notes, “there are just a lot of plausible ways to care a little bit (one way or the other!) about a civilization that created you, that you’ve interacted with, and which was prima facie plausibly an important actor in the world.”
By default, the values of future AIs are likely to include broadly-scoped goals, which will involve rapacious influence-seeking.
I agree that there are instrumentally convergent goals, which include some degree of power/influence-seeking. But I don’t think instrumental convergence alone gets you to ‘doom with >50%’.
It’s not enough to have a moderate desire for influence. I think it’s plausible that the default path involves systems who do ‘normal-ish human activities’ in pursuit of more local goals. I quote a story from Katja Grace in my shortform here.
So far, I’ve discussed just one disjunct, but I can imagine outlining similar assumptions for the other disjuncts. For instance: if we have >20 years to conduct AI alignment research conditional on the problem not being super hard, why can’t there be a decent chance that a not-super-competent research community solves the problem? Again, I find it hard to motivate the case for a claim like that without already assuming a mainline model of doom.
I’m not saying there aren’t interesting arguments here, but I think that arguments of this type mostly assume a mainline model of doom (or the adequacy of a ‘disjunctive framing’), rather than providing independent arguments for a mainline model of doom.
Future Responses
This blog is ~1k words. Can you write a similar length blog for the other side, rebutting all my points?
I think so! But I’m unclear what, exactly, your arguments are meant to be. Also, I would personally find it much easier to engage with arguments in premise-conclusion format. Otherwise, I feel like I have to spend a lot of work trying to understand the logical structure of your argument, which requires a decent chunk of time-investment.
Still, I’m happy to chat over DM if you think that discussing this further would be profitable. Here’s my attempt to summarize your current view of things.
We’re on a doomed path, and I’d like to see arguments which could allow me to justifiably believe that there are paths which will steer us away from the default attractor state of doom. The technical problem of alignment has many component pieces, and it seems like failure to solve any one of the many component pieces is likely sufficient for doom. Moreover, the problems for each piece of the alignment puzzle look ~independent.
Thanks for the reply. I think the talk of 20 years is a red herring as we might only have 2 years (or less). Re your example of “A Conjunctive Case for the Disjunctive Case for Doom”, I don’t find the argument convincing because you use 20 years. Can you make the same arguments s/20/2?
And what I’m arguing is not that we are doomed by default, but the conditional on being doomed given AGI; P(doom|AGI). I’m actually reasonably optimistic that we can just stop building AGI and therefore won’t be doomed! And that’s what I’m working toward (yes, it’s going to be a lot of work; I’d appreciate more help).
On my way of viewing things, an argument for a disjunctive framing shows that “failure on intent alignment (with success in the other areas) leads to a high P(Doom | AGI), failure on outer alignment alignment (with success in the other areas) leads to a high P(Doom | AGI), etc …”. I think that you have not shown this for any of the disjuncts
Isn’t it obvious that none of {outer alignment, inner alignment, misuse risk, multipolar coordination} have come anywhere close to being solved? Do I really need to summarise progress to date and show why it isn’t a solution, when no one is even claiming to have a viable, scalable, solution to any of them!? Isn’t it obvious that current models are only safe because they are weak? Will Claude-3 spontaneously just decide not to make napalm with the Grandma’s bedtime story napalm recipe jailbreak when it’s powerful enough to do so and hooked up to a chemical factory?
So far, I’ve discussed just one disjunct, but I can imagine outlining similar assumptions for the other disjuncts.
Ok, but you really need to defeat all of them given that they are disjuncts!
I don’t think instrumental convergence alone gets you to ‘doom with >50%’.
Can you elaborate more on this? Is it because you expect AGIs to spontaneously be aligned enough to not doom us?
I’m unclear what, exactly, your arguments are meant to be. Also, I would personally find it much easier to engage with arguments in premise-conclusion format
Judging by the overall response to this post, I do think it needs a rewrite.
Ay thanks, sorry I’m late back to you. I’ll respond to various parts in turn.
My initial interpretation of this passage is: you seem to be saying that conjunctive/disjunctive arguments are presented against a mainline model (say, one of doom/hope). In presenting a ‘conjunctive’ argument, Carlsmith belies a mainline model of hope. However, you doubt the mainline model of hope, and so his argument is unconvincing. If that reading is correct, then my view is that the mainline model of doom has not been successfully argued for. What do you take to be the best argument for a ‘mainline model’ of doom? If I’m correct in interpreting the passage below as an argument for a ‘mainline model’ of doom, then it strikes me as unconvincing:
Under your framing, I don’t think that you’ve come anywhere close to providing an argument for your preferred disjunctive framing. On my way of viewing things, an argument for a disjunctive framing shows that “failure on intent alignment (with success in the other areas) leads to a high P(Doom | AGI), failure on outer alignment alignment (with success in the other areas) leads to a high P(Doom | AGI), etc …”. I think that you have not shown this for any of the disjuncts, and an argument for a disjunctive frame requires showing this for all of the disjuncts.
Nate’s Framing
I claimed that an argument for (my slight alteration of) Nate’s framing was likely to rely on the conjunction of many assumptions, and you (very reasonably) asked me to spell them out. To recap, here’s the framing:
For this to be a disjunctive argument for doom, all of the following need to be true:
That is, the first point requires an argument which shows the following:
A Conjunctive Case for the Disjunctive Case for Doom:[1]
Even if we have a competent alignment-research culture, and
Even if the technical challenge of alignment is also pretty easy, nevertheless
Humanity is likely to go extinct if it has <20 years to prepare for AGI.
If I try to spell out the arguments for this framing, things start to look pretty messy. If technical alignment were “pretty easy”, and tackled by a culture which competently pursued alignment research, then I don’t feel >90% confident in doom. The claim “if humanity has < 20 years to prepare for AGI, then doom is highly likely” requires (non-exhaustively) the following assumptions:
Obviously, the argument directly entails the following: Groups of competent alignment researchers would fail to make ‘sufficient progress’ on alignment within <20 years, even if the technical challenge of alignment is “pretty easy”.
There have to be some premises here which help make sense of why this would be true. What’s the bar for a competent ‘alignment culture’?
If the bar is low, then the claim does not seem obviously true. If the bar for ‘competent alignment-research culture’ is very high, then I think you’ll need an assumption like the one below.
With extremely high probability, the default expectation should be that the values of future AIs are unlikely to care about continued human survival, or the survival of anything we’d find valuable.
I will note that this assumption seems required to motivate the disjunctive framing above, rather than following from the framing above.
The arguments I know of for claims like this do seem to rely on strong claims about the sort of ‘plan search’ algorithms we’d expect future AIs to instantiate. For example, Rob claims that we’re on track to produce systems which approximate ‘randomly sample from the space of simplicity-weighted plans’. See discussion here.
As Paul notes, “there are just a lot of plausible ways to care a little bit (one way or the other!) about a civilization that created you, that you’ve interacted with, and which was prima facie plausibly an important actor in the world.”
By default, the values of future AIs are likely to include broadly-scoped goals, which will involve rapacious influence-seeking.
I agree that there are instrumentally convergent goals, which include some degree of power/influence-seeking. But I don’t think instrumental convergence alone gets you to ‘doom with >50%’.
It’s not enough to have a moderate desire for influence. I think it’s plausible that the default path involves systems who do ‘normal-ish human activities’ in pursuit of more local goals. I quote a story from Katja Grace in my shortform here.
So far, I’ve discussed just one disjunct, but I can imagine outlining similar assumptions for the other disjuncts. For instance: if we have >20 years to conduct AI alignment research conditional on the problem not being super hard, why can’t there be a decent chance that a not-super-competent research community solves the problem? Again, I find it hard to motivate the case for a claim like that without already assuming a mainline model of doom.
I’m not saying there aren’t interesting arguments here, but I think that arguments of this type mostly assume a mainline model of doom (or the adequacy of a ‘disjunctive framing’), rather than providing independent arguments for a mainline model of doom.
Future Responses
I think so! But I’m unclear what, exactly, your arguments are meant to be. Also, I would personally find it much easier to engage with arguments in premise-conclusion format. Otherwise, I feel like I have to spend a lot of work trying to understand the logical structure of your argument, which requires a decent chunk of time-investment.
Still, I’m happy to chat over DM if you think that discussing this further would be profitable. Here’s my attempt to summarize your current view of things.
Suggestions for better argument names are not being taken at this time.
Thanks for the reply. I think the talk of 20 years is a red herring as we might only have 2 years (or less). Re your example of “A Conjunctive Case for the Disjunctive Case for Doom”, I don’t find the argument convincing because you use 20 years. Can you make the same arguments s/20/2?
And what I’m arguing is not that we are doomed by default, but the conditional on being doomed given AGI; P(doom|AGI). I’m actually reasonably optimistic that we can just stop building AGI and therefore won’t be doomed! And that’s what I’m working toward (yes, it’s going to be a lot of work; I’d appreciate more help).
Isn’t it obvious that none of {outer alignment, inner alignment, misuse risk, multipolar coordination} have come anywhere close to being solved? Do I really need to summarise progress to date and show why it isn’t a solution, when no one is even claiming to have a viable, scalable, solution to any of them!? Isn’t it obvious that current models are only safe because they are weak? Will Claude-3 spontaneously just decide not to make napalm with the Grandma’s bedtime story napalm recipe jailbreak when it’s powerful enough to do so and hooked up to a chemical factory?
Ok, but you really need to defeat all of them given that they are disjuncts!
Can you elaborate more on this? Is it because you expect AGIs to spontaneously be aligned enough to not doom us?
Judging by the overall response to this post, I do think it needs a rewrite.