I like your points against the value of public advocacy. I’m not convinced overall, but that’s probably mostly because I’m in an overall more pessimistic state where I don’t mind trying more desperate measures.
People say this a lot, but I don’t get it.
Your baseline has to be really pessimistic before it looks good to throw in a negative-expectation high-variance intervention. (Perhaps worth making some math models and seeing when it looks good vs not.) Afaict only MIRI-sphere is pessimistic enough for this to make sense.
It’s very uncooperative and unilateralist. I don’t know why exactly it has became okay to say “well I think alignment is doomed, so it’s fine if I ruin everyone else’s work on alignment with a negative-expectation intervention”, but I dislike it and want it to stop.
Or to put it a bit more viscerally: It feels crazy to me that when I say “here are reasons your intervention is increasing x-risk”, the response is “I’m pessimistic, so actually while I agree that the effect in a typical world is to increase x-risk, it turns out that there’s this tiny slice of worlds where it made the difference and that makes the intervention good actually”. It could be true, but it sure throws up a lot of red flags.
I agree that unilateralism is bad. I’m still in discussion mode rather than confidently advocating for some specific hard-to-reverse intervention. (I should have flagged that explicitly.)
I think it’s not just the MIRI-sphere that’s very pessimistic, so there might be a situation where two camps disagree but neither camp is obviously small enough to be labelled unilateralist defectors. Seems important to figure out what to do from a group-rationality perspective in that situation. Maybe the best thing would be to agree on predictions that tell us what world we’re more likely in, and then commit to a specific action once one group turned out to be wrong about their worldview’s major crux/cruxes. (Assuming we have time for that.)
It feels crazy to me that when I say “here are reasons your intervention is increasing x-risk”, the response is “I’m pessimistic, so actually while I agree that the effect in a typical world is to increase x-risk, it turns out that there’s this tiny slice of worlds where it made the difference and that makes the intervention good actually”.
That’s not how I’d put it. I think we are still in a “typical” world, but the world that optimistic EAs assume we are in is the unlikely one where institutions around AI development an deployment suddenly turn out to be saner than our baseline would suggest. (If someone had strong reasons to think something like “[leader of major AI lab] is a leader with exceptionally high integrity who cares the most about doing the right thing; his understanding of the research and risks is pretty great, and he really knows how to manage teams and so on, so that’s why I’m confident,” then I’d be like “okay, that makes sense.”)
That’s not how I’d put it. I think we are still in a “typical” world, but the world that optimistic EAs assume we are in is the unlikely one where institutions around AI development an deployment suddenly turn out to be saner than our baseline would suggest.
I don’t see at all how this justifies that public advocacy is good? From my perspective you’re assuming we’re in an unlikely world where the public turns out to be saner than our baseline would suggest. I don’t think I have a lot of trust in institutions (though maybe I do have more trust than you do); I think I have a deep distrust of politics and the public.
I’m also not sure I understood your original argument any more. The argument I thought you were making was something like:
Consider an instrumental variable like “quality-weighted interventions that humanity puts in place to reduce AI x-risk”. Then public advocacy is:
Negative expectation: Public advocacy reduces the expected value of quality-weighted interventions, for the reasons given in the post.
High variance: Public advocacy also increases the variance of quality-weighted interventions (e.g. maybe we get a complete ban on all AI, which seems impossible without public advocacy).
However, I am pessimistic:
Pessimism: The required quality-weighted interventions to avoid doom is much higher than the default quality-weighted interventions we’re going to get.
Therefore, even though public advocacy is negative-expectation on quality-weighted interventions, it still reduces p(doom) due to its high variance.
(This is the only way I see to justify rebuttals like “I’m in an overall more pessimistic state where I don’t mind trying more desperate measures”, though perhaps I’m missing something.)
Is this what you meant with your original argument? If not, can you expand?
I think it’s not just the MIRI-sphere that’s very pessimistic
What is your p(doom)?
For reference, I think it seems crazy to advocate for negative-expectation high-variance interventions if you have p(doom) < 50%. As a first pass heuristic, I think it still seems pretty unreasonable all the way up to p(doom) of < 90%, though this could be overruled by details of the intervention (how negative is the expectation, how high is the variance).
From my perspective you’re assuming we’re in an unlikely world where the public turns out to be saner than our baseline would suggest.
Hm, or that we get lucky in terms of the public’s response being a good one given the circumstances, even if I don’t expect the discourse to be nuanced. It seems like a reasonable stance to think that a crude reaction of “let’s stop this research before it’s too late” is appropriate as a first step, and that it’s okay to worry about other things later on. The negatives you point out are certainly significant, so if we could get a conditional pause setup through other channels, that seems clearly better! But my sense is that it’s unlikely we’d succeed at getting ambitious measures in place without some amount of public pressure. (For what it’s worth, I think the public pressure is already mounting, so I’m not necessarily saying we have to ramp up the advocacy side a lot – I’m definitely against forming PETA-style anti-AI movements.)
As a first pass heuristic, I think it still seems pretty unreasonable all the way up to p(doom) of < 90%,
It also matters how much weight you give to person-affecting views (I’ve argued here for why I think they’re not unreasonable). If we can delay AI takeoff for five years, that’s worth a lot from the perspective of currently-existing people! (It’s probably also weakly positive or at least neutral from a suffering-focused longtermist perspective because everything seems uncertain from that perspective and a first-pass effect is delaying things from getting bigger; though I guess you could argue that particular s-risks are lower if more alignment-research-type reflection goes into AI development.) Of course, buying a delay that somewhat (but not tremendously) worsens your chances later on is a huge cost to upside-focused longtermism. But if we assume that we’re already empirically pessimistic on that view to begin with, then it’s an open question how a moral parliament between worldviews would bargain things out. Certainly the upside-focused longtermist faction should get important concessions like “try to ensure that actually-good alignment research doesn’t fall under the type of AI research that will be prohibited.”
What is your p(doom)?
My all-things-considered view (the one I would bet on) is maybe 77%. My private view (what to report to avoid double-counting the opinions of people the community updates towards) is more like 89%. (This doesn’t consider scenarios where AI is misaligned but still nice towards humans for weird decision-theoretic reasons where the AI is cooperating with other AIs elsewhere in the multiverse – not that I consider that particularly likely, but I think it’s too confusing to keep track of that in the same assessment.)
Some context on that estimate: When I look at history, I don’t think of humans being “in control” of things. I’m skeptical of Stephen Pinker’s “Better Angels” framing. Sure, a bunch of easily measurable metrics got better (though some may even be reversing for the last couple of years, e.g., life expectancy is lower now than it used to be before Covid). However, at the same time, technological progress introduced new problems of its own that don’t seem anywhere close to being addressed (e.g., social media addiction, increases in loneliness, maybe polarization via attention-grabbing news). Even if there’s an underlying trend of “Better Angels,” there’s also a trend of “new technology increases the strength of Molochian forces.” We seem to be losing that battle! AI is an opportunity to gain control for the first time via superhuman intelligence/rationality/foresight to bail us out and reign in Molochian forces once and for all, but to get there, we have to accomplish an immense feat of coordination. I don’t see why people are by-default optimistic about something like that. If anything, my 11% that humans will gain control over history for the first time ever seems like the outlandish prediction here! The 89% p(doom) is more like what we should expect by default: things get faster and out of control and then that’s it for humans.
From my perspective you’re assuming we’re in an unlikely world where the public turns out to be saner than our baseline would suggest.
Hm, or that we get lucky in terms of the public’s response being a good one given the circumstances, even if I don’t expect the discourse to be nuanced.
That sounds like a rephrasing of what I said that puts a positive spin on it. (I don’t see any difference in content.)
To put it another way—you’re critiquing “optimistic EAs” about their attitudes towards institutions, but presumably they could say “we get lucky in terms of the institutional response being a good one given the circumstances”. What’s the difference between your position and theirs?
But my sense is that it’s unlikely we’d succeed at getting ambitious measures in place without some amount of public pressure.
Why do you believe that?
It also matters how much weight you give to person-affecting views (I’ve argued here for why I think they’re not unreasonable).
I don’t think people would be on board with the principle “we’ll reduce the risk of doom in 2028, at the cost of increasing risk of doom in 2033 by a larger amount”.
For me, the main argument in favor of person-affecting views is that they agree with people’s intuitions. Once a person-affecting view recommends something that disagrees with other ethical theories and with people’s intuitions, I feel pretty fine ignoring it.
The 89% p(doom) is more like what we should expect by default: things get faster and out of control and then that’s it for humans.
Your threat model seems to be “Moloch will cause doom by default, but with AI we have one chance to prevent that, but we need to do it very carefully”. But Molochian forces grow much stronger as you increase the number of actors! The first intervention would be to keep the number of actors involved as small as possible, which you do by having the few leaders race forward as fast as possible, with as much secrecy as possible. If this were my main threat model I would be much more strongly against public advocacy and probably also against both conditional and unconditional pausing.
(I do think 89% is high enough that I’d start to consider negative-expectation high-variance interventions. I would still be thinking about it super carefully though.)
That sounds like a rephrasing of what I said that puts a positive spin on it. (I don’t see any difference in content.)
Yeah, I just wanted to say that my position doesn’t require public discourse to turn out to be surprisingly nuanced.
To put it another way—you’re critiquing “optimistic EAs” about their attitudes towards institutions, but presumably they could say “we get lucky in terms of the institutional response being a good one given the circumstances”. What’s the difference between your position and theirs?
I’m hoping to get lucky but not expecting it. The more optimistic EAs seem to be expecting it (otherwise they would share more pessimism).
I don’t think people would be on board with the principle “we’ll reduce the risk of doom in 2028, at the cost of increasing risk of doom in 2033 by a larger amount”.
If no one builds transformative AI in a five-year period, the risk of doom can be brought down to close to 0%. By contrast, once we do build it (with that five years delay), if we’re still pessimistic about the course of civilization, as I am and I was taking as an assumption for my comments about how worldviews would react to these tradeoffs, then the success chances five years later will still be bad. (EDIT:) So, the tradeoff is something like 55% death spread over a five-year period vs no death for five years, for an eventual reward of reducing total chance of death (over a 10y period or whatever) from 89% to 82%. (Or something a bit lower; I probably place >20% on us somehow not getting to TAI even in 10 years.)
In that scenario, I could imagine that many people would go for the first option. Many care more about life as they know it than for chances of extreme longevity and awesome virtual worlds. (Some people would certainly change their minds if they thought about the matter a lot more under ideal reasoning conditions, but I expect many [including me when I think self-orientedly] wouldn’t.)
(I acknowledge that, given your more optimistic estimates about AI going well and about low imminent takeoff risk, person-affecting concerns seem highly aligned with long-termist concerns.)
Your threat model seems to be “Moloch will cause doom by default, but with AI we have one chance to prevent that, but we need to do it very carefully”. But Molochian forces grow much stronger as you increase the number of actors! The first intervention would be to keep the number of actors involved as small as possible, which you do by having the few leaders race forward as fast as possible, with as much secrecy as possible. If this were my main threat model I would be much more strongly against public advocacy and probably also against both conditional and unconditional pausing.
That’s interesting. I indeed find myself thinking that our best chances of success come from a scenario where most large-model AI research shuts down, but a new org of extremely safety-conscious people is formed where they make progress with a large lead. I was thinking of something like the scenario you describe as “Variant 2: In addition to this widespread pause, there is a tightly controlled and monitored government project aiming to build safe AGI.” It doesn’t necessarily have to be government-led, but maybe the government has talked to evals experts and demands a tight structure where large expenditures of compute always have to be approved by a specific body of safety evals experts.
>>But my sense is that it’s unlikely we’d succeed at getting ambitious measures in place without some amount of public pressure.
Why do you believe that?
I don’t know. Maybe I’m wrong: If the people who are closest to DC are optimistic that lawmakers would be willing to take ambitious measures soon enough, then I’d update that public advocacy has fewer upsides (while the downsides remain). I was just assuming that the current situation is more like “some people are receptive, but there’s also a lot of resistance from lawmakers.”
So, the tradeoff is something like 55% death spread over a five-year period vs no death for five years, for an eventual reward of reducing total chance of death (over a 10y period or whatever) from 89% to 82%.
Oh we disagree much more straightforwardly. I think the 89% should be going up, not down. That seems by far the most important disagreement.
(I thought you were saying that person-affecting views means that even if the 89% goes up that could still be a good trade.)
I still don’t know why you expect the 89% to go down instead of up given public advocacy. (And in particular I don’t see why optimism vs pessimism has anything to do with it.) My claim is that it should go up.
I was thinking of something like the scenario you describe as “Variant 2: In addition to this widespread pause, there is a tightly controlled and monitored government project aiming to build safe AGI.” It doesn’t necessarily have to be government-led, but maybe the government has talked to evals experts and demands a tight structure where large expenditures of compute always have to be approved by a specific body of safety evals experts.
But why do evals matter? What’s an example story where the evals prevent Molochian forces from leading to us not being in control? I’m just not seeing how this scenario intervenes on your threat model to make it not happen.
(It does introduce government bureaucracy, which all else equal reduces the number of actors, but there’s no reason to focus on safety evals if the theory of change is “introduce lots of bureaucracy to reduce number of actors”.)
Maybe I’m wrong: If the people who are closest to DC are optimistic that lawmakers would be willing to take ambitious measures soon enough
This seems like the wrong criterion. The question is whether this strategy is more likely to succeed than others. Your timelines are short enough that no ambitious measure is going to come into place fast enough if you aim to save ~all worlds.
But e.g. ambitious measures in ~5 years seems very doable (which seems like it is around your median, so still in time for half of worlds). We’re already seeing signs of life:
note the existence of the UK Frontier AI Taskforce and the people on it, as well as the intent bill SB 294 in California about “responsible scaling”
You could also ask people in DC; my prediction is they’d say something reasonably similar.
Oh we disagree much more straightforwardly. I think the 89% should be going up, not down. That seems by far the most important disagreement.
I think we agree (at least as far as my example model with the numbers was concerned). The way I meant it, 82% goes up to 89%.
(My numbers were confusing because I initially said 89% was my all-things-considered probability, but in this example model, I was giving 89% as the probability for a scenario where we take a (according to your view) suboptimal action. In the example model, 82% is the best chance we can get with optimal course of action, but it comes at the price of way higher risk of death in the first five years.)
In any case, my assumptions for this example model were:
(1) Public advocacy is the only way to install an ambitious pause soon enough to reduce risks that happen before 5 years.
(2) If it succeeds at the above, public advocacy will likely also come with negative side effects that increase the risks later on.
And I mainly wanted to point out how, from a person-affecting perspective, the difference between 82% and 89% isn’t necessarily huge, whereas getting 5 years of zero risk vs 5 years of 55% cumulative risks feels like something that could matter a lot.
But one can also discuss the validity of (1) and (2). It sounds like you don’t buy (1) at all. By contrast, I think (1) is plausible but I’m anot confident in my stance here, and you raise good points.
Regarding (2), I probably agree that if you achieve an ambitious pause via public advocacy and public pressure playing a large role, this makes some things harder later.
But why do evals matter? What’s an example story where the evals prevent Molochian forces from leading to us not being in control? I’m just not seeing how this scenario intervenes on your threat model to make it not happen.
Evals prevent the accidental creation of misaligned transformative AI by the project that’s authorized to go beyond the compute cap for safety research (if necessary; obviously they don’t have to go above the cap if the returns from alignment research are high enough at lower levels of compute).
Molochian forces are one part of my threat model, but I also think alignment difficulty is high and hard takeoff is more likely than soft takeoff. (Not all these components to my worldview are entirely independent. You could argue that being unusually concerned about Molochian forces and expecting high alignment difficulty is both produced by the same underlying sentiment. Arguably, most humans aren’t really aligned with human values, for Hansonian reasons, which we can think of as a subtype of Moloch problems. Likewise, if it’s already hard to align humans to human values, it’ll probably also be hard to align AIs to those values [or at least create an AI that is high-integrity friendly towards humans, while perhaps pursuing some of its own aims as well – I think that would be enough to generate a good outcome, so we don’t necessarily have to create AIs that care about nothing else besides human values].)
Taking your numbers at face value, and assuming that people have on average 40 years of life ahead of them (Google suggests median age is 30 and typical lifespan is 70-80), the pause gives an expected extra 2.75 years of life during the pause (delaying 55% chance of doom by 5 years) while removing an expected extra 2.1 years of life (7% of 30) later on. This looks like a win on current-people-only views, but it does seem sensitive to the numbers.
I’m not super sold on the numbers. Removing the full 55% is effectively assuming that the pause definitely happens and is effective—it neglects the possibility that advocacy succeeds enough to have the negative effects, but still fails to lead to a meaningful pause. I’m not sure how much probability I assign to that scenario but it’s not negligible, and it might be more than I assign to “advocacy succeeds and effective pause happens”.
It sounds like you don’t buy (1) at all.
I’d say it’s more like “I don’t see why we should believe (1) currently”. It could still be true. Maybe all the other methods really can’t work for some reason I’m not seeing, and that reason is overcome by public advocacy.
People say this a lot, but I don’t get it.
Your baseline has to be really pessimistic before it looks good to throw in a negative-expectation high-variance intervention. (Perhaps worth making some math models and seeing when it looks good vs not.) Afaict only MIRI-sphere is pessimistic enough for this to make sense.
It’s very uncooperative and unilateralist. I don’t know why exactly it has became okay to say “well I think alignment is doomed, so it’s fine if I ruin everyone else’s work on alignment with a negative-expectation intervention”, but I dislike it and want it to stop.
Or to put it a bit more viscerally: It feels crazy to me that when I say “here are reasons your intervention is increasing x-risk”, the response is “I’m pessimistic, so actually while I agree that the effect in a typical world is to increase x-risk, it turns out that there’s this tiny slice of worlds where it made the difference and that makes the intervention good actually”. It could be true, but it sure throws up a lot of red flags.
Or in meme form:
I agree that unilateralism is bad. I’m still in discussion mode rather than confidently advocating for some specific hard-to-reverse intervention. (I should have flagged that explicitly.)
I think it’s not just the MIRI-sphere that’s very pessimistic, so there might be a situation where two camps disagree but neither camp is obviously small enough to be labelled unilateralist defectors. Seems important to figure out what to do from a group-rationality perspective in that situation. Maybe the best thing would be to agree on predictions that tell us what world we’re more likely in, and then commit to a specific action once one group turned out to be wrong about their worldview’s major crux/cruxes. (Assuming we have time for that.)
That’s not how I’d put it. I think we are still in a “typical” world, but the world that optimistic EAs assume we are in is the unlikely one where institutions around AI development an deployment suddenly turn out to be saner than our baseline would suggest. (If someone had strong reasons to think something like “[leader of major AI lab] is a leader with exceptionally high integrity who cares the most about doing the right thing; his understanding of the research and risks is pretty great, and he really knows how to manage teams and so on, so that’s why I’m confident,” then I’d be like “okay, that makes sense.”)
I don’t see at all how this justifies that public advocacy is good? From my perspective you’re assuming we’re in an unlikely world where the public turns out to be saner than our baseline would suggest. I don’t think I have a lot of trust in institutions (though maybe I do have more trust than you do); I think I have a deep distrust of politics and the public.
I’m also not sure I understood your original argument any more. The argument I thought you were making was something like:
(This is the only way I see to justify rebuttals like “I’m in an overall more pessimistic state where I don’t mind trying more desperate measures”, though perhaps I’m missing something.)
Is this what you meant with your original argument? If not, can you expand?
What is your p(doom)?
For reference, I think it seems crazy to advocate for negative-expectation high-variance interventions if you have p(doom) < 50%. As a first pass heuristic, I think it still seems pretty unreasonable all the way up to p(doom) of < 90%, though this could be overruled by details of the intervention (how negative is the expectation, how high is the variance).
Hm, or that we get lucky in terms of the public’s response being a good one given the circumstances, even if I don’t expect the discourse to be nuanced. It seems like a reasonable stance to think that a crude reaction of “let’s stop this research before it’s too late” is appropriate as a first step, and that it’s okay to worry about other things later on. The negatives you point out are certainly significant, so if we could get a conditional pause setup through other channels, that seems clearly better! But my sense is that it’s unlikely we’d succeed at getting ambitious measures in place without some amount of public pressure. (For what it’s worth, I think the public pressure is already mounting, so I’m not necessarily saying we have to ramp up the advocacy side a lot – I’m definitely against forming PETA-style anti-AI movements.)
It also matters how much weight you give to person-affecting views (I’ve argued here for why I think they’re not unreasonable). If we can delay AI takeoff for five years, that’s worth a lot from the perspective of currently-existing people! (It’s probably also weakly positive or at least neutral from a suffering-focused longtermist perspective because everything seems uncertain from that perspective and a first-pass effect is delaying things from getting bigger; though I guess you could argue that particular s-risks are lower if more alignment-research-type reflection goes into AI development.) Of course, buying a delay that somewhat (but not tremendously) worsens your chances later on is a huge cost to upside-focused longtermism. But if we assume that we’re already empirically pessimistic on that view to begin with, then it’s an open question how a moral parliament between worldviews would bargain things out. Certainly the upside-focused longtermist faction should get important concessions like “try to ensure that actually-good alignment research doesn’t fall under the type of AI research that will be prohibited.”
My all-things-considered view (the one I would bet on) is maybe 77%. My private view (what to report to avoid double-counting the opinions of people the community updates towards) is more like 89%. (This doesn’t consider scenarios where AI is misaligned but still nice towards humans for weird decision-theoretic reasons where the AI is cooperating with other AIs elsewhere in the multiverse – not that I consider that particularly likely, but I think it’s too confusing to keep track of that in the same assessment.)
Some context on that estimate: When I look at history, I don’t think of humans being “in control” of things. I’m skeptical of Stephen Pinker’s “Better Angels” framing. Sure, a bunch of easily measurable metrics got better (though some may even be reversing for the last couple of years, e.g., life expectancy is lower now than it used to be before Covid). However, at the same time, technological progress introduced new problems of its own that don’t seem anywhere close to being addressed (e.g., social media addiction, increases in loneliness, maybe polarization via attention-grabbing news). Even if there’s an underlying trend of “Better Angels,” there’s also a trend of “new technology increases the strength of Molochian forces.” We seem to be losing that battle! AI is an opportunity to gain control for the first time via superhuman intelligence/rationality/foresight to bail us out and reign in Molochian forces once and for all, but to get there, we have to accomplish an immense feat of coordination. I don’t see why people are by-default optimistic about something like that. If anything, my 11% that humans will gain control over history for the first time ever seems like the outlandish prediction here! The 89% p(doom) is more like what we should expect by default: things get faster and out of control and then that’s it for humans.
That sounds like a rephrasing of what I said that puts a positive spin on it. (I don’t see any difference in content.)
To put it another way—you’re critiquing “optimistic EAs” about their attitudes towards institutions, but presumably they could say “we get lucky in terms of the institutional response being a good one given the circumstances”. What’s the difference between your position and theirs?
Why do you believe that?
I don’t think people would be on board with the principle “we’ll reduce the risk of doom in 2028, at the cost of increasing risk of doom in 2033 by a larger amount”.
For me, the main argument in favor of person-affecting views is that they agree with people’s intuitions. Once a person-affecting view recommends something that disagrees with other ethical theories and with people’s intuitions, I feel pretty fine ignoring it.
Your threat model seems to be “Moloch will cause doom by default, but with AI we have one chance to prevent that, but we need to do it very carefully”. But Molochian forces grow much stronger as you increase the number of actors! The first intervention would be to keep the number of actors involved as small as possible, which you do by having the few leaders race forward as fast as possible, with as much secrecy as possible. If this were my main threat model I would be much more strongly against public advocacy and probably also against both conditional and unconditional pausing.
(I do think 89% is high enough that I’d start to consider negative-expectation high-variance interventions. I would still be thinking about it super carefully though.)
Yeah, I just wanted to say that my position doesn’t require public discourse to turn out to be surprisingly nuanced.
I’m hoping to get lucky but not expecting it. The more optimistic EAs seem to be expecting it (otherwise they would share more pessimism).
If no one builds transformative AI in a five-year period, the risk of doom can be brought down to close to 0%. By contrast, once we do build it (with that five years delay), if we’re still pessimistic about the course of civilization, as I am and I was taking as an assumption for my comments about how worldviews would react to these tradeoffs, then the success chances five years later will still be bad. (EDIT:) So, the tradeoff is something like 55% death spread over a five-year period vs no death for five years, for an eventual reward of reducing total chance of death (over a 10y period or whatever) from 89% to 82%. (Or something a bit lower; I probably place >20% on us somehow not getting to TAI even in 10 years.)
In that scenario, I could imagine that many people would go for the first option. Many care more about life as they know it than for chances of extreme longevity and awesome virtual worlds. (Some people would certainly change their minds if they thought about the matter a lot more under ideal reasoning conditions, but I expect many [including me when I think self-orientedly] wouldn’t.)
(I acknowledge that, given your more optimistic estimates about AI going well and about low imminent takeoff risk, person-affecting concerns seem highly aligned with long-termist concerns.)
That’s interesting. I indeed find myself thinking that our best chances of success come from a scenario where most large-model AI research shuts down, but a new org of extremely safety-conscious people is formed where they make progress with a large lead. I was thinking of something like the scenario you describe as “Variant 2: In addition to this widespread pause, there is a tightly controlled and monitored government project aiming to build safe AGI.” It doesn’t necessarily have to be government-led, but maybe the government has talked to evals experts and demands a tight structure where large expenditures of compute always have to be approved by a specific body of safety evals experts.
I don’t know. Maybe I’m wrong: If the people who are closest to DC are optimistic that lawmakers would be willing to take ambitious measures soon enough, then I’d update that public advocacy has fewer upsides (while the downsides remain). I was just assuming that the current situation is more like “some people are receptive, but there’s also a lot of resistance from lawmakers.”
Oh we disagree much more straightforwardly. I think the 89% should be going up, not down. That seems by far the most important disagreement.
(I thought you were saying that person-affecting views means that even if the 89% goes up that could still be a good trade.)
I still don’t know why you expect the 89% to go down instead of up given public advocacy. (And in particular I don’t see why optimism vs pessimism has anything to do with it.) My claim is that it should go up.
But why do evals matter? What’s an example story where the evals prevent Molochian forces from leading to us not being in control? I’m just not seeing how this scenario intervenes on your threat model to make it not happen.
(It does introduce government bureaucracy, which all else equal reduces the number of actors, but there’s no reason to focus on safety evals if the theory of change is “introduce lots of bureaucracy to reduce number of actors”.)
This seems like the wrong criterion. The question is whether this strategy is more likely to succeed than others. Your timelines are short enough that no ambitious measure is going to come into place fast enough if you aim to save ~all worlds.
But e.g. ambitious measures in ~5 years seems very doable (which seems like it is around your median, so still in time for half of worlds). We’re already seeing signs of life:
You could also ask people in DC; my prediction is they’d say something reasonably similar.
I think we agree (at least as far as my example model with the numbers was concerned). The way I meant it, 82% goes up to 89%.
(My numbers were confusing because I initially said 89% was my all-things-considered probability, but in this example model, I was giving 89% as the probability for a scenario where we take a (according to your view) suboptimal action. In the example model, 82% is the best chance we can get with optimal course of action, but it comes at the price of way higher risk of death in the first five years.)
In any case, my assumptions for this example model were:
(1) Public advocacy is the only way to install an ambitious pause soon enough to reduce risks that happen before 5 years.
(2) If it succeeds at the above, public advocacy will likely also come with negative side effects that increase the risks later on.
And I mainly wanted to point out how, from a person-affecting perspective, the difference between 82% and 89% isn’t necessarily huge, whereas getting 5 years of zero risk vs 5 years of 55% cumulative risks feels like something that could matter a lot.
But one can also discuss the validity of (1) and (2). It sounds like you don’t buy (1) at all. By contrast, I think (1) is plausible but I’m anot confident in my stance here, and you raise good points.
Regarding (2), I probably agree that if you achieve an ambitious pause via public advocacy and public pressure playing a large role, this makes some things harder later.
Evals prevent the accidental creation of misaligned transformative AI by the project that’s authorized to go beyond the compute cap for safety research (if necessary; obviously they don’t have to go above the cap if the returns from alignment research are high enough at lower levels of compute).
Molochian forces are one part of my threat model, but I also think alignment difficulty is high and hard takeoff is more likely than soft takeoff. (Not all these components to my worldview are entirely independent. You could argue that being unusually concerned about Molochian forces and expecting high alignment difficulty is both produced by the same underlying sentiment. Arguably, most humans aren’t really aligned with human values, for Hansonian reasons, which we can think of as a subtype of Moloch problems. Likewise, if it’s already hard to align humans to human values, it’ll probably also be hard to align AIs to those values [or at least create an AI that is high-integrity friendly towards humans, while perhaps pursuing some of its own aims as well – I think that would be enough to generate a good outcome, so we don’t necessarily have to create AIs that care about nothing else besides human values].)
Oops, sorry for the misunderstanding.
Taking your numbers at face value, and assuming that people have on average 40 years of life ahead of them (Google suggests median age is 30 and typical lifespan is 70-80), the pause gives an expected extra 2.75 years of life during the pause (delaying 55% chance of doom by 5 years) while removing an expected extra 2.1 years of life (7% of 30) later on. This looks like a win on current-people-only views, but it does seem sensitive to the numbers.
I’m not super sold on the numbers. Removing the full 55% is effectively assuming that the pause definitely happens and is effective—it neglects the possibility that advocacy succeeds enough to have the negative effects, but still fails to lead to a meaningful pause. I’m not sure how much probability I assign to that scenario but it’s not negligible, and it might be more than I assign to “advocacy succeeds and effective pause happens”.
I’d say it’s more like “I don’t see why we should believe (1) currently”. It could still be true. Maybe all the other methods really can’t work for some reason I’m not seeing, and that reason is overcome by public advocacy.