Under your assumptions and definitions, I think your 20.2 % probability of nuclear winter if there is a large-scale nuclear war is a significant overestimate. You calculated it using the mean of a beta distribution. I am not sure how you defined it, but it is supposed to represent the 3 point estimates you are aggregating of 60 %, 8.96 % and 0.0355 %. In any case, 20.2 % is quite:
Different from the output of what I think are good aggregation methods:
The geometric mean of odds, which I think should be the default method to aggregate probabilities, results in 3.61 % (= 1/(1 + (0.6/(1 − 0.6)*0.0896/(1 − 0.0896)*0.000355/(1 − 0.000355))^(-1/3))), which is 17.9 % (= 0.0361/0.202) your value.
The geometric mean, which performed the best among unweighted methods on Metaculus’ data, results in 2.67 % (= (0.6*0.0896*0.000355)^(1/3), which is 13.2 % (= 0.0267/0.202) your value. Samotsvetyused geometric mean after removing the lowest and highest values to aggregate estimates related to the probability of nuclear war from 7 forecasters who often differed a lot between them, as is the case for the 3 probabilities you are aggregating.
Similar to the output of what I think are bad aggregation methods:
The maximum likelihood estimator (MLE) of the mean of a beta distribution with the 3 aforementioned probabilities as random samples results in 21.7 %. On Metaculus’ data, beta_mean_weighted performed worse than geo_mean_odds_weighted, median_weighted and beta_median_weighted.
The 23.0 % (= (0.6 + 0.0896 + 0.000355)/3) I get for the mean of the 3 aforementioned probabilities. Again, on Metaculus’ data, mean_weighted performed worse than geo_mean_odds_weighted, median_weighted and beta_median_weighted.
A common thread here is that aggregation methods which ignore information from extreme predictions tend to be worse (although one should be careful not to overweught them). As Jaime said with respect to mean (and I think the same applies to the MLE of the mean of a beta distribution fitted to the samples):
The arithmetic mean of probabilities ignores information from extreme predictions
The arithmetic mean of probabilities ignores extreme predictions in favor of tamer results, to the extent that even large changes to individual predictions will barely be reflected in the aggregate prediction.
As an illustrative example, consider an outsider expert and an insider expert on a topic, who are eliciting predictions about an event. The outsider expert is reasonably uncertain about the event, and each of them assigns a probability of around 10% to the event. The insider has priviledged information about the event, and assigns to it a very low probability.
Ideally, we would like the aggregate probability to be reasonably sensitive to the strength of the evidence provided by the insider expert—if the insider assigns a probability of 1 in 1000 the outcome should be meaningfully different than if the insider assigns a probability of 1 in 10,000 [9].
The arithmetic mean of probabilities does not achieve this—in both cases the pooled probability is around (10%+1/1,000)/2≈(10%+1/10,000)/2≈5.00%. The uncertain prediction has effectively overwritten the information in the more precise prediction.
The geometric mean of odds works better in this situation. We have that [(1:9)×(1:999)]1/2≈1:95, while [(1:9)×(1:9999)]1/2≈1:300. Those correspond respectively to probabilities of 1.04% and 0.33% - showing the greater sensitivity to the evidence the insider brings to the table.
See (Baron et al, 2014) for more discussion on the distortive effects of the arithmetic mean of probabilities and other aggregates.
For these reasons, I would aggregate the 3 probabilities using the geometric mean of odds, in which case the final probability would be 17.9 % as large.
CEARCH finds the cost-effectiveness of conducting a pilot study of a resilient food source to be 10,000 DALYs per USD 100,000, which is around 14× as cost-effective as giving to a GiveWell top charity[1] (link to full CEA).
Based on my adjustment to the probability of nuclear winter, I would conclude the cost-effectiveness is 2.51 (= 14*0.179) times that of GiveWell’s top charities (ignoring effects on animals), i.e. within the same order of magnitude. This would be in agreement with what I said in my analysis of nuclear famine about the cost-effectiveness of activities related to resilient food solutions:
I guess the true cost-effectiveness is within the same order of magnitude of that of GiveWell’s top charities
I should also note there are way more cost-effective intervention to increase welfare:
I have argued corporate campaigns for chicken welfare are 3 orders of magnitude more cost-effective than GiveWell’s top charities
In addition, life-saving interventions have to contend with themeat-eater problem:
From a nearterm perspective, I am concerned with the meat-eater problem, and believe it can be a crucial consideration. The people whose lives were saved thanks to resilient food solutions would go on to eat factory-farmed animals, which may well have sufficiently bad lives for the decrease in human mortality to cause net suffering. In fact, net global welfare may be negative and declining.
I estimated the annual welfare of all farmed animals combined is −4.64 times that of all humans combined[70], which suggests not saving a random human life might be good (-12 < −1). Nonetheless, my estimate is not resilient, so I am mostly agnostic with respect to saving random human lives. There is also a potentially dominant beneficial/harmful effect on wild animals.
Accordingly, I am uncertain about whether decreasing famine deaths due to the climatic effects of nuclear war would be beneficial or harmful. I think the answer would depend on the country, with saving lives being more beneficial in (usually low income) countries with lower consumption per capita of farmed animals with bad lives. I calculated the cost-effectiveness of saving lives in the countries targeted by GiveWell’s top charities only decreases by 8.72 % accounting for negative effects on farmed animals, which means it would still be beneficial (0.0872 < 1).
Some hopes would be:
Resilient food solutions mostly save lives in countries where there is low consumption per capita of animals with bad lives.
The conditions of animals significantly improving, or the consumption of animals with bad lives majorly decreasing in the next few decades[71], before an eventual nuclear war starts.
The decreased consumption of animals in high income countries during the 1st few years after the nuclear war persisting to some extent[72].
Bear in mind price-, taste-, and convenience-competitive plant-based meat would not currently replace meat.
I would also be curious to know about whether CEARCH has been mostly using the mean, or other methods underweighting low predictions, to aggregate probabilities differing a lot between them, both in this analysis and others. I think using the mean will tend to result in overestimating the cost-effectiveness, which might explain someof the estimates I consider intuitively quite high.
We have been thinking about aggregation methods a lot here at CEARCH, and our views on it are evolving. A few months ago we switched to using the geometric mean as our default aggregation method—although we are considering switching to the geometric mean of odds for probabilities, based on Simon’s M persuasive post that you referenced (although in many cases the difference is very small).
Firstly I’d like to say that our main weakness on the nuclear winter probability is a lack of information. Experts in the field are not forthcoming on probabilities, and most modeling papers use point-estimates and only consider one nuclear war scenario. One of my top priorities as we take this project to the “Deep” stage is to improve on this nuclear winter probability estimate. This will likely involve asking more experts for inside views, and exploring what happens to some of the top models when we introduce some uncertainty at each stage.
I think you are generally right that we should go with the method that works the best on relatively large forecasting datasets like Metaculus. In this case I think there is a bit more room for personal discretion, given that I am working from only three forecasts, where one is more than two orders of magnitude smaller than the others. I feel that in this situation—some experts think nuclear winter is an almost-inevitable consequence for large-scale nuclear war, others think it is very unlikely—it would just feel unjustifiably confident to conclude that the probability is only 2%. Especially since two of these three estimates are in-house estimates.
We have been thinking about aggregation methods a lot here at CEARCH, and our views on it are evolving. A few months ago we switched to using the geometric mean as our default aggregation method—although we are considering switching to the geometric mean of odds for probabilities, based on Simon’s M persuasive post that you referenced (although in many cases the difference is very small).
Cool!
Firstly I’d like to say that our main weakness on the nuclear winter probability is a lack of information. Experts in the field are not forthcoming on probabilities, and most modeling papers use point-estimates and only consider one nuclear war scenario.
Right, I wish experts were more transparent about their best guesses and uncertainty (accounting for the limitations of their studies).
One of my top priorities as we take this project to the “Deep” stage is to improve on this nuclear winter probability estimate. This will likely involve asking more experts for inside views, and exploring what happens to some of the top models when we introduce some uncertainty at each stage.
Nice to know there is going to be more analysis! I think one important limitation of your current model, which I would try to eliminate in further work, is that it relies on the vague concept of nuclear winter to define the climatic effects. You calculate the expected mortality multiplying:
Probability of a large nuclear war.
Probability of nuclear winter if there is a large nuclear war.
Expected mortality if there is a nuclear winter.
However, I believe it is better to rely on a more precise concept to assess the climatic effects, namely the amount of soot injected into the stratosphere, or the mean drop in global temperature over a certain period (e.g. 2 years) after the nuclear war. In my analysis, I relied on the amount of soot, estimating the expected famine deaths due to the climatic effects multiplying:
Probability of a large nuclear war.
Expected soot injection into the stratosphere if there is a large nuclear war.
Expected famine deaths due to the climatic effects for the expected soot injection into the stratosphere.
Ideally, I would get the expected famine deaths multiplying:
Probability of a large nuclear war.
Expected famine deaths if there is a large nuclear war. To obtain the distribution of the famine deaths, I would:
Define a logistic function describing the famine deaths as a function of the soot injected into the stratosphere (or, even better, the mean drop in global temperature over a certain period). In my analysis, I approximated the logistic function as a piecewise linear function.
Input into the above function a distribution for the soot injected into the stratosphere if there is a large nuclear war (or, even better, the mean drop in global temperature over a certain period if there is a large nuclear war). To obtain this soot distribution, I would:
Define a function describing the soot injected into the stratosphere as a function of the number of offensive nuclear detonations.
Input into the above function a distribution for the number of offensive nuclear detonations if there is a large nuclear war.
Luisa followed something like the above, although I think her results are super pessimistic.
I think you are generally right that we should go with the method that works the best on relatively large forecasting datasets like Metaculus. In this case I think there is a bit more room for personal discretion, given that I am working from only three forecasts, where one is more than two orders of magnitude smaller than the others.
Fair point, there is no data on which method is best when we are just aggregating 3 forecasts. That being said:
A priori, it seems reasonable to assume that the best method for large samples is also the best method for small samples.
Samotsvetyaggregated predictions differing a lot between them from 7 forecasters[1], and still used a modified version of the geometric mean, which ensures predictions smaller than 10 % of the mean are not ignored. A priori, it seems sensible to use an aggregation method that one of the most accomplished forecasting groups uses.
I feel that in this situation—some experts think nuclear winter is an almost-inevitable consequence for large-scale nuclear war, others think it is very unlikely—it would just feel unjustifiably confident to conclude that the probability is only 2%. Especially since two of these three estimates are in-house estimates.
I think there is a natural human bias towards thinking that the probability of events whose plausibility is hard to assess (not lotteries) has to be somewhere between 10 % to 90 %. In general, my view is more that it feels overconfident to ignore predictions, and using the mean does this when samples differ a lot among them. To illustrate, if I am trying to aggregate N probabilities, 10 %, 1 %, 0.1 %, …, and 10^-N, for N = 9:
The probability corresponding to the geometric mean of odds is 0.0152 % (= 1/(1 + (1/9)^(-(1 + 7)/2*7/7))), which is 1.52 times the median of 0.01 %.
The mean is 1.59 % (= 0.1*(1 − 0.1^7)/(1 − 0.1)/7), i.e. 159 times the median.
I think the mean is implausible because:
Ignoring the 4 to 5 lowest predictions among only 7 seems unjustifiable, and using the mean is equivalent to using the probability corresponding to the geometric mean of odds putting 0 weight in the 4 to 5 lowest predictions, which would lead to 0.894 % (= 1/(1 + (1/9)^(-(1 + 5)/2*5/7))) to 4.15 % (= 1/(1 + (1/9)^(-(1 + 4)/2*4/7))).
Ignoring the 3 lowest and 3 highest predictions among 7 seems justifiable, and would lead to the median, whereas the mean is 159 times the median.
You say 2 % probability of nuclear winter conditional on large nuclear war seems unjustifiable, but note the geometric mean of odds implies 4 %. In any case, I suspect the reason even this would feel too high is that it may in fact be too high, depending on how one defines nuclear winter, but that you are overestimating famine deaths conditional on nuclear winter. You put a weight of:
1⁄3 in Luisa’s results multiplied by 0.5, but I think the weight may still be too high given how pessimistic they are. Luisa predicts a 5 % chance of at least 36 % deaths (= 2.7/7.5), which looks quite high to me.
2⁄3 in Xia 2022′s results multiplied by 0.75, but this seems like an insufficient adjustment given you are relying on 37.5 % famine deaths, and this refers to no adaptation. Reducing food waste, decreasing the consumption of animals, expanding cultivated area, and reducing the production of biofuels are all quite plausible adaptation measures to me. So I think their baseline scenario is quite pessimistic, unless you also want to account for deaths indirectly resulting from infrastructure destruction which would happen even without any nuclear winter. I have some thoughts on reasons Xia 2022′s famine deaths may be too low and high here.
At the end of the day, I should say our estimates for the famine deaths are pretty much in agreement. I expect 4.43 % famine deaths due to the climatic effects of a large nuclear war, whereas you expect 6.16 % (20.2 % probability of nuclear winter if there is a large-scale nuclear war times 30.5 % deaths given nuclear winter).
For the question “What is the unconditional probability of London being hit with a nuclear weapon in October?”, the 7 forecasts were 0.01, 0.00056, 0.001251, 10^-8, 0.000144, 0.0012, and 0.001. The largest of these is 1 M (= 0.01/10^-8) times the smallest, whereas in your case the largest probability is 2 k (= 0.6/0.000355) times the smallest.
Nice analysis, Stan!
Under your assumptions and definitions, I think your 20.2 % probability of nuclear winter if there is a large-scale nuclear war is a significant overestimate. You calculated it using the mean of a beta distribution. I am not sure how you defined it, but it is supposed to represent the 3 point estimates you are aggregating of 60 %, 8.96 % and 0.0355 %. In any case, 20.2 % is quite:
Different from the output of what I think are good aggregation methods:
The geometric mean of odds, which I think should be the default method to aggregate probabilities, results in 3.61 % (= 1/(1 + (0.6/(1 − 0.6)*0.0896/(1 − 0.0896)*0.000355/(1 − 0.000355))^(-1/3))), which is 17.9 % (= 0.0361/0.202) your value.
The geometric mean, which performed the best among unweighted methods on Metaculus’ data, results in 2.67 % (= (0.6*0.0896*0.000355)^(1/3), which is 13.2 % (= 0.0267/0.202) your value. Samotsvety used geometric mean after removing the lowest and highest values to aggregate estimates related to the probability of nuclear war from 7 forecasters who often differed a lot between them, as is the case for the 3 probabilities you are aggregating.
Similar to the output of what I think are bad aggregation methods:
The maximum likelihood estimator (MLE) of the mean of a beta distribution with the 3 aforementioned probabilities as random samples results in 21.7 %. On Metaculus’ data, beta_mean_weighted performed worse than geo_mean_odds_weighted, median_weighted and beta_median_weighted.
The 23.0 % (= (0.6 + 0.0896 + 0.000355)/3) I get for the mean of the 3 aforementioned probabilities. Again, on Metaculus’ data, mean_weighted performed worse than geo_mean_odds_weighted, median_weighted and beta_median_weighted.
A common thread here is that aggregation methods which ignore information from extreme predictions tend to be worse (although one should be careful not to overweught them). As Jaime said with respect to mean (and I think the same applies to the MLE of the mean of a beta distribution fitted to the samples):
For these reasons, I would aggregate the 3 probabilities using the geometric mean of odds, in which case the final probability would be 17.9 % as large.
Based on my adjustment to the probability of nuclear winter, I would conclude the cost-effectiveness is 2.51 (= 14*0.179) times that of GiveWell’s top charities (ignoring effects on animals), i.e. within the same order of magnitude. This would be in agreement with what I said in my analysis of nuclear famine about the cost-effectiveness of activities related to resilient food solutions:
I should also note there are way more cost-effective intervention to increase welfare:
In addition, life-saving interventions have to contend with the meat-eater problem:
I would also be curious to know about whether CEARCH has been mostly using the mean, or other methods underweighting low predictions, to aggregate probabilities differing a lot between them, both in this analysis and others. I think using the mean will tend to result in overestimating the cost-effectiveness, which might explain some of the estimates I consider intuitively quite high.
Thanks for the comment, Vasco!
We have been thinking about aggregation methods a lot here at CEARCH, and our views on it are evolving. A few months ago we switched to using the geometric mean as our default aggregation method—although we are considering switching to the geometric mean of odds for probabilities, based on Simon’s M persuasive post that you referenced (although in many cases the difference is very small).
Firstly I’d like to say that our main weakness on the nuclear winter probability is a lack of information. Experts in the field are not forthcoming on probabilities, and most modeling papers use point-estimates and only consider one nuclear war scenario. One of my top priorities as we take this project to the “Deep” stage is to improve on this nuclear winter probability estimate. This will likely involve asking more experts for inside views, and exploring what happens to some of the top models when we introduce some uncertainty at each stage.
I think you are generally right that we should go with the method that works the best on relatively large forecasting datasets like Metaculus. In this case I think there is a bit more room for personal discretion, given that I am working from only three forecasts, where one is more than two orders of magnitude smaller than the others. I feel that in this situation—some experts think nuclear winter is an almost-inevitable consequence for large-scale nuclear war, others think it is very unlikely—it would just feel unjustifiably confident to conclude that the probability is only 2%. Especially since two of these three estimates are in-house estimates.
Thanks for the reply, Stan!
Cool!
Right, I wish experts were more transparent about their best guesses and uncertainty (accounting for the limitations of their studies).
Nice to know there is going to be more analysis! I think one important limitation of your current model, which I would try to eliminate in further work, is that it relies on the vague concept of nuclear winter to define the climatic effects. You calculate the expected mortality multiplying:
Probability of a large nuclear war.
Probability of nuclear winter if there is a large nuclear war.
Expected mortality if there is a nuclear winter.
However, I believe it is better to rely on a more precise concept to assess the climatic effects, namely the amount of soot injected into the stratosphere, or the mean drop in global temperature over a certain period (e.g. 2 years) after the nuclear war. In my analysis, I relied on the amount of soot, estimating the expected famine deaths due to the climatic effects multiplying:
Probability of a large nuclear war.
Expected soot injection into the stratosphere if there is a large nuclear war.
Expected famine deaths due to the climatic effects for the expected soot injection into the stratosphere.
Ideally, I would get the expected famine deaths multiplying:
Probability of a large nuclear war.
Expected famine deaths if there is a large nuclear war. To obtain the distribution of the famine deaths, I would:
Define a logistic function describing the famine deaths as a function of the soot injected into the stratosphere (or, even better, the mean drop in global temperature over a certain period). In my analysis, I approximated the logistic function as a piecewise linear function.
Input into the above function a distribution for the soot injected into the stratosphere if there is a large nuclear war (or, even better, the mean drop in global temperature over a certain period if there is a large nuclear war). To obtain this soot distribution, I would:
Define a function describing the soot injected into the stratosphere as a function of the number of offensive nuclear detonations.
Input into the above function a distribution for the number of offensive nuclear detonations if there is a large nuclear war.
Luisa followed something like the above, although I think her results are super pessimistic.
Fair point, there is no data on which method is best when we are just aggregating 3 forecasts. That being said:
A priori, it seems reasonable to assume that the best method for large samples is also the best method for small samples.
Samotsvety aggregated predictions differing a lot between them from 7 forecasters[1], and still used a modified version of the geometric mean, which ensures predictions smaller than 10 % of the mean are not ignored. A priori, it seems sensible to use an aggregation method that one of the most accomplished forecasting groups uses.
I think there is a natural human bias towards thinking that the probability of events whose plausibility is hard to assess (not lotteries) has to be somewhere between 10 % to 90 %. In general, my view is more that it feels overconfident to ignore predictions, and using the mean does this when samples differ a lot among them. To illustrate, if I am trying to aggregate N probabilities, 10 %, 1 %, 0.1 %, …, and 10^-N, for N = 9:
The probability corresponding to the geometric mean of odds is 0.0152 % (= 1/(1 + (1/9)^(-(1 + 7)/2*7/7))), which is 1.52 times the median of 0.01 %.
The mean is 1.59 % (= 0.1*(1 − 0.1^7)/(1 − 0.1)/7), i.e. 159 times the median.
I think the mean is implausible because:
Ignoring the 4 to 5 lowest predictions among only 7 seems unjustifiable, and using the mean is equivalent to using the probability corresponding to the geometric mean of odds putting 0 weight in the 4 to 5 lowest predictions, which would lead to 0.894 % (= 1/(1 + (1/9)^(-(1 + 5)/2*5/7))) to 4.15 % (= 1/(1 + (1/9)^(-(1 + 4)/2*4/7))).
Ignoring the 3 lowest and 3 highest predictions among 7 seems justifiable, and would lead to the median, whereas the mean is 159 times the median.
You say 2 % probability of nuclear winter conditional on large nuclear war seems unjustifiable, but note the geometric mean of odds implies 4 %. In any case, I suspect the reason even this would feel too high is that it may in fact be too high, depending on how one defines nuclear winter, but that you are overestimating famine deaths conditional on nuclear winter. You put a weight of:
1⁄3 in Luisa’s results multiplied by 0.5, but I think the weight may still be too high given how pessimistic they are. Luisa predicts a 5 % chance of at least 36 % deaths (= 2.7/7.5), which looks quite high to me.
2⁄3 in Xia 2022′s results multiplied by 0.75, but this seems like an insufficient adjustment given you are relying on 37.5 % famine deaths, and this refers to no adaptation. Reducing food waste, decreasing the consumption of animals, expanding cultivated area, and reducing the production of biofuels are all quite plausible adaptation measures to me. So I think their baseline scenario is quite pessimistic, unless you also want to account for deaths indirectly resulting from infrastructure destruction which would happen even without any nuclear winter. I have some thoughts on reasons Xia 2022′s famine deaths may be too low and high here.
At the end of the day, I should say our estimates for the famine deaths are pretty much in agreement. I expect 4.43 % famine deaths due to the climatic effects of a large nuclear war, whereas you expect 6.16 % (20.2 % probability of nuclear winter if there is a large-scale nuclear war times 30.5 % deaths given nuclear winter).
For the question “What is the unconditional probability of London being hit with a nuclear weapon in October?”, the 7 forecasts were 0.01, 0.00056, 0.001251, 10^-8, 0.000144, 0.0012, and 0.001. The largest of these is 1 M (= 0.01/10^-8) times the smallest, whereas in your case the largest probability is 2 k (= 0.6/0.000355) times the smallest.