These numbers seem pretty all-over-the-place. On nearly every question, the odds given by the 7 forecasters span at least 2 orders of magnitude, and often substantially more. And the majority of forecasters (4/7) gave multiple answers which seem implausible (details below) in ways that suggest that their numbers aren’t coming from a coherent picture of the situation.
I have collected the numbers in a spreadsheet and highlighted (in red) the ones that seem implausible to me.
Odds span at least 2 orders of magnitude:
Another commenter noted that the answers to “What is the probability that Russia will use a nuclear weapon in Ukraine in the next MONTH?” range from .001 to .27. In odds that is from 1:999 to 1:2.7, which is an odds ratio of 369. And this was one of the more tightly clustered questions; odds ratios between the largest and smallest answer on the other questions were 144, 42857, 66666, 332168, 65901, 1010101, and (with n=6) 12.
Other than the final (tactical nuke) question, these cover enough orders of magnitude for my reaction to be “something is going on here; let’s take a closer look” rather than “there are some different perspectives which we can combine by aggregating” or “looks like this is roughly the range of well-informed opinion.”
Individual extreme outlier answers:
Two forecasters gave an estimate on one of the component questions that was more than 2 orders of magnitude away from the next closest estimate (odds ratio over 100).
On the question “Conditional on Russia using a nuclear weapon in Ukraine, what is the probability that nuclear conflict will scale beyond Ukraine in the next YEAR after the initial nuclear weapon use?”, one forecaster gave the answer 10^-5. The next smallest answer was 0.0151, an odds ratio of 1533. On the MONTH version of this question, the ratio was 130. So the 10^-5 answer differs wildly from each of the other answers, and also (IMO) seems implausibly low.
On the question “Conditional on the nuclear conflict expanding to NATO, what is the chance that London would get hit, one MONTH after the first non-Ukraine nuclear bomb is used?”, the largest answer was .9985 and the 2nd largest was 0.5, an odds ratio of 666. The ratio was the same for the YEAR version of this question. This multiple-orders-of-magnitude outlier from all the other forecasts also seems implausibly high to me.
Implausible month-to-year ratios:
We can compare the answers to “Conditional on Russia using a nuclear weapon in Ukraine, what is the probability that nuclear conflict will scale beyond Ukraine in the next MONTH after the initial nuclear weapon use?” to the YEAR version of this question to see how likely each forecaster thought that the escalation would happen within a month, conditional on it happening within a year. From smallest to largest, these probabilities for p(escalation within a MONTH | escalation within a YEAR) are .067, .086, .5, .6, .75, .75, 1. Probabilities below 10% seem implausible here, both considering the question (nuclear escalation will very likely take more than a month if it happens?) and considering the other estimates, but 2 forecasters are in that range. (A probability of 1 would be implausibly high if forecasters were estimating it directly, but given that this is calculated from 2 probabilities and many answers only had 1 sigfig I guess it’s not a major issue.)
Similarly, the implied estimates for p(London hit within a MONTH of a non-Ukraine nuke | London hit within a YEAR of a non-Ukraine nuke) are, from smallest to largest, .17, .2, .5, .89, 1, 1, 1. Again, low probabilities (.2 or smaller) seem implausible.
Conjunction vs. direct elicitation:
One sanity check in the original post is comparing the implied probability for a London nuke (based on p(London within a month | escalation), p(escalation within a month | Ukraine nuke), and p(Ukraine nuke within a month)) with the directly elicited p(London nuke in October). The implied probability covers a longer time period (since the monthlong window resets with each event), but the directly elicited probability covers all paths to London being nuked (not just the path via escalation from Russia nuking Ukraine), so it’s not obvious which should be larger, but I think they should be close (and Nuño thought the conjunction should be larger).
Looking at each forecaster, the ratio of p(London nuke in October) to the conjunction, from smallest to largest, is .57, .62, 1.04, 8, 20, 25, 48. Five of seven forecasters gave estimates which imply that the direct estimate (shorter timeframe, more pathways) is larger. Four of them gave estimates which imply a ratio of 8 or higher, which seems implausible.
And all four of those forecasters gave at least one of the other implausible forecasts mentioned above (an outlier individual estimate and/or an implausible month:year ratio). The three forecasters who have plausible ratios here (.57, .62, 1.04) did not give any of the implausible answers according to my other two sanity checks.
Bottom line:
3 of the 7 forecasters passed all three of these sanity checks. The other 4 forecasters each failed at least 2 of these sanity checks.
Aggregation which treats all this as noise and tries to find the central tendency helps keep the final estimate in a plausible range (and generally within the range of the 3 forecasters who passed the sanity checks), but it still seems possible to do significantly better.
IMO the epistemic status here is not seven good generalist forecasters who have thought carefully enough about these questions to give well-considered estimates, aggregated with some math that helps combine their different perspectives. Instead, the math is mainly just helping to filter out the not-carefully-considered answers.
Hey, thanks for the analysis, we might do something like that next time to improve consistency of our estimates, either as a team or as individuals. Note that some of the issues you point out are the cost of speed, of working a bit in the style of an emergency response team, rather than delaying a forecast for longer.
Still, I think that I’m more chill and less worried than you about these issues, because as you say the aggregation method was picked this up, and it doesn’t take the geometric mean of the forecasts that you colored in red, given that it excludes the minimum and maximum.
I also appreciated the individual comparison between chained probabilities and directly elicited ones, and it makes me even more pessimistic about using the directly elicited ones, particularly for <1% probabilities
Hey Dan, thanks for sanity-checking! I think you and feruell are correct to be suspicious of these estimates, we laid out reasoning and probabilities for people to adjust to their taste/confidence.
I agree outliers are concerning (and find some of them implausible), but I likewise have an experience of being at 10..20% when a crowd was at ~0% (for a national election resulting in a tie) and at 20..30% when a crowd was at ~0% (for a SCOTUS case) [likewise for me being ~1% while the crowd was much higher; I also on occasion was wrong updating x20 as a result, not sure if peers foresaw Biden-Putin summit but I was particularly wrong there].
I think the risk is front-loaded, and low month-to-year ratios are suspicious, but I don’t find them that implausible (e.g., one might expect everyone to get on a negotiation table/emergency calls after nukes are used and for the battlefield to be “frozen/shocked” – so while there would be more uncertainty early on, there would be more effort and reasons not to escalate/use more nukes at least for a short while — these two might roughly offset each other).
Yeah, it was my prediction that conjunction vs. direct wouldn’t match for people (really hard to have a good “sense” of such low probabilities if you are not doing a decomposition). I think we should have checked these beforehand and discussed them with folks.
It would be interesting whether the forecasters with outlier numbers stand by those forecasts on reflection, and to hear their reasoning if so. In cases where outlier forecasts reflect insight, how do we capture that insight rather than brushing them aside with the noise? Checking in with those forecasters after their forecasts have been flagged as suspicious-to-others is a start.
The p(month|year) number is especially relevant, since that is not just an input into the bottom line estimate, but also has direct implications for individual planning. The plan ‘if Russia uses a nuclear weapon in Ukraine then I will leave my home to go someplace safer’ looks pretty different depending on whether the period of heightened risk when you will be away from home is more like 2 weeks or 6 months.
These numbers seem pretty all-over-the-place. On nearly every question, the odds given by the 7 forecasters span at least 2 orders of magnitude, and often substantially more. And the majority of forecasters (4/7) gave multiple answers which seem implausible (details below) in ways that suggest that their numbers aren’t coming from a coherent picture of the situation.
I have collected the numbers in a spreadsheet and highlighted (in red) the ones that seem implausible to me.
Odds span at least 2 orders of magnitude:
Another commenter noted that the answers to “What is the probability that Russia will use a nuclear weapon in Ukraine in the next MONTH?” range from .001 to .27. In odds that is from 1:999 to 1:2.7, which is an odds ratio of 369. And this was one of the more tightly clustered questions; odds ratios between the largest and smallest answer on the other questions were 144, 42857, 66666, 332168, 65901, 1010101, and (with n=6) 12.
Other than the final (tactical nuke) question, these cover enough orders of magnitude for my reaction to be “something is going on here; let’s take a closer look” rather than “there are some different perspectives which we can combine by aggregating” or “looks like this is roughly the range of well-informed opinion.”
Individual extreme outlier answers:
Two forecasters gave an estimate on one of the component questions that was more than 2 orders of magnitude away from the next closest estimate (odds ratio over 100).
On the question “Conditional on Russia using a nuclear weapon in Ukraine, what is the probability that nuclear conflict will scale beyond Ukraine in the next YEAR after the initial nuclear weapon use?”, one forecaster gave the answer 10^-5. The next smallest answer was 0.0151, an odds ratio of 1533. On the MONTH version of this question, the ratio was 130. So the 10^-5 answer differs wildly from each of the other answers, and also (IMO) seems implausibly low.
On the question “Conditional on the nuclear conflict expanding to NATO, what is the chance that London would get hit, one MONTH after the first non-Ukraine nuclear bomb is used?”, the largest answer was .9985 and the 2nd largest was 0.5, an odds ratio of 666. The ratio was the same for the YEAR version of this question. This multiple-orders-of-magnitude outlier from all the other forecasts also seems implausibly high to me.
Implausible month-to-year ratios:
We can compare the answers to “Conditional on Russia using a nuclear weapon in Ukraine, what is the probability that nuclear conflict will scale beyond Ukraine in the next MONTH after the initial nuclear weapon use?” to the YEAR version of this question to see how likely each forecaster thought that the escalation would happen within a month, conditional on it happening within a year. From smallest to largest, these probabilities for p(escalation within a MONTH | escalation within a YEAR) are .067, .086, .5, .6, .75, .75, 1. Probabilities below 10% seem implausible here, both considering the question (nuclear escalation will very likely take more than a month if it happens?) and considering the other estimates, but 2 forecasters are in that range. (A probability of 1 would be implausibly high if forecasters were estimating it directly, but given that this is calculated from 2 probabilities and many answers only had 1 sigfig I guess it’s not a major issue.)
Similarly, the implied estimates for p(London hit within a MONTH of a non-Ukraine nuke | London hit within a YEAR of a non-Ukraine nuke) are, from smallest to largest, .17, .2, .5, .89, 1, 1, 1. Again, low probabilities (.2 or smaller) seem implausible.
Conjunction vs. direct elicitation:
One sanity check in the original post is comparing the implied probability for a London nuke (based on p(London within a month | escalation), p(escalation within a month | Ukraine nuke), and p(Ukraine nuke within a month)) with the directly elicited p(London nuke in October). The implied probability covers a longer time period (since the monthlong window resets with each event), but the directly elicited probability covers all paths to London being nuked (not just the path via escalation from Russia nuking Ukraine), so it’s not obvious which should be larger, but I think they should be close (and Nuño thought the conjunction should be larger).
Looking at each forecaster, the ratio of p(London nuke in October) to the conjunction, from smallest to largest, is .57, .62, 1.04, 8, 20, 25, 48. Five of seven forecasters gave estimates which imply that the direct estimate (shorter timeframe, more pathways) is larger. Four of them gave estimates which imply a ratio of 8 or higher, which seems implausible.
And all four of those forecasters gave at least one of the other implausible forecasts mentioned above (an outlier individual estimate and/or an implausible month:year ratio). The three forecasters who have plausible ratios here (.57, .62, 1.04) did not give any of the implausible answers according to my other two sanity checks.
Bottom line:
3 of the 7 forecasters passed all three of these sanity checks. The other 4 forecasters each failed at least 2 of these sanity checks.
Aggregation which treats all this as noise and tries to find the central tendency helps keep the final estimate in a plausible range (and generally within the range of the 3 forecasters who passed the sanity checks), but it still seems possible to do significantly better.
IMO the epistemic status here is not seven good generalist forecasters who have thought carefully enough about these questions to give well-considered estimates, aggregated with some math that helps combine their different perspectives. Instead, the math is mainly just helping to filter out the not-carefully-considered answers.
Hey, thanks for the analysis, we might do something like that next time to improve consistency of our estimates, either as a team or as individuals. Note that some of the issues you point out are the cost of speed, of working a bit in the style of an emergency response team, rather than delaying a forecast for longer.
Still, I think that I’m more chill and less worried than you about these issues, because as you say the aggregation method was picked this up, and it doesn’t take the geometric mean of the forecasts that you colored in red, given that it excludes the minimum and maximum.
I also appreciated the individual comparison between chained probabilities and directly elicited ones, and it makes me even more pessimistic about using the directly elicited ones, particularly for <1% probabilities
Hey Dan, thanks for sanity-checking! I think you and feruell are correct to be suspicious of these estimates, we laid out reasoning and probabilities for people to adjust to their taste/confidence.
I agree outliers are concerning (and find some of them implausible), but I likewise have an experience of being at 10..20% when a crowd was at ~0% (for a national election resulting in a tie) and at 20..30% when a crowd was at ~0% (for a SCOTUS case) [likewise for me being ~1% while the crowd was much higher; I also on occasion was wrong updating x20 as a result, not sure if peers foresaw Biden-Putin summit but I was particularly wrong there].
I think the risk is front-loaded, and low month-to-year ratios are suspicious, but I don’t find them that implausible (e.g., one might expect everyone to get on a negotiation table/emergency calls after nukes are used and for the battlefield to be “frozen/shocked” – so while there would be more uncertainty early on, there would be more effort and reasons not to escalate/use more nukes at least for a short while — these two might roughly offset each other).
Yeah, it was my prediction that conjunction vs. direct wouldn’t match for people (really hard to have a good “sense” of such low probabilities if you are not doing a decomposition). I think we should have checked these beforehand and discussed them with folks.
It would be interesting whether the forecasters with outlier numbers stand by those forecasts on reflection, and to hear their reasoning if so. In cases where outlier forecasts reflect insight, how do we capture that insight rather than brushing them aside with the noise? Checking in with those forecasters after their forecasts have been flagged as suspicious-to-others is a start.
The p(month|year) number is especially relevant, since that is not just an input into the bottom line estimate, but also has direct implications for individual planning. The plan ‘if Russia uses a nuclear weapon in Ukraine then I will leave my home to go someplace safer’ looks pretty different depending on whether the period of heightened risk when you will be away from home is more like 2 weeks or 6 months.