Have you looked at how sensitive this analysis is to outliers, or to (say) the most extreme 10% of responses on each component?
The recent Samotsvety nuclear risk estimate removed the largest and smallest forecast (out of 7) for each component before aggregating (the remaining 5 forecasts) with the geometric mean. Would a similar adjustment here change the bottom line much (for the single probability and/or the distribution over “worlds”)?
The prima facie case for worrying about outliers actually seems significantly stronger for this survey than for an org like Samotsvety, which relies on skilled forecasters who treat each forecast professionally. This AI survey could have included people who haven’t thought in much depth about AI existential risk, or who aren’t comfortable with the particular decomposition you used, or who aren’t good at giving probabilities, or who didn’t put much time/effort/thought into answering these survey questions.
And it seems like the synthetic point estimate method used here might magnify the impact of outlier respondents rather than attenuating it. An extreme response can move the geometric mean a lot, and a person who gives extreme answers on 3 of the components can have their extreme estimates show up in 3/n of the synthetic estimates, not just 1/n.
I had not thought to do that, and it seems quite sensible (I agree with your point about prima facie worry about low outliers). The results are below.
To my eye, the general mechanism I wanted to defend about is preserved (there is an asymmetric probability of finding yourself in a low-risk world), but the probability of finding yourself in an ultra-low-risk world has significantly lowered, with that probability mass roughly redistributing itself around the geometric mean (which itself has gone up to 7%-ish)
In some sense this isn’t totally surprising—removing the lowest 10% of estimates means that order-of-magnitude uncertainty is only preserved for one of the six parameters in the equation (Containment), so the SDO mechanism doesn’t really apply. I don’t have the subject-specific knowledge to conclude is de-extremising the data in this way is reasonable (do we actually have better-than-order-of-magnitude knowledge about all of these parameters except Containment?), but the analysis you suggest is an important limitation of my results which I had totally overlooked, so thank you for the suggestion.
do we actually have better-than-order-of-magnitude knowledge about all of these parameters except Containment?)
Sorta kinda, yes? For example, convincingly arguing that any conditional probability in Carlsmith decomposition is less than 10% (while not inflating others) would probably win the main prize given that “I [Nick Beckstead] am pretty sympathetic to the analysis of Joe Carlsmith here.” + Nick is x3 higher than Carlsmith at the time of writing the report.
My understanding of what everyone is producing (Carlsmith, Beckstead etc) is their point estimate / most likely probability for some proposition being true. Shifting this point estimate to below 10% would be near enough a prize, but plenty of real-world applications have highish point estimates with a lower bound uncertainty that is very low.
The application where I am most familiar with this effect is clinical trials for oncology drugs; it isn’t uncommon for the point estimate for a drug’s effectiveness to be (say) 50% better than all other drugs on the market, but with a 95% confidence interval that covers no better at all, or even sometimes substantially worse. It seems to me to be quite a radical claim that we have better knowledge of AI Risk across nearly all parameters than we have of an oncology drug across a single parameter following a clinical trial.
I dropped 10% from both the low and high end- so the analysis in the results above are the most central 80% of estimates for each parameter (although just eyeballing the data I was left with quite a few >99% probabilities even after dropping the extreme top end)
Have you looked at how sensitive this analysis is to outliers, or to (say) the most extreme 10% of responses on each component?
The recent Samotsvety nuclear risk estimate removed the largest and smallest forecast (out of 7) for each component before aggregating (the remaining 5 forecasts) with the geometric mean. Would a similar adjustment here change the bottom line much (for the single probability and/or the distribution over “worlds”)?
The prima facie case for worrying about outliers actually seems significantly stronger for this survey than for an org like Samotsvety, which relies on skilled forecasters who treat each forecast professionally. This AI survey could have included people who haven’t thought in much depth about AI existential risk, or who aren’t comfortable with the particular decomposition you used, or who aren’t good at giving probabilities, or who didn’t put much time/effort/thought into answering these survey questions.
And it seems like the synthetic point estimate method used here might magnify the impact of outlier respondents rather than attenuating it. An extreme response can move the geometric mean a lot, and a person who gives extreme answers on 3 of the components can have their extreme estimates show up in 3/n of the synthetic estimates, not just 1/n.
I had not thought to do that, and it seems quite sensible (I agree with your point about prima facie worry about low outliers). The results are below.
To my eye, the general mechanism I wanted to defend about is preserved (there is an asymmetric probability of finding yourself in a low-risk world), but the probability of finding yourself in an ultra-low-risk world has significantly lowered, with that probability mass roughly redistributing itself around the geometric mean (which itself has gone up to 7%-ish)
In some sense this isn’t totally surprising—removing the lowest 10% of estimates means that order-of-magnitude uncertainty is only preserved for one of the six parameters in the equation (Containment), so the SDO mechanism doesn’t really apply. I don’t have the subject-specific knowledge to conclude is de-extremising the data in this way is reasonable (do we actually have better-than-order-of-magnitude knowledge about all of these parameters except Containment?), but the analysis you suggest is an important limitation of my results which I had totally overlooked, so thank you for the suggestion.
Sorta kinda, yes? For example, convincingly arguing that any conditional probability in Carlsmith decomposition is less than 10% (while not inflating others) would probably win the main prize given that “I [Nick Beckstead] am pretty sympathetic to the analysis of Joe Carlsmith here.” + Nick is x3 higher than Carlsmith at the time of writing the report.
My understanding of what everyone is producing (Carlsmith, Beckstead etc) is their point estimate / most likely probability for some proposition being true. Shifting this point estimate to below 10% would be near enough a prize, but plenty of real-world applications have highish point estimates with a lower bound uncertainty that is very low.
The application where I am most familiar with this effect is clinical trials for oncology drugs; it isn’t uncommon for the point estimate for a drug’s effectiveness to be (say) 50% better than all other drugs on the market, but with a 95% confidence interval that covers no better at all, or even sometimes substantially worse. It seems to me to be quite a radical claim that we have better knowledge of AI Risk across nearly all parameters than we have of an oncology drug across a single parameter following a clinical trial.
Did you only drop the low outliers, or did you drop both the low outliers and the high outliers?
I dropped 10% from both the low and high end- so the analysis in the results above are the most central 80% of estimates for each parameter (although just eyeballing the data I was left with quite a few >99% probabilities even after dropping the extreme top end)