I think the differing estimates between domain experts and superforecasters is very interesting. Two factors I think might contribute to this are selection effects, and anchoring effects.
For selection effects, we know that 42% of the selected domain experts had attended EA meetups, whereas only 9% of the superforecasters had (page 9 of report). I assume the oversampling of EA members among experts may cause some systematic shift. The same applies to previous surveys of AI experts: for example the 2022 survey was voluntary had only a 17% response rate. It’s possible (I would even say likely) that if you had managed to survey 100% of the experts, the median probability of AI doom would drop, as AI concerned people are probably more likely to answer surveys.
The other cause could be differing susceptibility to anchoring bias. We know (from page 9) that the general public is extremely susceptible to anchoring here: the median public estimate of x-risk dropped six orders of magnitude from 5% to 1 in 15 million depending on the phrasing of the question (with the difference cause either by the example probabilities presented, or whether they asked in terms of odds instead of percentages).
If the public is susceptible to anchoring, experts probably will be as well. If you look at the resources given to the survey participants on AI risk (page 132), they gave a list of previous forecasts and expert surveys, which were in turn:
5%, 1 in 10, 5%, 5%, “0 to 10%”, 0.05%, and 5%
Literally 4 out of the 7 forecast give the exact same number. Are we sure that it’s a coincidence that the final forecast median ends up around this same number?
My view is that the domain experts are subconsciously afraid to stray too far from their anchor point, whereas the superforecasters are more adept at resisting such biases, and noticing that the estimates come mostly from EA sources, which may have correlated bias on this question.
Of course, there could be plenty of other reasons as well, but I thought these two were interesting to highlight.
One could say the CIA benefited from their expertise in geopolitics, and yet the superforecasters still beat them. Superforecasters perform well because they are good at synthesising diverse opinions and expert opinions: they had access to the same survey of AI experts that everyone else did.
Saying the AI experts had more expertise in AI is obviously true, but it doesn’t explain the discrepancy in forecasts. Why were superforecasters unconvinced by the AI experts?
I think there are a bunch of things that make expertise more valuable in AI forecasting than in geopolitical forecasting:
We’ve all grown up with geopolitics and read about it in the news, in books and so on, so most of us non-experts already have passable models of it. That’s not true with AI (until maybe one year ago, but even now news don’t report on technical safety).
Geopolitical events have fairly clear reference classes that can give you base rates and so on (and this is a tool available to both experts and non-experts) -- this is much harder with AI. That means the outside view is less valuable for AI forecasting.
I think AI is really complex and technical, and especially hard given that we’re dealing with systems that don’t yet exist. Geopolitics is also complex, and geopolitical futures will be different from now, but the basic elements are the same. I think this also favors non-experts when forecasting on geopolitics.
And quoting Peter McCluskey, a participating superforecaster:
The initial round of persuasion was likely moderately productive. The persuasion phases dragged on for nearly 3 months. We mostly reached drastically diminishing returns on discussion after a couple of weeks.
[...]
The persuasion seemed to be spread too thinly over 59 questions. In hindsight, I would have preferred to focus on core cruxes, such as when AGI would become dangerous if not aligned, and how suddenly AGI would transition from human levels to superhuman levels. That would have required ignoring the vast majority of those 59 questions during the persuasion stages. But the organizers asked us to focus on at least 15 questions that we were each assigned, and encouraged us to spread our attention to even more of the questions.
[...]
Many superforecasters suspected that recent progress in AI was the same kind of hype that led to prior disappointments with AI. I didn’t find a way to get them to look closely enough to understand why I disagreed.
My main success in that area was with someone who thought there was a big mystery about how an AI could understand causality. I pointed him to Pearl, which led him to imagine that problem might be solvable. But he likely had other similar cruxes which he didn’t get around to describing.
That left us with large disagreements about whether AI will have a big impact this century.
I’m guessing that something like half of that was due to a large disagreement about how powerful AI will be this century.
I find it easy to understand how someone who gets their information about AI from news headlines, or from laymen-oriented academic reports, would see a fair steady pattern of AI being overhyped for 75 years, with it always looking like AI was about 30 years in the future. It’s unusual for an industry to quickly switch from decades of overstating progress, to underhyping progress. Yet that’s what I’m saying has happened.
[...]
That superforecaster trend seems to be clear evidence for AI skepticism. How much should I update on it? I don’t know. I didn’t see much evidence that either group knew much about the subject that I didn’t already know. So maybe most of the updates during the tournament were instances of the blind leading the blind.
Scott Alexander points out that the superforecasters have likely already gotten one question pretty wrong, having a median prediction of the most expensive training run for 2024 of $35M (experts had a median of $65M by 2024) whereas GPT-4 seems to have been ~$60M, though with ample uncertainty. But bearish predictions will tend to fail earlier than bullish predictions, so we’ll see how the two groups compare in the next years, I guess.
I think you make good points in favour of the AI expert side of the equation. To balance that out, I want to offer one more point in favour of the superforecasters, in addition to my earlier points about anchoring and selection bias (we don’t actually know what the true median of AI expert opinion is or would be if questions were phrased differently).
The primary point I want to make is that Ai x-risk forecasting is, at least partly, a geopolitical forecast. Extinction from rogue AI requires some form of war or struggle between humanity. You have to estimate the probability that that struggle ends with humanity losing.
An AI expert is an expert in software development, not in geopolitical threat management. Neither are they experts in potential future weapon technology. If someone has worked on the latest bombshell LLM model, I will take their predictions about specific AI development seriously, but if they tell me an AI will be able to build omnipotent nanomachines that take over the planet in a month, I have no hesitations in telling them they’re wrong, because I have more expertise in that realm than they do.
I think the superforecasters have superior geopolitical knowledge than the AI experts, and that is reflected in these estimates.
My guess is that the crowds are similar and thus the surveys and the initial forecasts were also similar.
Iirc(?) the report states that there wasn’t much updating of forecasts, so the final and initial average also are naturally close.
Besides that, there was also some deference to literature/group averages, and also some participants imitated e.g. the Carlsmith forecast but with their own numbers (I think it was 1/8th of my group, but I’d need to check my notes).
I kinda speculate that Carlsmith’s model may be biased towards producing numbers around ~5% (sth about how making long chains of conditional probabilities doesn’t work because humans fail to imagine each step correctly and thus end up biased towards default probabilities closer to 50% at each step).
I think the differing estimates between domain experts and superforecasters is very interesting. Two factors I think might contribute to this are selection effects, and anchoring effects.
For selection effects, we know that 42% of the selected domain experts had attended EA meetups, whereas only 9% of the superforecasters had (page 9 of report). I assume the oversampling of EA members among experts may cause some systematic shift. The same applies to previous surveys of AI experts: for example the 2022 survey was voluntary had only a 17% response rate. It’s possible (I would even say likely) that if you had managed to survey 100% of the experts, the median probability of AI doom would drop, as AI concerned people are probably more likely to answer surveys.
The other cause could be differing susceptibility to anchoring bias. We know (from page 9) that the general public is extremely susceptible to anchoring here: the median public estimate of x-risk dropped six orders of magnitude from 5% to 1 in 15 million depending on the phrasing of the question (with the difference cause either by the example probabilities presented, or whether they asked in terms of odds instead of percentages).
If the public is susceptible to anchoring, experts probably will be as well. If you look at the resources given to the survey participants on AI risk (page 132), they gave a list of previous forecasts and expert surveys, which were in turn:
5%, 1 in 10, 5%, 5%, “0 to 10%”, 0.05%, and 5%
Literally 4 out of the 7 forecast give the exact same number. Are we sure that it’s a coincidence that the final forecast median ends up around this same number?
My view is that the domain experts are subconsciously afraid to stray too far from their anchor point, whereas the superforecasters are more adept at resisting such biases, and noticing that the estimates come mostly from EA sources, which may have correlated bias on this question.
Of course, there could be plenty of other reasons as well, but I thought these two were interesting to highlight.
Yes, for example maybe the AI experts benefited from their expertise on AI.
One could say the CIA benefited from their expertise in geopolitics, and yet the superforecasters still beat them. Superforecasters perform well because they are good at synthesising diverse opinions and expert opinions: they had access to the same survey of AI experts that everyone else did.
Saying the AI experts had more expertise in AI is obviously true, but it doesn’t explain the discrepancy in forecasts. Why were superforecasters unconvinced by the AI experts?
I think there are a bunch of things that make expertise more valuable in AI forecasting than in geopolitical forecasting:
We’ve all grown up with geopolitics and read about it in the news, in books and so on, so most of us non-experts already have passable models of it. That’s not true with AI (until maybe one year ago, but even now news don’t report on technical safety).
Geopolitical events have fairly clear reference classes that can give you base rates and so on (and this is a tool available to both experts and non-experts) -- this is much harder with AI. That means the outside view is less valuable for AI forecasting.
I think AI is really complex and technical, and especially hard given that we’re dealing with systems that don’t yet exist. Geopolitics is also complex, and geopolitical futures will be different from now, but the basic elements are the same. I think this also favors non-experts when forecasting on geopolitics.
And quoting Peter McCluskey, a participating superforecaster:
Scott Alexander points out that the superforecasters have likely already gotten one question pretty wrong, having a median prediction of the most expensive training run for 2024 of $35M (experts had a median of $65M by 2024) whereas GPT-4 seems to have been ~$60M, though with ample uncertainty. But bearish predictions will tend to fail earlier than bullish predictions, so we’ll see how the two groups compare in the next years, I guess.
I think you make good points in favour of the AI expert side of the equation. To balance that out, I want to offer one more point in favour of the superforecasters, in addition to my earlier points about anchoring and selection bias (we don’t actually know what the true median of AI expert opinion is or would be if questions were phrased differently).
The primary point I want to make is that Ai x-risk forecasting is, at least partly, a geopolitical forecast. Extinction from rogue AI requires some form of war or struggle between humanity. You have to estimate the probability that that struggle ends with humanity losing.
An AI expert is an expert in software development, not in geopolitical threat management. Neither are they experts in potential future weapon technology. If someone has worked on the latest bombshell LLM model, I will take their predictions about specific AI development seriously, but if they tell me an AI will be able to build omnipotent nanomachines that take over the planet in a month, I have no hesitations in telling them they’re wrong, because I have more expertise in that realm than they do.
I think the superforecasters have superior geopolitical knowledge than the AI experts, and that is reflected in these estimates.
By the way, I feel now that my first reply in this thread was needlessly snarky, and am sorry about that.
My guess is that the crowds are similar and thus the surveys and the initial forecasts were also similar.
Iirc(?) the report states that there wasn’t much updating of forecasts, so the final and initial average also are naturally close.
Besides that, there was also some deference to literature/group averages, and also some participants imitated e.g. the Carlsmith forecast but with their own numbers (I think it was 1/8th of my group, but I’d need to check my notes).
I kinda speculate that Carlsmith’s model may be biased towards producing numbers around ~5% (sth about how making long chains of conditional probabilities doesn’t work because humans fail to imagine each step correctly and thus end up biased towards default probabilities closer to 50% at each step).