Error
Unrecognized LW server error:
Field "fmCrosspost" of type "CrosspostOutput" must have a selection of subfields. Did you mean "fmCrosspost { ... }"?
Unrecognized LW server error:
Field "fmCrosspost" of type "CrosspostOutput" must have a selection of subfields. Did you mean "fmCrosspost { ... }"?
Interesting side-finding: prediction markets seem notably worse than cleverly aggregated prediction pools (at least when liquidity is as low as in the play markets). Not many studies, but see Appendix A for what we’ve found.
Thank you for writing this overview! I think it’s very useful. A few notes on the famous “30%” claim:
Part of the problem with fully understanding the performance of IC analysts is that much of the information about the tournaments and the ICPM is classified.
What originally happened is that someone leaked info about ACE to David Ignatius, who then published it in his column. (The IC never denied the claim.[1]) The document you cite is part of a case study by MITRE that’s been approved for public release.
One under-appreciated takeaway that you hint at is that prediction markets (rather than non-market aggregation platforms) are poorly suited to classified environments. Here’s a quote from a white paper I co-wrote last year:[2]
More broadly, I would like to push back a little against the idea that your point 3(a) ( whether supers outperform IC analysts) is really much evidence for or against 3 (whether supers outperform domain experts).
First, the IARPA tournaments asked a wide range of questions, but intelligence analysts tend to be specialized. If you’re looking at the ICPM, are you really looking at the performance of domain experts? Or are you looking at e.g. an expert on politics in the Horn of Africa trying to forecast the price of the Ruble? On the one hand, since participants self-selected which questions they answered, we might expect domain experts to stick to their domain. On the other, analysts might have seen it as a “game,” a “break,” or “professional development”—in short, an opportunity to try their hand something outside their expertise. The point is that we simply don’t know whether the ICPM really reflects “expert” opinion.
Second, I am inclined to believe that comparisons between IC analysts and supers may tell us more about the secrecy heuristic than about forecaster performance. From the same white paper:
I personally see much of the promise of forecasting platforms not as a tool for beating experts, but as a tool for identifying them more reliably (more reliably than by the usual signals, like a PhD).
Tetlock discusses this a bit in Chapter 4 of Superforecasting.
Keeping Score: A New Approach to Geopolitical Forecasting, https://global.upenn.edu/sites/default/files/perry-world-house/Keeping%20Score%20Forecasting%20White%20Paper.pdf.
Travers et al., “The Secrecy Heuristic,” https://www.jstor.org/stable/43785861.
This is extremely helpful and a deep cut—thanks Christian. I’ve linked to it in the post.
Yeah, our read of Goldstein isn’t much evidence against (3), we’re just resetting the table, since previously people used it as strong evidence for (3).
Thanks Gavin! That makes sense on how you view this and (3).
David Manheim’s 2020 viewpoint prefigures some of the above, but goes further in questioning the superforecaster phenomenon (by reducing it to intelligence + open-mindedness + giving a damn).
Thanks for this, it’s really helpful! I find it very plausible to me that “generalist forecasters are the most accurate source for predictions on ~any question” has become too much of a community shibboleth. This is a useful correction.
Given how widely the “forecasters are better than experts!” meme has spread, point 3a seems particularly important to me (emphasis mine):
I would have found a couple more discussion paragraphs helpful. As written, it’s difficult for me to tell which studies you think are most influential in shaping the conclusions you lay out in the summary paragraph at the beginning of the post. The “Summary” section of the post isn’t actually summarizing the rest of the post; instead, that’s just where your discussion and conclusions are being presented.
I’m excited to potentially see more critical analysis of the forecasting literature! Plus ideas for new studies that can help identify the conditions under which forecasters are most accurate/helpful.
Renamed the summary section, thanks
Thank you! We might consider editing the summary. This particular point is mostly supported by our takes on Goldstein et al (2015) and by Appendix A.
These claims about Superforecasting are eye-catching. However, it’s difficult to draw any conclusions when most of the research cited doesn’t in fact include Superforecasters. In our view, it isn’t a matter of Superforecasters vs experts: the Boolean is “and” as much as possible to get the best results.
For those who are interested in taking a deeper dive into the peer-reviewed literature, though, take a look here:
https://goodjudgment.com/about/the-science-of-superforecasting/
Some of our work on combining forecasters and experts is here:
https://www.foreignaffairs.com/articles/united-states/2020-10-13/better-crystal-ball
https://warontherocks.com/2021/07/did-sino-american-relations-have-to-deteriorate-a-better-way-of-doing-counterfactual-thought-experiments/
Where’s the “delta” upvote when I need it? :)
Appreciate that, Yonatan! :)
In principle, I like the research question and the comparison above is probably the most you can make out from what is published. That said, it is the year 2022, capabilites and methodology have advanced enormously at least with those PM firms operating successfully in the commercial world markets. So it’s the proverbial comparing apples and oranges on several dimensions to talk about how “prediction markets” (sic) perform for whatever. Different platform implementations have very different capabilities suited to very different tasks. Moreover, like any advanced tool, practical application of the more advanced PM platforms need a high degree of methodic knowhow on how to use their specific capabilities—based on real experience of what works and what does’t.
As a semi-active user of prediction markets and a person who looked up a bunch of studies about them, I don’t see that many innovations or at least anything that crucially changes the picture. I would be excited to be proven wrong, and am curious to know what you would characterize as advances in capability and methodology.
I am partly basing my impression on Mellers & Tetlock (2019), they write “We gradually got better at improving prediction polls with various behavioral and statistical interventions, but it proved stubbornly hard to improve prediction markets.” And my impression is that they experimented quite a bit with them.
So here’s a potentially fatal flaw in this analysis:
You write, “Goldstein et al showed that superforecasters outperformed the intelligence community....”
But the Goldstein paper was not about the Superforecasters. Your analysis, footnote 4, says, “‘All Surveys Logit’ takes the most recent forecasts from a selection of individuals in GJP’s survey elicitation condition....”
Thousands of individuals were in GJP’s survey elicitation condition, of whom only fraction (a few dozen) were Superforecasters.
So Goldstein did not find that “superforecasters outperformed the intelligence community”; rather, he found that [thousands of regular forecasters + a few dozen Superforecasters] outperformed the intelligence community. That’s an even lower bar.
Please check for yourself. All GJP data is publicly-available here: https://dataverse.harvard.edu/dataverse/gjp.
Thanks for engaging with our post!
Here is Mellers et al. (2017) about the study:
(Emphasis mine.)
I believe their assessment of whether it’s fair to call one of “GJP best methods” “superforecasters” is more authoritative as the term originated from their research (and comes with a better understanding of methodology).
Anyways, the “GJP best method” used all Brier score boosting adjustments discussed in the literature (maybe excluding teaming), including selecting individuals (see below). And, IIRC, superforecasters are basically forecasters selected based on their performance.
Hi @Misha, Thank you for your patience and sorry for the delay.
I triple-checked. Without any doubt, the “All Surveys Logit” used forecast data from thousands of “regular” forecasters and several dozen Superforecasters.
So it is the case that [regular forecasters + Superforecasters] outperformed U.S. intelligence analysts on the same questions by roughly 30%. It is NOT the case that the ICPM was compared directly and solely against Superforecasters.
It may be true, as you say, that there is a “common misconception...that superforecasters outperformed intelligence analysts by 30%”—but the Goldstein paper does not contain data that permits a direct comparison of intelligence analysts and Superforecasters.
The sentence in the 2017 article you cite contains an error. Simple typo? No idea. But typos happen and it’s not the end of the world. For example, in the table above, in the box with the Goldstein study, we see “N = 193 geopolitical questions.” That’s a typo. It is N = 139.
All Survey Logit was the best method out of the many methods the study tried. Their class of methods is flexible enough to include superforecasters as they were trying weighting forecasters by past performance (and as the research was done based on year 3 data the superforecasters were a salient option). By construction ASL is superforecaster level or above.
Oh my! May I ask, have you actually contacted anyone at Good Judgment to check? Because your assertion is simply not correct.
Upd 2022-03-14: Good Judgement Inc representative confirmed that Goldstein et al (2015) didn’t have a superforecaster-only pool. Unfortunately, the citations above are indeed misleading; as of now, we are not aware of research comparing superforecasters and ICPM.
Upd 2022-03-08: after some thought, we decided to revisit the post to be more precise. While this study has been referenced multiple times as superforecasters vs ICPM it’s unclear whether one of the twenty algorithms compared used only superforecasters (which seems plausible, see below). We still believe that Goldstein et al bear on how well the best prediction pools do, compared to ICPM. The main question about All Surveys Logit, whether the performance gap is due to the different aggregation algorithms used, also applies to claims about superforecasters.
Co-investigators of GJP summarize the result that way (comment);
Good Judgment Inc. uses this study on their page Superforecasters vs. ICPM (comment);
further, in private communications people assumed that narrative;
my understanding of data justifies the claim (comment).
Lastly, even if we assume that claims of superforecasters performance in comparison with IC haven’t been backed by this (or any other) study[1], the substantive claim hold: the 30% edge is likely partly due to the different aggregation techniques used stands.
As I reassert in this comment, everyone refers to this study as a justification; and upon extensive literature search, I haven’t found other comparisons.
Hi again Misha,
Not sure what the finding here is: ”...the 30% edge is likely partly due to the different aggregation techniques used....” [emphasis mine]
How can we know more than likely partly? On what basis can we make a determination? Goldstein et. al. posit several hypotheses for the 30% advantage Good Judgment had over the ICPM: 1) GJ folks were paid; 2) a “secrecy heuristic” posited by Travers et. al.; 3) aggregation algorithms; 4) etc.
Have you disaggregated these effects such that we can know the extent to which the aggregation techniques boosted accuracy? Maybe the effect was entirely related to the $150 Amazon gift cards that GJ forecasters received for 12 months work? Maybe the “secrecy heuristic” explains the delta?
Thank you, Tim! Likely partly due to is my impressions of what’s going on based on existing research; I think we know that it is “likely partly” but probably not much more based on current literature.
The line of reasoning which I find plausible is “GJP PM and GJP All Surveys Logit” is more or less the same pool of people but the one aggregation algorithm is much better than another; it’s plausible that “IC All Surveys Logit would improve on ICPM quite dramatically.” And because the difference between GJP PM and ICPM is small it feels plausible that if the best aggregation method would be applied to IC, IC would cut the aforementioned 30% gap.
(I am happy to change my mind upon seeing more research comparing strong forecasters and domain experts.)
Just emailed Good Judgment Inc about it.
Thanks for catching a typo! Appreciate the heads up.
From the conclusion of this new paper https://psyarxiv.com/rm49a/
We checked to see if Tetlock’s 2005 book had anything to tell us about our question.
Despite my own and others’ recollection that it shows that top generalists match experts, the main RFE experiment turns out to compare PhD area experts against PhD experts outside their precise area. The confusion arises because he uses the word “dilettante” for these latter experts, and doesn’t define this until the last appendix.
Be sure to check out the vast chasm between the experts and random undergrads.
One nice little study which was out of scope: ClearerThinking vs Good Judgment Inc vs MTurk on Trump policies. (This has been advertised as superforecasters vs experts, but it isn’t.)
this might be due to a change on EA forum since you initially posted this post, but the left and right columns of the table of studies are quite unreadable for me on desktop, on both Chrome and Edge. see screenshot for what it looks like from my end. is there any other format I can read this post in?
Oh that is annoying, thanks for pointing it out. I’ve just tried to use the new column width feature to fix it, but no luck.
Here’s a slightly more readable gdoc.
Our recent submission, “Training experts to be forecasters”, to the cause exploration prize may be of interest (I certainly found this post interesting as a justification for some of the ideas we experiment with).
https://forum.effectivealtruism.org/posts/WFbf2d4LHjgvWJCus/cause-exploration-prizes-training-experts-to-be-forecasters
Can someone clarify these statements from Summary (3a)? They seem to be at odds....
A: “A common misconception is that superforecasters outperformed intelligence analysts by 30%.”
B: “Instead: Goldstein et al showed that superforecasters outperformed the intelligence community...”[then a table listing the ICPM MMDB as 0.23 versus the GJP Best MMDB as 0.15].
--> Wouldn’t that be 34% better?
Indeed, but the misconception/lack of nuance is specifically about 30% here is Wikipedia on Good Judgement Project. I guess it’s either about looking at preliminary data or rounding.
It is, but we’re talking about the misconception, which became “30 percent” in (e.g.) this article.
Sorry, I’m confused. Do you mean the misconception is that rather than “30%” we should be saying that GJP was “34.7%” better than the ICPM?
It’s indeed the case that GJP was 34.7% better than the ICPM. But it’s not the case that GJP participants were 34.7% better than intelligence analysts. The intelligent analyst used prediction markets that are generally worse than prediction pools (see Appendix A), so we are not comparing apples to apples.
It would be fair to judge IC for using prediction markets rather than prediction pools after seeing research coming out of GJP. But we don’t know how an intelligence analyst prediction pool would perform compared to the GJP prediction pool. We have reasons to believe that difference might not be that impressive based on ICPM vs GJP PM and based on Sell et al (2021).
There’s three things
The true performance difference between forecasters and CIA analysts with classified info (0%??)
What Goldstein found about a related but quite different quantity (34.7%)
What NPR etc reported (30%)
The important misconception is using (2) as if it was (1). Sentence A is about misunderstanding the relationship between the above three things, so it seems fine to use the number from (3). We haven’t seen anyone with misconceptions about the precise 34.7% figure and we’re not attributing the error to Goldstein et al.
Curious: You say the 2015 Seth Goldstein “unpublished document” was “used to justify the famous ‘Supers are 30% better than the CIA’ claim.”
But that was reported two years earlier, in 2013: https://www.washingtonpost.com/opinions/david-ignatius-more-chatter-than-needed/2013/11/01/1194a984-425a-11e3-a624-41d661b0bb78_story.html.
So how was the 2015 paper the justification?
The linked story doesn’t cite another paper, so it’s hard to guess their actual source. Generally, academic research takes a while to be written and get published; the 2015 version of the paper seems to be the latest draft in circulation. It’s not uncommon to share and cite papers before they get published.
Thanks for the clarification, @Misha-Yagudin.
So to be clear, in his November 1, 2013 article, David Ignatius had access to forecasting data from the period August 1, 2013 through May 9, 2014!! (See section 5 of the Seth Goldstein paper underlying your analysis).
That, my friend, is quite the feat!!
Good catch, Tim! Well, at least Good Judgement Inc. (and some papers I’ve seen) cite Goldstein et al (2015) straight after David Ignatius’s 30% claim: https://goodjudgment.com/resources/the-superforecasters-track-record/superforecasters-vs-the-icpm/
If you by any chance have another paper[1] or resource in mind regarding the 30% claim, I would love to include it in the review.
Note that Goldstein et al don’t make that claim themselves, their discussion and conclusion are nuanced.
Christian Ruhl confirms that results from ACE were leaked early to Ignatius.