Each year, the top 2% of subjects were designated “superforecasters” and were assigned to work together in elite teams. In this richer setting, superforecasters became more accurate and resisted regression to the mean, suggesting that their ac-curacy was driven at least in part by skill, rather than luck (Mellers, Stone, Atanasov, Rohrbaugh, Metz, Ungar, Bishop, Horowitz, Merkle & Tetlock, 2015b). Indeed, using Brier scores to measure accuracy, Goldstein, Hartman, Comstockand Baumgarten (2016) found that superforecasters outperformed U.S. intelligence analysts on the same questions by roughly 30%.
(Emphasis mine.)
I believe their assessment of whether it’s fair to call one of “GJP best methods” “superforecasters” is more authoritative as the term originated from their research (and comes with a better understanding of methodology).
Anyways, the “GJP best method” used all Brier score boosting adjustments discussed in the literature (maybe excluding teaming), including selecting individuals (see below). And, IIRC, superforecasters are basically forecasters selected based on their performance.
Finally, we compare ICPM accuracy to that of GJP’s single most accurate CW method for the set of questions being analyzed—a method called “All Surveys Logit.” All Surveys Logit takes the most recent forecasts from a selection of individuals in GJP’s survey elicitation condition, weights them based on a forecaster’s historical accuracy, expertise, and psychometric profile, and then extremizes the aggregate forecast (towards 1 or 0) using an optimized extremization coefficient.
Hi @Misha, Thank you for your patience and sorry for the delay.
I triple-checked. Without any doubt, the “All Surveys Logit” used forecast data from thousands of “regular” forecasters and several dozen Superforecasters.
So it is the case that [regular forecasters + Superforecasters] outperformed U.S. intelligence analysts on the same questions by roughly 30%. It is NOT the case that the ICPM was compared directly and solely against Superforecasters.
It may be true, as you say, that there is a “common misconception...that superforecasters outperformed intelligence analysts by 30%”—but the Goldstein paper does not contain data that permits a direct comparison of intelligence analysts and Superforecasters.
The sentence in the 2017 article you cite contains an error. Simple typo? No idea. But typos happen and it’s not the end of the world. For example, in the table above, in the box with the Goldstein study, we see “N = 193 geopolitical questions.” That’s a typo. It is N = 139.
All Survey Logit was the best method out of the many methods the study tried. Their class of methods is flexible enough to include superforecasters as they were trying weighting forecasters by past performance (and as the research was done based on year 3 data the superforecasters were a salient option). By construction ASL is superforecaster level or above.
Upd 2022-03-14: Good Judgement Inc representative confirmed that Goldstein et al (2015) didn’t have a superforecaster-only pool. Unfortunately, the citations above are indeed misleading; as of now, we are not aware of research comparing superforecasters and ICPM.
Upd 2022-03-08: after some thought, we decided to revisit the post to be more precise. While this study has been referenced multiple times as superforecasters vs ICPM it’s unclear whether one of the twenty algorithms compared used only superforecasters (which seems plausible, see below). We still believe that Goldstein et al bear on how well the best prediction pools do, compared to ICPM. The main question about All Surveys Logit, whether the performance gap is due to the different aggregation algorithms used, also applies to claims about superforecasters.
Co-investigators of GJP summarize the result that way (comment);
Good Judgment Inc. uses this study on their page Superforecasters vs. ICPM (comment);
further, in private communications people assumed that narrative;
my understanding of data justifies the claim (comment).
Lastly, even if we assume that claims of superforecasters performance in comparison with IC haven’t been backed by this (or any other) study[1], the substantive claim hold: the 30% edge is likely partly due to the different aggregation techniques used stands.
As I reassert in this comment, everyone refers to this study as a justification; and upon extensive literature search, I haven’t found other comparisons.
Not sure what the finding here is: ”...the 30% edge is likely partly due to the different aggregation techniques used....” [emphasis mine]
How can we know more than likely partly? On what basis can we make a determination? Goldstein et. al. posit several hypotheses for the 30% advantage Good Judgment had over the ICPM: 1) GJ folks were paid; 2) a “secrecy heuristic” posited by Travers et. al.; 3) aggregation algorithms; 4) etc.
Have you disaggregated these effects such that we can know the extent to which the aggregation techniques boosted accuracy? Maybe the effect was entirely related to the $150 Amazon gift cards that GJ forecasters received for 12 months work? Maybe the “secrecy heuristic” explains the delta?
Thank you, Tim! Likely partly due to is my impressions of what’s going on based on existing research; I think we know that it is “likely partly” but probably not much more based on current literature.
The line of reasoning which I find plausible is “GJP PM and GJP All Surveys Logit” is more or less the same pool of people but the one aggregation algorithm is much better than another; it’s plausible that “IC All Surveys Logit would improve on ICPM quite dramatically.” And because the difference between GJP PM and ICPM is small it feels plausible that if the best aggregation method would be applied to IC, IC would cut the aforementioned 30% gap.
(I am happy to change my mind upon seeing more research comparing strong forecasters and domain experts.)
Thanks for engaging with our post!
Here is Mellers et al. (2017) about the study:
(Emphasis mine.)
I believe their assessment of whether it’s fair to call one of “GJP best methods” “superforecasters” is more authoritative as the term originated from their research (and comes with a better understanding of methodology).
Anyways, the “GJP best method” used all Brier score boosting adjustments discussed in the literature (maybe excluding teaming), including selecting individuals (see below). And, IIRC, superforecasters are basically forecasters selected based on their performance.
Hi @Misha, Thank you for your patience and sorry for the delay.
I triple-checked. Without any doubt, the “All Surveys Logit” used forecast data from thousands of “regular” forecasters and several dozen Superforecasters.
So it is the case that [regular forecasters + Superforecasters] outperformed U.S. intelligence analysts on the same questions by roughly 30%. It is NOT the case that the ICPM was compared directly and solely against Superforecasters.
It may be true, as you say, that there is a “common misconception...that superforecasters outperformed intelligence analysts by 30%”—but the Goldstein paper does not contain data that permits a direct comparison of intelligence analysts and Superforecasters.
The sentence in the 2017 article you cite contains an error. Simple typo? No idea. But typos happen and it’s not the end of the world. For example, in the table above, in the box with the Goldstein study, we see “N = 193 geopolitical questions.” That’s a typo. It is N = 139.
All Survey Logit was the best method out of the many methods the study tried. Their class of methods is flexible enough to include superforecasters as they were trying weighting forecasters by past performance (and as the research was done based on year 3 data the superforecasters were a salient option). By construction ASL is superforecaster level or above.
Oh my! May I ask, have you actually contacted anyone at Good Judgment to check? Because your assertion is simply not correct.
Upd 2022-03-14: Good Judgement Inc representative confirmed that Goldstein et al (2015) didn’t have a superforecaster-only pool. Unfortunately, the citations above are indeed misleading; as of now, we are not aware of research comparing superforecasters and ICPM.
Upd 2022-03-08: after some thought, we decided to revisit the post to be more precise. While this study has been referenced multiple times as superforecasters vs ICPM it’s unclear whether one of the twenty algorithms compared used only superforecasters (which seems plausible, see below). We still believe that Goldstein et al bear on how well the best prediction pools do, compared to ICPM. The main question about All Surveys Logit, whether the performance gap is due to the different aggregation algorithms used, also applies to claims about superforecasters.
Co-investigators of GJP summarize the result that way (comment);
Good Judgment Inc. uses this study on their page Superforecasters vs. ICPM (comment);
further, in private communications people assumed that narrative;
my understanding of data justifies the claim (comment).
Lastly, even if we assume that claims of superforecasters performance in comparison with IC haven’t been backed by this (or any other) study[1], the substantive claim hold: the 30% edge is likely partly due to the different aggregation techniques used stands.
As I reassert in this comment, everyone refers to this study as a justification; and upon extensive literature search, I haven’t found other comparisons.
Hi again Misha,
Not sure what the finding here is: ”...the 30% edge is likely partly due to the different aggregation techniques used....” [emphasis mine]
How can we know more than likely partly? On what basis can we make a determination? Goldstein et. al. posit several hypotheses for the 30% advantage Good Judgment had over the ICPM: 1) GJ folks were paid; 2) a “secrecy heuristic” posited by Travers et. al.; 3) aggregation algorithms; 4) etc.
Have you disaggregated these effects such that we can know the extent to which the aggregation techniques boosted accuracy? Maybe the effect was entirely related to the $150 Amazon gift cards that GJ forecasters received for 12 months work? Maybe the “secrecy heuristic” explains the delta?
Thank you, Tim! Likely partly due to is my impressions of what’s going on based on existing research; I think we know that it is “likely partly” but probably not much more based on current literature.
The line of reasoning which I find plausible is “GJP PM and GJP All Surveys Logit” is more or less the same pool of people but the one aggregation algorithm is much better than another; it’s plausible that “IC All Surveys Logit would improve on ICPM quite dramatically.” And because the difference between GJP PM and ICPM is small it feels plausible that if the best aggregation method would be applied to IC, IC would cut the aforementioned 30% gap.
(I am happy to change my mind upon seeing more research comparing strong forecasters and domain experts.)
Just emailed Good Judgment Inc about it.
Thanks for catching a typo! Appreciate the heads up.