Note: This post was co-authored by Luis Enrique Urtubey De Césaris, Director at Good Judgment, and Marc Koehler, Senior Vice President at Good Judgment.
The last few weeks have seen some thought-provoking analysis and discussion in the EA Forum on the accuracy of different forecasting approaches, including the Superforecasting® approach Good Judgment (GJ) has been developing for over a decade. Recent posts have included a discussion of the accuracy of GJ’s Superforecasters® versus the US intelligence community in a 2013-2014 study. Some of this discussion has caused confusion—which we ourselves have added to with imprecise language on our website (hopefully all fixed now)!
A key question considered in this thoughtful paper by Arb Research is whether top forecasters are more accurate than domain experts. To be clear, GJ has no data on whether any individual Superforecaster was or was not more accurate than any individual intelligence analyst in the 2013-2014 study. There is also no data on how a Superforecaster Prediction Market performed against the Intelligence Community Prediction Market (ICPM) at that time, as there was not a “Super Market” until year 4 of the US Government “IARPA/ACE” research project [1]. We do know, however, that GJ’s best methods were 34.7% more accurate than those of the ICPM over 139 forecasting questions running from the fall of 2013 to the early summer of 2014 [2].
This is the finding of the study done by IARPA Program Manager Dr. Seth Goldstein et. al. The conclusion was not surprising, as GJ was also 35-72% more accurate than the four other academic and industry teams taking part in the four-year US Government “IARPA/ACE” research project running during this period.
In IARPA/ACE Year 3, when Goldstein collected the comparison data, it was not possible to do a strict apples-to-apples comparison of Superforecasters versus Intelligence Community analysts on the same sort of forecasting platform/elicitation method because there was no Superforecaster prediction market in Year 3. Instead, the Goldstein analysis compared the ICPM against all Good Judgment Project (GJP) forecasters who were forecasting on “survey”/opinion pool platforms, similar to Good Judgment Open. This included a mix of both regular forecasters and Superforecasters. GJP was testing several different aggregation algorithms, and the algorithm called “all surveys logit” was used for comparison. For further information, please see our journal articles on aggregation algorithms here.
GJP did have a prediction market for Superforecasters the following year (data here) during IARPA/ACE Year 4. Based on this data, we compared Superforecasters working in their own prediction market against the “all surveys logit” aggregation algorithm on the same questions. Our chief data scientist at the time found no statistically significant difference in accuracy between those methods, which is intuitive as, for example, aggregation algorithms have not been found, in general, to boost the accuracy of Superforecasters significantly.
So there was no statistical difference between “all surveys logit” and the “Super Market” in Year 4, and “all surveys logit” was 34.7% more accurate than the ICPM in Year 3. That’s the closest we can come to saying anything about comparing the Superforecaster prediction market to the ICPM with currently available data.
Somewhat relatedly, there has been debate on a statement made in a November 1, 2013, article by Washington Post editor David Ignatius. He wrote, “One of the most interesting findings, according to a participant in the project, is that forecasting accuracy doesn’t necessarily improve when analysts have access to highly classified signals intelligence…. In fact, the top forecasters, drawn from universities and elsewhere, performed about 30 percent better than the average for intelligence community analysts who could read intercepts and other secret data.” Some have speculated that Ignatius had a peek at the Year 3 Goldstein data and that he based his assertion on that. We have no insight into what Ignatius was shown but note that at the time his article was published, less than 20% of the 139 Year 3 forecasts that Goldstein used as data had been closed and scored. Perhaps Ignatius had access to forecasts from Year 2, when Superforecasters worked on over 100 forecast questions? We simply don’t know.
In summary, we have no directly comparable data on whether Superforecasters in a prediction market were better than the ICPM, but Goldstein’s study makes clear that Superforecasting methods employed by GJ yielded forecasts that were 34.7% more accurate than the Intelligence Community Prediction Market.
We welcome further discussion on the above, including on how forecasting experts compare to subject matter experts. Arb Research has brought new attention to an important set of research questions for all parties interested in the forecasting of critical topics.
[1] The ACE Program was a four-year US Government research project whose goal was to “dramatically enhance the accuracy, precision, and timeliness of intelligence forecasts for a broad range of event types, through the development of advanced techniques that elicit, weight, and combine the judgments of many intelligence analysts”.
[2] Of the 161 forecasting questions that ran in IARPA/ACE Year 3, we do not know which 139 were cross-posted on the ICPM.
Comparing Superforecasting and the Intelligence Community Prediction Market
Note: This post was co-authored by Luis Enrique Urtubey De Césaris, Director at Good Judgment, and Marc Koehler, Senior Vice President at Good Judgment.
The last few weeks have seen some thought-provoking analysis and discussion in the EA Forum on the accuracy of different forecasting approaches, including the Superforecasting® approach Good Judgment (GJ) has been developing for over a decade. Recent posts have included a discussion of the accuracy of GJ’s Superforecasters® versus the US intelligence community in a 2013-2014 study. Some of this discussion has caused confusion—which we ourselves have added to with imprecise language on our website (hopefully all fixed now)!
A key question considered in this thoughtful paper by Arb Research is whether top forecasters are more accurate than domain experts. To be clear, GJ has no data on whether any individual Superforecaster was or was not more accurate than any individual intelligence analyst in the 2013-2014 study. There is also no data on how a Superforecaster Prediction Market performed against the Intelligence Community Prediction Market (ICPM) at that time, as there was not a “Super Market” until year 4 of the US Government “IARPA/ACE” research project [1]. We do know, however, that GJ’s best methods were 34.7% more accurate than those of the ICPM over 139 forecasting questions running from the fall of 2013 to the early summer of 2014 [2].
This is the finding of the study done by IARPA Program Manager Dr. Seth Goldstein et. al. The conclusion was not surprising, as GJ was also 35-72% more accurate than the four other academic and industry teams taking part in the four-year US Government “IARPA/ACE” research project running during this period.
In IARPA/ACE Year 3, when Goldstein collected the comparison data, it was not possible to do a strict apples-to-apples comparison of Superforecasters versus Intelligence Community analysts on the same sort of forecasting platform/elicitation method because there was no Superforecaster prediction market in Year 3. Instead, the Goldstein analysis compared the ICPM against all Good Judgment Project (GJP) forecasters who were forecasting on “survey”/opinion pool platforms, similar to Good Judgment Open. This included a mix of both regular forecasters and Superforecasters. GJP was testing several different aggregation algorithms, and the algorithm called “all surveys logit” was used for comparison. For further information, please see our journal articles on aggregation algorithms here.
GJP did have a prediction market for Superforecasters the following year (data here) during IARPA/ACE Year 4. Based on this data, we compared Superforecasters working in their own prediction market against the “all surveys logit” aggregation algorithm on the same questions. Our chief data scientist at the time found no statistically significant difference in accuracy between those methods, which is intuitive as, for example, aggregation algorithms have not been found, in general, to boost the accuracy of Superforecasters significantly.
So there was no statistical difference between “all surveys logit” and the “Super Market” in Year 4, and “all surveys logit” was 34.7% more accurate than the ICPM in Year 3. That’s the closest we can come to saying anything about comparing the Superforecaster prediction market to the ICPM with currently available data.
Somewhat relatedly, there has been debate on a statement made in a November 1, 2013, article by Washington Post editor David Ignatius. He wrote, “One of the most interesting findings, according to a participant in the project, is that forecasting accuracy doesn’t necessarily improve when analysts have access to highly classified signals intelligence…. In fact, the top forecasters, drawn from universities and elsewhere, performed about 30 percent better than the average for intelligence community analysts who could read intercepts and other secret data.” Some have speculated that Ignatius had a peek at the Year 3 Goldstein data and that he based his assertion on that. We have no insight into what Ignatius was shown but note that at the time his article was published, less than 20% of the 139 Year 3 forecasts that Goldstein used as data had been closed and scored. Perhaps Ignatius had access to forecasts from Year 2, when Superforecasters worked on over 100 forecast questions? We simply don’t know.
In summary, we have no directly comparable data on whether Superforecasters in a prediction market were better than the ICPM, but Goldstein’s study makes clear that Superforecasting methods employed by GJ yielded forecasts that were 34.7% more accurate than the Intelligence Community Prediction Market.
We welcome further discussion on the above, including on how forecasting experts compare to subject matter experts. Arb Research has brought new attention to an important set of research questions for all parties interested in the forecasting of critical topics.
[1] The ACE Program was a four-year US Government research project whose goal was to “dramatically enhance the accuracy, precision, and timeliness of intelligence forecasts for a broad range of event types, through the development of advanced techniques that elicit, weight, and combine the judgments of many intelligence analysts”.
[2] Of the 161 forecasting questions that ran in IARPA/ACE Year 3, we do not know which 139 were cross-posted on the ICPM.