Comparing top forecasters and domain experts

Arb is a new research consultancy. You can reach us at hi@arbresearch.com . Listen to this post on the Nonlinear Library.

The superforecasting phenomenon—that certain teams of forecasters are better than other prediction mechanisms like large crowds and simple statistical rules—seems sound. But serious interest in superforecasting stems from the reported triumph of forecaster generalists over non-forecaster experts. (Another version says that they also outperform analysts with classified information.)

So distinguish some claims:

  1. “Forecasters > the public”

  2. “Forecasters > simple models”

  3. “Forecasters > experts”

3a. “Forecasters > experts with classified info”
3b. “Averaged forecasters > experts”
3c. “Aggregated forecasters > experts”

Is (3) true? This post reviews all the studies we could find on experts vs forecasters. (We also attempt to cover the related question of prediction markets vs experts.)

Conclusions

First, our conclusions. These look pessimistic, but are mostly pretty uncertain:

  • We think claim (1) is true with 99% confidence[1] and claim (2) is true with 95% confidence. But surprisingly few studies compare experts to generalists (i.e. study claim 3). Of those we found, the analysis quality and transparency leave much to be desired. The best study found that forecasters and health professionals performed similarly. In other studies, experts had goals besides accuracy, or there were too few of them to produce a good aggregate prediction.

  • (3a) A common misconception is that superforecasters outperformed intelligence analysts by 30%. Instead: Goldstein et al showed that [EDIT: the Good Judgment Project’s best-performing aggregation method][2] outperformed the intelligence community, but this was partly due to the different aggregation technique used (the GJP weighting algorithm performs better than prediction markets, given the apparently low volumes of the ICPM market). The forecaster prediction market performed about as well as the intelligence analyst prediction market; and in general, prediction pools outperform prediction markets in the current market regime (e.g. low subsidies, low volume, perverse incentives, narrow demographics). [85% confidence]

  • (3b) In the same study, the forecaster average was notably worse than the intelligence community.

  • (3c) Ideally, we would pit a crowd of forecasters against a crowd of experts. Only one study, an unpublished extension of Sell et al. manages this; it found a small (~3%) forecaster advantage.

  • The bar may be low. That is: it doesn’t seem that hard to become a top forecaster, at present. Expertise, plus basic forecasting training and active willingness to forecast regularly, were enough to be on par with the best forecasters. [33%]
    [3]

  • In more complex domains, like ML, there could be significant returns to expertise. So it might be better to broaden focus from generalist forecasters to competent ML pros who are excited about forecasting. [40%]

Table of studies

Due to issues with rendering tables on the forum, please, see the table in a google doc version of this post.

Search criteria

We were given a set of initial studies to branch out from.

And some general suggestions for scholarship:

  • look for review articles

  • look for textbooks and handbooks or companions

  • find key terms

  • go through researchers’ homepages/​google scholar

Superforecasting began with IARPA’s ACE tournament. (Misha thinks the evidence in Tetlock’s Expert Political Judgment doesn’t fit for our purposes: there were no known skilled-amateur forecasters at that point.)

A Google Scholar search for studies funded by IARPA ACE yielded no studies. We looked at other IARPA projects (ForeST, HCT, and OSI), which sounded remotely relevant to our goals.

We searched Google Scholar for (non-exhaustive list): “good judgment project”, “superforecasters”, “collective intelligence”, “wisdom of crowds”, “crowd prediction”, “judgemental forecasting”, …, and various combinations of these, and “comparison”, “experts”, …

We got niche prediction markets from the Database of Prediction Marketsand searched for studies mentioning them. Hollywood SX and FantasySCOTUS paid off as a result. We also searched for things people commonly predict: sports, elections, Oscars, and macroeconomics.

In the process, we read the papers for additional keywords and references. We also looked for other papers from the authors we encountered.

On AI forecasting

In more complex domains, like ML, there could be significant returns to knowledge and expertise. It seems to us that moving from generalist forecasters to competent ML practitioners/​researchers might be in order, because:

  • To predict e.g. scaling laws and emerging capabilities, people need to understand them, which requires some expertise and understanding of ML

  • It’s unclear whether general forecasters actually outperform experts in a legible domain, even though we believe in the phenomenon of superforecasting, (that some people are much better forecasters than most). We also liked David Manheim’s take on Superforecasting.

  • We think that this will plausibly reduce ML researchers’ aversion to forecasting proposals — and if we were to execute it, we would be selecting good forecasters based on their performance anyway. It seems potentially feasible.


Finally, we note that the above is heavily limited by lack of data (lack of data collected and a lack of availability). We hope that the experimental data gets reanalyzed at least.

This report represents 2.1 person-weeks of effort.

Thanks to Emile Servan-Schreiber, Luke Muehlhauser, and Javier Prieto for comments. These commenters don’t necessarily endorse any of this. Mistakes are our own. Research funded by Open Philanthropy.

Appendix: Table of less relevant studies

Appendix: markets vs pools

Appendix: Expert Political Judgment (2005)

Appendix: by methods compared

Changelog for this post

Brought to you by Arb:
skin clear, crops flourishing, grades up.
  1. ^

    This is almost a trivial claim, since forecasters are by definition more interested in current affairs than average, and much more interested in epistemics than average. So we’d select for the subset of “the public” who should outperform simply through increased effort, even if all the practice and formal calibration training did nothing, and it probably does do something.

  2. ^

    Previously this section said “superforecasters”; after discussion, it seems more prudent to say “the Good Judgment Project’s best-performing aggregation method”. See this comment for details.

  3. ^

    Our exact probability hinges on what’s considered low and on how good e.g. Hypermind’s trained forecasters are. This is less obvious than it seems: in the CSET Foretell tournament, the top forecasters were not uniformly good; a team which included Misha finished with a 4x better relative Brier score than the “top forecaster” team. Further, our priors are mixed: (a) common sense makes us favor experts, (b) but common sense also somewhat favors expert forecasters, (c) Tetlock’s work on expert political judgment pushes us away from politics experts, and finally (d) we have first-hand experience about superforecasting being real.

  4. ^

    All Surveys Logit “takes the most recent forecasts from a selection of individuals in GJP’s survey elicitation condition, weights them based on a forecaster’s historical accuracy, expertise, and psychometric profile, and then extremizes the aggregate forecast (towards 1 or 0) using an optimized extremization coefficient.” Note that this method was selected post hoc, which raises the question of multiple comparisons; the authors respond that “several other GJP methods were of similar accuracy (<2% difference in accuracy).”

  5. ^

    There is some inconclusive research comparing real- and play-money: Servan-Schreiber et al. (2004) find no significant difference for predicting NFL (American football); Rosenbloom & Notz (2006) find that in non-sports events, real-money markets are more accurate and that they are comparably accurate for sports markets; and Slamka et al. (2008) finds real- and play-money prediction markets comparable for UEFA (soccer).

  6. ^

    MMBD is not a proper scoring rule (one incentivizing truthful reporting). If a question has a chance of resolving early (e.g., all questions of the form “will X occur by date?”), the rule incentivizes forecasters to report higher probabilities for such outcomes. This could have affected GJP (avg and best) predictors, who were rewarded for it; but should have not affected ICPM and GJP (PM), as these used the Logarithmic Market Scoring Rule.
    See Sempere & Lawsen (2021) for details.

  7. ^

    Our understanding is that these were not averaged. On average there were ~2.5 imputed predictions per report.

  8. ^

    It’s unclear if imputers did a reasonable job separating their personal views from their imputations. Mandel (2019) notes that the Pearson correlation between mean Brier scores for personal and imputed forecasts is very high, r(3)=.98, p=.005. Imputers average Brier scores ranged from .145 to .362 suggesting that traditional analysis’ apparent accuracy depends on whether interpreters are better or worse forecasters. Lehner and Stastny (2019) responded.

  9. ^

    “We replicated some of these markets in the ICPM, or identified closely analogous predictions if they existed, so that direct comparisons between the two prediction markets could be made over time.”

  10. ^

    “We repeatedly collected forecasts from our markets and our experts to sample various time horizons, ranging from very near-term forecasts to as long as 4 months before a subject was resolved. All told, we collected 152 individual forecasts from the ICPM, InTrade, and individual IC experts over approximately matching topics and time horizons.”

  11. ^

    A perfectly calibrated forecaster expects on average brier points from their prediction. So this average Brier suggests that a “typical” InTrade prediction was either <4% or >96%. From experience, this feels too confident and suggests that questions were either biased towards low noise or that luck is partly responsible for such good performance.

  12. ^

    Personal communication with Servan-Schreiber.

  13. ^

    Given N log scores, scaled rank assigns a value of 1/​N to the smallest log score, a value of 2/​N to the second smallest log score, and so on, assigning a value of 1 to the highest log score. (As with log scores, here computed from probability density functions, — the higher rank the better.)

  14. ^

    It’s unclear to me how well they did compare to a prior based on how often SCOTUS reverses the decisions. The historical average is ~70% with ~80% reversals in 2008, the relevant term.