Challenges in evaluating forecaster performance

Briefly: There are some good reasons to assess how good you or others are at forecasting (alongside some less-good ones). This is much harder than it sounds, even with a fairly long track record on something like Good Judgement Open to review: all the natural candidates are susceptible to distortion (or ‘gaming’). Making comparative judgements more prominent also have other costs. Perhaps all of this should have been obvious at first glance. But it wasn’t to me, hence this post.


I’m a fan of cultivating better forecasting performance in general and in the EA community in particular. Perhaps one can break down the benefits this way.

Training: Predicting how the future will go seems handy for intervening upon it to make it go better. Given forecasting seems to be a skill that can improve with practice, practising it could be a worthwhile activity.

Community accuracy: History augurs poorly for those who claim to know the future. Although even unpracticed forecasters typically beat chance, they tend inaccurate and overconfident. My understanding (tendentiously argued elsewhere) is taking aggregates of these forecasts—ditto all other beliefs we have (ibid.) - allows us to fare better than we would each out on our own. Forecasting platforms are one useful way to coordinate in such an exercise, and so participation supplies a common epistemic good.[1] Although this good is undersupplied throughout the intellectual terrain, it may be particularly valuable for more ‘in house’ topics of the EA community, given less in the way of ‘outside interest’.

Self-knowledge/​‘calibration’: Knowing one’s ability as a forecast be useful piece of self-knowledge. It can inform how heavily we should weigh our own judgement in those rare cases where our opinion comprises a non-trivial proportion of the opinions we are modestly aggregating (ibid. ad nauseam). Sometimes others ask us for forecasts, often under the guise of advice (I have been doing quite a lot of this with ongoing COVID-19 pandemic): our accuracy (absolute or relative) would be useful to provide alongside our forecast, so our advice can be weighed appropriately by its recipient.

Epistemic peer evaluation: It has been known for some to offer their opinion despite their counsel not being invited. In such cases, public disagreement can result. We may be more accurate in adjudicating these disagreements by weighing the epistemic virtue of the opposing ‘camps’ instead of the balance of argument as it appears to us (ibid. - peccavi).

Alas, direct measures of epistemic accuracy can be elusive: people are apt to better remember (and report) their successes over their failures, and track records from things like prop betting or publicly registered predictions tend low-resolution. Other available proxy measures for performance—subject matter expertise, social status, a writing style suffused with fulminant candenzas of melismatic and mellifluous (yet apropos and adroit) limerence of language [sic] - are inaccurate. Forecasting platforms allow people to make a public track record, and paying greater attention to these track records likely improves whatever rubbish approach is the status quo for judging others’ judgement.


The latter two objectives require some means of comparing forecasters to one another.[2] This evaluation is tricky for a few reasons:

1. Metrics which allow good inter-individual comparison can interfere with the first two objectives, alongside other costs.

2. Probably in principle (and certainly in practice) natural metrics for this introduce various distortions.

3. (In consequence, said metrics are extremely gameable and vulnerable to Goodhart’s law).

Forecasting and the art of slaking one’s fragile and rapacious ego

Suppose every EA started predicting on a platform like Metaculus. Also suppose there was a credible means to rank all of them by their performance (more later). Finally, suppose this ‘Metaculus rank’ became an important metric used in mutual evaluation.

Although it goes without saying effective altruists almost perfectly act to further the common good, all-but-unalloyed with any notion of self-regard, insofar as this collective virtue is not adamantine, perverse incentives arise. Such as:

  • Fear of failure has a mixed reputation as an aid to learning. Prevalent worry about ‘tanking ones rank’ could slow learning and improvement, and result in poorer individual and collective performance.

  • People can be reluctant to compete when they believe they are guaranteed to lose. Whoever finds themselves in the bottom 10% may find excusing themselves from forecasting more appealing than continuing to broadcast their inferior judgement (even your humble author [sic—ad nau- nvm] may not have written this post if he was miles below-par on Good Judgement Open). This is bad for these forecasters (getting better in absolute terms still matters), and for the forecasting community (relatively poorer forecasters still provide valuable information).

  • Competing over relative rank is zero-sum. To win in zero-sum competition, it is not enough that you succeed—all others must fail. Good reasoning techniques and new evidence are better jealously guarded as ‘trade secrets’ rather than publicly communicated. Yet the latter helps one another to get better, and for the ‘wisdom of the crowd’ to be wiser.

Places like GJO and Metaculus are aware of these problems, and so do not reward relative accuracy alone, either through separate metrics (badges for giving your rationale, ‘upvotes’ on comments, etc.) or making their ‘ranking’ measures composite metrics of accuracy and other things like activity (more later).[3]

These composite metrics are often better. Alice, who starts off a poor forecaster but through diligent practice becomes a good (but not great) and regular contributor to a prediction platform has typically done something more valuable and praiseworthy than Bob, who was naturally brilliant but only stuck around long enough to demonstrate a track record to substantiate his boasting. Yet, as above, sometimes we really do (and really should) care about performance alone, and would value Bob’s judgement over Alice’s.

Hacking scoring metrics for minimal fun and illusory profit

Even if we ignore the above, constructing a good metric of relative accuracy is much easier said than done. Even if we want to (as Tetlock recommends) ‘keep score’ of our performance, essentially all means of keeping score either introduce distortions, are easy to Goodhart, or are uninterpretable. To illustrate, I’ll use all the metrics available for participating in Good Judgement Open as examples (I’m not on Metaculus, but I believe similar things apply).

Incomplete evaluation and strategic overconfidence: Some measures are only reported for single questions or a small set of questions (‘forecast challenges’ in the GJO). This can inadvertently reward overconfidence. ‘Best performers’ for a single question are typically overconfident (and typically inaccurate) forecasters who maxed out their score by betting 0/​100% the day a question opened and got lucky.

Sets of questions (‘challenges’) do a bit better (good forecasters tend to find themselves frequently near the top of the leader board), but their small number still allows a lot of volatility. My percentile across question sets on GJO varies from top 0.1% to significantly below average. The former was on a set where I was on the ‘right’ side of the crowd for all dozen of the questions in the challenge. Yet for many of these I was at something like 20% whilst the crowd was at 40% - even presuming I had edge rather than overconfidence, I got lucky that none of these happened. Contrariwise, being (rightly) less highly confident than the crowd will pay out in the long run, but the modal result in a small question-set is getting punished. The latter was a set where I ‘beat the crowd’ on most of the low probabilities, but tanked on an intermediate probability one—Brier scoring and non-normalized adding of absolute difference means this question explained most of the variance in performance across the set.[4]

If one kept score by ones ‘best rankings’, ones number of ‘top X finishes’, or similar, this measure would reward overconfidence, as although this costs you in the long run, it amplifies good fortune.

Activity loading: The leaderboard for challenges isn’t ranked by Brier score (more later), but accuracy, essentially your Brier—crowd Brier. GJO evaluates each day, and ‘carries forward’ forecasts made before (i.e. if you say 52% on Monday, you are counted as forecasting 52% on Tuesday, Wednesday, and every day until the question closes unless you change it). Thus—if you are beating the crowd—your ‘accuracy score’ is also partly an activity score, as answering all the questions, having active forecasts as soon questions open (etc) all improve ones score (presuming one is typically beating the crowd) without being measures of good judgement per se.

Metaculus ranks all users by a point score which (like this forum’s karma system) rewards a history of activity rather than ‘present performance’: even if Alice was more accurate than all current Metaculus users, if she joined today it would take her a long time to rise to the top them.

Raw performance scores are meaningless without a question set: Happily, GJO uses a fairly pure ‘performance metric’ front and centre: Brier score across all of your forecasts. Although there are ways to ‘grind’ activity into accuracy (updating very frequently, having a hair trigger to update on news a couple of days before others get around to it, etc.) it loads much more heavily on performance. It also (at least in the long run) punishes overconfidence.

The problem is that this measure has little meaning on its own. A Brier score of (e.g.) 0.3 may be good or bad depending on how hard it was to forecast your questions—and one can get arbitrarily close to 0 by ‘forecasting the obvious’ (i.e. putting 100% on ‘No’ the day before a question closes and the event has not happened yet). One can fairly safely say your performance isn’t great if you’re underperforming a coin flip, but more than that is hard to say.

Comparative performance is messy with disjoint evaluation sets: To help with this, GJO provides a ‘par score’ as a benchmark, composed of median performance across other forecasts on the same questions and time periods as one’s own. This gives a fairly reliable single bit of information: a Brier score of 0.3 is good if it is lower than this median, as it suggests one has outperformed the typical forecasts on these questions (and vice versa).

However, it is hard to say much more than that. If everyone answered the same questions across the same time periods, one could get a sense of (e.g.) ‘how much better than the median’, or which of Alice and Bob (who are both ‘better than average’) is better. But when (as is typically the case on forecasting questions) people forecast disjoint sets of questions, this goes up in the air again. ‘How much you beat the median by’ can depend as much on ‘picking the right questions’ as ‘forecasting well’:

  • If the median forecast is already close to the forecast you would make, making a forecast can harm your ‘average outperformance’ even if you are right and the median is wrong (at the extreme, making a forecast identical to the median inevitably drags one closer to par). This can bite a lot for low likelihood questions: if the market is at 2% but you think it should be 8%, even if you’re right you are a) typically likely to lose in the short run, and b) even in the long run the yield, when aggregated, can make you look more similar to the average.

  • On GJO the more political questions tend to attract less able forecasters, and the niche/​technical ones more able ones. For questions where the comments are often “I know X is a fighter, they will win this!” or “Definitely Y, as [conspiracy theory]”, the bar to ‘beat the crowd’ is much lower than for questions where the comments are more “I produced this mathematical model for volatility to give a base-rate for exceeding the cut-off value which does well back-tested on the last 2 years—of course, back-testing on training data risks overfitting (schoolboy error, I know), but this also corresponds with the probability inferred from the price of this exotic financial instrument”.

This is partly a good thing, as it incentives forecasters to prioritise questions where the current consensus forecast is less accurate. But it remains a bad one for the purposes discussed here: even if Alice in some platonic sense is a superior forecaster, Bob may still appear better if he happens to (or strategizes to) benefit from these trends more than her.[5]

Admiring the problem: an Ode to Goodhart’s law

One might have hoped as well describing this problem, I would also have some solutions to propose. Sadly this is beyond me.

The problem seems fundamentally tied to Goodhart’s law. (q.v.) People may have a mix of objectives for forecasting, and this mix may differ between people. Even a metric closely addressed to one objective probably will not line up perfectly, and the incentives to ‘game’ would amplify the divergence. With respect to other important objectives, one expects a greater mismatch: a metric that rewards accuracy can discourage participation (see above); a metric that rewards participation can encourage people to ‘do a lot’ without necessarily ‘adding much’ (or trying to get much better). Composite metrics can help with this problem, but provide another one in turn: difficulty isolating and evaluating individual aspects of the composite.

Perhaps best is a mix of practical wisdom and balance—taking the metrics as useful indicators but not as targets for monomaniacal focus. Some may be better at this than others.

[1]: Aside: one skill in forecasting which I think is neglected is formulating good questions. Typically our convictions are vague gestalts rather than particular crisp propositions. Finding useful ‘proxy propositions’ which usefully inform these broader convictions is an under-appreciated art.

[2]: Relative performance is a useful benchmark for peer weighting as well as self-knowledge. If Alice tends worse than the average person at forecasting, she would be wise to be upfront about this lest her ‘all things considered’ judgements inadvertently lead others (who would otherwise aggregate better discounting her view rather than giving it presumptive equal weight) astray.

[3]: I imagine it is also why (e.g.) the GJP doesn’t spell out exactly how it selects the best GJO users for superforecaster selection.

[4]: One can argue the toss about whether there are easy improvements. One could make a scoring rule more sensitive to accuracy on rare events (Brier is infamously insensitive), or do some intra-question normalisation of accuracy. The downside would be this is intensely gameable, encouraging a ‘pick the change up in front of the steamroller’ strategy—overconfidently predicting rare events definitely won’t happen will typically net one a lot of points, with the occasional massive bust.

[5]: One superforecaster noted taking a similar approach (but see).