Challenges in evaluating forecaster performance

Briefly:

Motivation

I’m a fan of cultivating better forecasting ability in general and in the EA community in particular. Perhaps one can break down the benefits this way.

Training: Predicting how the future will go seems handy for intervening upon it to make it go better. Given forecasting seems to be a skill that can improve with practice, practising it could be a worthwhile activity.

Community accuracy: History augurs poorly for those who claim to know the future. Although even unpracticed forecasters typically beat chance, they tend inaccurate and overconfident. My understanding (tendentiously argued elsewhere) is taking aggregates of these forecasts—ditto all other beliefs we have (ibid) - allows us to fare better than we would each out on our own. Forecasting platforms are one useful way to coordinate in such an exercise, and so participation supplies a common epistemic good.[1] Although this good is undersupplied through the intellectual terrain, it may be particularly valuable for ‘in house’ topics of the EA community, as few outside it may contemplate these topics.

Self-knowledge/​‘calibration’: Knowing one’s ability as a forecast be useful piece of self-knowledge. It can inform how heavily we should weigh our own judgement in those rare cases where our opinion comprises a non-trivial proportion of the opinions we are modestly aggregating (ibid ad nauseum). Sometimes others ask us for forecasts, often under the guise of advice (I have been doing quite a lot of this with ongoing COVID-19 pandemic): our accuracy (absolute or relative) would be useful to provide alongside our forecast, so our advice can be weighed appropriately by its recipient.

Epistemic peer evaluation: It has been known for some to offer their opinion despite their counsel not being invited. In such cases, public disagreement can result. We may be more accurate in adjudicating these disagreements by weighing the epistemic virtue of the opposing ‘camps’ instead of the balance of argument as it appears to us (ibid—peccavi).

Alas, direct measures of epistemic accuracy can be elusive: people are apt to better remember (and report) their successes over their failures, and track records from things like prop betting or publicly registered predictions tend low-resolution. Other available proxy measures for ‘intellectual clout’ - subject matter expertise, social status, a writing style suffused with fulminant candenzas of melismatic and mellifluous (yet apropos and adroit) limerence of language—are inaccurate. Forecasting platforms allow people to publicly demonstrate their good judgement, and paying greater attention to these track records likely improves whatever rubbish approach is the status quo for judging others’ judgement.

Challenges

The latter two objectives require some means of comparing forecasters to one another.[2] This evaluation is tricky for a few reasons:

1. Metrics which allow good inter-individual comparison can interfere with the first two objectives, alongside other costs.

2. Probably in principle (and certainly in practice) natural metrics for this introduce various distortions.

3. (In consequence, said metrics are extremely gameable and vulnerable to Goodhart’s law).

Forecasting and the art of slaking one’s fragile and rapacious ego

Suppose every EA started predicting on a platform like Metacalus. Also, suppose there was a credible means to rank all of them by their performance (more later). Finally, suppose this ‘Metaculus rank’ became an important metric used in mutual evaluation.

Although it goes without saying effective altruists almost perfectly act to further the common good, all-but-unalloyed with any notion of self-regard, insofar as this collective virtue is not adamantine, perverse incentives arise. Such as:

  • Fear of failure has a mixed reputation as an aid to learning. Prevalent worry about ‘tanking ones rank’ could slow learning and improvement, and result in poorer collective performance.

  • People can be reluctant to compete when they believe they are guaranteed to lose. Whoever finds themselves in the bottom 10% may find excusing themselves from forecasting more appealing than continuing to broadcast their inferior judgement (even your humble author may not have written this post if he was miles below-par on Good Judgement Open). This is bad for these forecasters (getting better in absolute terms still matters), and for the forecasting community (relatively poorer forecasters still provide useful information).

  • Competing over relative rank is zero-sum. To win in zero-sum competition, it is not enough that you succeed—all others must fail. Good reasoning techniques and new evidence are better jealously guarded rather than publicly communicated. Yet the latter helps one another to get better, and for the ‘wisdom of the crowd’ to be wiser.

Places like GJO and Metaculus are aware of these problems, and so do not reward relative accuracy alone, either through separate metrics (badges for giving your rationale, ‘upvotes’ on comments, etc.) or making their ‘ranking’ measures composite metrics of accuracy and other things like activity (more later).[3]

These composite metrics are often better. Alice, who starts off a poor forecaster but through diligent practice becomes a good (but not great) and prolific contributor to a prediction platform has done something more valuable and praiseworthy than Bob, who was naturally brilliant but only stuck around long enough to demonstrate a track record to substantiate his boasting. Yet, as above, sometimes we really do (and really should) care about relative accuracy alone, and would value Bob’s judgement over Alice’s.

Hacking scoring metrics for minimal fun and illusory profit

Even if we ignore the above, constructing a good metric of relative accuracy is much easier said than done. Even if we want to (as Tetlock recommends) ‘keep score’ of our performance, essentially all means of keeping score either introduce distortions, are easy to Goodhart, or are uninterpretable. To illustrate, I’ll use all the metrics available for participating in Good Judgement Open as examples (I’m not on Metaculus, but I believe similar things apply).

Incomplete evaluation and strategic overconfidence: Some measures are only reported for single questions or a small set of questions (‘forecast challenges’ in the GJO). The challenges are those of variance and inadvertently rewarding overconfidence. ‘Best performers’ for a single question are invariably overconfident (and typically inaccurate) forecasters who maxed out their score by betting 0/​100% the day a question opens and got lucky.

Sets of questions do a bit better (good forecasters tend to find themselves frequently at the top of the leaderboard), but their small number still allows a lot of volatility. My percentile across sets varies from top 0.1% to 60th or so. The former was on a set where I was on the ‘right’ side of the crowd for all of the dozen of the questions. Yet for many of these I was at something like 20% whilst the crowd was at 40% - even presuming I had edge rather than overconfidence, I got lucky that none of these happened. Contrariwise, being (rightly) less highly confident than the crowd will pay out in the long run, but the modal result in a small question-set is getting punished. The latter was a set where I ‘beat the crowd’ on most of the low probabilities, but tanked on an intermediate probability one—Brier scoring and non-normalized adding of absolute difference means this question explained most of the variance in performance across the set.[4]

If one kept score by ones ‘best rankings’, ones number of ‘top ten finishes’, or similar, this measure would reward overconfidence, as although this costs you in the long run, over the short run it can amplify good fortune.

Activity loading: The leaderboard for challenges isn’t ranked by brier score (more later), but accuracy, essentially your Brier—crowd Brier. GJO evaluates each day, and ‘carries forward’ forecasts made before (i.e. if you say 52% on monday, you are counted as forecasting 52% on Tuesday, Wednesday, and every day until the question closes unless you change it). Thus—if you are beating the crowd—your ‘accuracy score’ is also partly an activity score, answering all the questions, having active forecasts as soon as they open all improve ones rank without being measures of good judgement.[5]

Metaculus ranks all users by a point score which (like this forum’s karma system) rewards a history of activity rather than ‘present performance’: even if Alice was more accurate than all current metaculus users, if she joined today it would take her a very long time to overtake them.

Raw scores are meaningless without a question set: Happily, GJO uses a fairly pure ‘performance metric’ front and centre: Brier score across all of your forecasts.


Raw Brier Score:

[1]: Aside: one skill in forecasting which I think is neglected is formulating good questions. Typically our convictions are vague gestalts rather than particular crisp propositions. Finding useful ‘proxy propositions’ which usefully inform these broader convictions is an under-appreciated art.

[2]: Relative performance is a useful benchmark for self-knowledge as well as peer evaluation. Raw measures of absolute accuracy tend uninformative (much more later). If Alice tends worse than the average person at forecasting, she would be wise to be upfront about this lest her ‘all things considered’ judgements inadvertently lead others (who will typically aggregate better) astray.

[3]: I imagine it is also why the GJP doesn’t spell out exactly how it selects the best GJO users for superforecaster selection.

[4]: One can argue the toss about whether there are easy improvements. One could make a scoring rule more sensitive to accuracy on rare events (Brier is infamously insensitive), or do some intra-question normalisation of accuracy. The downside would be this is intensely gameable for small question sets encouraging a ‘pick the change up in front of the steamroller’ strategy—overconfidently predicting rare events definitely won’t happen will typically net one a lot of points, with the occasional massive bust.

[5] The ‘carry forward’ feature also means there are other ways to improve ones score which is more ‘grinding’ than ‘talent’, such as regularly reducing a forecast for an event happening in a time period as it begins to elapse. These are pretty minor though.