How does forecast quantity impact forecast quality on Metaculus?

Introduction

Many people, myself included, worry about how much we can trust aggregate forecasts (e.g., the Metaculus community median) that are based on the predictions of only a small number of individuals. This consideration also came up in my recent post analysing predictions of future grants by Open Philanthropy, where having few predictors left me unsure of how much we could really trust the aggregate predictions[1]. How justified is this worry? In other words, to what extent is the number of individual predictors on a question correlated with the accuracy of the aggregate forecast on that question?

And to what extent does increasing the number of predictors on a question itself cause the aggregate forecast on that question to be more accurate?

The first question is relevant to whether and how to use the number of predictors as a proxy for how much to trust an aggregate forecast. The second question is relevant to how much question writers and others should aim to increase the number of predictors on a question, such as by creating binary rather than continuous questions (since binary questions generally receive ~twice as many predictions as continuous ones [2]), highlighting how forecasts on the question will inform decisions and thus be impactful, or increasing how many people use Metaculus.

This post mostly attempts to answer the first correlational question, but also discusses the second causal question.

Key points

  • Accuracy of an aggregate prediction (as measured by Brier scores) improves as the number of predictors rises, with the marginal improvement of the aggregate forecast given an X% increase in number of predictors (for any X) looking larger when there are lower (single digit) numbers of predictors.

  • The evidence suggests Brier score improvement after reaching 10 predictors is slow, and in practice I think I will now have a very similar level of confidence in a prediction if it has 10 or 30 predictors, whereas before I might have more easily dismissed the former.

  • Criticisms have been raised of Metaculus in particular in the past that its points system incentivises lazy predictors to ā€œherdā€ around the existing median to accumulate points without needing to spend much time forecasting. I investigated this, and it doesnā€™t look like there is much evidence for herding among Metaculus users to any noticeable extent, or if there is herding, it doesnā€™t seem to increase as the number of predictors rises. This supports the notion that adding more predictors should continue to improve forecast accuracy at least to some extent.

The data

In this post I will look at binary forecasts from Metaculus. I focus on binary forecasts rather than continuous as they are easier to score, and the incomplete historic log of continuous forecasts available through the Metaculus API does not allow for easy scoring of them.

Previously I found that Metaculus was poorly calibrated on resolved questions with a >1 year time horizon. Restricting our set to exclude those gives a more calibrated dataset to start with and avoids us having to control for this, so I have excluded predictions with a time horizon of greater than 1 year.

I have used the Metaculus ā€œcommunity medianā€ prediction throughout. This gives more weight to recently updated predictions than to older ones. I used this as it is the prediction most will see in practice when looking at a Metaculus question.

I collected data from 626 resolved binary questions on Metaculus with a time horizon of less than 1 year. For each question, I restricted my analysis to the first 25% of the question lifetime (that is, Ā¼ of the way through the period [question open date, question resolution date]). This was to exclude situations where new information came to light affecting question outcomes (and possibly also leading more predictors to make predictions on the question), in an attempt to isolate just the effect of increasing the numbers of predictors.

At this point the questions had the following distribution of unique predictors: (note log scale)

It looks to be approximately log-linear, though it blows up at the high end, with the top 5% of questions receiving much more attention. Looking at these, there are some common themes (several 2020 election season related questions, Covid related questions, AI milestone questions, and questions related to technological development) but the set is reasonably diverse. You can see what the most popular resolved questions were hereā€”these are sorted by ā€œinterestā€ but this is very highly correlated with forecasts.

Analysis

Between-question analysis

I scored the predictions using Brier scores. The Brier score is a scoring rule for probabilistic forecasts. For a set of binary forecasts, the Brier score is equivalent to the mean squared error of the forecasts, so a lower score is better. I took logs of the numbers of predictors, as it seemed a natural unit to use given the distribution of predictor counts and my strong prior that increasing your number of predictors by some fixed number matters more when you have fewer predictors to start with.

Taking Brier scores and plotting them vs logā‚‚(predictors) we get the following.

So it looks like a linear fit shows a decreasing (improving) Brier score. We get a slope of āˆ’0.012 with a 95% confidence interval of [-0.026, 0.002]

Estimate95% Confidence Interval
Intercept 0.2330 [0.154 , 0.312]
Slope āˆ’0.0120[-0.026 , 0.002]

What does a coefficient of āˆ’0.012 mean? It means that, for every doubling of our number of predictors, we get a 0.012 decrease in expected Brier score. For example, that would mean going from 2^3 to 2^7 (8 to 128) predictors can be expected to decrease the Brier score by .048 - or, to frame it in terms of prediction discrimination, the equivalent of going from a perfectly calibrated 73% prediction to a perfectly calibrated 82% prediction. To give some context, this approximately 25% decrease in Brier scores is similar to the 23% by which Tetlock claimed teams of superforecasters outperformed individual superforecasters [3].

And we indeed see approximately that pattern when looking at the raw numbers, grouping questions into bins by how many predictors they had:

Number of Predictors (inclusive)avg Brier for the aggregate forecast on the questionNumber of questions in this bin

2-7

0.1961

11

8-11

0.1956

14

12-15

0.1799

19

16-22

0.1981

64

23-31

0.1744

109

32-45

0.1697

173

46-63

0.1682

177

64-90

0.1717

144

91-127

0.1325

83

128-181

0.1497

31

I then decided to exclude questions for which the aggregate prediction was below 10% or above 90% when the question was Ā¼ of the way through its time horizon. This was in order to exclude questions with very confident predictions, since such questions might be less likely than other questions to see significant changes in the prediction, which would make it harder to detect any possible benefit of a question receiving additional predictors. When such questions are excluded, we see a somewhat larger, though noisy, effect:

Number of Predictors (inclusive)avg Brier for the aggregate forecast on the questionNumber of questions in this bin

2-7

0.2157

10

8-11

0.2375

10

12-15

0.1894

18

16-22

0.2044

62

23-31

0.1842

98

32-45

0.1777

127

46-63

0.1742

116

64-90

0.2255

60

91-127

0.1475

34

128-181

0.1515

18

It looks here like the clearest differences here are between the first two buckets (<12 predictors) and those after, and between the last two buckets (>90 predictors) and those before, though the data is noisy enough that this could potentially be an artefact.

Within-question analysis

We are vulnerable here to confounding effectsā€”what if forecasts which garnered more predictions also tended to be easier or more difficult to forecast than those which garnered fewer? In such an eventuality, we would get a false picture of the difference between forecasts.

One way to potentially get around this is to look at how the Brier scores of aggregate forecasts on a given question changes as that question gains a larger number of individual predictors. I looked at every question which ended up with over 90 unique predictors and which had an aggregate forecast Ā¼ of the way through the question lifetime within the 10%-90% range, and took their Brier scores at the point where they had N unique predictors for various N. Unfortunately Metaculus only keeps a limited subset of historic prediction values, and so our sample size here was reduced to 30 and I had to drop the N=2 and N=4 fields as there were no values for many questions. I also threw out one question which clearly received a meaningful informational update during the period we are looking at, though this doesnā€™t mean I caught all such instances.

We see the following:

PredictionsBrier

8

0.1976

16

0.1779

32

0.1926

64

0.1801

90

0.1734


If instead we look at those predictions with from 32-90 predictions, we get the following (sample size: 163, here I could bring back the smaller values of N):

PredictionsBrier

2

0.2357

4

0.2323

8

0.1968

16

0.1932

32

0.1787


So it seems like the strongest effect is going from very few to 10+ predictors. But this data may exaggerate the effect of an increased number of predictors, for reasons including that in these cases the higher numbers of predictors occurred at later points in the questionā€™s lifetime later than the lower numbers, and new information may have come to light in the interim for some questions. It doesnā€™t clearly contradict our earlier finding about predictions not improving a great deal after the first 11 predictors.

Why might this be? I can think of a couple of plausible reasons for why this might be the case:

  • Perhaps many predictions are relatively easy, and settling on a good enough answer is something which can be done with few predictors

  • Perhaps later Metaculus predictors are largely herding, and contribute little to the overall estimate accuracy

How might we test these hypotheses?

For one thing, we can look at how much the overall prediction changes over the time from the early predictions to the later ones. If the aggregate prediction remains fairly stable, thatā€™s some evidence for both the ā€˜predictions are easyā€™ hypothesis and the herding hypothesis, whereas if it isnā€™t very stable that would be evidence against both hypotheses. We can also see if the interquartile range narrows over time from the early predictions to the narrow ones; if so, that is some evidence for the herding hypothesis (since if a bunch of near-median predictions are made, that should raise the 25th percentile prediction and lower the 75th percentile).

Looking more closely, it doesnā€™t seem like predictions are very stable, though they are slightly more stable with respect to log(predictors) as time goes on. The mean (median) prediction moves 11.61 (8) percentage points between the time the question has 2 predictors and the time it has 4, and this goes to 9.43 (6) percentage points as the question goes from 4 to 8 predictors, 7.22 (6) points from 8 to 16 predictors, and 7.42 (5) points from 16 to 32 predictors. There was no autocorrelation between the moves [4].

An example for illustration:

This question had a median of 50% at 2 predictors, then 40% at 4, 50% at 8, 55% at 16 predictors, and then a median of 65% at 32 predictors. It hit 16 predictors within a day of opening in Jan 2019, but didnā€™t hit 32 until 4 days later.

Interquartile ranges [5] also do not vary significantly at different points, suggesting there is not much obvious herding going on, or at least that the probability a marginal predictor will be herding is not a function of how many predictors have already predicted on the question.

PredictionsMean IQR width

2

0.1863

4

0.2223

8

0.2108

16

0.2052

32

0.2087


I donā€™t think this analysis supported either of my hypotheses, and I think this should make us somewhat more confident that more predictors are expected to improve the aggregate forecasts, though to what degree probably depends on the difficulty of the question.

Further research ideas

One could attempt to study performance across continuous questions to see if these conclusions hold there too. Doing this would require more access to historic data on distributions for Metaculusā€™ questions than is currently possible with Metaculusā€™ public API.

If one had access to the individual predictions, one could also try to take 1000 random bootstrap samples of size 1 of all the predictions, then 1000 random bootstrap samples of size 2, and so on and measure how accuracy changes with larger random samples. This might also be possible with data from other prediction sites.

Other factors, such as comments on questions from people sharing their reasoning or information sources, could affect aggregate prediction accuracy on a question. One could look into this.

Footnotes

[1] But note that a question having received a low number of forecasts is only one of several reasons why one might not trust the aggregate prediction on that question.

[2] I looked at all the questions on Metaculus from 2019 and 2020, and the median number of unique predictors on a binary question was 75, vs 38 for a continuous one. The mean was 97 vs 46. There were 942 continuous questions over the time window and 727 binary questions. This does not control for the questions being different, for example, perhaps the average binary question is more interesting? Metaculus also has an ā€œinterestā€ feature, where people can upvote questions they find interesting. This allowed me to look at number of predictors per ā€œinterestedā€ user. This was 5.52 for binary questions and 4.27 for continuous questions, a difference of 1.3x, suggesting that the difference of 2x is probably in large part down to the binary questions being more interesting, though what makes a question interesting here is up to individual users.
[3] Tetlock, Superforecasting: The Art and Science of Prediction, 2015

[4] That is, the direction of an update did not predict the direction of the next update.

[5] The interquartile range (IQR) for a binary question is the range [X,Y] where Ā¼ of predictors gave a lower probability than X and Ā¼ of predictors gave a higher probability than Y on the question.

This post is a project of Rethink Priorities.

It was written by Charles Dillon, a volunteer for Rethink Priorities. Thanks to Michael Aird, David Rhys Bernard, Linch Zhang, and Peter Wildeford for comments and feedback on this post. If you like our work, please consider subscribing to our newsletter. You can see all our work to date here.