Exploring Metaculus’ community predictions

Disclaimer: this is not a project from Arb Research.

Summary

  • I really like Metaculus!

  • I have collected and analysed in this Sheet metrics about Metaculus’ questions outside of question groups, and their Metaculus’ community predictions (see tab “TOC”). The Colab to extract the data and calculate the metrics is here.

  • The mean metrics vary a lot across categories, and the same is seemingly true for correlations among metrics. So one should not assume the performance across all questions is representative of that within each of Metaculus’ categories. To illustrate:

    • Across categories, the 5th and 95th percentiles of the mean normalised outcome are 0 and 0.784[1], and of the mean Brier score are 0.0369 and 0.450. For context, the Brier score is 0.25 (= 0.5^2) for the maximally uncertain probability of 0.5.

    • According to Metaculus’ track record page, the mean Brier score for Metaculus’ community predictions evaluated at all times is 0.126 for all questions, but 0.237 for those of the category artificial intelligence. So Metaculus’ community predictions about probabilities[2] look good in general, but they perform close to random predictions for the category of artificial intelligence. However, note there are other categories with questions about artificial intelligence, like AI and machine learning.

  • There can be significant differences between Metaculus community predictions and Metaculus’ predictions. For instance, the mean Brier score of the latter for the category of artificial intelligence is 0.168, which is way more accurate than the 0.237 of the former.

  • According to my results, Metaculus’ community predictions are:

    • In general (i.e. considering all questions), less accurate for questions:

      • Whose predictions are more extreme under Bayesian updating (correlation coefficient R = 0.346, and p-value p = 0[3]).

      • With a greater amount of updating (R = 0.262, and p = 0).

      • With a greater difference between amount of updating and uncertainty reduction (R = 0.256, and p = 0).

    • For the category of artificial intelligence, less accurate for questions with:

      • Greater difference between amount of updating and uncertainty reduction (R = 0.361, and p = 0.0387).

      • More predictions (R = 0.316, and p = 0.0729).

      • A greater amount of updating (R = 0.282, and p = 0.111).

    • Compatible with Bayesian updating in general, in the sense I failed to reject it during the 2nd half of the period during which each question was or has been open (mean p-value of 0.425).

  • If you want to know how much to trust a given prediction from Metaculus, I think it is sensible to check Metaculus’ track record for similar past questions (more here).

Acknowledgements

Thanks to Charles Dillon, Misha Yagudin from Arb Research, Peter Mühlbacher, and Ryan Beck.

Dark crystall ball in a bright foggy galaxy. Generated by OpenAI’s DALL-E.

Introduction

I really like Metaculus!

Methods

I believe it would be important to better understand how much to trust Metaculus’ predictions. To that end, I have determined in this Sheet (see tab “TOC”) metrics about all Metaculus’ questions outside of question groups with an ID from 1 to 15000 on 13 March 2023[4], and their Metaculus’ community predictions. The metrics for each question are:

  • Tags, which identify the Metaculus’ category.

  • Publish time (year).

  • Close time (year).

  • Resolve time (year).

  • Time from publish to close (year).

  • Time from close to resolve (year).

  • Time from publish to resolve (year).

  • Number of forecasters.

  • Number of predictions.

  • Number of analysed dates, which is the number of instances at which the predictions were assessed.

  • Total belief movement, which is a measure of the amount of updating, and is the sum of the belief movements, which are the squared differences between 2 consecutive beliefs.

    • The values of the beliefs range from 0 to 1, and can respect a:

      • Probability.

      • Ratio between an expectation and difference between the maximum and minimum allowed by Metaculus.

    • To illustrate, the belief movement from a probability of 0.5 to 0.8 is 0.09 (= (0.8 − 0.5)^2).

  • Total uncertainty reduction, which is the difference between the initial and final uncertainties, where the uncertainty linked to a belief value p equals p (1 - p). This is null for probabilities of 0 and 1, and maximum and equal to 0.25 for a probability of 0.5.

  • Total excess belief movement, which is the difference between the total belief movement and total uncertainty reduction.

  • Normalised excess belief movement, which is the ratio between the total belief movement and total uncertainty reduction.

  • Absolute value of normalised excess belief movement.

  • Z-score for the null hypothesis that the beliefs are Bayesian.

  • P-value for the null hypothesis that the beliefs are Bayesian.

  • Normalised outcome, which is, for questions about:

    • Probabilities, 0 if the question resolves as “no”, and 1 if as “yes”.

    • Expectations, the ratio between the outcome and the difference between the maximum and minimum allowed by Metaculus.

  • Brier score, which is the mean squared difference between the predicted probability and outcome (0 or 1). Note the Brier score does not apply to questions about expectations, whose accuracy I did not assess.

Augenblick 2021 shows the total belief movement should match the total uncertainty reduction in expectation for Bayesian updating (see “Proposition 1”), in which case the total excess movement and normalised excess belief movement should be 0 and 1. I suppose Metaculus’ community predictions are less reliable early on. So, in the context of the metrics regarding belief movement and uncertainty reduction, I only analysed predictions concerning the 2nd half of the period during which each question was or has been open.

The Colab to extract the data and calculate the metrics is here[5].

Results

The tables below have results for:

  • The mean, and 5th and 95th percentiles across categories of the number of questions, number of resolved questions, and mean metrics (1st table).

  • Mean metrics for all questions and those of the category of artificial intelligence (2nd table).

  • Correlations among metrics for all questions and those of the category of artificial intelligence (3rd table).

The results in the 2nd and 3rd tables for the other categories are in the Sheet.

Mean metrics

Metric

Category

Mean

5th percentile

95th percentile

Number of questions

64.8

3.00

179

Number of resolved questions

27.4

0

68.0

Mean publish time (year)

2020

2017

2022

Mean close time (year)

2039

2019

2077

Mean resolve time (year)

2062

2020

2161

Mean time from publish to close (year)

18.8

0.0530

56.2

Mean time from close to resolve (year)

23.0

2.04*10^-7

72.9

Mean time from publish to resolve (year)

41.8

0.159

141

Mean number of forecasters

82.1

23.5

166

Mean number of predictions

172

50.0

357

Mean number of analysed dates

86.5

56.4

104

Mean total belief movement

0.0191

2.15*10^-3

0.0461

Mean total uncertainty reduction

0.0130

-0.0108

0.0491

Mean total excess belief movement

6.10*10^-3

-0.0253

0.0394

Mean normalised excess belief movement

-43.6

-7.09

7.77

Mean absolute value of normalised excess belief movement

49.0

0.213

18.5

Mean z-score for the null hypothesis that the beliefs are Bayesian

0.103

-0.711

0.811

Mean p-value for the null hypothesis that the beliefs are Bayesian

0.456

0.306

0.638

Mean normalised outcome

0.328

0

0.669

Mean Brier score

0.162

0.0367

0.300

Metric

Category

Any

Artificial intelligence

Number of questions

5,335

199

Number of resolved questions

2,337

50

Mean publish time (year)

2,021

2,020

Mean close time (year)

2,036

2,043

Mean resolve time (year)

2,048

2,050

Mean time from publish to close (year)

15.3

22.9

Mean time from close to resolve (year)

12.2

7.07

Mean time from publish to resolve (year)

27.6

30.0

Mean number of forecasters

88.2

104.5

Mean number of predictions

206

200

Mean number of analysed dates

90.4

91.0

Mean total belief movement

0.0238

0.0219

Mean total uncertainty reduction

0.0191

0.0144

Mean total excess belief movement

4.70*10^-3

7.53*10^-3

Mean normalised excess belief movement

-43.1

-3.92

Mean absolute value of normalised excess belief movement

47.2

5.52

Mean z-score for the null hypothesis that the beliefs are Bayesian

-6.78*10^-3

0.105

Mean p-value for the null hypothesis that the beliefs are Bayesian

0.425

0.413

Mean normalised outcome

0.365

0.381

Mean Brier score

0.151

0.230

Correlations among metrics

Correlation between Brier score and...

Category

Any (N = 1,374)

Artificial intelligence (N = 33)

Correlation coefficient

P-value for the null hypothesis that there is no correlation[3]

Correlation coefficient

P-value for the null hypothesis that there is no correlation

Publish time (year)

-0.143

9.82*10^-8

0.179

0.319

Close time (year)

-0.117

1.40*10^-5

0.172

0.339

Resolve time (year)

-0.146

5.68*10^-8

0.184

0.305

Time from publish to close (year)

0.0319

0.238

7.82*10^-3

0.966

Time from close to resolve (year)

-0.0193

0.476

0.0341

0.850

Time from publish to resolve (year)

0.0102

0.705

0.0318

0.861

Number of forecasters

-0.0776

4.02*10^-3

0.0680

0.707

Number of predictions

-0.0366

0.175

0.316

0.0729

Number of analysed dates

-0.107

6.57*10^-5

0.198

0.270

Total belief movement

0.262

0

0.282

0.111

Total uncertainty reduction

-0.136

4.61*10^-7

-0.150

0.405

Total excess belief movement

0.256

0

0.361

0.0387

Normalised excess belief movement

-4.63*10^-3

0.864

0.0708

0.695

Absolute value of normalised excess belief movement

0.0893

9.17*10^-4

0.110

0.542

Z-score for the null hypothesis that the beliefs are Bayesian

0.346

0

0.241

0.176

P-value for the null hypothesis that the beliefs are Bayesian

0.0296

0.273

-0.0269

0.882

Normalised outcome

0.102

1.60*10^-4

0.112

0.535

Discussion

Mean metrics

The mean metrics vary a lot across categories. For example, the 5th and 95th percentiles of the mean normalised outcome are 0 and 0.669, and of the mean Brier score are 0.0367 and 0.300.

I computed mean normalised excess belief movements of −43.1 and −3.92 for all questions and those of the category of artificial intelligence, but these are not statistically significant, as the mean p-values are 0.425 and 0.413. So it is not possible to reject Bayesian updating for Metaculus’ community predictions during the 2nd half of the period during which each question was or has been open. To contextualise, Table III of Augenblick 2021 presents normalised excess belief movements pretty close to 1 (and the p-values for the null hypothesis of Bayesian updating are all lower than 0.001):

  • 1.20 for “a large data set, provided by and explored previously in Mellers (2014) and Moore (2017), that tracks individual probabilistic beliefs over an extended period of time”.

  • 0.931 for “predictions of a popular baseball statistics website called Fangraphs”.

  • 1.046 for “Betfair, a large British prediction market that matches individuals who wish to make opposing financial bets about a binary event”.

I estimated mean normalised outcomes of 0.365 and 0.381 for all questions and those of the category of artificial intelligence. If we assume these values apply to questions about both probabilities and expectations:

  • The likelihood of a question about probabilities resolving as “yes” is 36.5 % for all questions, and 38.1 % for those of the category of artificial intelligence.

  • The outcome of a question about expectations is expected to equal the allowed minimum plus 36.5 % of the distance between the allowed minimum and maximum for all questions, and 38.1 % for those of the category of artificial intelligence.

I got mean Brier scores of 0.151 and 0.230 for all questions and those of the category of artificial intelligence, which are 19.5 % higher and 2.86 % lower than the mean Brier scores of 0.126 and 0.237 shown in Metaculus’ track record page[6]. I believe the differences are explained by my results:

  • Excluding group questions.

  • Approximating the mean Brier score based on a set of dates which covers the whole lifetime of the question (in uniform time steps[7]), but does not encompass all community predictions[8].

I think the 1st of these considerations is much more important than the 2nd. The category of artificial intelligence does not include probabilistic group questions, so it is only affected by the 2nd consideration, and the discrepancy is much smaller than for all questions (2.86 % < 19.5 %).

In any case, according to Metaculus’ track record page, Metaculus’ community predictions for questions of the category of artificial intelligence perform close to randomly, as 0.237 is pretty close to 0.25. However, Metaculus’ predictions and postdictions[9] for the same category perform considerably better, with mean Brier scores of 0.168 and 0.146. These are also lower than the mean Brier score of 0.232 achieved for predictions matching the mean outcome of 0.365[10] for probabilistic questions of the category of artificial intelligence[11]. In addition, I should note Metaculus’ predictions for the category of AI and machine learning have a mean Brier score of 0.149 (< 0.168).

In contrast, among all questions, the mean Brier score of Metaculus’ community predictions of 0.126 is similar to that of 0.120 for Metaculus’ predictions. So, overall, Metaculus’ community predictions perform roughly as well as Metaculus’ predictions, although there can be important differences between them within categories, as illustrated above for the category of artificial intelligence.

It would also be nice to see the mean accuracy of the predictions of questions about expectations, but I have not done that here.

Correlations among metrics

The 3 metrics which correlate more strongly with the Brier score are, listed by descending strength of the correlation (correlation coefficient; p-value):

  • For all questions:

    • Z-score for the null hypothesis that the beliefs are Bayesian (0.346; 0), i.e. predictions are less accurate (higher Brier score) for questions whose predictions are more extreme under Bayesian updating.

    • Total belief movement (0.262; 0), i.e. predictions are less accurate for questions with a greater amount of updating. This is surprising, as one would expect predictions to converge to the truth as they are updated.

    • Total excess belief movement (0.256; 0), i.e. predictions are less accurate for questions with greater difference between amount of updating and uncertainty reduction.

  • For the category of artificial intelligence:

    • Total excess belief movement (0.361; 0.0387), i.e. predictions are less accurate for questions with greater difference between amount of updating and uncertainty reduction.

    • Number of predictions (0.316; 0.0729), i.e. predictions are less accurate for questions with more predictions. Maybe more popular questions attract worse forecasters?

    • Total belief movement (0.282; 0.111), i.e. predictions are less accurate for questions with a greater amount of updating. This is surprising, but connected to the correlation above. The community prediction moves each time a new prediction is made.

The correlations with the normalised excess belief movement are weak (correlation coefficients of −4.63*10^-3 and 0.0708), and not statistically significant (p-values of 0.864 and 0.695). So it is not possible to reject (the null hypothesis) that there is no correlation between accuracy and Bayesian updating, but the correlation I obtained is quite weak anyways.

Comparing the correlations for all questions and those of the category of artificial intelligence shows one should not extrapolate the results from all questions to each of the categories. The signs of the correlations are different for 52.9 % (= 917) of the metrics, although some of those of the category of artificial intelligence are not statistically significant. I guess the same applies to other categories. Feel free to check the correlations among metrics for each of the categories in tab “Correlations among metrics within categories”, selecting the category in the drop-down at the top.

Finally, correlations with accuracy for questions about expectations may differ from the ones I have discussed above for ones about probabilities.

My recommendation on how to use Metaculus

If you want to know how much to trust a given prediction from Metaculus, I think it is sensible to check Metaculus’ track record for similar past questions:

  • The type of prediction you are seeing, either Metaculus’ community prediction or Metaculus’ prediction.

  • The categories to which that question belongs (often more than one). The relevant menus show up when you click on “Show Filter”.

  • The type of question. If it is about:

    • Probabilities, select “Brier score” or “Log score (discrete)”. I think the latter is especially important if small differences in probabilities close to 0 or 1 matter for your purpose.

    • Expectations, select “Log score (continuous)”.

  • The time which matches more closely your conditions. To do this, you can select “other time…” after clicking on the dropdown after “evaluated at”.

    • This is relevant because, even if the track record as evaluated at “all times” is good, it may not be so early in the question lifetime.

    • The “other time” can be defined as a fraction of the question lifetime, or time before resolution.

I am glad Metaculus has made available all these options, and I really appreciate the transparency!

  1. ^

    I define the normalised outcome such that it ranges from 0 to 1 for questions about expectations, such that its lower and upper bound match the possible outcomes for probabilities.

  2. ^

    The Brier score does not apply to expectations.

  3. ^

    All p-values of 0 I present here are actually positive, but are so small they were rounded to 0 in Sheets.

  4. ^

    The pages of Metaculus’ questions have the format “https://​​www.metaculus.com/​​questions/​​ID/​​”.

  5. ^

    The running time is about 20 min.

  6. ^

    To see the 1st of these Brier scores, you have to select “Brier score”, for the “community prediction”, evaluated at “all times”. To see the 2nd, you have to additionally click on “Show filter”, and select “Artificial intelligence” below “Categories include”.

  7. ^

    Metaculus considers all predictions, which are not uniformly distributed in time (unlike the ones I retrieved), and therefore have different weights in the mean Brier score.

  8. ^

    The mean number of analysed dates is 43.9 % (= 90.4/​206) of the mean number of predictions.

  9. ^

    From here, Metaculus’ postdictions refer to “what our [Metaculus’] current algorithm would have predicted if it and its calibration data were available at the question’s close”.

  10. ^

    Mean of column T of tab “Metrics by question” for the questions of the category of artificial intelligence with normalised outcome of 0 or 1.

  11. ^

    0.232 = 0.365*(1 − 0.365)^2 + (1 − 0.365)*(0.365)^2.

  12. ^

    Some p-values are so small that they were rounded to 0 in Sheets.