My current best guess on how to aggregate forecasts

Jaime SevillaOct 6, 2021, 8:33 AM

39 points

In short: I still recommend using the geometric mean of odds as the default aggregation method, but I give my best guess on exceptions to this rule.

Since writing about how the geometric mean of odds compares to other forecasting aggregation methods I have received many comments asking for a more nuanced approach to choosing how to aggregate forecasts. I do not yet have a full answer to this question, but here I am going to outline my current best guess to help people with their research and give a chance to commenters to prove me wrong.

In short, here is my current best guess in the form of a flowchart:

Flowchart to choose a forecast aggregation method. Start from the yellow box and follow the arrows. You can right click the image and select “open image in new tab” for a bigger version.

Some explanations are in order:

I currently believe that the geometric mean of odds should be the default option for aggregating forecasts. In the two large scale empirical evaluations I am aware of [1] [2], it surpasses the mean of probabilities and the median (*). It is also the only method that makes the group aggregate behave as a Bayesian, and (in my opinion) it behaves well with extreme predictions.
If you are not aggregating all-considered views of experts, but rather aggregating models with mutually exclusive assumptions, use the mean of probabilities. For example, this will come up if you first compute your timelines for Transformative AI assuming it will be derived from transformer-like methods, and then assuming it will come from emulated beings, etc. In this case, $P (Event) = P (A s s u m p t i o n_{1}) P (E v e n t | A s s u m p t i o n_{1}) +$ $P (A s s u m p t i o n_{2}) P (E v e n t | A s s u m p t i o n_{2}) + . . .$ .
When the data includes poorly calibrated outliers, if it’s possible exclude them and take the geometric mean. If not, we should use a pooling method resistant to outliers. The median is one such popular aggregation method.
If there is a known bias in the community of predictors you are polling for predicting positive resolution of binary questions, you can consider correcting for this. One correction that worked on metaculus data is taking the geometric mean of the probabilities (this pulls the aggregate towards zero compared to the geometric mean of odds). Better corrections are likely to exist. [EDIT: I rerun the test recently and the geo mean of probabilities does no longer outcompete the geo mean of odds. Consequently, I no longer endorse using the geometric mean of probabilities]
If there is a track record of underconfidence in past aggregate predictions from the community, consider extremizing the final outcome. This has been common practice in academia for a while. For example, (Satopää et al, 2014) have found good performance using extremized logodds. To choose an extremizing factor I suggest experimenting with what extremizing factors would have given you good performance in past predictions from the same community (EDIT: Simon M lays out a case against extremizing, EDIT 2: I explain here a more robust way of choosing an extremizing factor).
Lastly, it seems that empirically the weighting you use for the predictions matters much more than the aggregation method. I do not have yet great recommendations on how to do weighting, but it seems that weighting by recency of the prediction and by track record of the predictor works well at least in some cases.

There are reasons to believe I will have a better idea of which aggregation methods work best in a given context in a year. For example, it is not clear to me how to detect and deal with outliers, none of the current aggregation methods give consistent answers when annualizing probabilities and there is a huge unexplored space of aggregation functions that we might tap into with machine learning methods.

In conclusion and repeating myself: for the time being I would recommend people to stick to the geometric mean of odds as a default aggregate. I also encourage emphasizing the 10%, 50% and 90% percentile of predictions as well as the number of predictions to summarize the spread.

If you have a good pulse on a problem with the data, above I suggest some solutions you can try. But beware applying them blindly and choosing the outcome you like best.

Acknowledgements:

Thanks to Simon M for his analysis of aggregation and weighting methods on Metaculus data. I thank Eric Neyman, Ozzie Gooen and Peter Wildeford for discussion and feedback.

Footnotes

(*) (Seaver, 1978) also performs different experiments comparing different pooling methods, and founds similar performance between the mean of probabilities and geometric mean of odds. However I suspect this is because the aggregated probabilities in the experiments were in a range where both methods give similar results.

What links here?

Jaime SevillaOct 6, 2021, 8:33 AM

39 points

19 comments3 min readEA link

Models Forecasting

eca Oct 6, 2021, 1:52 PM
10 points
0 ∶ 0

I wonder how these compare with fitting a Beta distribution and using one of its statistics? I’m imagining treating each forecast (assuming they are probabilities) as an observation, and maximizing the Beta likelihood. The resulting Beta is your best guess distribution over the forecasted variable.

It would be nice to have an aggregation method which gave you info about the spread of the aggregated forecast, which would be straightforward here.
- Simon_M Oct 6, 2021, 5:00 PM
  23 points
  0 ∶ 0
  Parent
  
  It’s not clear to me that “fitting a Beta distribution and using one of it’s statistics” is different from just taking the mean of the probabilities.
  I fitting a beta distribution to Metaculus forecasts and looked at:
  - Median forecast
  - Mean forecast
  - Mean log-odds / Geometric mean of odds
  - Fitted beta median
  - Fitted beta mean
  Scattering these 5 values against each other I get:
  We can see fitted values are closely aligned with the mean and mean-log-odds, but not with the median. (Unsurprising when you consider the ~parametric formula for the mean / median).
  The performance is as follows:
  brier log_score questions
  geo_mean_odds_weighted 0.116 0.37 856
  beta_median_weighted 0.118 0.378 856
  median_weighted 0.121 0.38 856
  mean_weighted 0.122 0.391 856
  beta_mean_weighted 0.123 0.396 856
  My intuition for what is going on here is that the beta-median is an extremized form of the beta-mean / mean, which is an improvement
  Looking more recently (as the community became more calibrated), the beta-median’s performance edge seems to have reduced:
  brier log_score questions
  geo_mean_odds_weighted 0.09 0.29 330
  median_weighted 0.091 0.294 330
  beta_median_weighted 0.091 0.297 330
  mean_weighted 0.094 0.31 330
  beta_mean_weighted 0.095 0.314 330
  What links here?
  - Jaime Sevilla's comment on When pooling forecasts, use the geometric mean of odds by Jaime Sevilla (Oct 22, 2023, 1:18 PM; 4 points)
  - Vasco Grilo🔸's comment on Intermediate Report on Abrupt Sunlight Reduction Scenarios by Stan Pinsent (Oct 31, 2023, 11:24 AM; 3 points)
- Jaime Sevilla Oct 6, 2021, 4:23 PM
  7 points
  0 ∶ 0
  Parent
  
  Hmm good question.
  For a quick foray into this we can see what would happen if we use our estimate the mean of the max likelihood beta distribution implied by the sample of forecasts $p_{1}, . . ., p_{N}$ .
  The log-likelihood to maximize is then
  $log L (α, β) = (α - 1) \sum_{i} log p_{i} + (β - 1) \sum_{i} log (1 - p_{i}) - N log B (α, β)$
  The wikipedia article on the Beta distribution discusses this maximization problem in depth, pointing out that albeit no closed form exists if $α$ and $β$ can be assumed to be not too small the max likelihood estimate can be approximated as $^α \approx \frac{1}{2} + \frac{{^G}_{X}}{2 (1 - {^G}_{X} - {^G}_{1 - X})}$ and $^β \approx \frac{1}{2} + \frac{{^G}_{1 - X}}{2 (1 - {^G}_{X} - {^G}_{1 - X})}$ , where $G_{X} = \prod_{i} p_{i}^{1 / N}$ and $G_{1 - X} = \prod_{i} (1 - p_{i})^{1 / N}$ .
  The mean of a beta with these max likelihood parameters is $\frac{^α}{^α +^β} = \frac{(1 - G_{1 - X})}{(1 - G_{X}) + (1 - G_{1 - X})}$ .
  By comparison, the geometric mean of odds estimate is:
  $p = \frac{\prod_{i = 1}^{N} p_{i}^{1 / N}}{\prod_{i = 1}^{N} p_{i}^{1 / N} + \prod_{i = 1}^{N} (1 - p_{i})^{1 / N}} = \frac{G_{X}}{G_{X} + G_{1 - X}}$
  Here are two examples of how the two methods compare aggregating five forecasts
  I originally did this to convince myself that the two aggregates were different. And they seem to be! The method seems to be close to the arithmetic mean in this example. Let’s see what happens when we extremize one of the predictions:
  We have made p3 one hundred times smaller. The geometric mean is suitable affected. The maximum likelihood beta mean stays close to the arithmetic mean, unperturbed.
  This makes me a bit less excited about this method, but I would be excited about people poking around with this method and related ones!
MichaelA🔸Oct 7, 2021, 2:03 PM
9 points
0 ∶ 0

Thanks for this post—I think this was a very useful conversation to have started (at least for my own work!), even if I’m less confident than you in some of these conclusions (both because I just feel confused and because I’ve heard other people give good-sounding arguments for other conclusions).
In the two large scale empirical evaluations I am aware of [1] [2], it surpasses the mean of probabilities and the median (*).
But it seems worth noting that in one of those cases the geometric mean of probabilities outperformed the geometric mean of odds.
You later imply that you think this is at least partly because of a specific bias among Metaculus forecasts. But I’m not sure if you think it’s fully because of that or whether that’s the right explanation (I only skimmed the linked thread). And in any case the basic fact that geometric mean of probabilities performed best in this dataset seems worth noting if you’re using performance in that dataset as evidence for some other aggregation method.
- Jaime Sevilla Oct 7, 2021, 2:56 PM
  10 points
  0 ∶ 0
  Parent
  
  Thanks for this post—I think this was a very useful conversation to have started (at least for my own work!), even if I’m less confident than you in some of these conclusions
  Thank you for your kind words! To dismiss any impression of confidence, this represents my best guesses. I am also quite confused.
  I’ve heard other people give good-sounding arguments for other conclusions
  I’d be really curious if you can dig these up!
  You later imply that you think [the geo mean of probs outperforming the geo mean of odds] is at least partly because of a specific bias among Metaculus forecasts. But I’m not sure if you think it’s fully because of that or whether that’s the right explanation
  I am confident that the geometric mean of probs outperformed the geo mean of odds because of this bias. If you change the coding of all binary questions so that True becomes False and viceversa then you are going to get worse performance that the geo mean of odds.
  This is because the geometric mean of probabilities does not map consistently predictions and their complements. With a basic example, suppose that we have $p_{1} = 0.01, p_{2} = 0.3$ . Then $\sqrt{p_{1} * p_{2}} + \sqrt{(1 - p_{1}) * (1 - p_{2})} \approx 0.89 < 1$ .
  So the geometric mean of probabilities in this sense it’s not a consistent probability—it doesn’t map the complement of probabilities to the the complement of the geometric mean as we would expect (the geometric mean of odds, the mean of probabilities and the median all satisfy this basic property).
  So I would recommend viewing the geometric mean of probabilities as a hack to adjust the geometric mean of odds down. This is also why I think better adjustments likely exist, since this isn’t a particularly well motivated adjustment. It does however seem to slighly improve Metaculus predictions, so I included it in the flowchart.
  To drill this point even more, here is what we would get if we aggregated the predictions in the last 860 resolved metaculus binary questions by mapping each prediction to their complement, taking the geo mean of probs and taking the complement again:
  The complement of the geometric mean of complement probabilities is called comp_geo_mean
  As you can see, this change (that would not affect the other aggregates) significantly weakens the geo mean of probs.
  - Nathan Young Sep 25, 2024, 9:40 AM
    2 points
    0 ∶ 0
    Parent
    
    Is there a map which doesn’t have a discontinuity at .5?
    - Jaime Sevilla Sep 26, 2024, 7:58 PM
      2 points
      0 ∶ 0
      Parent
      
      What do you mean exactly? None of these maps have a discontinuity at .5
Vasco Grilo🔸Jan 28, 2024, 2:39 PM
5 points
1 ∶ 0

Hi Jaime,
How would you aggregate forecasts for extinction risk, spanning many orders of magnitude, which are based on fitting the same data to different probability distributions? I suspect your diagram would suggest using the mean, as each distribution relies on mutually exclusive assumptions, but I am not confident. I am asking in the context of this and this analyses, where I used the median (see discussion). Feel free to share your thoughts there.
What links here?
- Vasco Grilo🔸's comment on Can a war cause human extinction? Once again, not on priors by Vasco Grilo🔸 (Jan 29, 2024, 10:34 AM; 7 points)
- Jaime Sevilla Jan 29, 2024, 1:59 PM
  7 points
  0 ∶ 0
  Parent
  
  Interesting case. I can see the intuitive case for the median.
  
  I think the mean is more appropriate—in this case, what this is telling you is that your uncertainty is dominated by the possibility of a fat tail, and the priority is ruling it out.
  
  I’d still report both for completeness sake, and to illustrate the low resilience of the guess.
  
  Very much enjoyed the posts btw
  What links here?
  - Vasco Grilo🔸's comment on Can a war cause human extinction? Once again, not on priors by Vasco Grilo🔸 (Jan 30, 2024, 1:33 AM; 3 points)
  - Vasco Grilo🔸Jan 30, 2024, 1:34 AM
    2 points
    0 ∶ 0
    Parent
    
    Thanks for the feedback! I am still standing by the median, but it would be nice to empirically investigate similar cases, and see which method performs better!
    - Ryan Greenblatt Jan 30, 2024, 1:38 AM
      3 points
      0 ∶ 0
      Parent
      
      (I think it seems in principle promising to try to setup toy models of tail prediction, but I’m unsure what the right toy model is here.)
Jaime Sevilla Oct 10, 2021, 8:45 AM
3 points
0 ∶ 0

I found some revelant discussion in the EA Forum about extremizing in footnote 5 of this post.
The aggregation algorithm was elitist, meaning that it weighted more heavily forecasters with good track-records who had updated their forecasts more often. In these slides, Tetlock describes the elitism differently: He says it gives weight to higher-IQ, more open-minded forecasters. The extremizing step pushes the aggregated judgment closer to 1 or 0, to make it more confident. The degree to which they extremize depends on how diverse and sophisticated the pool of forecasters is. The academic papers on this topic can be found here and here. Whether extremizing is a good idea is controversial; according to one expert I interviewed, more recent data suggests that the successes of the extremizing algorithm during the forecasting tournament were a fluke. After all, a priori one would expect extremizing to lead to small improvements in accuracy most of the time, but big losses in accuracy some of the time.
The post in general is quite good, and I recommend it.
Vasco Grilo🔸Feb 28, 2024, 10:58 AM
2 points
0 ∶ 0

Hi Jaime,
If you are not aggregating all-considered views of experts, but rather aggregating models with mutually exclusive assumptions, use the mean of probabilities.
Models can have more or less mutually exclusive assumptions? I guess the less they do, the more it makes sense to rely on the median or geometric mean of odds instead of the mean. In practice, how do you decide? I am asking following a discussion about how to aggregate welfare ranges.
In addition, there is not a strong distinction between all-considered views and the outputs of quantitative models, as the judgements of people are models themselves. Moreover, one should presumably prefer the all-considered views of the modellers over the models, as the former account for more information?
What links here?
- Vasco Grilo🔸's comment on Rethink Priorities’ Welfare Range Estimates by Bob Fischer (Feb 28, 2024, 12:10 PM; 4 points)
- Jaime Sevilla Mar 4, 2024, 7:46 PM
  4 points
  0 ∶ 0
  Parent
  
  (speculating) The key property you are looking for IMO is to which degree people are looking at different information when making forecasts. Models that parcel reality into neat little mutually exclusive packages are more amenable , while forecasts that obscurely aggregate information from independent sources will work better with geomeans.
  In any case, this has little bearing on aggregating welfare IMO. You may want to check out geometric rationality as an account that lends itself more to using geometric aggregation of welfare.
Vasco Grilo🔸Nov 5, 2022, 4:30 PM
1 point
0 ∶ 0

Hi Jaime,
Do you have any thoughts on the best way to aggregate forecasts of quantities which are not between 0 and 1 (e.g. global number of deaths during 2030)?
- Jaime Sevilla Nov 6, 2022, 2:36 PM
  3 points
  0 ∶ 0
  Parent
  
  Depends on whether you are aggregating distributions or point estimates.
  If you are aggregating distributions, I would follow the same procedure outlined in this post, and use the continuous version of the geometric mean of odds I outline in footnote 1 of this post.
  If you are aggregating point estimates, at this point I would use the procedure explained in this paper, which is taking a sort of extremized average. I would consider a log transform depending on the quantity you are aggregating. (though note that I have not spent as much time thinking about how to aggregate point estimates)
  - Vasco Grilo🔸Nov 6, 2022, 4:18 PM
    1 point
    0 ∶ 0
    Parent
    
    Thanks!
    I am aggregating arrays of Monte Carlo samples which have N samples each. There is a sense in which each sample is one point estimate, but for large N (I am using 10^7) I guess I can fit a distribution to each of the arrays.
    - Jaime Sevilla Nov 6, 2022, 4:39 PM
      3 points
      0 ∶ 0
      Parent
      
      Without more context, I’d say that fit a distribution to each array and then aggregate them using a weighted linear aggregate of the resulting CDFs, assigning a weight proportional to your confidence on the assumptions that produced the array.
      - Vasco Grilo🔸Nov 6, 2022, 6:02 PM
        1 point
        0 ∶ 0
        Parent
        
        Thank you. Feel free to check this for more context.

	brier	log_score	questions
geo_mean_odds_weighted	0.116	0.37	856
beta_median_weighted	0.118	0.378	856
median_weighted	0.121	0.38	856
mean_weighted	0.122	0.391	856
beta_mean_weighted	0.123	0.396	856

	brier	log_score	questions
geo_mean_odds_weighted	0.09	0.29	330
median_weighted	0.091	0.294	330
beta_median_weighted	0.091	0.297	330
mean_weighted	0.094	0.31	330
beta_mean_weighted	0.095	0.314	330