Each predicting player is marked with a number n (starting at 1) that orders them from oldest active prediction to newest prediction. The individual predictions are given weights w(n)∝e√n and combined to form a weighted community distribution function.
[2] This doesn’t actually hold up more recently, where the Metaculus prediction has been underperforming.
This increases my confidence in the geo mean of the odds, and decreases my confidence in the extremization bit.
I find it very interesting that the extremized version was consistently below by a narrow margin. I wonder if this means that there is a subset of questions where it works well, and another where it underperforms.
One question / nitpick: what do you mean by geometric mean of the probabilities? If you just take the geometric mean of probabilities then you do not get a valid probability—the sum of the pooled ps and (1-p)s does not equal 1. You need to rescale them, at which point you end with the geometric mean of odds.
Unexpected values explains this better than me here.
I find it very interesting that the extremized version was consistently below by a narrow margin. I wonder if this means that there is a subset of questions where it works well, and another where it underperforms.
I think it’s actually that historically the Metaculus community was underconfident (see track record here before 2020 vs after 2020).
Extremizing fixes that underconfidence, but also the average predictor improving their ability also fixed that underconfidence.
One question / nitpick: what do you mean by geometric mean of the probabilities?
Metaculus has a known bias towards questions resolving positive . Metaculus users have a known bias overestimating the probabilities of questions resolving positive. (Again—see the track record). Taking a geometric median of the probabilities of the events happening will give a number between 0 and 1. (That is, a valid probability). It will be inconsistent with the estimate you’d get if you flipped the question HOWEVER Metaculus users also seem to be inconsistent in that way, so I thought it was a neat way to attempt to fix that bias. I should have made it more explicit, that’s fair.
but also the average predictor improving their ability also fixed that underconfidence
What do mean by this?
Metaculus has a known bias towards questions resolving positive
Oh I see!
It is very cool that this works.
One thing that confuses me—when you take the geometric mean of probabilities you end up with ppooled+(1−p)pooled<1. So the pooled probability gets slighly nudged towards 0 in comparison to what you would get with the geometric mean of odds. Doesn’t that mean that it should be less accurate, given the bias towards questions resolving positively?
but also the average predictor improving their ability also fixed that underconfidence
What do mean by this?
I mean in the past people were underconfident (so extremizing would make their predictions better). Since then they’ve stopped being underconfident. My assumption is that this is because the average predictor is now more skilled or because more predictors improves the quality of the average.
Doesn’t that mean that it should be less accurate, given the bias towards questions resolving positively?
The bias isn’t that more questions resolve positively than users expect. The bias is that users expect more questions to resolve positive than actually resolve positive. Shifting probabilities lower fixes this.
Basically lots of questions on Metaculus are “Will X happen?” where X is some interesting event people are talking about, but the base rate is perhaps low. People tend to overestimate the probability of X relative to what actually occurs.
The bias isn’t that more questions resolve positively than users expect. The bias is that users expect more questions to resolve positive than actually resolve positive.
I mean in the past people were underconfident (so extremizing would make their predictions better). Since then they’ve stopped being underconfident. My assumption is that this is because the average predictor is now more skilled or because more predictors improves the quality of the average.
Gotcha!
The bias isn’t that more questions resolve positively than users expect.
I think it would be valuable to repeat it specifically for questions where there is large variance across predictions, where the choice of the aggregation method is specially relevant. Under these conditions, I suspect methods like the median or geometric mean will be even better than methods like the mean because the latter ignore information from extremely low predictions, and overweight outliers.
I was curious about why the extremized geo mean of odds didn’t seem to beat other methods. Eric Neyman suggested trying a smaller extremization factor, so I did that.
I tried an extremizing factor of 1.5, and reused your script to score the performance on recent binary questions. The result is that the extremized prediction comes on top.
This has restored my faith on extremization. On hindsight, recommending a fixed extremization factor was silly, since the correct extremization factor is going to depend on the predictors being aggregated and the topics they are talking about.
Going forward I would recommend people who want to apply extremization to study what extremization factors would have made sense in past questions from the same community.
Metaculus was way underconfident originally. (Prior to 2020, 22% using their metric). Recently it has been much better calibrated - (2020- now, 4% using their metric).
Of course if they are underconfident then extremizing will improve the forecast, but the question is what is most predictive going forward. Given that before 2020 they were 22% underconfident, more recently 4% underconfident, it seems foolhardy to expect them to be underconfident going forward.
I would NOT advocate extremizing the Metaculus community prediction going forward.
More than this, you will ALWAYS be able to find an extremize parameter which will improve the forecasts unless they are perfectly calibrated. This will give you better predictions in hindsight but not better predictions going forward. If you have a reason to expect forecasts to be underconfident, by all means extremize them, but I think that’s a strong claim which requires strong evidence.
I get what you are saying, and I also harbor doubts about whether extremization is just pure hindsight bias or if there is something else to it.
Overall I still think its probably justified in cases like Metaculus to extremize based on the extremization factor that would optimize the last 100 resolved questions, and I would expect the extremized geo mean with such a factor to outperform the unextremized geo mean in the next 100 binary questions to resolve (if pressed to put a number on it maybe ~70% confidence without thinking too much).
My reasoning here is something like:
There seems to be a long tradition of extremizing in the academic literature (see the reference in the post above). Though on the other hand empirical studies have been sparse, and eg Satopaa et al are cheating by choosing the extremization factor with the benefit of hindsight.
In this case I didn’t try too hard to find an extremization factor that would work, just two attempts. I didn’t need to mine for a factor that would work. But obviously we cannot generalize from just one example.
Extremizing has an intuitive meaning as accounting for the different pieces of information across experts that gives it weight (pun not intended). On the other hand, every extra parameter in the aggregation is a chance to shoot off our own foot.
Intuitively it seems like the overall confidence of a community should be roughly continuous over time? So the level of underconfidence in recent questions should be a good indicator of its confidence for the next few questions.
So overall I am not super convinced, and a big part of my argument is an appeal to authority.
Also, it seems to be the case that extremization by 1.5 also works when looking at the last 330 questions.
I’d be curious about your thoughts here. Do you think that a 1.5-extremized geo mean will outperform the unextremized geo mean in the next 100 questions? What if we choose a finetuned extremization factor that would optimize the last 100?
Looking at the rolling performance of your method (optimize on last 100 and use that to predict), median and geo mean odds, I find they have been ~indistinguishable over the last ~200 questions. If I look at the exact numbers, extremized_last_100 does win marginally, but looking at that chart I’d have a hard time saying “there’s a 70% chance it wins over the next 100 questions”. If you’re interested in betting at 70% odds I’d be interested.
There seems to be a long tradition of extremizing in the academic literature (see the reference in the post above). Though on the other hand empirical studies have been sparse, and eg Satopaa et al are cheating by choosing the extremization factor with the benefit of hindsight.
No offense, but the academic literature can do one.
In this case I didn’t try too hard to find an extremization factor that would work, just two attempts. I didn’t need to mine for a factor that would work. But obviously we cannot generalize from just one example.
Again, I don’t find this very persuasive, given what I already knew about the history of Metaculus’ underconfidence.
Extremizing has an intuitive meaning as accounting for the different pieces of information across experts that gives it weight (pun not intended). On the other hand, every extra parameter in the aggregation is a chance to shoot off our own foot.
I think extremizing might make sense if the other forecasts aren’t public. (Since then the forecasts might be slightly more independent). When the other forecasts are public, I think extremizing makes less sense. This goes doubly so when the forecasts are coming from a betting market.
Intuitively it seems like the overall confidence of a community should be roughly continuous over time? So the level of underconfidence in recent questions should be a good indicator of its confidence for the next few questions.
I find this the most persuasive. I think it ultimately depends how you think people adjust for their past calibration. It’s taken the community ~5 years to reduce it’s under-confidence, so maybe it’ll take another 5 years. If people immediately update, I would expect this to be very unpredictable.
tl;dr The conclusions of this article hold up in an empirical test with Metaculus data
Looking at resolved binary Metaculus questions and using 5 different methods to pool the community estimate.
Geometric mean of probabilities
Geometric mean of odds / Arithmetic mean of log-odds
Median of odds (current Metaculus forecast)
Arithmetic mean of odds
Proprietary Metaculus forecast
Also looking at two different scoring rules (Brier and Log) I find rankings as (smaller is better in my table):
Metaculus prediction is currently the best[2]
Geometric mean of probabilities
Geometric mean of odds / Arithmetic mean of log-odds
Median
Arithmetic mean of probabilities
Another conclusion which follows from this is that weighting is much more important than how you aggregate your probabilities. Roughly speaking:
Weighting by quality of predictor beats...
Weighting by how recently they updated beats...
No weighting at all
(I also did this analysis for both weighted[1] and unweighted odds)
(Analysis on ~850 questions, predictors per question: [ 34 , 51 , 78 , 122, 188] (10th, 25th, 50th, 75th, 90th percentile)
[1] Metaculus weights it’s predictions by recency:
[2] This doesn’t actually hold up more recently, where the Metaculus prediction has been underperforming.
META: Do you think you could edit this comment to include...
The number of questions, and aggregated predictions per question?
The information on extremized geometric mean you computed below (I think it is not receiving as much attention due to being buried in the replies)?
Possibly a code snippet to reproduce the results?
Thanks in advance!
Cool, that’s really useful to know. Can you also check how extremizing the odds with different parameters performs?
Thank you for the superb analysis!
This increases my confidence in the geo mean of the odds, and decreases my confidence in the extremization bit.
I find it very interesting that the extremized version was consistently below by a narrow margin. I wonder if this means that there is a subset of questions where it works well, and another where it underperforms.
One question / nitpick: what do you mean by geometric mean of the probabilities? If you just take the geometric mean of probabilities then you do not get a valid probability—the sum of the pooled ps and (1-p)s does not equal 1. You need to rescale them, at which point you end with the geometric mean of odds.
Unexpected values explains this better than me here.
I think it’s actually that historically the Metaculus community was underconfident (see track record here before 2020 vs after 2020).
Extremizing fixes that underconfidence, but also the average predictor improving their ability also fixed that underconfidence.
Metaculus has a known bias towards questions resolving positive. Metaculus users have a known bias overestimating the probabilities of questions resolving positive. (Again—see the track record). Taking a geometric median of the probabilities of the events happening will give a number between 0 and 1. (That is, a valid probability). It will be inconsistent with the estimate you’d get if you flipped the question HOWEVER Metaculus users also seem to be inconsistent in that way, so I thought it was a neat way to attempt to fix that bias. I should have made it more explicit, that’s fair.Edit: Updated for clarity based on comments below
What do mean by this?
Oh I see!
It is very cool that this works.
One thing that confuses me—when you take the geometric mean of probabilities you end up with ppooled+(1−p)pooled<1. So the pooled probability gets slighly nudged towards 0 in comparison to what you would get with the geometric mean of odds. Doesn’t that mean that it should be less accurate, given the bias towards questions resolving positively?
What am I missing?
I mean in the past people were underconfident (so extremizing would make their predictions better). Since then they’ve stopped being underconfident. My assumption is that this is because the average predictor is now more skilled or because more predictors improves the quality of the average.
The bias isn’t that more questions resolve positively than users expect. The bias is that users expect more questions to resolve positive than actually resolve positive. Shifting probabilities lower fixes this.
Basically lots of questions on Metaculus are “Will X happen?” where X is some interesting event people are talking about, but the base rate is perhaps low. People tend to overestimate the probability of X relative to what actually occurs.
I don’t get what the difference between these is.
“more questions resolve positively than users expect”
Users expect 50 to resolve positively, but actually 60 resolve positive.
“users expect more questions to resolve positive than actually resolve positive”
Users expect 50 to resolve positive, but actually 40 resolve positive.
I have now editted the original comment to be clearer?
Cheers
Gotcha!
Oh I see!
(I note these scores are very different than in the first table; I assume these were meant to be the Brier scores instead?)
Yes—copy and paste fail—now corrected
Thanks for the analysis, Simon!
I think it would be valuable to repeat it specifically for questions where there is large variance across predictions, where the choice of the aggregation method is specially relevant. Under these conditions, I suspect methods like the median or geometric mean will be even better than methods like the mean because the latter ignore information from extremely low predictions, and overweight outliers.
I was curious about why the extremized geo mean of odds didn’t seem to beat other methods. Eric Neyman suggested trying a smaller extremization factor, so I did that.
I tried an extremizing factor of 1.5, and reused your script to score the performance on recent binary questions. The result is that the extremized prediction comes on top.
This has restored my faith on extremization. On hindsight, recommending a fixed extremization factor was silly, since the correct extremization factor is going to depend on the predictors being aggregated and the topics they are talking about.
Going forward I would recommend people who want to apply extremization to study what extremization factors would have made sense in past questions from the same community.
I talk more about this in my new post.
I think this is the wrong way to look at this.
Metaculus was way underconfident originally. (Prior to 2020, 22% using their metric). Recently it has been much better calibrated - (2020- now, 4% using their metric).
Of course if they are underconfident then extremizing will improve the forecast, but the question is what is most predictive going forward. Given that before 2020 they were 22% underconfident, more recently 4% underconfident, it seems foolhardy to expect them to be underconfident going forward.
I would NOT advocate extremizing the Metaculus community prediction going forward.
More than this, you will ALWAYS be able to find an extremize parameter which will improve the forecasts unless they are perfectly calibrated. This will give you better predictions in hindsight but not better predictions going forward. If you have a reason to expect forecasts to be underconfident, by all means extremize them, but I think that’s a strong claim which requires strong evidence.
I get what you are saying, and I also harbor doubts about whether extremization is just pure hindsight bias or if there is something else to it.
Overall I still think its probably justified in cases like Metaculus to extremize based on the extremization factor that would optimize the last 100 resolved questions, and I would expect the extremized geo mean with such a factor to outperform the unextremized geo mean in the next 100 binary questions to resolve (if pressed to put a number on it maybe ~70% confidence without thinking too much).
My reasoning here is something like:
There seems to be a long tradition of extremizing in the academic literature (see the reference in the post above). Though on the other hand empirical studies have been sparse, and eg Satopaa et al are cheating by choosing the extremization factor with the benefit of hindsight.
In this case I didn’t try too hard to find an extremization factor that would work, just two attempts. I didn’t need to mine for a factor that would work. But obviously we cannot generalize from just one example.
Extremizing has an intuitive meaning as accounting for the different pieces of information across experts that gives it weight (pun not intended). On the other hand, every extra parameter in the aggregation is a chance to shoot off our own foot.
Intuitively it seems like the overall confidence of a community should be roughly continuous over time? So the level of underconfidence in recent questions should be a good indicator of its confidence for the next few questions.
So overall I am not super convinced, and a big part of my argument is an appeal to authority.
Also, it seems to be the case that extremization by 1.5 also works when looking at the last 330 questions.
I’d be curious about your thoughts here. Do you think that a 1.5-extremized geo mean will outperform the unextremized geo mean in the next 100 questions? What if we choose a finetuned extremization factor that would optimize the last 100?
Looking at the rolling performance of your method (optimize on last 100 and use that to predict), median and geo mean odds, I find they have been ~indistinguishable over the last ~200 questions. If I look at the exact numbers, extremized_last_100 does win marginally, but looking at that chart I’d have a hard time saying “there’s a 70% chance it wins over the next 100 questions”. If you’re interested in betting at 70% odds I’d be interested.
No offense, but the academic literature can do one.
Again, I don’t find this very persuasive, given what I already knew about the history of Metaculus’ underconfidence.
I think extremizing might make sense if the other forecasts aren’t public. (Since then the forecasts might be slightly more independent). When the other forecasts are public, I think extremizing makes less sense. This goes doubly so when the forecasts are coming from a betting market.
I find this the most persuasive. I think it ultimately depends how you think people adjust for their past calibration. It’s taken the community ~5 years to reduce it’s under-confidence, so maybe it’ll take another 5 years. If people immediately update, I would expect this to be very unpredictable.