I was curious about why the extremized geo mean of odds didn’t seem to beat other methods. Eric Neyman suggested trying a smaller extremization factor, so I did that.
I tried an extremizing factor of 1.5, and reused your script to score the performance on recent binary questions. The result is that the extremized prediction comes on top.
This has restored my faith on extremization. On hindsight, recommending a fixed extremization factor was silly, since the correct extremization factor is going to depend on the predictors being aggregated and the topics they are talking about.
Going forward I would recommend people who want to apply extremization to study what extremization factors would have made sense in past questions from the same community.
Metaculus was way underconfident originally. (Prior to 2020, 22% using their metric). Recently it has been much better calibrated - (2020- now, 4% using their metric).
Of course if they are underconfident then extremizing will improve the forecast, but the question is what is most predictive going forward. Given that before 2020 they were 22% underconfident, more recently 4% underconfident, it seems foolhardy to expect them to be underconfident going forward.
I would NOT advocate extremizing the Metaculus community prediction going forward.
More than this, you will ALWAYS be able to find an extremize parameter which will improve the forecasts unless they are perfectly calibrated. This will give you better predictions in hindsight but not better predictions going forward. If you have a reason to expect forecasts to be underconfident, by all means extremize them, but I think that’s a strong claim which requires strong evidence.
I get what you are saying, and I also harbor doubts about whether extremization is just pure hindsight bias or if there is something else to it.
Overall I still think its probably justified in cases like Metaculus to extremize based on the extremization factor that would optimize the last 100 resolved questions, and I would expect the extremized geo mean with such a factor to outperform the unextremized geo mean in the next 100 binary questions to resolve (if pressed to put a number on it maybe ~70% confidence without thinking too much).
My reasoning here is something like:
There seems to be a long tradition of extremizing in the academic literature (see the reference in the post above). Though on the other hand empirical studies have been sparse, and eg Satopaa et al are cheating by choosing the extremization factor with the benefit of hindsight.
In this case I didn’t try too hard to find an extremization factor that would work, just two attempts. I didn’t need to mine for a factor that would work. But obviously we cannot generalize from just one example.
Extremizing has an intuitive meaning as accounting for the different pieces of information across experts that gives it weight (pun not intended). On the other hand, every extra parameter in the aggregation is a chance to shoot off our own foot.
Intuitively it seems like the overall confidence of a community should be roughly continuous over time? So the level of underconfidence in recent questions should be a good indicator of its confidence for the next few questions.
So overall I am not super convinced, and a big part of my argument is an appeal to authority.
Also, it seems to be the case that extremization by 1.5 also works when looking at the last 330 questions.
I’d be curious about your thoughts here. Do you think that a 1.5-extremized geo mean will outperform the unextremized geo mean in the next 100 questions? What if we choose a finetuned extremization factor that would optimize the last 100?
Looking at the rolling performance of your method (optimize on last 100 and use that to predict), median and geo mean odds, I find they have been ~indistinguishable over the last ~200 questions. If I look at the exact numbers, extremized_last_100 does win marginally, but looking at that chart I’d have a hard time saying “there’s a 70% chance it wins over the next 100 questions”. If you’re interested in betting at 70% odds I’d be interested.
There seems to be a long tradition of extremizing in the academic literature (see the reference in the post above). Though on the other hand empirical studies have been sparse, and eg Satopaa et al are cheating by choosing the extremization factor with the benefit of hindsight.
No offense, but the academic literature can do one.
In this case I didn’t try too hard to find an extremization factor that would work, just two attempts. I didn’t need to mine for a factor that would work. But obviously we cannot generalize from just one example.
Again, I don’t find this very persuasive, given what I already knew about the history of Metaculus’ underconfidence.
Extremizing has an intuitive meaning as accounting for the different pieces of information across experts that gives it weight (pun not intended). On the other hand, every extra parameter in the aggregation is a chance to shoot off our own foot.
I think extremizing might make sense if the other forecasts aren’t public. (Since then the forecasts might be slightly more independent). When the other forecasts are public, I think extremizing makes less sense. This goes doubly so when the forecasts are coming from a betting market.
Intuitively it seems like the overall confidence of a community should be roughly continuous over time? So the level of underconfidence in recent questions should be a good indicator of its confidence for the next few questions.
I find this the most persuasive. I think it ultimately depends how you think people adjust for their past calibration. It’s taken the community ~5 years to reduce it’s under-confidence, so maybe it’ll take another 5 years. If people immediately update, I would expect this to be very unpredictable.
I was curious about why the extremized geo mean of odds didn’t seem to beat other methods. Eric Neyman suggested trying a smaller extremization factor, so I did that.
I tried an extremizing factor of 1.5, and reused your script to score the performance on recent binary questions. The result is that the extremized prediction comes on top.
This has restored my faith on extremization. On hindsight, recommending a fixed extremization factor was silly, since the correct extremization factor is going to depend on the predictors being aggregated and the topics they are talking about.
Going forward I would recommend people who want to apply extremization to study what extremization factors would have made sense in past questions from the same community.
I talk more about this in my new post.
I think this is the wrong way to look at this.
Metaculus was way underconfident originally. (Prior to 2020, 22% using their metric). Recently it has been much better calibrated - (2020- now, 4% using their metric).
Of course if they are underconfident then extremizing will improve the forecast, but the question is what is most predictive going forward. Given that before 2020 they were 22% underconfident, more recently 4% underconfident, it seems foolhardy to expect them to be underconfident going forward.
I would NOT advocate extremizing the Metaculus community prediction going forward.
More than this, you will ALWAYS be able to find an extremize parameter which will improve the forecasts unless they are perfectly calibrated. This will give you better predictions in hindsight but not better predictions going forward. If you have a reason to expect forecasts to be underconfident, by all means extremize them, but I think that’s a strong claim which requires strong evidence.
I get what you are saying, and I also harbor doubts about whether extremization is just pure hindsight bias or if there is something else to it.
Overall I still think its probably justified in cases like Metaculus to extremize based on the extremization factor that would optimize the last 100 resolved questions, and I would expect the extremized geo mean with such a factor to outperform the unextremized geo mean in the next 100 binary questions to resolve (if pressed to put a number on it maybe ~70% confidence without thinking too much).
My reasoning here is something like:
There seems to be a long tradition of extremizing in the academic literature (see the reference in the post above). Though on the other hand empirical studies have been sparse, and eg Satopaa et al are cheating by choosing the extremization factor with the benefit of hindsight.
In this case I didn’t try too hard to find an extremization factor that would work, just two attempts. I didn’t need to mine for a factor that would work. But obviously we cannot generalize from just one example.
Extremizing has an intuitive meaning as accounting for the different pieces of information across experts that gives it weight (pun not intended). On the other hand, every extra parameter in the aggregation is a chance to shoot off our own foot.
Intuitively it seems like the overall confidence of a community should be roughly continuous over time? So the level of underconfidence in recent questions should be a good indicator of its confidence for the next few questions.
So overall I am not super convinced, and a big part of my argument is an appeal to authority.
Also, it seems to be the case that extremization by 1.5 also works when looking at the last 330 questions.
I’d be curious about your thoughts here. Do you think that a 1.5-extremized geo mean will outperform the unextremized geo mean in the next 100 questions? What if we choose a finetuned extremization factor that would optimize the last 100?
Looking at the rolling performance of your method (optimize on last 100 and use that to predict), median and geo mean odds, I find they have been ~indistinguishable over the last ~200 questions. If I look at the exact numbers, extremized_last_100 does win marginally, but looking at that chart I’d have a hard time saying “there’s a 70% chance it wins over the next 100 questions”. If you’re interested in betting at 70% odds I’d be interested.
No offense, but the academic literature can do one.
Again, I don’t find this very persuasive, given what I already knew about the history of Metaculus’ underconfidence.
I think extremizing might make sense if the other forecasts aren’t public. (Since then the forecasts might be slightly more independent). When the other forecasts are public, I think extremizing makes less sense. This goes doubly so when the forecasts are coming from a betting market.
I find this the most persuasive. I think it ultimately depends how you think people adjust for their past calibration. It’s taken the community ~5 years to reduce it’s under-confidence, so maybe it’ll take another 5 years. If people immediately update, I would expect this to be very unpredictable.