I get what you are saying, and I also harbor doubts about whether extremization is just pure hindsight bias or if there is something else to it.
Overall I still think its probably justified in cases like Metaculus to extremize based on the extremization factor that would optimize the last 100 resolved questions, and I would expect the extremized geo mean with such a factor to outperform the unextremized geo mean in the next 100 binary questions to resolve (if pressed to put a number on it maybe ~70% confidence without thinking too much).
My reasoning here is something like:
There seems to be a long tradition of extremizing in the academic literature (see the reference in the post above). Though on the other hand empirical studies have been sparse, and eg Satopaa et al are cheating by choosing the extremization factor with the benefit of hindsight.
In this case I didn’t try too hard to find an extremization factor that would work, just two attempts. I didn’t need to mine for a factor that would work. But obviously we cannot generalize from just one example.
Extremizing has an intuitive meaning as accounting for the different pieces of information across experts that gives it weight (pun not intended). On the other hand, every extra parameter in the aggregation is a chance to shoot off our own foot.
Intuitively it seems like the overall confidence of a community should be roughly continuous over time? So the level of underconfidence in recent questions should be a good indicator of its confidence for the next few questions.
So overall I am not super convinced, and a big part of my argument is an appeal to authority.
Also, it seems to be the case that extremization by 1.5 also works when looking at the last 330 questions.
I’d be curious about your thoughts here. Do you think that a 1.5-extremized geo mean will outperform the unextremized geo mean in the next 100 questions? What if we choose a finetuned extremization factor that would optimize the last 100?
Looking at the rolling performance of your method (optimize on last 100 and use that to predict), median and geo mean odds, I find they have been ~indistinguishable over the last ~200 questions. If I look at the exact numbers, extremized_last_100 does win marginally, but looking at that chart I’d have a hard time saying “there’s a 70% chance it wins over the next 100 questions”. If you’re interested in betting at 70% odds I’d be interested.
There seems to be a long tradition of extremizing in the academic literature (see the reference in the post above). Though on the other hand empirical studies have been sparse, and eg Satopaa et al are cheating by choosing the extremization factor with the benefit of hindsight.
No offense, but the academic literature can do one.
In this case I didn’t try too hard to find an extremization factor that would work, just two attempts. I didn’t need to mine for a factor that would work. But obviously we cannot generalize from just one example.
Again, I don’t find this very persuasive, given what I already knew about the history of Metaculus’ underconfidence.
Extremizing has an intuitive meaning as accounting for the different pieces of information across experts that gives it weight (pun not intended). On the other hand, every extra parameter in the aggregation is a chance to shoot off our own foot.
I think extremizing might make sense if the other forecasts aren’t public. (Since then the forecasts might be slightly more independent). When the other forecasts are public, I think extremizing makes less sense. This goes doubly so when the forecasts are coming from a betting market.
Intuitively it seems like the overall confidence of a community should be roughly continuous over time? So the level of underconfidence in recent questions should be a good indicator of its confidence for the next few questions.
I find this the most persuasive. I think it ultimately depends how you think people adjust for their past calibration. It’s taken the community ~5 years to reduce it’s under-confidence, so maybe it’ll take another 5 years. If people immediately update, I would expect this to be very unpredictable.
I get what you are saying, and I also harbor doubts about whether extremization is just pure hindsight bias or if there is something else to it.
Overall I still think its probably justified in cases like Metaculus to extremize based on the extremization factor that would optimize the last 100 resolved questions, and I would expect the extremized geo mean with such a factor to outperform the unextremized geo mean in the next 100 binary questions to resolve (if pressed to put a number on it maybe ~70% confidence without thinking too much).
My reasoning here is something like:
There seems to be a long tradition of extremizing in the academic literature (see the reference in the post above). Though on the other hand empirical studies have been sparse, and eg Satopaa et al are cheating by choosing the extremization factor with the benefit of hindsight.
In this case I didn’t try too hard to find an extremization factor that would work, just two attempts. I didn’t need to mine for a factor that would work. But obviously we cannot generalize from just one example.
Extremizing has an intuitive meaning as accounting for the different pieces of information across experts that gives it weight (pun not intended). On the other hand, every extra parameter in the aggregation is a chance to shoot off our own foot.
Intuitively it seems like the overall confidence of a community should be roughly continuous over time? So the level of underconfidence in recent questions should be a good indicator of its confidence for the next few questions.
So overall I am not super convinced, and a big part of my argument is an appeal to authority.
Also, it seems to be the case that extremization by 1.5 also works when looking at the last 330 questions.
I’d be curious about your thoughts here. Do you think that a 1.5-extremized geo mean will outperform the unextremized geo mean in the next 100 questions? What if we choose a finetuned extremization factor that would optimize the last 100?
Looking at the rolling performance of your method (optimize on last 100 and use that to predict), median and geo mean odds, I find they have been ~indistinguishable over the last ~200 questions. If I look at the exact numbers, extremized_last_100 does win marginally, but looking at that chart I’d have a hard time saying “there’s a 70% chance it wins over the next 100 questions”. If you’re interested in betting at 70% odds I’d be interested.
No offense, but the academic literature can do one.
Again, I don’t find this very persuasive, given what I already knew about the history of Metaculus’ underconfidence.
I think extremizing might make sense if the other forecasts aren’t public. (Since then the forecasts might be slightly more independent). When the other forecasts are public, I think extremizing makes less sense. This goes doubly so when the forecasts are coming from a betting market.
I find this the most persuasive. I think it ultimately depends how you think people adjust for their past calibration. It’s taken the community ~5 years to reduce it’s under-confidence, so maybe it’ll take another 5 years. If people immediately update, I would expect this to be very unpredictable.