I am more hesitant to recommend the more complex extremization method where we use the historical baseline resolution log-odds
It’s the other way around for me. Historical baseline may be somewhat arbitrary and unreliable, but so is 1:1 odds. If the motivation for extremizing is that different forecasters have access to independent sources of information to move them away from a common prior, but that common prior is far from 1:1 odds, then extremizing away from 1:1 odds shouldn’t work very well, and historical baseline seems closer to a common prior than 1:1 odds does.
I’m interested in how to get better-justified odds ratios to use as a baseline. One idea is to use past estimates of the same question. For example, suppose metaculus asks “Does X happen in 2030”, and the question closes at the end of 2021, and then it asks the exact same question again at the beginning of 2022. Then the aggregated odds that the first question closed at can be used as a baseline for the second question. Perhaps you could do something more sophisticated, like, instead of closing the question and opening an identical one, keep the question open, but use the odds that experts gave it at some point in the past as a baseline with which to interpret more recent odds estimates provided by experts. Of course, none of this works if there hasn’t been an identical question asked previously, and the question has been open for a short amount of time.
Another possibility is to use two pools of forecasters, both of which have done calibration training, but one of which consists of subject-matter experts, and the other of which consists of people with little specialized knowledge on the subject matter, and ask the latter group not to do much research before answering. Then the aggregated odds of the non-experts can be used as a baseline when aggregating odds given by the experts, on the theory that the non-experts can give you a well-calibrated prior because of their calibration training, but won’t be taking into account the independent sources of knowledge that the experts have.
It’s the other way around for me. Historical baseline may be somewhat arbitrary and unreliable, but so is 1:1 odds.
Agreed! To give some nuance to my recommendation, the reason I am hesitant is mainly because of lack of academic precedent (as far as I know).
If the motivation for extremizing is that different forecasters have access to independent sources of information to move them away from a common prior, but that common prior is far from 1:1 odds, then extremizing away from 1:1 odds shouldn’t work very well.
Note that the data backs this up! Using “pseudo-historical” odds is quite better than using 1:1 odds. See the appendix for more details.
[...] use past estimates of the same question.
[...] use the odds that experts gave it at some point in the past as a baseline with which to interpret more recent odds estimates provided by experts.
I’d be interested in seeing the results of such experiments using Metaculus data!
Another possibility is to use two pools of forecasters [...]
It’s the other way around for me. Historical baseline may be somewhat arbitrary and unreliable, but so is 1:1 odds. If the motivation for extremizing is that different forecasters have access to independent sources of information to move them away from a common prior, but that common prior is far from 1:1 odds, then extremizing away from 1:1 odds shouldn’t work very well, and historical baseline seems closer to a common prior than 1:1 odds does.
I’m interested in how to get better-justified odds ratios to use as a baseline. One idea is to use past estimates of the same question. For example, suppose metaculus asks “Does X happen in 2030”, and the question closes at the end of 2021, and then it asks the exact same question again at the beginning of 2022. Then the aggregated odds that the first question closed at can be used as a baseline for the second question. Perhaps you could do something more sophisticated, like, instead of closing the question and opening an identical one, keep the question open, but use the odds that experts gave it at some point in the past as a baseline with which to interpret more recent odds estimates provided by experts. Of course, none of this works if there hasn’t been an identical question asked previously, and the question has been open for a short amount of time.
Another possibility is to use two pools of forecasters, both of which have done calibration training, but one of which consists of subject-matter experts, and the other of which consists of people with little specialized knowledge on the subject matter, and ask the latter group not to do much research before answering. Then the aggregated odds of the non-experts can be used as a baseline when aggregating odds given by the experts, on the theory that the non-experts can give you a well-calibrated prior because of their calibration training, but won’t be taking into account the independent sources of knowledge that the experts have.
Thanks for chipping in Alex!
Agreed! To give some nuance to my recommendation, the reason I am hesitant is mainly because of lack of academic precedent (as far as I know).
Note that the data backs this up! Using “pseudo-historical” odds is quite better than using 1:1 odds. See the appendix for more details.
I’d be interested in seeing the results of such experiments using Metaculus data!
This one is trippy, I like it!