Presumably, when choosing your X, there is a trade-off between āhaving better forecastersā and āhaving more forecastersā (see this and this analysis on why more forecasters might be good).
FWIW, here, I found a correlation of ā0.0776 between number of forecasters and Brier score. So more forecasters does seem to help, but not that much.
If you have the choice to ask a large crowd OR a small group of accomplished forecasters, you should maybe consider the crowd. This is especially true if you have access to past performance and can do something more sophisticated than Metaculusā Community Prediction.
Mannes 2014 found a select crowd to be better, although not by much, looking into 90 data sets:
Note they scored performance in terms of the mean absolute error, which is not proper, but I guess they would get qualitatively similar results in they had used a proper rule.
I used the Metaculus reputation scores for my analysis to select the top forecasters. Reputation scores are used internally to compute the Metaculus Prediction and track performance relative to other forecasters. Using average Brier scores or log scores might yield very different results. Really: this entire analysis hinges on whether or not you think the reputation score is a good proxy for past performance. And it may be, but it might also be flawed.
I think it makes more sense to measure reputation according to the metric being used for performance, i.e. with the Brier/ālog score, as Mannes 2014 did (but using mean absolute error). You could also try measuring reputation based on performance on questions of the same category, such that you get the best of each category.
Interesting, thanks for sharing the paper. Yeah agree that using the Brier score /ā log score might change results and it would definitely be good to check that as well.
Nice analysis!
FWIW, here, I found a correlation of ā0.0776 between number of forecasters and Brier score. So more forecasters does seem to help, but not that much.
Mannes 2014 found a select crowd to be better, although not by much, looking into 90 data sets:
Note they scored performance in terms of the mean absolute error, which is not proper, but I guess they would get qualitatively similar results in they had used a proper rule.
I think it makes more sense to measure reputation according to the metric being used for performance, i.e. with the Brier/ālog score, as Mannes 2014 did (but using mean absolute error). You could also try measuring reputation based on performance on questions of the same category, such that you get the best of each category.
Interesting, thanks for sharing the paper. Yeah agree that using the Brier score /ā log score might change results and it would definitely be good to check that as well.