I don’t think the forecaster needs 2^10 accounts if they pick a set of problems with mutually correlated outcomes. For example, you can make two accounts for AI forecasting, and have one bet consistently more AI skeptical than the average and the other more AI doomy than the average. You could do more than 2, too, like very skeptical, skeptical, average, doomy, very doomy. One of them could end up with a good track record in AI forecasting.
If doing well across domains is rewarded much more than similar performance within a domain, it would be harder to get away with this (assuming problems across domains have relatively uncorrelated outcomes, but you could probably find sources of correlation across some domains, like government competence). But then someone could look only for easy questions across domains to build their track record. So, maybe there’s a balance to strike. Also, rather than absolute performance across possibly different questions like the Brier score, you should measure performance relative to peers on each question and average that. Maybe something like relative returns on investment in prediction markets, with a large number of bets and across a large number of domains.
Good point on the correlated outcomes. I think you’re right that cross-domain performance could be a good measure, especially since performance in a single domain could be driven by having a single foundational prior that turned out to be right, rather than genuine forecasting skill.
On the second point, I’m pretty sure the Metaculus results already just compare your Brier to the community based on the same set of questions. So you could base inter-forecaster comparisons based on that difference (weakly).
I don’t think the forecaster needs 2^10 accounts if they pick a set of problems with mutually correlated outcomes. For example, you can make two accounts for AI forecasting, and have one bet consistently more AI skeptical than the average and the other more AI doomy than the average. You could do more than 2, too, like very skeptical, skeptical, average, doomy, very doomy. One of them could end up with a good track record in AI forecasting.
If doing well across domains is rewarded much more than similar performance within a domain, it would be harder to get away with this (assuming problems across domains have relatively uncorrelated outcomes, but you could probably find sources of correlation across some domains, like government competence). But then someone could look only for easy questions across domains to build their track record. So, maybe there’s a balance to strike. Also, rather than absolute performance across possibly different questions like the Brier score, you should measure performance relative to peers on each question and average that. Maybe something like relative returns on investment in prediction markets, with a large number of bets and across a large number of domains.
Good point on the correlated outcomes. I think you’re right that cross-domain performance could be a good measure, especially since performance in a single domain could be driven by having a single foundational prior that turned out to be right, rather than genuine forecasting skill.
On the second point, I’m pretty sure the Metaculus results already just compare your Brier to the community based on the same set of questions. So you could base inter-forecaster comparisons based on that difference (weakly).