I think you point to some potential for scepticism, but I don’t think this is convincing. Selective disclosure is unlikely to be a problem where a user can only point to summary statistics for their whole activity, like on Metaculus. An exception might be if only a subset of stats were presented, like ranking in past 3/6/12 months without giving Briers or other periods etc. But you could just ask for all the relevant stats.
The uncorrelated betting isn’t a problem if you just require a decent volume of questions in the track record. If you basically want at least 100 binary questions to form a track record, and say 20 of them were hard enough such that the malicious user wanted to hedge on them, you’d need 2^20 accounts to cover all possible answer sets. If they just wanted good performance on half of them, you’d still need 2^10 accounts.
A more realistic reason for scepticism is that points/ranking on Metaculus is basically a function of activity over time. You can be only a so-so forecaster but have an impressive Metaculus record just by following the crowd on loads of questions or picking probabilities that guarentee points. But Brier scores, especially relative to the community, should reveal this kind of chicanery.
The biggest reason for scepticism regarding forecasting as it’s used in EA is generalisation across domains. How confident should we be that the forecasters/heuristics/approaches that are good for U.S. political outcomes or Elon Musk activity translate successfully to predicting the future of AI or catastrophic pandemics or whatever? Michael Aird’s talk mentions some good reasons why some translation is reasonable to expect, but this is an open and ongoing question.
I don’t think the forecaster needs 2^10 accounts if they pick a set of problems with mutually correlated outcomes. For example, you can make two accounts for AI forecasting, and have one bet consistently more AI skeptical than the average and the other more AI doomy than the average. You could do more than 2, too, like very skeptical, skeptical, average, doomy, very doomy. One of them could end up with a good track record in AI forecasting.
If doing well across domains is rewarded much more than similar performance within a domain, it would be harder to get away with this (assuming problems across domains have relatively uncorrelated outcomes, but you could probably find sources of correlation across some domains, like government competence). But then someone could look only for easy questions across domains to build their track record. So, maybe there’s a balance to strike. Also, rather than absolute performance across possibly different questions like the Brier score, you should measure performance relative to peers on each question and average that. Maybe something like relative returns on investment in prediction markets, with a large number of bets and across a large number of domains.
Good point on the correlated outcomes. I think you’re right that cross-domain performance could be a good measure, especially since performance in a single domain could be driven by having a single foundational prior that turned out to be right, rather than genuine forecasting skill.
On the second point, I’m pretty sure the Metaculus results already just compare your Brier to the community based on the same set of questions. So you could base inter-forecaster comparisons based on that difference (weakly).
I think you point to some potential for scepticism, but I don’t think this is convincing. Selective disclosure is unlikely to be a problem where a user can only point to summary statistics for their whole activity, like on Metaculus. An exception might be if only a subset of stats were presented, like ranking in past 3/6/12 months without giving Briers or other periods etc. But you could just ask for all the relevant stats.
The uncorrelated betting isn’t a problem if you just require a decent volume of questions in the track record. If you basically want at least 100 binary questions to form a track record, and say 20 of them were hard enough such that the malicious user wanted to hedge on them, you’d need 2^20 accounts to cover all possible answer sets. If they just wanted good performance on half of them, you’d still need 2^10 accounts.
A more realistic reason for scepticism is that points/ranking on Metaculus is basically a function of activity over time. You can be only a so-so forecaster but have an impressive Metaculus record just by following the crowd on loads of questions or picking probabilities that guarentee points. But Brier scores, especially relative to the community, should reveal this kind of chicanery.
The biggest reason for scepticism regarding forecasting as it’s used in EA is generalisation across domains. How confident should we be that the forecasters/heuristics/approaches that are good for U.S. political outcomes or Elon Musk activity translate successfully to predicting the future of AI or catastrophic pandemics or whatever? Michael Aird’s talk mentions some good reasons why some translation is reasonable to expect, but this is an open and ongoing question.
I don’t think the forecaster needs 2^10 accounts if they pick a set of problems with mutually correlated outcomes. For example, you can make two accounts for AI forecasting, and have one bet consistently more AI skeptical than the average and the other more AI doomy than the average. You could do more than 2, too, like very skeptical, skeptical, average, doomy, very doomy. One of them could end up with a good track record in AI forecasting.
If doing well across domains is rewarded much more than similar performance within a domain, it would be harder to get away with this (assuming problems across domains have relatively uncorrelated outcomes, but you could probably find sources of correlation across some domains, like government competence). But then someone could look only for easy questions across domains to build their track record. So, maybe there’s a balance to strike. Also, rather than absolute performance across possibly different questions like the Brier score, you should measure performance relative to peers on each question and average that. Maybe something like relative returns on investment in prediction markets, with a large number of bets and across a large number of domains.
Good point on the correlated outcomes. I think you’re right that cross-domain performance could be a good measure, especially since performance in a single domain could be driven by having a single foundational prior that turned out to be right, rather than genuine forecasting skill.
On the second point, I’m pretty sure the Metaculus results already just compare your Brier to the community based on the same set of questions. So you could base inter-forecaster comparisons based on that difference (weakly).