I think that there are some tricky questions about comparing across different forecasters and their predictions. If you simply take Brier score, this can be Goodharted: people can choose the “easiest” questions and get way better scores than the ones taking on difficult questions.
I can think of some attempts to go at this:
Ranking forecasters:
For two forecasters, they get ranked according to their Brier scores on questions they have both forecasted on. I fear that this will lead to cyclical rankings, which could be dealt with using the Smith set or Hodge decomposition.
Forecasters are ranked according to their performance relative to all other forecasters on each question. (Making easier questions less impactful on a forecasters score).
I’d like to look into credibility theory to see whether it has some insights into ranking with different sample sizes since IMDb uses it for ranking movies.
I agree with your concerns on using a pure Brier score with open platforms. I expect that currently it makes the most sense within “tournaments” where participants are answering every question. Technically, I think some sort of objective, proper scoring rule is a prerequisite to a more advanced scoring system that conveys more useful information in open contexts.
I’ve seen some sort of a “relative Brier score” referenced frequently in associated research (definitely in the good judgement project papers, at a minimum) that scored forecasters based on the difficulty of each question, as determined by the performance of others who forecasted it. This seems promising, and I expect there are a lot of options in that direction.
I like this idea :-)
I think that there are some tricky questions about comparing across different forecasters and their predictions. If you simply take Brier score, this can be Goodharted: people can choose the “easiest” questions and get way better scores than the ones taking on difficult questions.
I can think of some attempts to go at this:
Ranking forecasters:
For two forecasters, they get ranked according to their Brier scores on questions they have both forecasted on. I fear that this will lead to cyclical rankings, which could be dealt with using the Smith set or Hodge decomposition.
Forecasters are ranked according to their performance relative to all other forecasters on each question. (Making easier questions less impactful on a forecasters score).
I’d like to look into credibility theory to see whether it has some insights into ranking with different sample sizes since IMDb uses it for ranking movies.
I agree with your concerns on using a pure Brier score with open platforms. I expect that currently it makes the most sense within “tournaments” where participants are answering every question. Technically, I think some sort of objective, proper scoring rule is a prerequisite to a more advanced scoring system that conveys more useful information in open contexts.
I’ve seen some sort of a “relative Brier score” referenced frequently in associated research (definitely in the good judgement project papers, at a minimum) that scored forecasters based on the difficulty of each question, as determined by the performance of others who forecasted it. This seems promising, and I expect there are a lot of options in that direction.