We created a function that estimates reputability as āstarsā on a 1-5 system using the forecasting platform, forecast count, and liquidity for prediction markets. The estimation came from volunteers acquainted with the various forecasting platforms. Weāre very curious for feedback here, both on what the function should be, and how to best explain and show the results.
[...] The ratings should reflect accuracy over time, and as data becomes available on prediction track records, aggregation and scoring can become less subjective.
To check I understand, are the following statements roughly accurate:
āIdeally, youād want the star ratings to be based on the calibration and resolution that now-resolved questions from that platform (or of that type, or similar) have tended to have in the past. But thereās not yet enough data to allow that. So you asked people who know about each platform to give their best guess as to how each platform has historically compared in calibration and resolution.ā
Or maybe people gave their best guess as to how the platforms will compare on those fronts, based on who uses each platform, what incentives it has, etc.?
āIdeally, youād want the star ratings to be based on the calibration and resolution that now-resolved questions from that platform (or of that type, or similar) have tended to have in the past. But thereās not yet enough data to allow that. So you asked people who know about each platform to give their best guess as to how each platform has historically compared in calibration and resolution.ā
Yes. Note that I actually didnāt ask them about their guess, I asked them to guess a function from various parameters to stars (e.g. ā3 stars, but 2 stars when the probability is higher than 90% or less than 10%ā.
Or maybe people gave their best guess as to how the platforms will compare on those fronts, based on who uses each platform, what incentives it has, etc.?
Also yes, past performance is highly indicative of future performance.
Also, unlike some other uses for platform comparison, if one platform systematically had much easier questions which they got almost always right, Iād want to give them a higher score (but perhaps show them afterwards in the search, because they might be somewhat trivial).
This whole thing is a somewhat tricky issue and one Iām surprised hasnāt been discussed much before, to my knowledge.
But thereās not yet enough data to allow that.
One issue here is that measurement is very tricky, because the questions are all over the place. Different platforms have very different questions of different difficulties. We donāt yet really have metrics that compare forecasts among different sets of questions. I imagine historical data will be very useful, but extra assumptions would be needed.
Weāre trying to get at some question-general stat of basically, āexpected score (which includes calibration + accuracy) adjusted for question difficulty.ā
One question this would be answering is, āIf Question A is on two platforms, you should trust the one with more starsā
I guess now, given that, Iām a bit confused by the statement āas data becomes available on prediction track records, aggregation and scoring can become less subjective.ā Do you think data will naturally become available on the issue of differences in question difficulty across platforms? Or is it that you think that (a) thereāll at least be more data on calibration + accuracy, and (b) people will think more (though without much new data) about how to deal with the question difficult issue, and together (a) and (b) will reduce this issue?
To check I understand, are the following statements roughly accurate:
āIdeally, youād want the star ratings to be based on the calibration and resolution that now-resolved questions from that platform (or of that type, or similar) have tended to have in the past. But thereās not yet enough data to allow that. So you asked people who know about each platform to give their best guess as to how each platform has historically compared in calibration and resolution.ā
Or maybe people gave their best guess as to how the platforms will compare on those fronts, based on who uses each platform, what incentives it has, etc.?
Yes. Note that I actually didnāt ask them about their guess, I asked them to guess a function from various parameters to stars (e.g. ā3 stars, but 2 stars when the probability is higher than 90% or less than 10%ā.
Also yes, past performance is highly indicative of future performance.
Also, unlike some other uses for platform comparison, if one platform systematically had much easier questions which they got almost always right, Iād want to give them a higher score (but perhaps show them afterwards in the search, because they might be somewhat trivial).
This whole thing is a somewhat tricky issue and one Iām surprised hasnāt been discussed much before, to my knowledge.
One issue here is that measurement is very tricky, because the questions are all over the place. Different platforms have very different questions of different difficulties. We donāt yet really have metrics that compare forecasts among different sets of questions. I imagine historical data will be very useful, but extra assumptions would be needed.
Weāre trying to get at some question-general stat of basically, āexpected score (which includes calibration + accuracy) adjusted for question difficulty.ā
One question this would be answering is, āIf Question A is on two platforms, you should trust the one with more starsā
Ah, that makes sense, thanks.
I guess now, given that, Iām a bit confused by the statement āas data becomes available on prediction track records, aggregation and scoring can become less subjective.ā Do you think data will naturally become available on the issue of differences in question difficulty across platforms? Or is it that you think that (a) thereāll at least be more data on calibration + accuracy, and (b) people will think more (though without much new data) about how to deal with the question difficult issue, and together (a) and (b) will reduce this issue?