MichaelA🔸 comments on Introducing Metaforecast: A Forecast Aggregator and Search Tool

MichaelA🔸Mar 8, 2021, 5:24 AM
6 points
0 ∶ 0
We created a function that estimates reputability as “stars” on a 1-5 system using the forecasting platform, forecast count, and liquidity for prediction markets. The estimation came from volunteers acquainted with the various forecasting platforms. We’re very curious for feedback here, both on what the function should be, and how to best explain and show the results.
[...] The ratings should reflect accuracy over time, and as data becomes available on prediction track records, aggregation and scoring can become less subjective.
To check I understand, are the following statements roughly accurate:
“Ideally, you’d want the star ratings to be based on the calibration and resolution that now-resolved questions from that platform (or of that type, or similar) have tended to have in the past. But there’s not yet enough data to allow that. So you asked people who know about each platform to give their best guess as to how each platform has historically compared in calibration and resolution.”
Or maybe people gave their best guess as to how the platforms will compare on those fronts, based on who uses each platform, what incentives it has, etc.?
- NunoSempere Mar 8, 2021, 9:07 AM
  4 points
  0 ∶ 0
  Parent
  “Ideally, you’d want the star ratings to be based on the calibration and resolution that now-resolved questions from that platform (or of that type, or similar) have tended to have in the past. But there’s not yet enough data to allow that. So you asked people who know about each platform to give their best guess as to how each platform has historically compared in calibration and resolution.”
  Yes. Note that I actually didn’t ask them about their guess, I asked them to guess a function from various parameters to stars (e.g. “3 stars, but 2 stars when the probability is higher than 90% or less than 10%”.
  Or maybe people gave their best guess as to how the platforms will compare on those fronts, based on who uses each platform, what incentives it has, etc.?
  Also yes, past performance is highly indicative of future performance.
  Also, unlike some other uses for platform comparison, if one platform systematically had much easier questions which they got almost always right, I’d want to give them a higher score (but perhaps show them afterwards in the search, because they might be somewhat trivial).
- Ozzie Gooen 8 Mar 2021 7:29 UTC
  4 points
  0 ∶ 0
  Parent
  This whole thing is a somewhat tricky issue and one I’m surprised hasn’t been discussed much before, to my knowledge.
  But there’s not yet enough data to allow that.
  One issue here is that measurement is very tricky, because the questions are all over the place. Different platforms have very different questions of different difficulties. We don’t yet really have metrics that compare forecasts among different sets of questions. I imagine historical data will be very useful, but extra assumptions would be needed.
  
  We’re trying to get at some question-general stat of basically, “expected score (which includes calibration + accuracy) adjusted for question difficulty.”
  
  One question this would be answering is, “If Question A is on two platforms, you should trust the one with more stars”
  - MichaelA🔸 8 Mar 2021 9:08 UTC
    2 points
    0 ∶ 0
    Parent
    Ah, that makes sense, thanks.
    I guess now, given that, I’m a bit confused by the statement “as data becomes available on prediction track records, aggregation and scoring can become less subjective.” Do you think data will naturally become available on the issue of differences in question difficulty across platforms? Or is it that you think that (a) there’ll at least be more data on calibration + accuracy, and (b) people will think more (though without much new data) about how to deal with the question difficult issue, and together (a) and (b) will reduce this issue?