This whole thing is a somewhat tricky issue and one I’m surprised hasn’t been discussed much before, to my knowledge.
But there’s not yet enough data to allow that.
One issue here is that measurement is very tricky, because the questions are all over the place. Different platforms have very different questions of different difficulties. We don’t yet really have metrics that compare forecasts among different sets of questions. I imagine historical data will be very useful, but extra assumptions would be needed.
We’re trying to get at some question-general stat of basically, “expected score (which includes calibration + accuracy) adjusted for question difficulty.”
One question this would be answering is, “If Question A is on two platforms, you should trust the one with more stars”
I guess now, given that, I’m a bit confused by the statement “as data becomes available on prediction track records, aggregation and scoring can become less subjective.” Do you think data will naturally become available on the issue of differences in question difficulty across platforms? Or is it that you think that (a) there’ll at least be more data on calibration + accuracy, and (b) people will think more (though without much new data) about how to deal with the question difficult issue, and together (a) and (b) will reduce this issue?
This whole thing is a somewhat tricky issue and one I’m surprised hasn’t been discussed much before, to my knowledge.
One issue here is that measurement is very tricky, because the questions are all over the place. Different platforms have very different questions of different difficulties. We don’t yet really have metrics that compare forecasts among different sets of questions. I imagine historical data will be very useful, but extra assumptions would be needed.
We’re trying to get at some question-general stat of basically, “expected score (which includes calibration + accuracy) adjusted for question difficulty.”
One question this would be answering is, “If Question A is on two platforms, you should trust the one with more stars”
Ah, that makes sense, thanks.
I guess now, given that, I’m a bit confused by the statement “as data becomes available on prediction track records, aggregation and scoring can become less subjective.” Do you think data will naturally become available on the issue of differences in question difficulty across platforms? Or is it that you think that (a) there’ll at least be more data on calibration + accuracy, and (b) people will think more (though without much new data) about how to deal with the question difficult issue, and together (a) and (b) will reduce this issue?