Great investigation! Now I’m slightly less salty that your post is exclusively cited when it comes to the relation of range and accuracy (though I may still bask in the glory of second-hand citation :-p).
Few users updated their predictions, and updating was not associated with lower Brier scores overall, though there was not enough data to infer much here. Of 9230 updates, 3141 (34%) were performed by the most frequent individual predictor and 4710 (51%) were due to the top 3 most frequent updators.
Matches my experience, though I think Metaculus is slightly better in this regard. Should still give observers pause to think about how suboptimal those platforms are.
The models were:
<1y: 0.9400*Prediction − 0.0154
1-3y: 0.9122*Prediction − 0.1066
3-5y: 0.8927*Prediction − 0.0837
5+y: 0.8587*Prediction − 0.1089
This is super cool!
I wanted to look into whether forecasters appeared to get better over time. For this, I took those forecasters with >100 predictions, and compared their performance on their first 50 predictions to their last 50.
The answer appeared to be “maybe”. There was no improvement in Brier scores or over confidence, but it is possible that they may have tried to predict more difficult questions in their later questions
I think that 100 predictions just isn’t enough, especially if you’re not doing deliberate practice. I think my predictions started getting okay after having experienced ~100 question resolutions, which would imply several hundred predictions. Surprised to hear the reviewer had the opposite opinion!
It should be possible to test this by performing a similar analysis, but looking at predictions made after a certain number of resolutions for that user and checking whether there is an improvement. I think resolutions should be the focus here: You can learn very little from predictions that you don’t know the outcome of yet (though I’ve found it helpful to predict Metaculus with the community prediction hidden and then check against the community). I’m not sure it would be worth the effort to perform this analysis, but I’ll put it on my todo list.
For the Metaculus data I could glean less information, as there were fewer questions, and no user level data available.
FWIW Metaculus now makes their user-level data available to researchers if you ask nicely.
Since we now know that 41% of things happen ;-), it’d be interesting to see whether things that are far off happen more rarely (or, in plain english, do questions with longer horizons resolve positively less often?). I don’t think you looked into into this here, right?
As for data sources, I’ve started working on a collection of forecasting datasets, but my funding for that ran out and wasn’t renewed :-/ Maybe I’ll find a way to finish it.
Great investigation! Now I’m slightly less salty that your post is exclusively cited when it comes to the relation of range and accuracy (though I may still bask in the glory of second-hand citation :-p).
Matches my experience, though I think Metaculus is slightly better in this regard. Should still give observers pause to think about how suboptimal those platforms are.
This is super cool!
I think that 100 predictions just isn’t enough, especially if you’re not doing deliberate practice. I think my predictions started getting okay after having experienced ~100 question resolutions, which would imply several hundred predictions. Surprised to hear the reviewer had the opposite opinion!
It should be possible to test this by performing a similar analysis, but looking at predictions made after a certain number of resolutions for that user and checking whether there is an improvement. I think resolutions should be the focus here: You can learn very little from predictions that you don’t know the outcome of yet (though I’ve found it helpful to predict Metaculus with the community prediction hidden and then check against the community). I’m not sure it would be worth the effort to perform this analysis, but I’ll put it on my todo list.
FWIW Metaculus now makes their user-level data available to researchers if you ask nicely.
Since we now know that 41% of things happen ;-), it’d be interesting to see whether things that are far off happen more rarely (or, in plain english, do questions with longer horizons resolve positively less often?). I don’t think you looked into into this here, right?
As for data sources, I’ve started working on a collection of forecasting datasets, but my funding for that ran out and wasn’t renewed :-/ Maybe I’ll find a way to finish it.