Great work and great writing, thank you. I wonder if there’s anything better powered than t-tests in this setting though?
ETA: is “which forecaster is best?” actually the right question to be answering? If the forecasts are close enough that we can’t tell the difference after 100 questions, maybe we don’t care about the difference?
Can’t think of anything better than a t-test, but open for suggestions.
If a forecaster is consistently off by like 10 percentage points—I think that is a difference that matters. But even in that extreme scenario where the (simulated) difference between two forecasters is in fact quite large, we have a hard time picking that up using standard significance tests.
Intuitively, I think there should be a way to take advantage of the fact that the outcomes are heavily structured. You have predictions on the same questions and they have a binary outcome.
OTOH, if in 20% of cases the worse forecaster is better on average, that suggests that there is just a hard bound on how much we can get.
Great work and great writing, thank you. I wonder if there’s anything better powered than t-tests in this setting though?
ETA: is “which forecaster is best?” actually the right question to be answering? If the forecasts are close enough that we can’t tell the difference after 100 questions, maybe we don’t care about the difference?
Can’t think of anything better than a t-test, but open for suggestions.
If a forecaster is consistently off by like 10 percentage points—I think that is a difference that matters. But even in that extreme scenario where the (simulated) difference between two forecasters is in fact quite large, we have a hard time picking that up using standard significance tests.
Afraid I don’t have good ideas here.
Intuitively, I think there should be a way to take advantage of the fact that the outcomes are heavily structured. You have predictions on the same questions and they have a binary outcome.
OTOH, if in 20% of cases the worse forecaster is better on average, that suggests that there is just a hard bound on how much we can get.