Happy to see this focus. I still find it quite strange out how little attention the general issue has gotten from other groups and how few decent studies exist.
I feel like one significant distinction for these discussions is that of calibration vs. resolution. This was mentioned in the footnotes (with a useful table) but I think it may deserve more attention here.
If long-term calibration is expected to be reasonable, then I would assume we could get much of the important information we could be interested in about forecasting ability from the resolution numbers. If forecasters are confident in predictions for a 5-20+ year time frame, this would be evident in corresponding high-resolution forecasts. If we want to compare these to baselines we could set them up now and compare resolution numbers.
We could also have forecasters do meta-forecasts; forecasts about forecasts. I believe that the straightforward resolution numbers should provide the main important data, but there could be other things you may be interested. For example, “What average level of resolution could we get on this set of questions if we were to spend X resources forecasting them?” If the forecasters were decently calibrated the main way this could go poorly is if the predictions to these questions would be low resolution, but if so that would be apparent quickly.
The much trickier thing seems to be calibration. If we cannot trust our forecasts to be calibrated over long time horizons, then the resolution of their forecasts is likely to be misleading, possibly in a highly systematic and deceiving way.
However, long-term calibration seems like a relatively constrained question to me, and one with possibly a pretty positive outlook. My impression from the table and spreadsheet is that in general, calibration was shown to be quite similar for short and long term forecasts. Also, it’s not clear to me why calibration would be dramatically worse in long-term questions than it would be in specific short-term questions that we could test for cheap. For instance, if we expected that forecasters may be poorly calibrated on long-term questions because the incentives are poor, we could try having forecasters forecast very short-term questions with similarly poor incentives. I recall reading Anthony Aguirre speculating that he didn’t expect Metaculus’s forecaster’s incentives to change much for long-term questions, but I forgot where this was mentioned (it may have been a podcast).
Having some long-term studies seems quite safe as well, but I’m not sure how much extra benefit they will give us compared to more rapid short-term studies combined with large sets of long-term predictions by calibrated forecasters (which should come with numbers of resolution).
Separately, I missed the footnotes on my first read through, but think that may have been my favorite part of it. The link is a bit small (though clicking on the citation numbers brings it up).
Happy to see this focus. I still find it quite strange out how little attention the general issue has gotten from other groups and how few decent studies exist.
I feel like one significant distinction for these discussions is that of calibration vs. resolution. This was mentioned in the footnotes (with a useful table) but I think it may deserve more attention here.
If long-term calibration is expected to be reasonable, then I would assume we could get much of the important information we could be interested in about forecasting ability from the resolution numbers. If forecasters are confident in predictions for a 5-20+ year time frame, this would be evident in corresponding high-resolution forecasts. If we want to compare these to baselines we could set them up now and compare resolution numbers.
We could also have forecasters do meta-forecasts; forecasts about forecasts. I believe that the straightforward resolution numbers should provide the main important data, but there could be other things you may be interested. For example, “What average level of resolution could we get on this set of questions if we were to spend X resources forecasting them?” If the forecasters were decently calibrated the main way this could go poorly is if the predictions to these questions would be low resolution, but if so that would be apparent quickly.
The much trickier thing seems to be calibration. If we cannot trust our forecasts to be calibrated over long time horizons, then the resolution of their forecasts is likely to be misleading, possibly in a highly systematic and deceiving way.
However, long-term calibration seems like a relatively constrained question to me, and one with possibly a pretty positive outlook. My impression from the table and spreadsheet is that in general, calibration was shown to be quite similar for short and long term forecasts. Also, it’s not clear to me why calibration would be dramatically worse in long-term questions than it would be in specific short-term questions that we could test for cheap. For instance, if we expected that forecasters may be poorly calibrated on long-term questions because the incentives are poor, we could try having forecasters forecast very short-term questions with similarly poor incentives. I recall reading Anthony Aguirre speculating that he didn’t expect Metaculus’s forecaster’s incentives to change much for long-term questions, but I forgot where this was mentioned (it may have been a podcast).
Having some long-term studies seems quite safe as well, but I’m not sure how much extra benefit they will give us compared to more rapid short-term studies combined with large sets of long-term predictions by calibrated forecasters (which should come with numbers of resolution).
Separately, I missed the footnotes on my first read through, but think that may have been my favorite part of it. The link is a bit small (though clicking on the citation numbers brings it up).