Iām fairly confident (letās say 80%) that Metaculus has underestimated progress on benchmarks so far. This doesnāt mean it will keep doing so in the future because (i) forecasters may have learned from this experience to be more bullish and/āor (ii) AI progress might slow down. I wouldnāt bet on (ii), but I expect (i) has already happened to some extentāit has certainly happened to me!
The other categories have fewer questions and some have special circumstances that make the evidence of bias much weaker in my view. Specifically, the biggest misses in ācomputeā came from GPU price spikes that can probably be explained by post-COVID supply chain disruptions and increased demand from crypto miners. Both of these factors were transient.
I like your example with the two independent dice. My takeaway is that, if you have access to a prior thatās more informative than a uniform distribution (in this case, āboth dice are unbiased so their sum must be a triangular distributionā), then you should compare your performance against that. My assumption when writing this was that a (log-)uniform prior over the relevant range was the best we could do for these questions. This is in line with the fact that Metaculusās log score on continuous questions is normalized using a (log-)uniform distribution.
Thatās a good point re: different time horizons. I didnāt bother to check the average time between close and resolution for all questions on the platform, but, assuming itās <<1 year as you suggest, I agree itās an important caveat. If you know that number off the top of your head, Iāll add it to the post.
Thanks, Peter!
To your questions:
Iām fairly confident (letās say 80%) that Metaculus has underestimated progress on benchmarks so far. This doesnāt mean it will keep doing so in the future because (i) forecasters may have learned from this experience to be more bullish and/āor (ii) AI progress might slow down. I wouldnāt bet on (ii), but I expect (i) has already happened to some extentāit has certainly happened to me!
The other categories have fewer questions and some have special circumstances that make the evidence of bias much weaker in my view. Specifically, the biggest misses in ācomputeā came from GPU price spikes that can probably be explained by post-COVID supply chain disruptions and increased demand from crypto miners. Both of these factors were transient.
I like your example with the two independent dice. My takeaway is that, if you have access to a prior thatās more informative than a uniform distribution (in this case, āboth dice are unbiased so their sum must be a triangular distributionā), then you should compare your performance against that. My assumption when writing this was that a (log-)uniform prior over the relevant range was the best we could do for these questions. This is in line with the fact that Metaculusās log score on continuous questions is normalized using a (log-)uniform distribution.
Thatās a good point re: different time horizons. I didnāt bother to check the average time between close and resolution for all questions on the platform, but, assuming itās <<1 year as you suggest, I agree itās an important caveat. If you know that number off the top of your head, Iāll add it to the post.