This 25% forecast is an order of magnitude different from the Metaculus estimate of 2-2.5%
I don’t get why two different groups of forecasters aggregated results end up with an over an order of magnitude difference.
Any idea why? Have I misunderstood something? Is one group know to be better? Or is one group more likely to be bias? Or is forecasting risks just really super unreliable and not a thing to put much weight on?
In terms of forecasting accuracy on Metaculus, Eli’s individual performance is comparable[1] to the community aggregate on his own, despite him having optimised for volume (he’s 10th on the heavily volume weighted leaderboard). I expect that were he to have pushed less hard for volume, he’d have significantly outperformed the community aggregate even as an individual.[2]
Assuming the other Samotsvety forecasters are comparably good, I’d expect the aggregated forecasts from the group to very comfortably outperform the community aggregate, even if they weren’t paying unusual attention to the questions (which they are).
Comparing ‘score at resolution time’, Eli looks slightly worse than the community. Comparing ‘score across all times’, Eli looks better than the community. Score across all times is a better measure of skill when comparing individuals, but does disadvantage the community prediction, because at earlier times questions have fewer predictors.
As some independent evidence of this, I comfortably outperform the community aggregate, having tried less hard than Eli to optimise for volume. Eli has beaten me in more than one competition, and think he’s a better forecaster.
In addition to the points above, there have been a few jokes on questions like that about the scoring rule not being proper (if the world ends, you don’t get the negative points for being wrong!). Not sure how much of a factor that is, though, and I could imagine it being minimal.
My take is that we should give little weight to Metaculus. From footnote 2 of the post:
Why do I give little weight to Metaculus’s views on AI? Primarily because of the incentives to make very shallow forecasts on a ton of questions (e.g. probably <20% of Metaculus AI forecasters have done the equivalent work of reading the Carlsmith report), and secondarily that forecasts aren’t aggregated from a select group of high performers but instead from anyone who wants to make an account and predict on that question.
(Edited to add: I see the post you linked also includes the “Metaculus prediction” which theoretically performs significantly better than the community prediction by weighting stronger predictors more heavily. But if you look at its actual track record, it doesn’t do much better than the community. For binary questions at resolve time, it has a log score of 0.438 vs. 0.426 for community. At all times, it gets 0.280 vs. 0.261. For continuous questions at resolve time, it has a log score of 2.19 vs. 2.12. At all times, it gets 1.57 vs. 1.55.)
That said:
Or is forecasting risks just really super unreliable and not a thing to put much weight on?
I wouldn’t want people to overestimate the precision of the estimates in this post! Take them as a few data points among many. I also think it’s very healthy for the community if many people are forming inside views about AI risk, though I understand it’s difficult and had a hard time with it myself for a while.
This 25% forecast is an order of magnitude different from the Metaculus estimate of 2-2.5%
I don’t get why two different groups of forecasters aggregated results end up with an over an order of magnitude difference.
Any idea why? Have I misunderstood something? Is one group know to be better? Or is one group more likely to be bias? Or is forecasting risks just really super unreliable and not a thing to put much weight on?
https://www.metaculus.com/questions/2568/ragnar%25C3%25B6k-seriesresults-so-far/
In terms of forecasting accuracy on Metaculus, Eli’s individual performance is comparable[1] to the community aggregate on his own, despite him having optimised for volume (he’s 10th on the heavily volume weighted leaderboard). I expect that were he to have pushed less hard for volume, he’d have significantly outperformed the community aggregate even as an individual.[2]
Assuming the other Samotsvety forecasters are comparably good, I’d expect the aggregated forecasts from the group to very comfortably outperform the community aggregate, even if they weren’t paying unusual attention to the questions (which they are).
Comparing ‘score at resolution time’, Eli looks slightly worse than the community. Comparing ‘score across all times’, Eli looks better than the community. Score across all times is a better measure of skill when comparing individuals, but does disadvantage the community prediction, because at earlier times questions have fewer predictors.
As some independent evidence of this, I comfortably outperform the community aggregate, having tried less hard than Eli to optimise for volume. Eli has beaten me in more than one competition, and think he’s a better forecaster.
In addition to the points above, there have been a few jokes on questions like that about the scoring rule not being proper (if the world ends, you don’t get the negative points for being wrong!). Not sure how much of a factor that is, though, and I could imagine it being minimal.
My take is that we should give little weight to Metaculus. From footnote 2 of the post:
(Edited to add: I see the post you linked also includes the “Metaculus prediction” which theoretically performs significantly better than the community prediction by weighting stronger predictors more heavily. But if you look at its actual track record, it doesn’t do much better than the community. For binary questions at resolve time, it has a log score of 0.438 vs. 0.426 for community. At all times, it gets 0.280 vs. 0.261. For continuous questions at resolve time, it has a log score of 2.19 vs. 2.12. At all times, it gets 1.57 vs. 1.55.)
That said:
I wouldn’t want people to overestimate the precision of the estimates in this post! Take them as a few data points among many. I also think it’s very healthy for the community if many people are forming inside views about AI risk, though I understand it’s difficult and had a hard time with it myself for a while.
Ah the answer was in the footnotes all along. Silly me. Thank you for the reply!