More generally, though probably especially on AI. I’m not exactly sure how to handle some of the evidence for why I’m pretty confident on this, but I can gesture at a few data points that seem fine to share:
Nuño, Misha and I each did much better on INFER than at least a few superforecasters (and we were 3 of the top 4 on the leaderboard; no superforecasters did better than us).
I’ve seen a few superforecaster aggregate predictions that seemed pretty obviously bad in advance, which I pre-registered and in both cases I was right.
I said on Twitter I thought supers were overconfident on Russia invasion of Ukraine; supers were at 84% no invasion, l was at 55% (and Metaculus was similar, to be fair). Unfortunately, Russia went on to invade Ukraine a few weeks later.
I noticed that superforecasters may be overconfident on the rise of the Delta COVID variant, and it seems they likely were: they predicted a 14% chance 7-day median would rise above 140k cases and a 2% (!) chance it would rise above 210k; it ended up peaking at about 150k so this is weak evidence, but it still seems like 2% was crazy low.
I’m hesitant to make it seem like I’m bashing on superforecasters, I think they’re amazing overall relative to the general public. But I think Samotsvety is even more amazing :D I also think some superforecasters are much better forecasters than others.
This doesn’t sound like an outlandish claim to me. Still, I’m not yet convinced.
I was really into Covid forecasting at the time, so I was tempted to go back through my comment history and noticed that this seemed like an extremely easy call at the time. (I made this comment 15 days before yours where I was predicting >100,000 cases with 98% confidence, saying I’d probably go to 99% after more checking of my assumptions. Admittedly, >100,000 cases in a single day is significantly less than >140,000 cases for the 7-day average. Still, a confidence level of 98%+ suggests that I’d definitely have put a lot more than 14% on the latter.) This makes me suspect that maybe that particular question was quite unrepresentative for the average track record of superforecasters? Relatedly, if we only focus on instances where it’s obvious that some group’s consensus is wrong, it’s probably somewhat easy to find such instances (even for elite groups) because of the favorable selection effect at work. A through analysis would look at the track record on a pre-registered selection of questions.
Edit: The particular Covid question is strong evidence for “sometimes superforecasters don’t seem to be trying as much as they could.” So maybe your point is something like “On questions where we try as hard as possible, I trust us more than the average superforecaster prediction.” I think that stance might be reasonable.
As a superforecaster, I’m going to strongly agree with “sometimes superforecasters don’t seem to be trying as much as they could,” and they aren’t incentivized to do deep dives into every question.
I’d say they are individually somewhere between metaculus and a more ideal group, which Samovetseky seems to be close to, but I’m not an insider, and have limited knowledge of how you manage epistemic issues like independent elicitation before discussion. One thing Samovestsky does not have, unfortunately, is the type of more sophisticated algorithm to aggregates that are used by Metaculus and GJ, nor the same level of diversity as either—though overall I see those as less important than more effort by properly calibrated forecasters.
This doesn’t sound like an outlandish claim to me. Still, I’m not yet convinced.
Yeah, I think the evidence I felt comfortable sharing right now is enough to get to some confidence but perhaps not high confidence, so this is fair. The INFER point is probably stronger than the two bad predictions which is why I put it first.
I was really into Covid forecasting at the time, so I was tempted to go back through my comment history and noticed that this seemed like an extremely easy call at the time… Relatedly, if we only focus on instances where it’s obvious that some group’s consensus is wrong, it’s probably somewhat easy to find such instances (even for elite groups) because of the favorable selection effect at work. A through analysis would look at the track record on a pre-registered selection of questions.
I agree a more thorough analysis would look at the track record on a pre-registered selection of questions would be great. It’s pretty hard to know because the vast majority of superforecaster predictions are private and not on their public dashboard. Speaking for myself, I’d be pretty excited about a Samotsvety vs. supers vs. [any other teams who were interested] tournament happening.
That being said, I’m confused about how you seem to be taking “I was really into Covid forecasting at the time, so I was tempted to go back through my comment history and noticed that this seemed like an extremely easy call at the time” as an update toward superforecasters being better? If anything this feels like an update against superforecasters? The point I was trying to make was that it was a foreseeably wrong prediction and you further confirmed it?
I’d also say that on the cherry-picking point, I wasn’t exactly checking the superforecaster public dashboard super often over the last few years (like maybe I’ve checked ~25-50 days total) and there are only like 5 predictions up at a time.
Edit: The particular Covid question is strong evidence for “sometimes superforecasters don’t seem to be trying as much as they could.” So maybe your point is something like “On questions where we try as hard as possible, I trust us more than the average superforecaster prediction.” I think that stance might be reasonable.
I think it’s fair to interpret the Covid question to some extent as superforecasters not trying, but I’m confused about how you seem to be attributing little of it to prediction error? It could be a combination of both.
I think it’s fair to interpret the Covid question to some extent as superforecasters not trying, but I’m confused about how you seem to be attributing little of it to prediction error? It could be a combination of both.
Good point. I over-updated on my feeling of “this particular question felt so easy at the time” so that I couldn’t imagine why anyone who puts serious time into it would get it badly wrong.
However, on reflection, I think it’s most plausible that different types of information were salient to different people, which could have caused superforecasters to make prediction errors even if they were trying seriously. (Specifically, the question felt easy to me because I happened to have a lot of detailed info on the UK situation, which presented one of the best available examples to use for forming a reference class.)
You’re right that I essentially gave even more evidence for the claim you were making.
More generally, though probably especially on AI. I’m not exactly sure how to handle some of the evidence for why I’m pretty confident on this, but I can gesture at a few data points that seem fine to share:
Nuño, Misha and I each did much better on INFER than at least a few superforecasters (and we were 3 of the top 4 on the leaderboard; no superforecasters did better than us).
I’ve seen a few superforecaster aggregate predictions that seemed pretty obviously bad in advance, which I pre-registered and in both cases I was right.
I said on Twitter I thought supers were overconfident on Russia invasion of Ukraine; supers were at 84% no invasion, l was at 55% (and Metaculus was similar, to be fair). Unfortunately, Russia went on to invade Ukraine a few weeks later.
I noticed that superforecasters may be overconfident on the rise of the Delta COVID variant, and it seems they likely were: they predicted a 14% chance 7-day median would rise above 140k cases and a 2% (!) chance it would rise above 210k; it ended up peaking at about 150k so this is weak evidence, but it still seems like 2% was crazy low.
I’m hesitant to make it seem like I’m bashing on superforecasters, I think they’re amazing overall relative to the general public. But I think Samotsvety is even more amazing :D I also think some superforecasters are much better forecasters than others.
This doesn’t sound like an outlandish claim to me. Still, I’m not yet convinced.
I was really into Covid forecasting at the time, so I was tempted to go back through my comment history and noticed that this seemed like an extremely easy call at the time. (I made this comment 15 days before yours where I was predicting >100,000 cases with 98% confidence, saying I’d probably go to 99% after more checking of my assumptions. Admittedly, >100,000 cases in a single day is significantly less than >140,000 cases for the 7-day average. Still, a confidence level of 98%+ suggests that I’d definitely have put a lot more than 14% on the latter.) This makes me suspect that maybe that particular question was quite unrepresentative for the average track record of superforecasters? Relatedly, if we only focus on instances where it’s obvious that some group’s consensus is wrong, it’s probably somewhat easy to find such instances (even for elite groups) because of the favorable selection effect at work. A through analysis would look at the track record on a pre-registered selection of questions.
Edit: The particular Covid question is strong evidence for “sometimes superforecasters don’t seem to be trying as much as they could.” So maybe your point is something like “On questions where we try as hard as possible, I trust us more than the average superforecaster prediction.” I think that stance might be reasonable.
As a superforecaster, I’m going to strongly agree with “sometimes superforecasters don’t seem to be trying as much as they could,” and they aren’t incentivized to do deep dives into every question.
I’d say they are individually somewhere between metaculus and a more ideal group, which Samovetseky seems to be close to, but I’m not an insider, and have limited knowledge of how you manage epistemic issues like independent elicitation before discussion. One thing Samovestsky does not have, unfortunately, is the type of more sophisticated algorithm to aggregates that are used by Metaculus and GJ, nor the same level of diversity as either—though overall I see those as less important than more effort by properly calibrated forecasters.
Really appreciate this deep dive!
Yeah, I think the evidence I felt comfortable sharing right now is enough to get to some confidence but perhaps not high confidence, so this is fair. The INFER point is probably stronger than the two bad predictions which is why I put it first.
I agree a more thorough analysis would look at the track record on a pre-registered selection of questions would be great. It’s pretty hard to know because the vast majority of superforecaster predictions are private and not on their public dashboard. Speaking for myself, I’d be pretty excited about a Samotsvety vs. supers vs. [any other teams who were interested] tournament happening.
That being said, I’m confused about how you seem to be taking “I was really into Covid forecasting at the time, so I was tempted to go back through my comment history and noticed that this seemed like an extremely easy call at the time” as an update toward superforecasters being better? If anything this feels like an update against superforecasters? The point I was trying to make was that it was a foreseeably wrong prediction and you further confirmed it?
I’d also say that on the cherry-picking point, I wasn’t exactly checking the superforecaster public dashboard super often over the last few years (like maybe I’ve checked ~25-50 days total) and there are only like 5 predictions up at a time.
I think it’s fair to interpret the Covid question to some extent as superforecasters not trying, but I’m confused about how you seem to be attributing little of it to prediction error? It could be a combination of both.
Good point. I over-updated on my feeling of “this particular question felt so easy at the time” so that I couldn’t imagine why anyone who puts serious time into it would get it badly wrong.
However, on reflection, I think it’s most plausible that different types of information were salient to different people, which could have caused superforecasters to make prediction errors even if they were trying seriously. (Specifically, the question felt easy to me because I happened to have a lot of detailed info on the UK situation, which presented one of the best available examples to use for forming a reference class.)
You’re right that I essentially gave even more evidence for the claim you were making.