From what I understand, effect size is one of the better ways to predict whether a study will replicate. For example, this paper found that 77% of replication effect sizes reported were within a 95% prediction interval based on the original effect size.
As a spot check, you say that brain training has massive purported effects. I looked at the research page of Lumosity, a company which sells brain training software. I expect their estimates of the effectiveness of brain training to be among the most optimistic, but their highlighted effect size is only d = 0.255.
A caveat is that if an effect size seems implausibly large, it might have arisen due to methodological error. (The one brain training study I found with a large effect size has been subject to methodological criticism.) Here is a blog post by Daniel Lakens where he discusses a study which found that judges hand out much harsher sentences before lunch:
If hunger had an effect on our mental resources of this magnitude, our society would fall into minor chaos every day at 11:45. Or at the very least, our society would have organized itself around this incredibly strong effect of mental depletion… we would stop teaching in the time before lunch, doctors would not schedule surgery, and driving before lunch would be illegal.
However, I think psychedelic drugs arguably do pass this test. During the 60s, before they became illegal, a lot of people kind of were talking about how society would reorganize itself around them. And forget about performing surgery or driving while you are tripping.
The way I see it, if you want to argue that an effect isn’t real, there are two ways to do it. You can argue that the supposed effect arose through random chance/p-hacking/etc., or you can argue that it arose through methodological error.
The random chance argument is harder to make if the studies have large effect sizes. If the true effect is 0, it’s unlikely we’ll observe a large effect by chance. If researchers are trying to publish papers based on noise, you’d expect p-values to cluster just below the p < 0.05 threshold (see p-curve analysis)… they’re essentially going to publish the smallest effect size they can get away with.
The methodological error argument could be valid for a large effect size, but if this is the case, confirmatory research is not necessarily going to help, because confirmatory research could have the same issue. So at that point your time is best spent trying to pinpoint the actual methodological flaw.
The random chance argument is harder to make if the studies have large effect sizes. If the true effect is 0, it’s unlikely we’ll observe a large effect by chance.
This is exactly what p-values are designed for, so you are probably better off looking at p-values rather than effect size if that’s the scenario you’re trying to avoid.
I suppose you could imagine that p-values are always going to be just around 0.05, and that for a real and large effect size people use a smaller sample because that’s all that’s necessary to get p < 0.05, but this feels less likely to me. I would expect that with a real, large effect you very quickly get p < 0.01, and researchers would in fact do that.
(I don’t necessarily disagree with the rest of your comment, I’m more unsure on the other points.)
This is exactly what p-values are designed for, so you are probably better off looking at p-values rather than effect size if that’s the scenario you’re trying to avoid.
This comment is a wonderful crystallisation of the ‘defensive statistics’ of Andrew Gelman, James Heathers and other great epistemic policemen. Thanks!
From what I understand, effect size is one of the better ways to predict whether a study will replicate. For example, this paper found that 77% of replication effect sizes reported were within a 95% prediction interval based on the original effect size.
As a spot check, you say that brain training has massive purported effects. I looked at the research page of Lumosity, a company which sells brain training software. I expect their estimates of the effectiveness of brain training to be among the most optimistic, but their highlighted effect size is only d = 0.255.
A caveat is that if an effect size seems implausibly large, it might have arisen due to methodological error. (The one brain training study I found with a large effect size has been subject to methodological criticism.) Here is a blog post by Daniel Lakens where he discusses a study which found that judges hand out much harsher sentences before lunch:
However, I think psychedelic drugs arguably do pass this test. During the 60s, before they became illegal, a lot of people kind of were talking about how society would reorganize itself around them. And forget about performing surgery or driving while you are tripping.
The way I see it, if you want to argue that an effect isn’t real, there are two ways to do it. You can argue that the supposed effect arose through random chance/p-hacking/etc., or you can argue that it arose through methodological error.
The random chance argument is harder to make if the studies have large effect sizes. If the true effect is 0, it’s unlikely we’ll observe a large effect by chance. If researchers are trying to publish papers based on noise, you’d expect p-values to cluster just below the p < 0.05 threshold (see p-curve analysis)… they’re essentially going to publish the smallest effect size they can get away with.
The methodological error argument could be valid for a large effect size, but if this is the case, confirmatory research is not necessarily going to help, because confirmatory research could have the same issue. So at that point your time is best spent trying to pinpoint the actual methodological flaw.
This is exactly what p-values are designed for, so you are probably better off looking at p-values rather than effect size if that’s the scenario you’re trying to avoid.
I suppose you could imagine that p-values are always going to be just around 0.05, and that for a real and large effect size people use a smaller sample because that’s all that’s necessary to get p < 0.05, but this feels less likely to me. I would expect that with a real, large effect you very quickly get p < 0.01, and researchers would in fact do that.
(I don’t necessarily disagree with the rest of your comment, I’m more unsure on the other points.)
Yes, this is a better idea.
This comment is a wonderful crystallisation of the ‘defensive statistics’ of Andrew Gelman, James Heathers and other great epistemic policemen. Thanks!