It does, but although that’s enough to make it worthwhile on the margin of existing medical research, that is not enough to make it a priority for the EA community.
The latter. EA shouldn’t fund most research, but whether it is confirmatory or not is irrelevant. Psychedelics shouldn’t make the cut if we expect (as I argue above) we expect a lot of failure to replicate and regression, and the true effect to be unexceptional in the context of existing mental health treatment.
I feel confused about why you think psychedelics shouldn’t make the cut. The present state of research (several small-n studies finding very large effect sizes) seems consistent with both:
The world in which psychedelics are in fact a promising intervention
The world in which the current promise of psychedelics is an artifact of our academic knowledge-generating process
It seems like the only way to know which world we’re in is to do confirmatory research.
That sounds a bit like the argument ‘either this claim is right, or it’s wrong, so there’s a 50% chance it’s true.’
One needs to attend to base rates. Our bad academic knowledge-generating process throws up many, many illusory interventions with purported massive effects for each amazing intervention we find, and the amazing interventions that we do find disproportionately were easier to show (with the naked eye, visible macro-correlations, consistent effects with well-powered studies, etc).
People are making similar arguments about cold fusion, psychic powers (of many different varieties), many environmental and nutritional contaminants, brain training, carbon dioxide levels, diets, polyphasic sleep, assorted purported nootropics, many psychological/parenting/educational interventions, etc.
Testing how your prior applies across a spectrum of other cases (past and present) is helpful for model checking. If psychedelics are a promising EA cause how many of those others qualify? If many do, then any one isn’t so individually special, although one might want to have a systematic program of systematically doing rigorous testing of all the wacky claims of large impact that can be tested cheaply.
If not, then it would be good to explain what exactly makes psychedelics different from the rest.
I think the case for psychedelics the OP has made doesn’t pass this standard yet, so doesn’t meet the standard for an EA cause area.
From what I understand, effect size is one of the better ways to predict whether a study will replicate. For example, this paper found that 77% of replication effect sizes reported were within a 95% prediction interval based on the original effect size.
As a spot check, you say that brain training has massive purported effects. I looked at the research page of Lumosity, a company which sells brain training software. I expect their estimates of the effectiveness of brain training to be among the most optimistic, but their highlighted effect size is only d = 0.255.
A caveat is that if an effect size seems implausibly large, it might have arisen due to methodological error. (The one brain training study I found with a large effect size has been subject to methodological criticism.) Here is a blog post by Daniel Lakens where he discusses a study which found that judges hand out much harsher sentences before lunch:
If hunger had an effect on our mental resources of this magnitude, our society would fall into minor chaos every day at 11:45. Or at the very least, our society would have organized itself around this incredibly strong effect of mental depletion… we would stop teaching in the time before lunch, doctors would not schedule surgery, and driving before lunch would be illegal.
However, I think psychedelic drugs arguably do pass this test. During the 60s, before they became illegal, a lot of people kind of were talking about how society would reorganize itself around them. And forget about performing surgery or driving while you are tripping.
The way I see it, if you want to argue that an effect isn’t real, there are two ways to do it. You can argue that the supposed effect arose through random chance/p-hacking/etc., or you can argue that it arose through methodological error.
The random chance argument is harder to make if the studies have large effect sizes. If the true effect is 0, it’s unlikely we’ll observe a large effect by chance. If researchers are trying to publish papers based on noise, you’d expect p-values to cluster just below the p < 0.05 threshold (see p-curve analysis)… they’re essentially going to publish the smallest effect size they can get away with.
The methodological error argument could be valid for a large effect size, but if this is the case, confirmatory research is not necessarily going to help, because confirmatory research could have the same issue. So at that point your time is best spent trying to pinpoint the actual methodological flaw.
The random chance argument is harder to make if the studies have large effect sizes. If the true effect is 0, it’s unlikely we’ll observe a large effect by chance.
This is exactly what p-values are designed for, so you are probably better off looking at p-values rather than effect size if that’s the scenario you’re trying to avoid.
I suppose you could imagine that p-values are always going to be just around 0.05, and that for a real and large effect size people use a smaller sample because that’s all that’s necessary to get p < 0.05, but this feels less likely to me. I would expect that with a real, large effect you very quickly get p < 0.01, and researchers would in fact do that.
(I don’t necessarily disagree with the rest of your comment, I’m more unsure on the other points.)
This is exactly what p-values are designed for, so you are probably better off looking at p-values rather than effect size if that’s the scenario you’re trying to avoid.
This comment is a wonderful crystallisation of the ‘defensive statistics’ of Andrew Gelman, James Heathers and other great epistemic policemen. Thanks!
That sounds a bit like the argument ‘either this claim is right, or it’s wrong, so there’s a 50% chance it’s true.’
I’m not claiming this. I’m claiming that given the research to date, more psychedelic research would be very impactful in expectation. (I’m at like 30-40% that the beneficial effects are real.)
If not, then it would be good to explain what exactly makes psychedelics different from the rest.
I haven’t read the literatures for all the examples you gave. For psychic powers & cold fusion, my impression is that confirmatory research was done and the initial results didn’t replicate.
So one difference is that the main benefits of psychedelic therapy haven’t yet failed to replicate.
Scott referred to some failures to replicate in his post.
Scott referred to one failure to replicate, for a finding that a psychedelic experience increased trait openness. This isn’t one of the benefits cited by the OP.
[Erritzoe et al. 2018 found that psilocybin increased Openness in a population of depressed people, which SSRIs do not do.] Maclean et al. 2011, an analysis of psilocybin given to healthy-typed people, also found a persisting increase in Openness. However, Griffiths et al. 2017, also psilocybin for healthy-typed people, found no persisting increase in Openness. So maybe psilocybin causes greater Openness but only sometimes? As always more research is needed.
Also:
Why would increasing Big-Five Openness matter? Erritzoe [et al. 2018] engages with that too:
″… the facets Openness to Actions and to Values significantly increased in our study. The facet Openness to Actions pertains to not being set in one’s way, and instead, being ready to try and do new things. Openness to Values is about valuing permissiveness, open-mindedness, and tolerance. These two facets therefore reflect an active approach on the part of the individual to try new ways of doing things and consider other peoples’ values and/or worldviews.”
And:
“It is well established that trait Openness correlates reliably with liberal political perspective… The apparent link between Openness and a generally liberal worldview may be attributed to the notion that people who are more open to new experiences are also less personally constrained by convention and that this freedom of attitude extends into every aspect of a person’s life, including their political orientation.”
Right, so you would want to show that 30-40% of interventions with similar literatures pan out.
I think we have a disagreement about what the appropriate reference class here is.
The reference class I’m using is something like “results which are supported by 2-3 small-n studies with large effect sizes.”
I’d expect roughly 30-40% of such results to hold up after confirmatory research.
Somewhat related: 62% of results assessed by Camerer et al. 2018 replicated.
It’s a bit complicated to think about replication re: psychedelics because the intervention is showing promise as a treatment for multiple indications (there are a couple studies showing large effect sizes for depression, a couple studies showing large effect sizes for anxiety, a couple studies showing large effect sizes for addictive disorders).
Could you say a little more about what reference class you’re using here?
Wouldn’t 1.2), 1.3), and 1.4) point towards funding more psychedelic research?
(To prove or disprove the benefits found in the early-stage trials?)
It does, but although that’s enough to make it worthwhile on the margin of existing medical research, that is not enough to make it a priority for the EA community.
Are you saying that EA shouldn’t fund confirmatory research, in general?
Or are you saying that there’s something in particular about this research, such that EA shouldn’t fund confirmatory research in this case?
The latter. EA shouldn’t fund most research, but whether it is confirmatory or not is irrelevant. Psychedelics shouldn’t make the cut if we expect (as I argue above) we expect a lot of failure to replicate and regression, and the true effect to be unexceptional in the context of existing mental health treatment.
Got it, thanks!
I feel confused about why you think psychedelics shouldn’t make the cut. The present state of research (several small-n studies finding very large effect sizes) seems consistent with both:
The world in which psychedelics are in fact a promising intervention
The world in which the current promise of psychedelics is an artifact of our academic knowledge-generating process
It seems like the only way to know which world we’re in is to do confirmatory research.
That sounds a bit like the argument ‘either this claim is right, or it’s wrong, so there’s a 50% chance it’s true.’
One needs to attend to base rates. Our bad academic knowledge-generating process throws up many, many illusory interventions with purported massive effects for each amazing intervention we find, and the amazing interventions that we do find disproportionately were easier to show (with the naked eye, visible macro-correlations, consistent effects with well-powered studies, etc).
People are making similar arguments about cold fusion, psychic powers (of many different varieties), many environmental and nutritional contaminants, brain training, carbon dioxide levels, diets, polyphasic sleep, assorted purported nootropics, many psychological/parenting/educational interventions, etc.
Testing how your prior applies across a spectrum of other cases (past and present) is helpful for model checking. If psychedelics are a promising EA cause how many of those others qualify? If many do, then any one isn’t so individually special, although one might want to have a systematic program of systematically doing rigorous testing of all the wacky claims of large impact that can be tested cheaply.
If not, then it would be good to explain what exactly makes psychedelics different from the rest.
I think the case for psychedelics the OP has made doesn’t pass this standard yet, so doesn’t meet the standard for an EA cause area.
From what I understand, effect size is one of the better ways to predict whether a study will replicate. For example, this paper found that 77% of replication effect sizes reported were within a 95% prediction interval based on the original effect size.
As a spot check, you say that brain training has massive purported effects. I looked at the research page of Lumosity, a company which sells brain training software. I expect their estimates of the effectiveness of brain training to be among the most optimistic, but their highlighted effect size is only d = 0.255.
A caveat is that if an effect size seems implausibly large, it might have arisen due to methodological error. (The one brain training study I found with a large effect size has been subject to methodological criticism.) Here is a blog post by Daniel Lakens where he discusses a study which found that judges hand out much harsher sentences before lunch:
However, I think psychedelic drugs arguably do pass this test. During the 60s, before they became illegal, a lot of people kind of were talking about how society would reorganize itself around them. And forget about performing surgery or driving while you are tripping.
The way I see it, if you want to argue that an effect isn’t real, there are two ways to do it. You can argue that the supposed effect arose through random chance/p-hacking/etc., or you can argue that it arose through methodological error.
The random chance argument is harder to make if the studies have large effect sizes. If the true effect is 0, it’s unlikely we’ll observe a large effect by chance. If researchers are trying to publish papers based on noise, you’d expect p-values to cluster just below the p < 0.05 threshold (see p-curve analysis)… they’re essentially going to publish the smallest effect size they can get away with.
The methodological error argument could be valid for a large effect size, but if this is the case, confirmatory research is not necessarily going to help, because confirmatory research could have the same issue. So at that point your time is best spent trying to pinpoint the actual methodological flaw.
This is exactly what p-values are designed for, so you are probably better off looking at p-values rather than effect size if that’s the scenario you’re trying to avoid.
I suppose you could imagine that p-values are always going to be just around 0.05, and that for a real and large effect size people use a smaller sample because that’s all that’s necessary to get p < 0.05, but this feels less likely to me. I would expect that with a real, large effect you very quickly get p < 0.01, and researchers would in fact do that.
(I don’t necessarily disagree with the rest of your comment, I’m more unsure on the other points.)
Yes, this is a better idea.
This comment is a wonderful crystallisation of the ‘defensive statistics’ of Andrew Gelman, James Heathers and other great epistemic policemen. Thanks!
I’m not claiming this. I’m claiming that given the research to date, more psychedelic research would be very impactful in expectation. (I’m at like 30-40% that the beneficial effects are real.)
I haven’t read the literatures for all the examples you gave. For psychic powers & cold fusion, my impression is that confirmatory research was done and the initial results didn’t replicate.
So one difference is that the main benefits of psychedelic therapy haven’t yet failed to replicate.
> I’m at like 30-40% that the beneficial effects are real.)
Right, so you would want to show that 30-40% of interventions with similar literatures pan out. I think the figure is less.
Scott referred to [edit: one] failure to replicate in his post.
Scott referred to one failure to replicate, for a finding that a psychedelic experience increased trait openness. This isn’t one of the benefits cited by the OP.
More on psychedelics & Openness:
Also:
I think we have a disagreement about what the appropriate reference class here is.
The reference class I’m using is something like “results which are supported by 2-3 small-n studies with large effect sizes.”
I’d expect roughly 30-40% of such results to hold up after confirmatory research.
Somewhat related: 62% of results assessed by Camerer et al. 2018 replicated.
It’s a bit complicated to think about replication re: psychedelics because the intervention is showing promise as a treatment for multiple indications (there are a couple studies showing large effect sizes for depression, a couple studies showing large effect sizes for anxiety, a couple studies showing large effect sizes for addictive disorders).
Could you say a little more about what reference class you’re using here?