I have previously let HLI have the last word, but this is too egregious.
Study quality: Publication bias (a property of the literature as a whole) and risk of bias (particular to each individual study which comprise it) are two different things.[1] Accounting for the former does not account for the latter. This is why the Cochrane handbook, the three meta-analyses HLI mentions here, and HLI’s own protocol consider distinguish the two.
Neither Cuijpers et al. 2023 nor Tong et al. 2023 further adjust their low risk of bias subgroup for publication bias.[2] I tabulate the relevant figures from both studies below:
So HLI indeed gets similar initial results and publication bias adjustments to the two other meta-analyses they find. Yet—although these are not like-for-like—these other two meta-analyses find similarly substantial effect reductions when accounting for study quality as they do when assessing at publication bias of the literature as a whole.
Although neither of these studies ‘adjust for both’, one later mentioned—Cuijpers et al. 2020 - does. It finds an additional discount to effect size when doing so.[4] So it suggests that indeed ‘accounting for’ publication bias does not adequately account for risk of bias en passant.
Tong et al. 2023 - the meta-analysis expressly on PT in LMICs rather than PT generally—finds higher prevalence of indicators of lower study quality in LMICs, and notes this as a competing explanation for the outsized effects.[5]
As previously mentioned, in the previous meta-analysis, unregistered trials had a 3x greater effect size than registered ones. All trials on Strongminds published so far have not been registered. Baird et al., which is registered, is anticipated to report disappointing results.
Evidentiary standards: Indeed, the report drew upon a large number of studies. Yet even a synthesis of 72 million (or whatever) studies can be misleading if issues of publication bias, risk of bias in individual studies (and so on) are not appropriately addressed. That an area has 72 (or whatever) studies upon it does not mean it is well-studied, nor would this number (nor any number) be sufficient, by itself, to satisfy any evidentiary standard.
Outlier exclusion: The report’s approach to outlier exclusion is dissimilar to both Cuijpers et al. 2020 and Tong et al. 2023, and further is dissimilar with respect to features I highlighted as major causes for concern re. HLI’s approach in my original comment.[6]Specifically:
Both of these studies present the analysis with the full data first in their results. Contrast HLI’s report, where only the results with outliers excluded are presented in the main results, and the analysis without exclusion is found only in the appendix.[7]
Both these studies also report the results with the full data as their main findings (e.g. in their respective abstracts). Cuijpers et al. mentions their outlier excluded results primarily in passing (“outliers” appears once in the main text); Tong et al. relegates a lot of theirs to the appendix. HLI’s report does the opposite. (cf. fn 7 above)
Only Tong et al. does further sensitivity analysis on the ‘outliers excluded’ subgroup. As Jason describes, this is done alongside the analysis where all data included, the qualitative and quantitative differences which result from this analysis choice are prominently highlighted to the reader and extensively discussed. In HLI’s report, by contrast, the factor of 3 reduction to ultimate effect size when outliers are not excluded is only alluded to qualitatively in a footnote (fn 33)[8] of the main report’s section (3.2) arguing why outliers should be excluded, not included in the reports sensitivity analysis, and only found in the appendix.[9]
Both studies adjust for publication bias only on all data, not on data with outliers excluded, and these are the publication bias findings they present. Contrast HLI’s report.
The Cuijpers et al. 2023 meta-analysis previously mentioned also differs in its approach to outlier exclusion from HLI’s report in the ways highlighted above. The Cochrane handbook also supports my recommendations on what approach should be taken, which is what the meta-analyses HLI cites approvingly as examples of “sensible practice” actually do, but what HLI’s own work does not.
The reports (non) presentation of the stark quantitative sensitivity of its analysis—material to its report bottom line recommendations—to whether outliers are excluded is clearly inappropriate. It is indefensible if, as I have suggested may be the case, the analysis with outliers included was indeed the analysis first contemplated and conducted.[10] It is even worse if it was the publication bias corrections on the full data was what in fact prompted HLI to start making alternative analysis choices which happened to substantially increase the bottom line figures.
Bayesian analysis: Bayesian methods notoriously do not avoid subjective inputs—most importantly here, what information we attend to when constructing an ‘informed prior’ (or, if one prefers, how to weigh the results with a particular prior stipulated).
In any case, they provide no protection from misunderstanding the calculation being performed, and so misinterpreting the results. The Bayesian method in the report is actually calculating the (adjusted) average effect size of psychotherapy interventions in general, not the expected effect of a given psychotherapy intervention. Although a trial on Strongminds which shows it is relatively ineffectual should not update our view much the efficacy of psychotherapy interventions (/similar to Strongminds) as a whole, it should update us dramatically on the efficacy of Strongminds itself.
Although as a methodological error this is a subtle one (at least, subtle enough for me not to initially pick up on it), the results it gave are nonsense to the naked eye (e.g. SM would still be held as a GiveDirectly-beating intervention even if there were multiple high quality RCTs on Strongminds giving flat or negative results). HLI should have seen this themselves, should have stopped to think after I highlighted these facially invalid outputs of their method in early review, and definitely should not be doubling down on these conclusions even now.
Making recommendations: Although there are other problems, those I have repeated here make the recommendations of the report unsafe. This is why I recommended against publication. Specifically:
Although I don’t think the Bayesian method the report uses would be appropriate, if it was calculated properly on its own terms (e.g. prediction intervals not confidence intervals to generate the prior, etc.), and leaving everything else the same, the SM bottom line would drop (I’m pretty sure) by a factor a bit more than 2.
The results are already essentially sensitive to whether outliers are excluded in analysis or not: SM goes from 3.7x → ~1.1x GD on the back of my envelope, again leaving all else equal.
(1) and (2) combined should net out to SM < GD; (1) or (2) combined with some of the other sensitivity analyses (e.g. spillovers) will also likely net out to SM < GD. Even if one still believes the bulk of (appropriate) analysis paths still support a recommendation, this sensitivity should be made transparent.
E.g. Even if all studies in the field are conducted impeccably, if journals only accept positive results the literature may still show publication bias. Contrariwise, even if all findings get published, failures in allocation/blinding/etc. could lead to systemic inflation of effect sizes across the literature. In reality—and here—you often have both problems, and they only partially overlap.
Jason correctly interprets Tong et al. 2023: the number of studies included in their publication bias corrections (117 [+36 w/ trim and fill]) equals the number of all studies, not the low risk of bias subgroup (36 - see table 3). I do have access to Cuijpers et al. 2023, which has a very similar results table, with parallel findings (i.e. they do their publication bias corrections on the whole set of studies, not on a low risk of bias subgroup).
HLI’s report does not assess the quality of its included studies, although it plans to do so. I appreciate GRADEing 90 studies or whatever is tedious and time consuming, but skipping this step to crack on with the quantitative synthesis is very unwise: any such synthesis could be hugely distorted by low quality studies.
Risk of bias is another important problem in research on psychotherapies for depression. In 70% of the trials (92/309) there was at least some risk of bias. And the studies with low risk of bias, clearly indicated smaller effect sizes than the ones that had (at least some) risk of bias. Only four of the 15 specific types of therapy had 5 or more trials without risk of bias. And the effects found in these studies were more modest than what was found for all studies (including the ones with risk of bias). When the studies with low risk of bias were adjusted for publication bias, only two types of therapy remained significant (the “Coping with Depression” course, and self-examination therapy).
The larger effect sizes found in non-Western trials were related to the presence of wait-list controls, high risk of bias, cognitive-behavioral therapy, and clinician-diagnosed depression (p < 0.05). The larger treatment effects observed in non-Western trials may result from the high heterogeneous study design and relatively low validity. Further research on long-term effects, adolescent groups, and individual-level data are still needed.
Apparently, all that HLI really meant with “Excluding outliers is thought sensible practice here; two related meta-analyses, Cuijpers et al., 2020c; Tong et al., 2023, used a similar approach” [my emphasis] was merely “[C]onditional on removing outliers, they identify a similar or greater range of effect sizes as outliers as we do.” (see).
Yeah, right.
I also had the same impression as Jason that HLI’s reply repeatedly strawmans me. The passive aggressive sniping sprinkled throughout and subsequent backpedalling (in fairness, I suspect by people who were not at the keyboard of the corporate account) is less than impressive too. But it’s nearly Christmas, so beyond this footnote I’ll let all this slide.
Received opinion is typically that outlier exclusion should be avoided without a clear rationale why the ‘outliers’ arise from a clearly discrepant generating process. If it is to be done, the results of the full data should still be presented as the primary analysis
If we didn’t first remove these outliers, the total effect for the recipient of psychotherapy would be much larger (see Section 4.1) but some publication bias adjustment techniques would over-correct the results and suggest the completely implausible result that psychotherapy has negative effects (leading to a smaller adjusted total effect). Once outliers are removed, these methods perform more appropriately. These methods are not magic detectors of publication bias. Instead, they make inferences based on patterns in the data, and we do not want them to make inferences on patterns that are unduly influenced by outliers (e.g., conclude that there is no effect – or, more implausibly, negative effects – of psychotherapy because of the presence of unreasonable effects sizes of up to 10 gs are present and creating large asymmetric patterns). Therefore, we think that removing outliers is appropriate. See Section 5 and Appendix B for more detail.
The sentence in the main text this is a footnote to says:
Removing outliers this way reduced the effect of psychotherapy and improves the sensibility of moderator and publication bias analyses.
[W]ithout excluding data, SM drops from ~3.6x GD to ~1.1x GD. Yet it doesn’t get a look in for the sensitivity analysis, where HLI’s ‘less favourable’ outlier method involves taking an average of the other methods (discounting by ~10%), but not doing no outlier exclusion at all (discounting by ~70%).
My remark about “Even if you didn’t pre-specify, presenting your first cut as the primary analysis helps for nothing up my sleeve reasons” which Dwyer mentions elsewhere was a reference to ‘nothing up my sleeve numbers’ in cryptography. In the same way picking pi or e initial digits for arbitrary constants reassures the author didn’t pick numbers with some ulterior purpose they are not revealing, reporting what one’s first analysis showed means readers can compare it to where you ended up after making all the post-hoc judgement calls in the garden of forking paths. “Our first intention analysis would give x, but we ended up convincing ourselves the most appropriate analysis gives a bottom line of 3x” would rightly arouse a lot of scepticism.
I’ve already mentioned I suspect this is indeed what has happened here: HLI’s first cut was including all data, but argued itself into making the choice to exclude, which gave a 3x higher ‘bottom line’. Beyond “You didn’t say you’d exclude outliers in your protocol” and “basically all of your discussion in the appendix re. outlier exclusion concerns the results of publication bias corrections on the bottom line figures”, I kinda feel HLI not denying it is beginning to invite an adverse inference from silence. If I’m right about this, HLI should come clean.
Although there are other problems, those I have repeated here make the recommendations of the report unsafe.
Even if one still believes the bulk of (appropriate) analysis paths still support a recommendation, this sensitivity should be made transparent.
The first statement says HLI’s recommendation is unsafe, but the second implies it is reasonable as long as the sensitivity is clearly explained. I’m grateful to Greg for presenting the analysis paths which lead to SM < GD, but it’s unclear to me how much those paths should be weighted compared to all the other paths which lead to SM > GD.
It’s notable that Cuijpers (who has done more than anyone in the field to account for publication bias and risk of bias) is still confident that psychotherapy is effective.
I was also surprised by the use of ‘unsafe’. Less cost-effective maybe, but ‘unsafe’ implies harm and I haven’t seen any evidence to support that claim.
I have previously let HLI have the last word, but this is too egregious.
Study quality: Publication bias (a property of the literature as a whole) and risk of bias (particular to each individual study which comprise it) are two different things.[1] Accounting for the former does not account for the latter. This is why the Cochrane handbook, the three meta-analyses HLI mentions here, and HLI’s own protocol consider distinguish the two.
Neither Cuijpers et al. 2023 nor Tong et al. 2023 further adjust their low risk of bias subgroup for publication bias.[2] I tabulate the relevant figures from both studies below:
So HLI indeed gets similar initial results and publication bias adjustments to the two other meta-analyses they find. Yet—although these are not like-for-like—these other two meta-analyses find similarly substantial effect reductions when accounting for study quality as they do when assessing at publication bias of the literature as a whole.
There is ample cause for concern here:[3]
Although neither of these studies ‘adjust for both’, one later mentioned—Cuijpers et al. 2020 - does. It finds an additional discount to effect size when doing so.[4] So it suggests that indeed ‘accounting for’ publication bias does not adequately account for risk of bias en passant.
Tong et al. 2023 - the meta-analysis expressly on PT in LMICs rather than PT generally—finds higher prevalence of indicators of lower study quality in LMICs, and notes this as a competing explanation for the outsized effects.[5]
As previously mentioned, in the previous meta-analysis, unregistered trials had a 3x greater effect size than registered ones. All trials on Strongminds published so far have not been registered. Baird et al., which is registered, is anticipated to report disappointing results.
Evidentiary standards: Indeed, the report drew upon a large number of studies. Yet even a synthesis of 72 million (or whatever) studies can be misleading if issues of publication bias, risk of bias in individual studies (and so on) are not appropriately addressed. That an area has 72 (or whatever) studies upon it does not mean it is well-studied, nor would this number (nor any number) be sufficient, by itself, to satisfy any evidentiary standard.
Outlier exclusion: The report’s approach to outlier exclusion is dissimilar to both Cuijpers et al. 2020 and Tong et al. 2023, and further is dissimilar with respect to features I highlighted as major causes for concern re. HLI’s approach in my original comment.[6] Specifically:
Both of these studies present the analysis with the full data first in their results. Contrast HLI’s report, where only the results with outliers excluded are presented in the main results, and the analysis without exclusion is found only in the appendix.[7]
Both these studies also report the results with the full data as their main findings (e.g. in their respective abstracts). Cuijpers et al. mentions their outlier excluded results primarily in passing (“outliers” appears once in the main text); Tong et al. relegates a lot of theirs to the appendix. HLI’s report does the opposite. (cf. fn 7 above)
Only Tong et al. does further sensitivity analysis on the ‘outliers excluded’ subgroup. As Jason describes, this is done alongside the analysis where all data included, the qualitative and quantitative differences which result from this analysis choice are prominently highlighted to the reader and extensively discussed. In HLI’s report, by contrast, the factor of 3 reduction to ultimate effect size when outliers are not excluded is only alluded to qualitatively in a footnote (fn 33)[8] of the main report’s section (3.2) arguing why outliers should be excluded, not included in the reports sensitivity analysis, and only found in the appendix.[9]
Both studies adjust for publication bias only on all data, not on data with outliers excluded, and these are the publication bias findings they present. Contrast HLI’s report.
The Cuijpers et al. 2023 meta-analysis previously mentioned also differs in its approach to outlier exclusion from HLI’s report in the ways highlighted above. The Cochrane handbook also supports my recommendations on what approach should be taken, which is what the meta-analyses HLI cites approvingly as examples of “sensible practice” actually do, but what HLI’s own work does not.
The reports (non) presentation of the stark quantitative sensitivity of its analysis—material to its report bottom line recommendations—to whether outliers are excluded is clearly inappropriate. It is indefensible if, as I have suggested may be the case, the analysis with outliers included was indeed the analysis first contemplated and conducted.[10] It is even worse if it was the publication bias corrections on the full data was what in fact prompted HLI to start making alternative analysis choices which happened to substantially increase the bottom line figures.
Bayesian analysis: Bayesian methods notoriously do not avoid subjective inputs—most importantly here, what information we attend to when constructing an ‘informed prior’ (or, if one prefers, how to weigh the results with a particular prior stipulated).
In any case, they provide no protection from misunderstanding the calculation being performed, and so misinterpreting the results. The Bayesian method in the report is actually calculating the (adjusted) average effect size of psychotherapy interventions in general, not the expected effect of a given psychotherapy intervention. Although a trial on Strongminds which shows it is relatively ineffectual should not update our view much the efficacy of psychotherapy interventions (/similar to Strongminds) as a whole, it should update us dramatically on the efficacy of Strongminds itself.
Although as a methodological error this is a subtle one (at least, subtle enough for me not to initially pick up on it), the results it gave are nonsense to the naked eye (e.g. SM would still be held as a GiveDirectly-beating intervention even if there were multiple high quality RCTs on Strongminds giving flat or negative results). HLI should have seen this themselves, should have stopped to think after I highlighted these facially invalid outputs of their method in early review, and definitely should not be doubling down on these conclusions even now.
Making recommendations: Although there are other problems, those I have repeated here make the recommendations of the report unsafe. This is why I recommended against publication. Specifically:
Although I don’t think the Bayesian method the report uses would be appropriate, if it was calculated properly on its own terms (e.g. prediction intervals not confidence intervals to generate the prior, etc.), and leaving everything else the same, the SM bottom line would drop (I’m pretty sure) by a factor a bit more than 2.
The results are already essentially sensitive to whether outliers are excluded in analysis or not: SM goes from 3.7x → ~1.1x GD on the back of my envelope, again leaving all else equal.
(1) and (2) combined should net out to SM < GD; (1) or (2) combined with some of the other sensitivity analyses (e.g. spillovers) will also likely net out to SM < GD. Even if one still believes the bulk of (appropriate) analysis paths still support a recommendation, this sensitivity should be made transparent.
E.g. Even if all studies in the field are conducted impeccably, if journals only accept positive results the literature may still show publication bias. Contrariwise, even if all findings get published, failures in allocation/blinding/etc. could lead to systemic inflation of effect sizes across the literature. In reality—and here—you often have both problems, and they only partially overlap.
Jason correctly interprets Tong et al. 2023: the number of studies included in their publication bias corrections (117 [+36 w/ trim and fill]) equals the number of all studies, not the low risk of bias subgroup (36 - see table 3). I do have access to Cuijpers et al. 2023, which has a very similar results table, with parallel findings (i.e. they do their publication bias corrections on the whole set of studies, not on a low risk of bias subgroup).
Me, previously:
From their discussion (my emphasis):
E.g. from the abstract (my emphasis):
Apparently, all that HLI really meant with “Excluding outliers is thought sensible practice here; two related meta-analyses, Cuijpers et al., 2020c; Tong et al., 2023, used a similar approach” [my emphasis] was merely “[C]onditional on removing outliers, they identify a similar or greater range of effect sizes as outliers as we do.” (see).
Yeah, right.
I also had the same impression as Jason that HLI’s reply repeatedly strawmans me. The passive aggressive sniping sprinkled throughout and subsequent backpedalling (in fairness, I suspect by people who were not at the keyboard of the corporate account) is less than impressive too. But it’s nearly Christmas, so beyond this footnote I’ll let all this slide.
Me again (my [re-?]emphasis)
Said footnote:
The sentence in the main text this is a footnote to says:
Me again:
My remark about “Even if you didn’t pre-specify, presenting your first cut as the primary analysis helps for nothing up my sleeve reasons” which Dwyer mentions elsewhere was a reference to ‘nothing up my sleeve numbers’ in cryptography. In the same way picking pi or e initial digits for arbitrary constants reassures the author didn’t pick numbers with some ulterior purpose they are not revealing, reporting what one’s first analysis showed means readers can compare it to where you ended up after making all the post-hoc judgement calls in the garden of forking paths. “Our first intention analysis would give x, but we ended up convincing ourselves the most appropriate analysis gives a bottom line of 3x” would rightly arouse a lot of scepticism.
I’ve already mentioned I suspect this is indeed what has happened here: HLI’s first cut was including all data, but argued itself into making the choice to exclude, which gave a 3x higher ‘bottom line’. Beyond “You didn’t say you’d exclude outliers in your protocol” and “basically all of your discussion in the appendix re. outlier exclusion concerns the results of publication bias corrections on the bottom line figures”, I kinda feel HLI not denying it is beginning to invite an adverse inference from silence. If I’m right about this, HLI should come clean.
I’m feeling confused by these two statements:
The first statement says HLI’s recommendation is unsafe, but the second implies it is reasonable as long as the sensitivity is clearly explained. I’m grateful to Greg for presenting the analysis paths which lead to SM < GD, but it’s unclear to me how much those paths should be weighted compared to all the other paths which lead to SM > GD.
It’s notable that Cuijpers (who has done more than anyone in the field to account for publication bias and risk of bias) is still confident that psychotherapy is effective.
I was also surprised by the use of ‘unsafe’. Less cost-effective maybe, but ‘unsafe’ implies harm and I haven’t seen any evidence to support that claim.