Thank you for your comments, Gregory. We’re aware you have strong views on the subject and we appreciate your conscientious contributions. We discussed your previous comments internally but largely concluded revisions weren’t necessary as we (a) had already considered them in the report and appendix, (b) will return to them in later versions and didn’t expect they would materially affect the results, or (c) simply don’t agree with these views. To unpack:
Study quality. We conclude the data set does contain bias, but we account for it (sections 3.2 and 5; it’s an open question among academics how best to do this). We don’t believe that the entire field of LMIC psychotherapy should be considered bunk, compromised, or uninformative. Our results are in line with existing meta-analyses of psychotherapy considered to have low risk of bias (see footnote).[1]
Evidentiary standards. We drew on a large number of RCTs for our systematic reviews and meta-analyses of cash transfers and psychotherapy (42 and 74, respectively). If one holds that the evidence for something as well-studied as psychotherapy is too weak to justify any recommendations, charity evaluators could recommend very little.
Outlier exclusion. The issues regarding outlier exclusion were discussed in some depth (3.2 in the main report and in Appendix B). Excluding outliers is thought sensible practice here; two related meta-analyses, Cuijpers et al., 2020c; Tong et al., 2023, used a similar approach. It’s consistent with not taking the entire literature at face value but also not taking guilt by association too far. If one excludes outliers, the specific way one does this has a minor effect (e.g., a 10% decline in effectiveness, see appendix). Our analysis necessarily makes analytic choices: some were pre-registered, some made on reflection, many were discussed in our sensitivity analysis. If one insisted only on using charity evaluations that had every choice pre-registered, there would be none to choose from.
Bayesian analysis: The method we use (‘grid approximation’, see 8.3 and 9.3) avoids subjective inputs. It is not this Bayesian analysis itself that ‘stacks the deck’ in favour of psychotherapy, but the evidence. Given that over 70 studies form the prior, it would be surprising if adding one study, as we did for StrongMinds, would radically alter the conclusions. [Edit 5/12/2023: on the point that StrongMinds could be more cost-effective than GiveDirectly, even if StrongMinds only has the small effect we assume it does in our hypothetical placeholder studies, it doesn’t seem inconceivable that a small, less effective intervention can still be more cost-effective than a big, expensive one. For context, we estimate it costs StrongMinds $63 per intervention—providing one person with a course of therapy—whereas it costs GiveDirectly $1221 per intervention—an $1000 cash transfer which costs $221 in overheads. As the therapy is about 20x cheaper, it can be far less effective yet still more cost-effective.]
Making recommendations: we aim to recommend the most cost-effective ways of increasing WELLBYs we’ve found so far. While we have intuitions about how good different interventions are our perspective as an organisation is that conclusions about what’s cost-effective should be led heavily by the evidence rather than by our pre-evidential beliefs (‘priors’). Given the evidence we’ve considered, we don’t see a strong case for recommending cash transfers over psychotherapy.
This is a working report, and we’ll be reflecting on how to incorporate the above, similarly psychotherapy-sceptical perspectives, and other views in the process of preparing it for academic review. In the interests of transparency, we don’t plan to engage beyond our comments above so as to preserve team resources.
We find an initial effect is 0.70 SDs, reduced to 0.46 SDs after publication bias adjustments. Cuijpers et al. 2023 find an effect of psychotherapy of 0.49 SDs for studies with low risk of bias (RoB) in low, middle, and high income countries (comparisons = 218), which reduces to between 0.27 and 0.57 after publication adjustment. Tong et al. 2023 find an effect of 0.69 SDs for studies with low RoB in non-western countries (primarily low and middle income; comparisons = 36), which adjust to between 0.42 and 0.60 after publication correction. Hence, our initial and adjusted numbers are similar.
Epistemic status: tentative, it’s been a long time since reading social science papers was a significant part of my life. Happy to edit/retract this preliminary view as appropriate if someone is able to identify mistakes.
Excluding outliers is thought sensible practice here; two related meta-analyses, Cuijpers et al., 2020c; Tong et al., 2023, used a similar approach.
I can’t access Cuijpers et al., but I don’t read Tong et al. as supporting what HLI has done here.
In their article, Tong et al. provide the effect size with no exclusions, then with outliers excluded, then with “extreme outliers” excluded (the latter of which seems to track HLI’s removal criterion). They also provide effect size with various publication-bias measures employed. See PDF at 5-6. If I’m not mistaken, the publication bias measures are applied to the no-exclusions version, not a version with outliers removed or limited to those with lower RoB. See id. at 6 tbl.2 (n = 117 for combined and 2 of 3 publication-bias effect sizes; 153 with trim-and-fill adding 36 studies; n = 74 for outliers removed & n = 104 for extreme outliers removed; effect sizes after publication-bias measures range from 0.42 to 0.60 seem to be those mentioned in HLI’s footnote above).
Tong et al. “conducted sensitive analyses comparing the results with and without the inclusion of extreme outliers,” PDF at 5, discussing the results without exclusion first and then the results with exclusion. See id. at 5-6. Tables 3-5 are based on data without exclusion of extreme outliers; the versions of Tables 4 and 5 that excludes extreme outliers are relegated to the supplemental tables (not in PDF). See id. at 6. This reads to my eyes as treating both the all-inclusive and extreme-outliers-excluded data seriously, with some pride of place to the all-inclusive data.
I don’t read Tong et al. as having reached a conclusion that either the all-inclusive or extreme-outliers-excluded results were more authoritative, saying things like:
Lastly, we were unable to explain the different findings in the analyses with vs. without extreme outliers. The full analyses that included extreme outliers may reflect the true differences in study characteristics, or they may imply the methodological issues raised by studies with effect sizes that were significantly higher than expected.
and
Therefore, the larger treatment effects observed in non-Western trials may not necessarily imply superior treatment outcomes. On the other hand, it could stem from variations in study design and quality.
and
Further research is required to explain the reasons for the differences in study design and quality between Western and non-Western trials, as well as the different results in the analyses with and without extreme outliers.
PDF at 10.
Of course, “further research needed” is an almost inevitable conclusion of the majority of academic papers, and Tong et al. have the luxury of not needing to reach any conclusions to inform the recommended distribution of charitable dollars. But I don’t read the article by Tong et al. as supporting the proposition that it is appropriate to just run with the outliers-excluded data. Rather, I read the article as suggesting that—at least in the absence of compelling reasons to the contrary—one should take both analyses seriously, but neither definitively.
I lack confidence in what taking both analyses seriously, but neither definitively would mean for purposes of conducting a cost-effectiveness analysis. But I speculate that it would likely involve some sort of weighting of the two views.
When we said “Excluding outliers is thought sensible practice here; two related meta-analyses, Cuijpers et al., 2020c; Tong et al., 2023, used a similar approach”—I can see that what we meant by “similar approach” was unclear. We meant that, conditional on removing outliers, they identify a similar or greater range of effect sizes as outliers as we do.
This was primarily meant to address the question raised by Gregory about whether to include outliers: “The cut data by and large doesn’t look visually ‘outlying’ to me.”
To rephrase, I think that Cuijpers et al. and Tong et al. would agree that the data we cut looks outlying. Obviously, this is a milder claim than our comment could be interpreted as making.
Turning to wider implications of these meta-analyses, As you rightly point out, they don’t have a “preferred specification” and are mostly presenting the options for doing the analysis. They present analyses with and without outlier removal in their main analysis, and they adjust for publication bias without outliers removed (which is not what we do). The first analytic choice doesn’t clearly support including or excluding outliers, and the second – if it supports any option, favors Greg’s proposed approach of correcting for publication bias without outliers removed.
I think one takeaway is that we should consider surveying the literature and some experts in the field, in a non-leading way, about what choices they’d make if they didn’t have “the luxury of not having to reach a conclusion”.
I think it seems plausible to give some weight to analyses with and without excluding outliers – if we are able find a reasonable way to treat the 2 out of 7 publication bias correction methods that produce the results suggesting that the effect of psychotherapy is in fact sizably negative. We’ll look into this more before our next update.
Cutting the outliers here was part of our first pass attempt at minimising the influence of dubious effects, which we’ll follow up with a Risk of Bias analysis in the next version. Our working assumption was that effects greater than ~ 2 standard deviations are suspect on theoretical grounds (that is, if they behave anything like SDs in an normal distribution), and seemed more likely to be the result of some error-generating process (e.g. data-entry error, bias) than a genuine effect.
We’ll look into this more in our next pass, but for this version we felt outlier removal was the most sensible choice.
I recently discovered that GiveWell decided to exclude an outlier in their water chlorination meta-analysis. I’m not qualified to judge their reasoning, but maybe others with sufficient expertise will weigh in?
We excluded one RCT that meets our other criteria because we think the results are implausibly high such that we don’t believe they represent the true effect of chlorination interventions (more in footnote).[4] It’s unorthodox to exclude studies for this reason when conducting a meta-analysis, but we chose to do so because we think it gives us an overall estimate that is more likely to represent the true effect size.
I recently discovered that GiveWell decided to exclude an outlier in their water chlorination meta-analysis. I’m not qualified to judge their reasoning, but maybe others with sufficient expertise will weigh in?
We excluded one RCT that meets our other criteria because we think the results are implausibly high such that we don’t believe they represent the true effect of chlorination interventions (more in footnote).[4] It’s unorthodox to exclude studies for this reason when conducting a meta-analysis, but we chose to do so because we think it gives us an overall estimate that is more likely to represent the true effect size.
Evidentiary standards. We drew on a large number of RCTs for our systematic reviews and meta-analyses of cash transfers and psychotherapy (42 and 74, respectively). If one holds that the evidence for something as well-studied as psychotherapy is too weak to justify any recommendations, charity evaluators could recommend very little.
A comparatively minor point, but it doesn’t seem to me that the claims in Greg’s post [more] are meaningfully weakened by whether or not psychotherapy is well-studied (as measured by how many RCTs HLI has found on it, noting that you already push back on some object level disagreement on study quality in point 1, which feels more directly relevant).
It also seems pretty unlikely to be true that psychotherapy being well studied necessarily means that StrongMinds is a cost-effective intervention comparable to current OP / GW funding bars (which is one main point of contention), or that charity evaluators need 74+ RCTs in an area before recommending a charity. Is the implicit claim being made here is that the evidence for StrongMinds being a top charity is stronger than that of AMF, which is (AFAIK) based on less than 74 RCTs?[1]
I have previously let HLI have the last word, but this is too egregious.
Study quality: Publication bias (a property of the literature as a whole) and risk of bias (particular to each individual study which comprise it) are two different things.[1] Accounting for the former does not account for the latter. This is why the Cochrane handbook, the three meta-analyses HLI mentions here, and HLI’s own protocol consider distinguish the two.
Neither Cuijpers et al. 2023 nor Tong et al. 2023 further adjust their low risk of bias subgroup for publication bias.[2] I tabulate the relevant figures from both studies below:
So HLI indeed gets similar initial results and publication bias adjustments to the two other meta-analyses they find. Yet—although these are not like-for-like—these other two meta-analyses find similarly substantial effect reductions when accounting for study quality as they do when assessing at publication bias of the literature as a whole.
Although neither of these studies ‘adjust for both’, one later mentioned—Cuijpers et al. 2020 - does. It finds an additional discount to effect size when doing so.[4] So it suggests that indeed ‘accounting for’ publication bias does not adequately account for risk of bias en passant.
Tong et al. 2023 - the meta-analysis expressly on PT in LMICs rather than PT generally—finds higher prevalence of indicators of lower study quality in LMICs, and notes this as a competing explanation for the outsized effects.[5]
As previously mentioned, in the previous meta-analysis, unregistered trials had a 3x greater effect size than registered ones. All trials on Strongminds published so far have not been registered. Baird et al., which is registered, is anticipated to report disappointing results.
Evidentiary standards: Indeed, the report drew upon a large number of studies. Yet even a synthesis of 72 million (or whatever) studies can be misleading if issues of publication bias, risk of bias in individual studies (and so on) are not appropriately addressed. That an area has 72 (or whatever) studies upon it does not mean it is well-studied, nor would this number (nor any number) be sufficient, by itself, to satisfy any evidentiary standard.
Outlier exclusion: The report’s approach to outlier exclusion is dissimilar to both Cuijpers et al. 2020 and Tong et al. 2023, and further is dissimilar with respect to features I highlighted as major causes for concern re. HLI’s approach in my original comment.[6]Specifically:
Both of these studies present the analysis with the full data first in their results. Contrast HLI’s report, where only the results with outliers excluded are presented in the main results, and the analysis without exclusion is found only in the appendix.[7]
Both these studies also report the results with the full data as their main findings (e.g. in their respective abstracts). Cuijpers et al. mentions their outlier excluded results primarily in passing (“outliers” appears once in the main text); Tong et al. relegates a lot of theirs to the appendix. HLI’s report does the opposite. (cf. fn 7 above)
Only Tong et al. does further sensitivity analysis on the ‘outliers excluded’ subgroup. As Jason describes, this is done alongside the analysis where all data included, the qualitative and quantitative differences which result from this analysis choice are prominently highlighted to the reader and extensively discussed. In HLI’s report, by contrast, the factor of 3 reduction to ultimate effect size when outliers are not excluded is only alluded to qualitatively in a footnote (fn 33)[8] of the main report’s section (3.2) arguing why outliers should be excluded, not included in the reports sensitivity analysis, and only found in the appendix.[9]
Both studies adjust for publication bias only on all data, not on data with outliers excluded, and these are the publication bias findings they present. Contrast HLI’s report.
The Cuijpers et al. 2023 meta-analysis previously mentioned also differs in its approach to outlier exclusion from HLI’s report in the ways highlighted above. The Cochrane handbook also supports my recommendations on what approach should be taken, which is what the meta-analyses HLI cites approvingly as examples of “sensible practice” actually do, but what HLI’s own work does not.
The reports (non) presentation of the stark quantitative sensitivity of its analysis—material to its report bottom line recommendations—to whether outliers are excluded is clearly inappropriate. It is indefensible if, as I have suggested may be the case, the analysis with outliers included was indeed the analysis first contemplated and conducted.[10] It is even worse if it was the publication bias corrections on the full data was what in fact prompted HLI to start making alternative analysis choices which happened to substantially increase the bottom line figures.
Bayesian analysis: Bayesian methods notoriously do not avoid subjective inputs—most importantly here, what information we attend to when constructing an ‘informed prior’ (or, if one prefers, how to weigh the results with a particular prior stipulated).
In any case, they provide no protection from misunderstanding the calculation being performed, and so misinterpreting the results. The Bayesian method in the report is actually calculating the (adjusted) average effect size of psychotherapy interventions in general, not the expected effect of a given psychotherapy intervention. Although a trial on Strongminds which shows it is relatively ineffectual should not update our view much the efficacy of psychotherapy interventions (/similar to Strongminds) as a whole, it should update us dramatically on the efficacy of Strongminds itself.
Although as a methodological error this is a subtle one (at least, subtle enough for me not to initially pick up on it), the results it gave are nonsense to the naked eye (e.g. SM would still be held as a GiveDirectly-beating intervention even if there were multiple high quality RCTs on Strongminds giving flat or negative results). HLI should have seen this themselves, should have stopped to think after I highlighted these facially invalid outputs of their method in early review, and definitely should not be doubling down on these conclusions even now.
Making recommendations: Although there are other problems, those I have repeated here make the recommendations of the report unsafe. This is why I recommended against publication. Specifically:
Although I don’t think the Bayesian method the report uses would be appropriate, if it was calculated properly on its own terms (e.g. prediction intervals not confidence intervals to generate the prior, etc.), and leaving everything else the same, the SM bottom line would drop (I’m pretty sure) by a factor a bit more than 2.
The results are already essentially sensitive to whether outliers are excluded in analysis or not: SM goes from 3.7x → ~1.1x GD on the back of my envelope, again leaving all else equal.
(1) and (2) combined should net out to SM < GD; (1) or (2) combined with some of the other sensitivity analyses (e.g. spillovers) will also likely net out to SM < GD. Even if one still believes the bulk of (appropriate) analysis paths still support a recommendation, this sensitivity should be made transparent.
E.g. Even if all studies in the field are conducted impeccably, if journals only accept positive results the literature may still show publication bias. Contrariwise, even if all findings get published, failures in allocation/blinding/etc. could lead to systemic inflation of effect sizes across the literature. In reality—and here—you often have both problems, and they only partially overlap.
Jason correctly interprets Tong et al. 2023: the number of studies included in their publication bias corrections (117 [+36 w/ trim and fill]) equals the number of all studies, not the low risk of bias subgroup (36 - see table 3). I do have access to Cuijpers et al. 2023, which has a very similar results table, with parallel findings (i.e. they do their publication bias corrections on the whole set of studies, not on a low risk of bias subgroup).
HLI’s report does not assess the quality of its included studies, although it plans to do so. I appreciate GRADEing 90 studies or whatever is tedious and time consuming, but skipping this step to crack on with the quantitative synthesis is very unwise: any such synthesis could be hugely distorted by low quality studies.
Risk of bias is another important problem in research on psychotherapies for depression. In 70% of the trials (92/309) there was at least some risk of bias. And the studies with low risk of bias, clearly indicated smaller effect sizes than the ones that had (at least some) risk of bias. Only four of the 15 specific types of therapy had 5 or more trials without risk of bias. And the effects found in these studies were more modest than what was found for all studies (including the ones with risk of bias). When the studies with low risk of bias were adjusted for publication bias, only two types of therapy remained significant (the “Coping with Depression” course, and self-examination therapy).
The larger effect sizes found in non-Western trials were related to the presence of wait-list controls, high risk of bias, cognitive-behavioral therapy, and clinician-diagnosed depression (p < 0.05). The larger treatment effects observed in non-Western trials may result from the high heterogeneous study design and relatively low validity. Further research on long-term effects, adolescent groups, and individual-level data are still needed.
Apparently, all that HLI really meant with “Excluding outliers is thought sensible practice here; two related meta-analyses, Cuijpers et al., 2020c; Tong et al., 2023, used a similar approach” [my emphasis] was merely “[C]onditional on removing outliers, they identify a similar or greater range of effect sizes as outliers as we do.” (see).
Yeah, right.
I also had the same impression as Jason that HLI’s reply repeatedly strawmans me. The passive aggressive sniping sprinkled throughout and subsequent backpedalling (in fairness, I suspect by people who were not at the keyboard of the corporate account) is less than impressive too. But it’s nearly Christmas, so beyond this footnote I’ll let all this slide.
Received opinion is typically that outlier exclusion should be avoided without a clear rationale why the ‘outliers’ arise from a clearly discrepant generating process. If it is to be done, the results of the full data should still be presented as the primary analysis
If we didn’t first remove these outliers, the total effect for the recipient of psychotherapy would be much larger (see Section 4.1) but some publication bias adjustment techniques would over-correct the results and suggest the completely implausible result that psychotherapy has negative effects (leading to a smaller adjusted total effect). Once outliers are removed, these methods perform more appropriately. These methods are not magic detectors of publication bias. Instead, they make inferences based on patterns in the data, and we do not want them to make inferences on patterns that are unduly influenced by outliers (e.g., conclude that there is no effect – or, more implausibly, negative effects – of psychotherapy because of the presence of unreasonable effects sizes of up to 10 gs are present and creating large asymmetric patterns). Therefore, we think that removing outliers is appropriate. See Section 5 and Appendix B for more detail.
The sentence in the main text this is a footnote to says:
Removing outliers this way reduced the effect of psychotherapy and improves the sensibility of moderator and publication bias analyses.
[W]ithout excluding data, SM drops from ~3.6x GD to ~1.1x GD. Yet it doesn’t get a look in for the sensitivity analysis, where HLI’s ‘less favourable’ outlier method involves taking an average of the other methods (discounting by ~10%), but not doing no outlier exclusion at all (discounting by ~70%).
My remark about “Even if you didn’t pre-specify, presenting your first cut as the primary analysis helps for nothing up my sleeve reasons” which Dwyer mentions elsewhere was a reference to ‘nothing up my sleeve numbers’ in cryptography. In the same way picking pi or e initial digits for arbitrary constants reassures the author didn’t pick numbers with some ulterior purpose they are not revealing, reporting what one’s first analysis showed means readers can compare it to where you ended up after making all the post-hoc judgement calls in the garden of forking paths. “Our first intention analysis would give x, but we ended up convincing ourselves the most appropriate analysis gives a bottom line of 3x” would rightly arouse a lot of scepticism.
I’ve already mentioned I suspect this is indeed what has happened here: HLI’s first cut was including all data, but argued itself into making the choice to exclude, which gave a 3x higher ‘bottom line’. Beyond “You didn’t say you’d exclude outliers in your protocol” and “basically all of your discussion in the appendix re. outlier exclusion concerns the results of publication bias corrections on the bottom line figures”, I kinda feel HLI not denying it is beginning to invite an adverse inference from silence. If I’m right about this, HLI should come clean.
Although there are other problems, those I have repeated here make the recommendations of the report unsafe.
Even if one still believes the bulk of (appropriate) analysis paths still support a recommendation, this sensitivity should be made transparent.
The first statement says HLI’s recommendation is unsafe, but the second implies it is reasonable as long as the sensitivity is clearly explained. I’m grateful to Greg for presenting the analysis paths which lead to SM < GD, but it’s unclear to me how much those paths should be weighted compared to all the other paths which lead to SM > GD.
It’s notable that Cuijpers (who has done more than anyone in the field to account for publication bias and risk of bias) is still confident that psychotherapy is effective.
I was also surprised by the use of ‘unsafe’. Less cost-effective maybe, but ‘unsafe’ implies harm and I haven’t seen any evidence to support that claim.
You cannot use the distribution for the expected value of an average therapy treatment as the prior distribution for a SPECIFIC therapy treatment, as there will be a large amount of variation between possible therapy treatments that is missed when doing this. Your prior here is that there is a 99%+ chance that StrongMinds will work better than GiveDirectly before looking at any actual StrongMinds results, this is a wildly implausible claim.
You also state “If one holds that the evidence for something as well-studied as psychotherapy is too weak to justify any recommendations, charity evaluators could recommend very little.” Nothing in Gregory’s post suggests that he thinks anything like this, he gives a g of ~0.5 in his meta-analysis that doesn’t improperly remove outliers without good cause. A g of ~0.5 suggests that individuals suffering from depression would likely greatly benefit from seeking therapy. There is a massive difference between “evidence behind psychotherapy is too weak to justify any recommendations” and claiming that “this particular form of therapy is not vastly better than GiveDirectly with a probability higher than 99% before even looking at RCT results”. Trying to throw out Gregory’s claims here over a seemingly false statement about his beliefs seems pretty offensive to me.
[Disclaimer: I worked at HLI until March 2023. I now work at the International Alliance of Mental Health Research Funders]
Gregory says
these problems are sufficiently major I think potential donors are ill-advised to follow the recommendations and analysis in this report.
That is a strong claim to make and it requires him to present a convincing case that GiveDirectly is more cost-effective than StrongMinds. I’ve found his previous methodological critiques to be constructive and well-explained. To their credit, HLI has incorporated many of them in the updated analysis. However, in my opinion, the critiques he presents here do not make a convincing case.
Taking his summary points in turn...
The literature on PT in LMICs is a complete mess. Insofar as more sense can be made from it, the most important factors appear to belong to the studies investigating it (e.g. their size) rather than qualities of the PT interventions themselves.
I think this is much too strong. The three meta-analyses (and Gregory’s own calculations) give me confidence that psychotherapy in LMICs is effective, although the effects are likely to be small.
2. Trying to correct the results of a compromised literature is known to be a nightmare. Here, the qualitative evidence for publication bias is compelling. But quantifying what particular value of ‘a lot?’ the correction should be is fraught: numerically, methods here disagree with one another dramatically, and prove highly sensitive to choices on data exclusion.
There is no consensus on the appropriate methodology for adjusting publication bias. I don’t have an informed opinion on this, but HLI’s approach seems reasonable to me and I think it’s reasonable for Greg to take a different view. From my limited understanding, neither approach makes GiveDirectly more cost-effective.
3. Regardless of how PT looks in general, StrongMinds, in particular, is looking less and less promising. Although initial studies looked good, they had various methodological weaknesses, and a forthcoming RCT with much higher methodological quality is expected to deliver disappointing results.
We don’t have any new data on StrongMinds so I’m confused why Greg thinks it’s “less and less promising”. HLI’s Bayesian approach is a big improvement on the subjective weightings they used in the first cost-effectiveness analysis. As with publication bias, it’s reasonable to hold different views on how to construct the prior, but personally, I do believe that any psychotherapy intervention in LMICs, so long as cost per patient is <$100, is a ~certain bet to beat cash transfers. There are no specific models of psychotherapy that perform better than the others, so I don’t find it surprising that training people to talk to other people about their problems is a more cost-effective way to improve wellbeing in LMICs. Cash transfers are much more expensive and the effects on subjective wellbeing are also small.
4. The evidential trajectory here is all to common, and the outlook typically bleak. It is dubious StrongMinds is a good pick even among psychotherapy interventions (picking one at random which doesn’t have a likely-bad-news RCT imminent seems a better bet). Although pricing different interventions is hard, it is even more dubious SM is close to the frontier of “very well evidenced” vs. “has very promising results” plotted out by things like AMF, GD, etc. HLI’s choice to nonetheless recommend SM again this giving season is very surprising. I doubt it will weather hindsight well.
HLI had to start somewhere and I think we should give credit to StrongMinds for being brave enough to open themselves up to the scrutiny they’ve faced. The three meta-analyses and the tentative analysis of Friendship Bench suggest there is ‘altruistic gold’ to be found here and HLI has only just started to dig. The field is growing quickly and I’m optimistic about the trajectories of CE-incubated charities like Vida Plena and Kaya Guides.
In the meantime, although the gap between GiveDirectly and StrongMinds has clearly narrowed, I remain unconvinced that cash is clearly the better option (but I do remain open-minded and open to pushback).
You cannot use the distribution for the expected value of an average therapy treatment as the prior distribution for a SPECIFIC therapy treatment, as there will be a large amount of variation between possible therapy treatments that is missed when doing this.
A specific therapy treatment is drawn from the distribution of therapy treatments. Our best guess about the distribution of value of a specific therapy treatment, without knowing anything about it, should take into account only that it comes from this distribution of therapy treatments. So I don’t see what’s unreasonable about this.
When running a meta-analysis, you can either use a fixed effect assumption (that all variation between studies is just due to sampling error) and a random effect assumption (that studies differ in terms of their “true effects”.) Therapy treatments differ greatly, so you have to use a random effects model in this case. Then the prior you use for strong minds impact should have a variance that is the sum of the variance in the estimate of average therapy treatments effects AND the variance among different types of treatments effects, both numbers should be available from a random effects meta-analysis. I’m not quite sure what HLI did exactly to get their prior for strong minds here, but for some reason the variance on it seems WAY too low, and I suspect that they neglected the second type of variance that they should have gotten from a random effects meta-analysis.
Section 2.2.2 of their report is titled “Choosing a fixed or random effects model”. They discuss the points you make and clearly say that they use a random effects model. In section 2.2.3 they discuss the standard measures of heterogeneity they use. Section 2.2.4 discusses the specific 4-level random effects model they use and how they did model selection.
I reviewed a small section of the report prior to publication but none of these sections, and it only took me 5 minutes now to check what they did. I’d like the EA Forum to have a higher bar (as Gregory’s parent comment exemplifies) before throwing around easily checkable suspicions about what (very basic) mistakes might have been made.
Yes, some of Greg’s examples point to the variance being underestimated, but the problem does not inherently come from the idea of using the distribution of effects as the prior, since that should include both the sampling uncertainty and true heterogeneity. That would be the appropriate approach even under a random effects model (I think; I’m more used to thinking in terms of Bayesian hierarchical models and the equivalence might not hold)
I think the fundamental point (i.e. “You cannot use the distribution for the expected value of an average therapy treatment as the prior distribution for a SPECIFIC therapy treatment, as there will be a large amount of variation between possible therapy treatments that is missed when doing this.”) is on the right lines, although subsequent discussion of fixed/random effect models might confuse the issue. (Cf. my reply to Jason).
The typical output of a meta-analysis is an (~) average effect size estimate (the diamond at the bottom of the forest plot, etc.) The confidence interval given for that is (very roughly)[1] the interval we predict the true average effect likely lies. So for the basic model given in Section 4 of the report, the average effect size is 0.64, 95% CI (0.54 − 0.74). So (again, roughly) our best guess of the ‘true’ average effect size of psychotherapy in LMICs from our data is 0.64, and we’re 95% sure(*) this average is somewhere between (0.54, 0.74).
Clearly, it is not the case that if we draw another study from the same population, we should be 95% confident(*) the effect size of this new data point will lie between 0.54 to 0.74. This would not be true even in the unicorn case there’s no between study heterogeneity (e.g. all the studies are measuring the same effect modulo sampling variance), and even less so when this is marked, as here. To answer that question, what you want is a prediction interval.[2] This interval is always wider, and almost always significantly so, than the confidence interval for the average effect: in the same analysis with the 0.54-0.74 confidence interval, the prediction interval was −0.27 to 1.55.
Although the full model HLI uses in constructing informed priors is different from that presented in S4 (e.g. it includes a bunch of moderators), they appear to be constructed with monte carlo on the confidenceintervals for the average, not the prediction interval for the data. So I believe the informed prior is actually one of the (adjusted) “Average effect of psychotherapy interventions as a whole”, not a prior for (e.g.) “the effect size reported in a given PT study.” The latter would need to use the prediction intervals, and have a much wider distribution.[3]
I think this ably explains exactly why the Bayesian method for (e.g.) Strongminds gives very bizarre results when deployed as the report does, but they do make much more sense if re-interpreted as (in essence) computing the expected effect size of ‘a future strongminds-like intervention’, but not the effect size we should believe StrongMinds actually has once in receipt of trial data upon it specifically. E.g.:
The histogram of effect sizes shows some comparisons had an effect size < 0, but the ‘informed prior’ suggests P(ES < 0) is extremely low. As a prior for the effect size of the next study, it is much too confident, given the data, a trial will report positive effects (you have >1/72 studies being negative, so surely it cannot be <1%, etc.). As a prior for the average effect size, this confidence is warranted: given the large number of studies in our sample, most of which report positive effects, we would be very surprised to discover the true average effect size is negative.
The prior doesn’t update very much on data provided. E.g. When we stipulate the trials upon strongminds report a near-zero effect of 0.05 WELLBYs, our estimate of 1.49 WELLBYS goes to 1.26: so we should (apparently) believe in such a circumstance the efficacy of SM is ~25 times greater than the trial data upon it indicates. This is, obviously, absurd. However, such a small update is appropriate if it were to ~the average of PT interventions as a whole: that we observe a new PT intervention has much below average results should cause our average to shift a little towards the new findings, but not much.
In essence, the update we are interested in is not “How effective should we expect future interventions like Strongminds are given the data on Strongminds efficacy”, but simply “How effective should we expect Strongminds is given the data on how effective Strongminds is”. Given the massive heterogeneity and wide prediction interval, the (correct) informed prior is pretty uninformative, as it isn’t that surprised by anything in a very wide range of values, and so on finding trial data on SM with a given estimate in this range, our estimate should update to match it pretty closely.[4]
(This also should mean, unlike the report suggests, the SM estimate is not that ‘robust’ to adverse data. Eyeballing it, I’d guess the posterior should be going down by a factor of 2-3 conditional on the stipulated data versus currently reported results).
I’m aware confidence intervals are not credible intervals, and that ‘the 95% CI tells you where the true value is with 95% likelihood’ strictly misinterprets what a confidence interval is, etc. (see) But perhaps ‘close enough’, so I’m going to pretend these are credible intervals, and asterisk each time I assume the strictly incorrect interpretation.
The summary estimate and confidence interval from a random-effects meta-analysis refer to the centre of the distribution of intervention effects, but do not describe the width of the distribution. Often the summary estimate and its confidence interval are quoted in isolation and portrayed as a sufficient summary of the meta-analysis. This is inappropriate. The confidence interval from a random-effects meta-analysis describes uncertainty in the location of the mean of systematically different effects in the different studies. It does not describe the degree of heterogeneity among studies, as may be commonly believed. For example, when there are many studies in a meta-analysis, we may obtain a very tight confidence interval around the random-effects estimate of the mean effect even when there is a large amount of heterogeneity. A solution to this problem is to consider a prediction interval (see Section 10.10.4.3).
Obviously, modulo all the other issues I suggest with both the meta-analysis as a whole, that we in fact would incorporate other sources of information into our actual prior, etc. etc.
Agree that there seem to be some strawmen in HLI’s response:
We don’t believe that the entire field of LMIC psychotherapy should be considered bunk, compromised, or uninformative.
Has anyone suggested that the “entire field of LMIC psychotherapy” is “bunk”?
If one insisted only on using charity evaluations that had every choice pre-registered, there would be none to choose from.
Has anyone suggested that, either? As I understand it, it’s typical to look at debatable choices that happen to support the author’s position with a somewhat more skeptical lens if they haven’t been pre-registered. I don’t think anyone has claimed lack of certain choices being pre-registered is somehow fatal, only a factor to consider.
Our (the HLI) comment was in reference to these quotes.
The literature on PT in LMICs is a complete mess.
Trying to correct the results of a compromised literature is known to be a nightmare.
I think it is valid to describe these as saying the literature is compromised and (probably) uninformative. I can understand your complaint about the word “bunk”. Apologies to Gregory if this is a mischaracterization.
Regarding our comment:
If one insisted only on using charity evaluations that had every choice pre-registered, there would be none to choose from.
And your comment:
I don’t think anyone has claimed lack of certain choices being pre-registered is somehow fatal, only a factor to consider.
Yeah, I think this is a valid point, and the post should have quoted Gregory directly. The point we were hoping to make here is that we’ve attempted to provide a wide range of sensitivity analyses throughout our report, to an extent that we think goes beyond most charity evaluations. It’s not surprising that we’ve missed some in this draft that others would like to see. Gregory’s comments mentioned “Even if you didn’t pre-specify, presenting your first cut as the primary analysis helps for nothing up my sleeve reasons” seemed to imply that we were deliberately hiding something, but in my view our interpretation was overly pessimistic.
Thank you for your comments, Gregory. We’re aware you have strong views on the subject and we appreciate your conscientious contributions. We discussed your previous comments internally but largely concluded revisions weren’t necessary as we (a) had already considered them in the report and appendix, (b) will return to them in later versions and didn’t expect they would materially affect the results, or (c) simply don’t agree with these views. To unpack:
Study quality. We conclude the data set does contain bias, but we account for it (sections 3.2 and 5; it’s an open question among academics how best to do this). We don’t believe that the entire field of LMIC psychotherapy should be considered bunk, compromised, or uninformative. Our results are in line with existing meta-analyses of psychotherapy considered to have low risk of bias (see footnote).[1]
Evidentiary standards. We drew on a large number of RCTs for our systematic reviews and meta-analyses of cash transfers and psychotherapy (42 and 74, respectively). If one holds that the evidence for something as well-studied as psychotherapy is too weak to justify any recommendations, charity evaluators could recommend very little.
Outlier exclusion. The issues regarding outlier exclusion were discussed in some depth (3.2 in the main report and in Appendix B). Excluding outliers is thought sensible practice here; two related meta-analyses, Cuijpers et al., 2020c; Tong et al., 2023, used a similar approach. It’s consistent with not taking the entire literature at face value but also not taking guilt by association too far. If one excludes outliers, the specific way one does this has a minor effect (e.g., a 10% decline in effectiveness, see appendix). Our analysis necessarily makes analytic choices: some were pre-registered, some made on reflection, many were discussed in our sensitivity analysis. If one insisted only on using charity evaluations that had every choice pre-registered, there would be none to choose from.
Bayesian analysis: The method we use (‘grid approximation’, see 8.3 and 9.3) avoids subjective inputs. It is not this Bayesian analysis itself that ‘stacks the deck’ in favour of psychotherapy, but the evidence. Given that over 70 studies form the prior, it would be surprising if adding one study, as we did for StrongMinds, would radically alter the conclusions. [Edit 5/12/2023: on the point that StrongMinds could be more cost-effective than GiveDirectly, even if StrongMinds only has the small effect we assume it does in our hypothetical placeholder studies, it doesn’t seem inconceivable that a small, less effective intervention can still be more cost-effective than a big, expensive one. For context, we estimate it costs StrongMinds $63 per intervention—providing one person with a course of therapy—whereas it costs GiveDirectly $1221 per intervention—an $1000 cash transfer which costs $221 in overheads. As the therapy is about 20x cheaper, it can be far less effective yet still more cost-effective.]
Making recommendations: we aim to recommend the most cost-effective ways of increasing WELLBYs we’ve found so far. While we have intuitions about how good different interventions are our perspective as an organisation is that conclusions about what’s cost-effective should be led heavily by the evidence rather than by our pre-evidential beliefs (‘priors’). Given the evidence we’ve considered, we don’t see a strong case for recommending cash transfers over psychotherapy.
This is a working report, and we’ll be reflecting on how to incorporate the above, similarly psychotherapy-sceptical perspectives, and other views in the process of preparing it for academic review. In the interests of transparency, we don’t plan to engage beyond our comments above so as to preserve team resources.
We find an initial effect is 0.70 SDs, reduced to 0.46 SDs after publication bias adjustments. Cuijpers et al. 2023 find an effect of psychotherapy of 0.49 SDs for studies with low risk of bias (RoB) in low, middle, and high income countries (comparisons = 218), which reduces to between 0.27 and 0.57 after publication adjustment. Tong et al. 2023 find an effect of 0.69 SDs for studies with low RoB in non-western countries (primarily low and middle income; comparisons = 36), which adjust to between 0.42 and 0.60 after publication correction. Hence, our initial and adjusted numbers are similar.
Epistemic status: tentative, it’s been a long time since reading social science papers was a significant part of my life. Happy to edit/retract this preliminary view as appropriate if someone is able to identify mistakes.
I can’t access Cuijpers et al., but I don’t read Tong et al. as supporting what HLI has done here.
In their article, Tong et al. provide the effect size with no exclusions, then with outliers excluded, then with “extreme outliers” excluded (the latter of which seems to track HLI’s removal criterion). They also provide effect size with various publication-bias measures employed. See PDF at 5-6. If I’m not mistaken, the publication bias measures are applied to the no-exclusions version, not a version with outliers removed or limited to those with lower RoB. See id. at 6 tbl.2 (n = 117 for combined and 2 of 3 publication-bias effect sizes; 153 with trim-and-fill adding 36 studies; n = 74 for outliers removed & n = 104 for extreme outliers removed; effect sizes after publication-bias measures range from 0.42 to 0.60 seem to be those mentioned in HLI’s footnote above).
Tong et al. “conducted sensitive analyses comparing the results with and without the inclusion of extreme outliers,” PDF at 5, discussing the results without exclusion first and then the results with exclusion. See id. at 5-6. Tables 3-5 are based on data without exclusion of extreme outliers; the versions of Tables 4 and 5 that excludes extreme outliers are relegated to the supplemental tables (not in PDF). See id. at 6. This reads to my eyes as treating both the all-inclusive and extreme-outliers-excluded data seriously, with some pride of place to the all-inclusive data.
I don’t read Tong et al. as having reached a conclusion that either the all-inclusive or extreme-outliers-excluded results were more authoritative, saying things like:
Lastly, we were unable to explain the different findings in the analyses with vs. without extreme outliers. The full analyses that included extreme outliers may reflect the true differences in study characteristics, or they may imply the methodological issues raised by studies with effect sizes that were significantly higher than expected.
and
Therefore, the larger treatment effects observed in non-Western trials may not necessarily imply superior treatment outcomes. On the other hand, it could stem from variations in study design and quality.
and
Further research is required to explain the reasons for the differences in study design
and quality between Western and non-Western trials, as well as the different results in the analyses with and without extreme outliers.
PDF at 10.
Of course, “further research needed” is an almost inevitable conclusion of the majority of academic papers, and Tong et al. have the luxury of not needing to reach any conclusions to inform the recommended distribution of charitable dollars. But I don’t read the article by Tong et al. as supporting the proposition that it is appropriate to just run with the outliers-excluded data. Rather, I read the article as suggesting that—at least in the absence of compelling reasons to the contrary—one should take both analyses seriously, but neither definitively.
I lack confidence in what taking both analyses seriously, but neither definitively would mean for purposes of conducting a cost-effectiveness analysis. But I speculate that it would likely involve some sort of weighting of the two views.
Hi again Jason,
When we said “Excluding outliers is thought sensible practice here; two related meta-analyses, Cuijpers et al., 2020c; Tong et al., 2023, used a similar approach”—I can see that what we meant by “similar approach” was unclear. We meant that, conditional on removing outliers, they identify a similar or greater range of effect sizes as outliers as we do.
This was primarily meant to address the question raised by Gregory about whether to include outliers: “The cut data by and large doesn’t look visually ‘outlying’ to me.”
To rephrase, I think that Cuijpers et al. and Tong et al. would agree that the data we cut looks outlying. Obviously, this is a milder claim than our comment could be interpreted as making.
Turning to wider implications of these meta-analyses, As you rightly point out, they don’t have a “preferred specification” and are mostly presenting the options for doing the analysis. They present analyses with and without outlier removal in their main analysis, and they adjust for publication bias without outliers removed (which is not what we do). The first analytic choice doesn’t clearly support including or excluding outliers, and the second – if it supports any option, favors Greg’s proposed approach of correcting for publication bias without outliers removed.
I think one takeaway is that we should consider surveying the literature and some experts in the field, in a non-leading way, about what choices they’d make if they didn’t have “the luxury of not having to reach a conclusion”.
I think it seems plausible to give some weight to analyses with and without excluding outliers – if we are able find a reasonable way to treat the 2 out of 7 publication bias correction methods that produce the results suggesting that the effect of psychotherapy is in fact sizably negative. We’ll look into this more before our next update.
Cutting the outliers here was part of our first pass attempt at minimising the influence of dubious effects, which we’ll follow up with a Risk of Bias analysis in the next version. Our working assumption was that effects greater than ~ 2 standard deviations are suspect on theoretical grounds (that is, if they behave anything like SDs in an normal distribution), and seemed more likely to be the result of some error-generating process (e.g. data-entry error, bias) than a genuine effect.
We’ll look into this more in our next pass, but for this version we felt outlier removal was the most sensible choice.
I recently discovered that GiveWell decided to exclude an outlier in their water chlorination meta-analysis. I’m not qualified to judge their reasoning, but maybe others with sufficient expertise will weigh in?
I recently discovered that GiveWell decided to exclude an outlier in their water chlorination meta-analysis. I’m not qualified to judge their reasoning, but maybe others with sufficient expertise will weigh in?
It looks like the same comment got posted several times?
Thanks Rebecca. I will delete the duplicates.
A comparatively minor point, but it doesn’t seem to me that the claims in Greg’s post [more] are meaningfully weakened by whether or not psychotherapy is well-studied (as measured by how many RCTs HLI has found on it, noting that you already push back on some object level disagreement on study quality in point 1, which feels more directly relevant).
It also seems pretty unlikely to be true that psychotherapy being well studied necessarily means that StrongMinds is a cost-effective intervention comparable to current OP / GW funding bars (which is one main point of contention), or that charity evaluators need 74+ RCTs in an area before recommending a charity. Is the implicit claim being made here is that the evidence for StrongMinds being a top charity is stronger than that of AMF, which is (AFAIK) based on less than 74 RCTs?[1]
GiveWell, Cochrane
I have previously let HLI have the last word, but this is too egregious.
Study quality: Publication bias (a property of the literature as a whole) and risk of bias (particular to each individual study which comprise it) are two different things.[1] Accounting for the former does not account for the latter. This is why the Cochrane handbook, the three meta-analyses HLI mentions here, and HLI’s own protocol consider distinguish the two.
Neither Cuijpers et al. 2023 nor Tong et al. 2023 further adjust their low risk of bias subgroup for publication bias.[2] I tabulate the relevant figures from both studies below:
So HLI indeed gets similar initial results and publication bias adjustments to the two other meta-analyses they find. Yet—although these are not like-for-like—these other two meta-analyses find similarly substantial effect reductions when accounting for study quality as they do when assessing at publication bias of the literature as a whole.
There is ample cause for concern here:[3]
Although neither of these studies ‘adjust for both’, one later mentioned—Cuijpers et al. 2020 - does. It finds an additional discount to effect size when doing so.[4] So it suggests that indeed ‘accounting for’ publication bias does not adequately account for risk of bias en passant.
Tong et al. 2023 - the meta-analysis expressly on PT in LMICs rather than PT generally—finds higher prevalence of indicators of lower study quality in LMICs, and notes this as a competing explanation for the outsized effects.[5]
As previously mentioned, in the previous meta-analysis, unregistered trials had a 3x greater effect size than registered ones. All trials on Strongminds published so far have not been registered. Baird et al., which is registered, is anticipated to report disappointing results.
Evidentiary standards: Indeed, the report drew upon a large number of studies. Yet even a synthesis of 72 million (or whatever) studies can be misleading if issues of publication bias, risk of bias in individual studies (and so on) are not appropriately addressed. That an area has 72 (or whatever) studies upon it does not mean it is well-studied, nor would this number (nor any number) be sufficient, by itself, to satisfy any evidentiary standard.
Outlier exclusion: The report’s approach to outlier exclusion is dissimilar to both Cuijpers et al. 2020 and Tong et al. 2023, and further is dissimilar with respect to features I highlighted as major causes for concern re. HLI’s approach in my original comment.[6] Specifically:
Both of these studies present the analysis with the full data first in their results. Contrast HLI’s report, where only the results with outliers excluded are presented in the main results, and the analysis without exclusion is found only in the appendix.[7]
Both these studies also report the results with the full data as their main findings (e.g. in their respective abstracts). Cuijpers et al. mentions their outlier excluded results primarily in passing (“outliers” appears once in the main text); Tong et al. relegates a lot of theirs to the appendix. HLI’s report does the opposite. (cf. fn 7 above)
Only Tong et al. does further sensitivity analysis on the ‘outliers excluded’ subgroup. As Jason describes, this is done alongside the analysis where all data included, the qualitative and quantitative differences which result from this analysis choice are prominently highlighted to the reader and extensively discussed. In HLI’s report, by contrast, the factor of 3 reduction to ultimate effect size when outliers are not excluded is only alluded to qualitatively in a footnote (fn 33)[8] of the main report’s section (3.2) arguing why outliers should be excluded, not included in the reports sensitivity analysis, and only found in the appendix.[9]
Both studies adjust for publication bias only on all data, not on data with outliers excluded, and these are the publication bias findings they present. Contrast HLI’s report.
The Cuijpers et al. 2023 meta-analysis previously mentioned also differs in its approach to outlier exclusion from HLI’s report in the ways highlighted above. The Cochrane handbook also supports my recommendations on what approach should be taken, which is what the meta-analyses HLI cites approvingly as examples of “sensible practice” actually do, but what HLI’s own work does not.
The reports (non) presentation of the stark quantitative sensitivity of its analysis—material to its report bottom line recommendations—to whether outliers are excluded is clearly inappropriate. It is indefensible if, as I have suggested may be the case, the analysis with outliers included was indeed the analysis first contemplated and conducted.[10] It is even worse if it was the publication bias corrections on the full data was what in fact prompted HLI to start making alternative analysis choices which happened to substantially increase the bottom line figures.
Bayesian analysis: Bayesian methods notoriously do not avoid subjective inputs—most importantly here, what information we attend to when constructing an ‘informed prior’ (or, if one prefers, how to weigh the results with a particular prior stipulated).
In any case, they provide no protection from misunderstanding the calculation being performed, and so misinterpreting the results. The Bayesian method in the report is actually calculating the (adjusted) average effect size of psychotherapy interventions in general, not the expected effect of a given psychotherapy intervention. Although a trial on Strongminds which shows it is relatively ineffectual should not update our view much the efficacy of psychotherapy interventions (/similar to Strongminds) as a whole, it should update us dramatically on the efficacy of Strongminds itself.
Although as a methodological error this is a subtle one (at least, subtle enough for me not to initially pick up on it), the results it gave are nonsense to the naked eye (e.g. SM would still be held as a GiveDirectly-beating intervention even if there were multiple high quality RCTs on Strongminds giving flat or negative results). HLI should have seen this themselves, should have stopped to think after I highlighted these facially invalid outputs of their method in early review, and definitely should not be doubling down on these conclusions even now.
Making recommendations: Although there are other problems, those I have repeated here make the recommendations of the report unsafe. This is why I recommended against publication. Specifically:
Although I don’t think the Bayesian method the report uses would be appropriate, if it was calculated properly on its own terms (e.g. prediction intervals not confidence intervals to generate the prior, etc.), and leaving everything else the same, the SM bottom line would drop (I’m pretty sure) by a factor a bit more than 2.
The results are already essentially sensitive to whether outliers are excluded in analysis or not: SM goes from 3.7x → ~1.1x GD on the back of my envelope, again leaving all else equal.
(1) and (2) combined should net out to SM < GD; (1) or (2) combined with some of the other sensitivity analyses (e.g. spillovers) will also likely net out to SM < GD. Even if one still believes the bulk of (appropriate) analysis paths still support a recommendation, this sensitivity should be made transparent.
E.g. Even if all studies in the field are conducted impeccably, if journals only accept positive results the literature may still show publication bias. Contrariwise, even if all findings get published, failures in allocation/blinding/etc. could lead to systemic inflation of effect sizes across the literature. In reality—and here—you often have both problems, and they only partially overlap.
Jason correctly interprets Tong et al. 2023: the number of studies included in their publication bias corrections (117 [+36 w/ trim and fill]) equals the number of all studies, not the low risk of bias subgroup (36 - see table 3). I do have access to Cuijpers et al. 2023, which has a very similar results table, with parallel findings (i.e. they do their publication bias corrections on the whole set of studies, not on a low risk of bias subgroup).
Me, previously:
From their discussion (my emphasis):
E.g. from the abstract (my emphasis):
Apparently, all that HLI really meant with “Excluding outliers is thought sensible practice here; two related meta-analyses, Cuijpers et al., 2020c; Tong et al., 2023, used a similar approach” [my emphasis] was merely “[C]onditional on removing outliers, they identify a similar or greater range of effect sizes as outliers as we do.” (see).
Yeah, right.
I also had the same impression as Jason that HLI’s reply repeatedly strawmans me. The passive aggressive sniping sprinkled throughout and subsequent backpedalling (in fairness, I suspect by people who were not at the keyboard of the corporate account) is less than impressive too. But it’s nearly Christmas, so beyond this footnote I’ll let all this slide.
Me again (my [re-?]emphasis)
Said footnote:
The sentence in the main text this is a footnote to says:
Me again:
My remark about “Even if you didn’t pre-specify, presenting your first cut as the primary analysis helps for nothing up my sleeve reasons” which Dwyer mentions elsewhere was a reference to ‘nothing up my sleeve numbers’ in cryptography. In the same way picking pi or e initial digits for arbitrary constants reassures the author didn’t pick numbers with some ulterior purpose they are not revealing, reporting what one’s first analysis showed means readers can compare it to where you ended up after making all the post-hoc judgement calls in the garden of forking paths. “Our first intention analysis would give x, but we ended up convincing ourselves the most appropriate analysis gives a bottom line of 3x” would rightly arouse a lot of scepticism.
I’ve already mentioned I suspect this is indeed what has happened here: HLI’s first cut was including all data, but argued itself into making the choice to exclude, which gave a 3x higher ‘bottom line’. Beyond “You didn’t say you’d exclude outliers in your protocol” and “basically all of your discussion in the appendix re. outlier exclusion concerns the results of publication bias corrections on the bottom line figures”, I kinda feel HLI not denying it is beginning to invite an adverse inference from silence. If I’m right about this, HLI should come clean.
I’m feeling confused by these two statements:
The first statement says HLI’s recommendation is unsafe, but the second implies it is reasonable as long as the sensitivity is clearly explained. I’m grateful to Greg for presenting the analysis paths which lead to SM < GD, but it’s unclear to me how much those paths should be weighted compared to all the other paths which lead to SM > GD.
It’s notable that Cuijpers (who has done more than anyone in the field to account for publication bias and risk of bias) is still confident that psychotherapy is effective.
I was also surprised by the use of ‘unsafe’. Less cost-effective maybe, but ‘unsafe’ implies harm and I haven’t seen any evidence to support that claim.
You cannot use the distribution for the expected value of an average therapy treatment as the prior distribution for a SPECIFIC therapy treatment, as there will be a large amount of variation between possible therapy treatments that is missed when doing this. Your prior here is that there is a 99%+ chance that StrongMinds will work better than GiveDirectly before looking at any actual StrongMinds results, this is a wildly implausible claim.
You also state “If one holds that the evidence for something as well-studied as psychotherapy is too weak to justify any recommendations, charity evaluators could recommend very little.” Nothing in Gregory’s post suggests that he thinks anything like this, he gives a g of ~0.5 in his meta-analysis that doesn’t improperly remove outliers without good cause. A g of ~0.5 suggests that individuals suffering from depression would likely greatly benefit from seeking therapy. There is a massive difference between “evidence behind psychotherapy is too weak to justify any recommendations” and claiming that “this particular form of therapy is not vastly better than GiveDirectly with a probability higher than 99% before even looking at RCT results”. Trying to throw out Gregory’s claims here over a seemingly false statement about his beliefs seems pretty offensive to me.
[Disclaimer: I worked at HLI until March 2023. I now work at the International Alliance of Mental Health Research Funders]
Gregory says
That is a strong claim to make and it requires him to present a convincing case that GiveDirectly is more cost-effective than StrongMinds. I’ve found his previous methodological critiques to be constructive and well-explained. To their credit, HLI has incorporated many of them in the updated analysis. However, in my opinion, the critiques he presents here do not make a convincing case.
Taking his summary points in turn...
I think this is much too strong. The three meta-analyses (and Gregory’s own calculations) give me confidence that psychotherapy in LMICs is effective, although the effects are likely to be small.
There is no consensus on the appropriate methodology for adjusting publication bias. I don’t have an informed opinion on this, but HLI’s approach seems reasonable to me and I think it’s reasonable for Greg to take a different view. From my limited understanding, neither approach makes GiveDirectly more cost-effective.
We don’t have any new data on StrongMinds so I’m confused why Greg thinks it’s “less and less promising”. HLI’s Bayesian approach is a big improvement on the subjective weightings they used in the first cost-effectiveness analysis. As with publication bias, it’s reasonable to hold different views on how to construct the prior, but personally, I do believe that any psychotherapy intervention in LMICs, so long as cost per patient is <$100, is a ~certain bet to beat cash transfers. There are no specific models of psychotherapy that perform better than the others, so I don’t find it surprising that training people to talk to other people about their problems is a more cost-effective way to improve wellbeing in LMICs. Cash transfers are much more expensive and the effects on subjective wellbeing are also small.
HLI had to start somewhere and I think we should give credit to StrongMinds for being brave enough to open themselves up to the scrutiny they’ve faced. The three meta-analyses and the tentative analysis of Friendship Bench suggest there is ‘altruistic gold’ to be found here and HLI has only just started to dig. The field is growing quickly and I’m optimistic about the trajectories of CE-incubated charities like Vida Plena and Kaya Guides.
In the meantime, although the gap between GiveDirectly and StrongMinds has clearly narrowed, I remain unconvinced that cash is clearly the better option (but I do remain open-minded and open to pushback).
A specific therapy treatment is drawn from the distribution of therapy treatments. Our best guess about the distribution of value of a specific therapy treatment, without knowing anything about it, should take into account only that it comes from this distribution of therapy treatments. So I don’t see what’s unreasonable about this.
When running a meta-analysis, you can either use a fixed effect assumption (that all variation between studies is just due to sampling error) and a random effect assumption (that studies differ in terms of their “true effects”.) Therapy treatments differ greatly, so you have to use a random effects model in this case. Then the prior you use for strong minds impact should have a variance that is the sum of the variance in the estimate of average therapy treatments effects AND the variance among different types of treatments effects, both numbers should be available from a random effects meta-analysis. I’m not quite sure what HLI did exactly to get their prior for strong minds here, but for some reason the variance on it seems WAY too low, and I suspect that they neglected the second type of variance that they should have gotten from a random effects meta-analysis.
Section 2.2.2 of their report is titled “Choosing a fixed or random effects model”. They discuss the points you make and clearly say that they use a random effects model. In section 2.2.3 they discuss the standard measures of heterogeneity they use. Section 2.2.4 discusses the specific 4-level random effects model they use and how they did model selection.
I reviewed a small section of the report prior to publication but none of these sections, and it only took me 5 minutes now to check what they did. I’d like the EA Forum to have a higher bar (as Gregory’s parent comment exemplifies) before throwing around easily checkable suspicions about what (very basic) mistakes might have been made.
Yes, some of Greg’s examples point to the variance being underestimated, but the problem does not inherently come from the idea of using the distribution of effects as the prior, since that should include both the sampling uncertainty and true heterogeneity. That would be the appropriate approach even under a random effects model (I think; I’m more used to thinking in terms of Bayesian hierarchical models and the equivalence might not hold)
(@Burner1989 @David Rhys Bernard @Karthik Tadepalli)
I think the fundamental point (i.e. “You cannot use the distribution for the expected value of an average therapy treatment as the prior distribution for a SPECIFIC therapy treatment, as there will be a large amount of variation between possible therapy treatments that is missed when doing this.”) is on the right lines, although subsequent discussion of fixed/random effect models might confuse the issue. (Cf. my reply to Jason).
The typical output of a meta-analysis is an (~) average effect size estimate (the diamond at the bottom of the forest plot, etc.) The confidence interval given for that is (very roughly)[1] the interval we predict the true average effect likely lies. So for the basic model given in Section 4 of the report, the average effect size is 0.64, 95% CI (0.54 − 0.74). So (again, roughly) our best guess of the ‘true’ average effect size of psychotherapy in LMICs from our data is 0.64, and we’re 95% sure(*) this average is somewhere between (0.54, 0.74).
Clearly, it is not the case that if we draw another study from the same population, we should be 95% confident(*) the effect size of this new data point will lie between 0.54 to 0.74. This would not be true even in the unicorn case there’s no between study heterogeneity (e.g. all the studies are measuring the same effect modulo sampling variance), and even less so when this is marked, as here. To answer that question, what you want is a prediction interval.[2] This interval is always wider, and almost always significantly so, than the confidence interval for the average effect: in the same analysis with the 0.54-0.74 confidence interval, the prediction interval was −0.27 to 1.55.
Although the full model HLI uses in constructing informed priors is different from that presented in S4 (e.g. it includes a bunch of moderators), they appear to be constructed with monte carlo on the confidence intervals for the average, not the prediction interval for the data. So I believe the informed prior is actually one of the (adjusted) “Average effect of psychotherapy interventions as a whole”, not a prior for (e.g.) “the effect size reported in a given PT study.” The latter would need to use the prediction intervals, and have a much wider distribution.[3]
I think this ably explains exactly why the Bayesian method for (e.g.) Strongminds gives very bizarre results when deployed as the report does, but they do make much more sense if re-interpreted as (in essence) computing the expected effect size of ‘a future strongminds-like intervention’, but not the effect size we should believe StrongMinds actually has once in receipt of trial data upon it specifically. E.g.:
The histogram of effect sizes shows some comparisons had an effect size < 0, but the ‘informed prior’ suggests P(ES < 0) is extremely low. As a prior for the effect size of the next study, it is much too confident, given the data, a trial will report positive effects (you have >1/72 studies being negative, so surely it cannot be <1%, etc.). As a prior for the average effect size, this confidence is warranted: given the large number of studies in our sample, most of which report positive effects, we would be very surprised to discover the true average effect size is negative.
The prior doesn’t update very much on data provided. E.g. When we stipulate the trials upon strongminds report a near-zero effect of 0.05 WELLBYs, our estimate of 1.49 WELLBYS goes to 1.26: so we should (apparently) believe in such a circumstance the efficacy of SM is ~25 times greater than the trial data upon it indicates. This is, obviously, absurd. However, such a small update is appropriate if it were to ~the average of PT interventions as a whole: that we observe a new PT intervention has much below average results should cause our average to shift a little towards the new findings, but not much.
In essence, the update we are interested in is not “How effective should we expect future interventions like Strongminds are given the data on Strongminds efficacy”, but simply “How effective should we expect Strongminds is given the data on how effective Strongminds is”. Given the massive heterogeneity and wide prediction interval, the (correct) informed prior is pretty uninformative, as it isn’t that surprised by anything in a very wide range of values, and so on finding trial data on SM with a given estimate in this range, our estimate should update to match it pretty closely.[4]
(This also should mean, unlike the report suggests, the SM estimate is not that ‘robust’ to adverse data. Eyeballing it, I’d guess the posterior should be going down by a factor of 2-3 conditional on the stipulated data versus currently reported results).
I’m aware confidence intervals are not credible intervals, and that ‘the 95% CI tells you where the true value is with 95% likelihood’ strictly misinterprets what a confidence interval is, etc. (see) But perhaps ‘close enough’, so I’m going to pretend these are credible intervals, and asterisk each time I assume the strictly incorrect interpretation.
Cf. Cochrane:
Although I think the same mean, so it will give the right ‘best guess’ initial estimates.
Obviously, modulo all the other issues I suggest with both the meta-analysis as a whole, that we in fact would incorporate other sources of information into our actual prior, etc. etc.
Agree that there seem to be some strawmen in HLI’s response:
Has anyone suggested that the “entire field of LMIC psychotherapy” is “bunk”?
Has anyone suggested that, either? As I understand it, it’s typical to look at debatable choices that happen to support the author’s position with a somewhat more skeptical lens if they haven’t been pre-registered. I don’t think anyone has claimed lack of certain choices being pre-registered is somehow fatal, only a factor to consider.
Hey Jason,
Our (the HLI) comment was in reference to these quotes.
I think it is valid to describe these as saying the literature is compromised and (probably) uninformative. I can understand your complaint about the word “bunk”. Apologies to Gregory if this is a mischaracterization.
Regarding our comment:
And your comment:
Yeah, I think this is a valid point, and the post should have quoted Gregory directly. The point we were hoping to make here is that we’ve attempted to provide a wide range of sensitivity analyses throughout our report, to an extent that we think goes beyond most charity evaluations. It’s not surprising that we’ve missed some in this draft that others would like to see. Gregory’s comments mentioned “Even if you didn’t pre-specify, presenting your first cut as the primary analysis helps for nothing up my sleeve reasons” seemed to imply that we were deliberately hiding something, but in my view our interpretation was overly pessimistic.
Cheers for keeping the discourse civil.