HLI kindly provided me with an earlier draft of this work to review a couple of weeks ago. Although things have gotten better, I noted what I saw as major problems with the draft as-is, and recommended HLI take its time to fix themâeven though this would take a while, and likely miss the window of Giving Tuesday.
Unfortunately, HLI went ahead anyway with the problems I identified basically unaddressed. Also unfortunately (notwithstanding laudable improvements elsewhere) these problems are sufficiently major I think potential donors are ill-advised to follow the recommendations and analysis in this report.
In essence:
Issues of study quality loom large over this literature, with a high risk of materially undercutting the results (they did last time). The reports interim attempts to manage these problems are inadequate.
Pub bias corrections are relatively mild, but only when all effects g > 2 are excluded from the analysisâthey are much starker (albeit weird) if all data is included. Due to this, the choice to exclude âoutliersâ roughly trebles the bottom line efficacy of PT. This analysis choice is dubious on its own merits, was not pre-specified in the protocol, yet is only found in the appendix rather than the sensitivity analysis in the main report.
The bayesian analysis completely stacks the deck in favour of psychotherapy interventions (i.e. an âinformed priorâ which asserts one should be > 99% confident strongminds is more effective than givedirectly before any data on strongminds is contemplated), such that psychotherapy/âstrongminds/âetc, getting recommended is essentially foreordained.
Study quality
It perhaps comes as little surprise that different studies on psychotherapy in LMICs report very different results:[1]
The x-axis is a standardized measure of effect size for psychotherapy in terms of wellbeing.[2] Mostâbut not allâshow a positive effect (g > 0), but the range is vast, HLI excludes effect sizes over 2 as outliers (much more later), but 2 is already a large effect: to benchmark, it is roughly the height difference between male and female populations.
Something like an â(weighted) average effect sizeâ across this set would look promising (~0.6) - to also benchmark, the effect size of cash transfers on (individual) wellbeing is ~0.1. Yet cash transfers (among many other interventions) have much less heterogeneous results: more like â0.1 +/â- 0.1âł, not ~â0.6 multiply-or-divide by an integerâ. It seems important to understand what is going on.
One hope would be this heterogeneity can be explained in terms of the intervention and length of follow-up. Different studies did (e.g.) different sorts of psychotherapy, did more or less of it, and measured the outcomes at different points afterwards. Once we factor these things in to our analysis, this wide distribution seen when looking at the impact of psychotherapy in general sharpens into a clearer picture for any particular psychotherapeutic intervention. One can then deploy this knowledge to assessâin particularâthe likely efficacy of a charity like Strongminds.
The report attempts this enterprise in section 4 of the report. I think a fair bottom line is despite these efforts, the overall picture is still very cloudy: the best model explains ~12% of the variance in effect sizes. But this best model is still better than no model (but more later), so one can still use it to make a best guess for psychotherapeutic interventions, even if there remains a lot of uncertainty and spread.
But there could be another explanation for why thereâs so much heterogeneity: there are a lot of low-quality studies, and low quality studies tend to report inflated effect sizes. In the worst case, the spread of data suggesting psychotherapyâs efficacy is instead a mirage, and the effect size melts under proper scrutiny.
Hence why most systematic reviews do assess the quality of included studies and their risk of bias. Sometimes this is only used to give a mostly qualitative picture alongside the evidence synthesis (e.g. âX% of our studies have a moderate to high risk of biasâ) or sometimes incorporated quantitatively (e.g. âquality scoreâ of studies included as a predictor/âmoderator, grouping by âhigh/âmoderate/âlowâ risk of bias, etc. - although all are controversial).
HLIâs report does not assess the quality of its included studies, although it plans to do so. I appreciate GRADEing 90 studies or whatever is tedious and time consuming, but skipping this step to crack on with the quantitative synthesis is very unwise:[3] any such synthesis could be hugely distorted by low quality studies. And itâs not like this is a mere possibility: I previously demonstrated in the previous meta-analysis that study registration status (one indicator of study quality) explained a lot of heterogeneity, and unregistered studies had on average a three times [!] greater effect size than registered ones.
The report notes it has done some things to help manage this risk. One is cutting âoutliersâ (g > 2, the teal in the earlier histogram), and extensive assessment of publication bias/âsmall study effects. These things do help: all else equal, Iâd expect bigger studies to be methodologically better ones, so adjusting for small study effects does partially âcontrolâ for study quality; Iâd also expect larger effect sizes to arise from lower-quality work, so cutting them should notch up the average quality of the studies that remain.
But I do not think they help enough[4] - these are loose proxies for what we seek to understand. Thus the findings would be unreliable in virtue of this alone until after this is properly looked at. Crucially, the risk that these features could confound the earlier moderator analysis has not been addressed:[5] maybe the relationship of (e.g.) âmore sessions given â greater effectâ is actually due to studies of such interventions tend to be lower quality than the rest. When I looked last time things like âstudy sizeâ or âregistration statusâ explained a lot more of the heterogeneity than (e.g.) all of the intervention moderators combined. I suspect the same will be true this time too.
Publication bias
I originally suggested (6m ago?) that correction for publication bias/âsmall study effects could be ~an integer division, so I am surprised the correction was a bit less: ~30%. Hereâs the funnel plot:[6]
Unsurprisingly, huge amounts of scatter, but the asymmetry, although there, does not leap off the page: the envelope of points is pretty rectangular, but you can persuade yourself itâs a bit of a parallelogram, and thereâs denser part of it which indeed has a trend going down and to the right (so smaller study â bigger effect).
But this only plots effect sizes g < 2 (those red, not teal, in the histogram). If we include all the studies again, this picture looks a lot clearerâthe âlong tailâ of higher effects tends to come from smaller studies, which are clearly asymmetric.
This effect, visible to the naked eye, also emerges in the statistics. The report uses a variety of numerical methods to correct for publication bias (some very sophisticated). All of them adjust the results much further downwards on the full data than when outliers are excluded to varying degrees (table B1, appendix). It would have a stark effect on the resultsâhereâs the âbottom lineâ result if you take a weighted average of all the different methods, with different approaches to outlier exclusionâred is the full data, green is the outlier exclusion method the report uses.
Needless to say, this choice is highly material to the bottom line results: without excluding data, SM drops from ~3.6x GD to ~1.1x GD. Yet it doesnât get a look in for the sensitivity analysis, where HLIâs âless favourableâ outlier method involves taking an average of the other methods (discounting by ~10%), but not doing no outlier exclusion at all (discounting by ~70%).[7]
Perhaps this is fine if outlier inclusion would be clearly unreasonable. But itâs not: cutting data is generally regarded as dubious, and the rationale for doing so here is not compelling. Briefly:
Received opinion is typically that outlier exclusion should be avoided without a clear rationale why the âoutliersâ arise from a clearly discrepant generating process. If it is to be done, the results of the full data should still be presented as the primary analysis (e.g.).
The cut data by and large doesnât look visually âoutlyingâ to me. The histogram shows a pretty smooth albeit skewed distribution. Cutting off the tail of the distribution at various lengths appears ill-motivated.
Given the interest in assessing small study effects, cutting out the largest effects (which also tend to be the smallest studies) should be expected to attenuate the small study effect (as indeed it does). Yet if our working hypothesis is these effects are large mainly because the studies are small, their datapoints are informative to plot this general trend (e.g. for slightly less small studies which have slightly less inflated results).[8]
The strongest argument given is that, in fact, some numerical methods to correct publication bias give absurd results if given the full data: i.e. one gives an adjusted effect size of â0.6, another â0.2. I could buy an adjustment that drives the effect down to roughly zero, but not one which suggests, despite almost all the data being fairly or very positive, we should conclude from these studies the real effect is actually (highly!) negative.
One could have a long argument on what the most appropriate response is: maybe just keep it, as the weighted average across methods is still sensible (albeit disappointing)? Maybe just drop those methods in particular and do an average of those giving sane answers on the full data? Should we keep g < 2 exclusion but drop p-curve analysis, as it (absurdly?) adjusts the effect slightly upwards? Maybe we should reweigh the averaging of different numerical methods by how volatile their results are when you start excluding data? Maybe pick the outlier exclusion threshold which results in the least disagreement between the different methods? Or maybe just abandon numerical correction, and just say âthereâs clear evidence of significant small study effects, which the current state of the art cannot reliably quantify and correctâ?
So a garden of forking paths opens before us. All of these are varying degrees of âarguableâ, and they do shift the bottom line substantially. One reason pre-specification is so valuable is it ties you to a particular path before getting to peek at the results, avoiding any risk a subconscious finger on the scale to push one down a path of still-defensible choices nonetheless favour a particular bottom line. Even if you didnât pre-specify, presenting your first cut as the primary analysis helps for nothing up my sleeve reasons.
It may be the prespecified or initial stab doesnât do a good job of expressing the data, and a different approach does better. Yet making it clear this subsequent analysis is post-hoc cautions a reader about potential risk of bias in analysis.
Happily, HLI did make a protocol for this work, made before they conducted the analysis. Unfortunately, it is silent on whether outlying data would be excluded, or by what criteria. Also unfortunately, because of this (and other things like the extensive discussion in the appendix discussing the value of outlier removal principally in virtue of its impact on publication bias correction), I am fairly sure the analysis with all data included was the first analysis conducted. Only after seeing the initial publication bias corrections did HLI look at the question of whether some data should be excluded. Maybe it should, but if it came second the initial analysis should be presented first (and definitely included in the sensitivity analysis).
Thereâs also a risk the cloud of quantification buries the qualitative lede. Publication bias is known to be very hard to correct, and despite HLI compiling multiple numerical state of the art methods, they starkly disagree on what the correction factor should be (i.e. from <~0 to > 100%). So perhaps the right answer is we basically do not know how much to discount the apparent effects seen in the PT literature given it also appears to be an extremely compromised one, and if forced to give an overall number, any ânumerical bottom lineâ should have even wider error bars because of this.[9]
Bayesian methods
I previously complained that the guestimate/âBOTEC-y approach HLI used in integrating information from the meta-analysis and the strongminds trial data couldnât be right, as it didnât pass various sanity tests: e.g. still recommending SM as highly effective even if you set the trial data to zero effect. HLI now has a much cleverer Bayesian approach to combining sources of information. On the bright side, this is mechanistically much clearer as well as much cleverer. On the downside, the deck still looks pretty stacked.
Starting at the bottom, hereâs how HLIâs Bayesian method compares SM to GD:
The informed prior (in essence) uses the meta-analysis findings with some monte carlo to get an expected effect for an intervention with strongminds-like traits (e.g. same number of sessions, same deliverer, etc.). The leftmost point of the solid line gives the expectation for the prior: so the prior is that SM is ~4x GDs cost effectiveness (dashed line).
The x axis is how much weight one gives to SM-specific data. Of interest, the line slopes down, so the data gives a negative update on SMs cost-effectiveness. This is because HLIâin anticipation of the Baird/âOzler RCT likely showing disappointing resultsâdiscounted the effect derived from the original SM-specific evidence by a factor of 20, so the likelihood is indeed much lower than the prior. Standard theory gives the appropriate weighting of this vs. the prior, so you adjust down a bit, but not a lot, from the prior (dotted line).
Despite impeccable methods, these results are facially crazy. To illustrate:
The rightmost point on the solid line is the result if you completely discount the prior, and only use the stipulated-to-be-bad SM-specific results. SM is still slightly better than GD on this analysis.[10]
If we âfollow Bayesian updatingâ as HLI recommends, the recommendation is surprisingly insensitive to the forthcoming Baird/âOzler RCT having disappointing findings. Eyeballing it, youâd need such a result to be replicated half a dozen times for the posterior to update to SM is roughly on a par with GD.
Setting the forthcoming data to basically showing zero effect will still return SM is 2-3x GD.[11] Iâd guess youâd need the forthcoming RCT to show astonishingly and absurdly negative results (e.g. SM treatment is worse for your wellbeing than bereavement), to get it to approximate equipoise with GD.
Youâd need even stronger adverse findings for the model to update all the way down to SM being ineffectual, rather than merely âless good than GiveDirectlyâ.
I take it most readers would disagree with the model here tooâe.g. if indeed the only RCT on strongminds is basically flat, that should be sufficient demote SM from putative âtop charityâ status.
I think I can diagnose the underlying problem: Bayesian methods are very sensitive to the stipulated prior. In this case, the prior is likely too high, and definitely too narrow/âoverconfident. See this:
Per the dashed and dotted lines in the previous figure, the âGiveDirectly barâ is fractionally below at the blue dashed line (the point estimate of the stipulated-SM data). The prior distribution is given in red. So the expectation (red dashed line) is indeed ~4x further from the origin (see above).
The solid red curve gives the distribution. Eyeballing the integrals reveals the problem: the integral of this distribution from the blue dashed line to infinity gives the models confidence psychotherapy interventions would be more cost-effective than GD. This is at least 99% of the area, if not 99.9% â 99.99%+. A fortiori, this prior asserts it is essentially certain the intervention is beneficial (total effect >0).
I donât think anyone should think that any intervention is P > 0.99 more cost-effective than give directly (or P < 0.0001 or whatever it is in fact harmful) as a prior,[12] but if one did, it would indeed take masses of evidence to change oneâs mind. Hence the very sluggish moves in response to adverse data (the purple line suggests the posterior is also 99%+ confident SM is better than givedirectly).
I think I can also explain the underlying problem of this underlying problem. HLI constructs its priors exclusively from its primary meta-analytic model (albeit adapted to match the intervention of interest, and recalculated excluding any studies done on this intervention to avoid double counting). Besides the extra uncertainty (so spread) likely implied by variety of factors covered in the sensitivity analysis, in real life our prior would be informed by other things too: the prospect entire literatures can be misguided, a general sense (at least for me) that cash transfers are easy to beat in principle, but much harder in practice, and so on.
In reality, our prior-to-seeing-the-metaanalysis prior would be very broad and probably reasonably pessimistic, and (even if Iâm wrong about the shortcomings I suggest earlier), the âupdateâ on reading it would be a bit upwards, and a little narrower, but not by that much. In turn, the âupdateâ on seeing (e.g.) disappointing RCT results for a given PT intervention would be a larger shift downwards, netting out that this was unlikely better than GiveDirectly after all.
If the Bayesian update was meant only to be a neat illustration, I would have no complaint. But instead the bottom line recommendations and assessments rely upon itâthat readers should indeed adopt the supposed prior the report proposes about the efficacy of PT interventions in general. Crisply, I doubt the typical reader seriously believes (e.g.) basically any psychotherapy intervention in LMICs, so long as cost per patient is <$100, is a ~certain bet to beat cash transfers. If not, they should question the reportâs recommendations too.
Summing up
Criticising is easier than doing better. But I think this is a case where a basic qualitative description tells the appropriate story, the sophisticated numerical methods are essentially a âbridge too farâ given the low quality of what they have to work with, and so confuse rather than clarify the matter. In essence:
The literature on PT in LMICs is a complete mess. Insofar as more sense can be made from it, the most important factors appear to belong to the studies investigating it (e.g. their size) rather than qualities of the PT interventions themselves.
Trying to correct the results of a compromised literature is known to be a nightmare. Here, the qualitative evidence for publication bias is compelling. But quantifying what particular value of âa lot?â the correction should be is fraught: numerically, methods here disagree with one another dramatically, and prove highly sensitive to choices on data exclusion.
Regardless of how PT looks in general, Strongminds, in particular, is looking less and less promising. Although initial studies looked good, they had various methodological weaknesses, and a forthcoming RCT with much higher methodological quality is expected to deliver disappointing results.
The evidential trajectory here is all to common, and the outlook typically bleak. It is dubious StrongMinds is a good pick even among psychotherapy interventions (picking one at random which doesnât have a likely-bad-news RCT imminent seems a better bet). Although pricing different interventions is hard, it is even more dubious SM is close to the frontier of âvery well evidencedâ vs. âhas very promising resultsâ plotted out by things like AMF, GD, etc. HLIâs choice to nonetheless recommend SM again this giving season is very surprising. I doubt it will weather hindsight well.
All of the figures are taken from the report and appendix. The transparency is praiseworthy, although it is a pity despite largely looking at the right things the report often mistakes the right conclusions to draw.
The Cochrane handbook section on meta-analysis is very clear on this (but to make it clearer, I add emphasis)
10.1 Do not start here!
It can be tempting to jump prematurely into a statistical analysis when undertaking a systematic review. The production of a diamond at the bottom of a plot is an exciting moment for many authors, but results of meta-analyses can be very misleading if suitable attention has not been given to formulating the review question; specifying eligibility criteria; identifying and selecting studies; collecting appropriate data; considering risk of bias; planning intervention comparisons; and deciding what data would be meaningful to analyse. Review authors should consult the chapters that precede this one before a meta-analysis is undertaken.
As a WIP, the data and code for this report is not yet out, but in my previous statistical noodling on the last one both study size and registration status significantly moderated the effect downwards when included together, suggesting indeed the former isnât telling you everything re. study quality.
The report does mention later (S10.2) controlling a different analysis for study quality, when looking at the effect of sample size itself:
To test for scaling effects, we add sample size as a moderator into our meta-analysis and find that for every extra 1,000 participants in a study the effect size decreases (non-significantly) by â0.09 (95% CI: â0.206, 0.002) SDs. Naively, this suggests that deploying psychotherapy at scale means its effect will substantially decline. However, when we control for study characteristics and quality, the coefficient for sample size decreases by 45% to â0.055 SDs (95% CI: â0.18, 0.07) per 1,000 increase in sample size. This suggests to us that, beyond this finding being non-significant, the effect of scaling can be controlled away with quality variables, more of which that we havenât considered might be included.
I donât think this analysis is included in the appendix or similar, but later text suggests the âstudy qualityâ correction is a publication bias adjustment. This analysis is least fruitful when applied to study scale, as measures of publication bias are measures of study size: so finding the effects of study scale are attenuated when you control for a proxy of study scale is uninformative.
What would be informative is the impact measures of âstudy scaleâ or publication bias have on the coefficients for the primary moderators. Maybe they too could end up âcontrolled away with quality variables, more of which that we havenât considered might be includedâ?
The report charts a much wiser course on a different âOutlier?â question: whether to include very long follow-up studies, where exclusion would cut the total effect in half. I also think including everything here is fine too, but the reportâs discussion in S4.2 clearly articulates the reason for concern, displays what impact inclusion vs. exclusion has, and carefully interrogates the outlying studies to see whether they have features (beyond that they report âoutlyingâ results) which warrants exclusion. They end up going âhalf-and-halfâ, but consider both full exclusion and inclusion in sensitivity analysis.
If you are using study size as an (improvised) measure of study quality, excluding the smallest studies because on an informal read they are particularly low quality makes little sense: this is the trend you are interested in.
A similar type of problem crops up when one is looking at the effect of âdosageâ on PT efficacy.
The solid lines are the fit (blue linear, orange log) on the full data, whilst the dashed lines are fits with extreme values of dosageâsmall or largeâexcluded (purple). The report freely concedes its choices here are very theory led rather than data drivenâand also worth saying getting more of a trend here makes a rod for SM and Friendship Benchâs back, as these deliver smaller numbers of sessions than the average, so adjusting with the dashed lines and not the solid ones reduces the expected effect.
Yet the main message I would take from the scatter plot is the data indeed looks very flat, and there is no demonstrable dose-response relationship of PT. Qualitatively, this isnât great for face validity.
To its credit, the write-up does highlight this, but does not seem to appreciate the implications are crazy: any PT intervention, so long as it is cheap enough, should be thought better than GD, even if studies upon it show very low effect size (which would usually be reported as a negative result, as almost any study in this field would be underpowered to detect effects as low as are being stipulated):
Therefore, even if the StrongMinds-specific evidence finds a small total recipient effect (as we present here as a placeholder), and we relied solely on this evidence, then it would still result in a cost-effectiveness that is similar or greater than that of GiveDirectly because StrongMinds programme is very cheap to deliver.
The report describes this clearly itself, but seems to think this is a feature rather than a bug (my emphasis):
Now, one might argue that the results of the Baird et al. study could be lower than 0.4 WELLBYs. But â assuming the same weights are given to the prior and the charity-specific data as in our analysisâeven if the Baird et al. results were 0.05 WELLBYs (extremely small), then the posterior would still be 1.49 * 0.84 + 0.05 * 0.16 = 1.26 WELLBYs; namely, very close to our current posterior (1.31 WELLBYs).
Iâm not even sure that âP > 0.99 better than GDâ would be warranted as posterior even for a Givewell recommended top charity, and Iâd guess the GW staff who made the recommendation would often agree.
I think I agree with the rest of this analysis (or at least the parts I could understand). However, the following paragraph seems off:
To its credit, the write-up does highlight this, but does not seem to appreciate the implications are crazy: any PT intervention, so long as it is cheap enough, should be thought better than GD, even if studies upon it show very low effect size
Apologies if Iâm being naive here, but isnât this just a known problem of first-order cost-effectiveness analysis, not with this particular analysis per se? I mean, since cheapness could be arbitrarily low (or at least down to $0.01), âbetter than GDâ is a bit of a red herring, the claim is merely that a single (even high-quality) study is not enough for someone to update their prior all the way down to zero, or negative.
And stated in English, this seems eminently reasonable to me. There might be good second-order etc reasons to not act on a naive first-order analysis (eg risk/âambiguity aversion, wanting to promote better studies, etc). But ultimately I donât think that literal claim is crazy to me, and naively seems like something that naturally falls out of a direct cost-effectiveness framework.
So the problem I had in mind was in the parenthetical in my paragraph:
To its credit, the write-up does highlight this, but does not seem to appreciate the implications are crazy: any PT intervention, so long as it is cheap enough, should be thought better than GD, even if studies upon it show very low effect size (which would usually be reported as a negative result, as almost any study in this field would be underpowered to detect effects as low as are being stipulated)
To elaborate: the actual data on Strongminds was a n~250 study by Bolton et al. 2003 then followed up by Bass et al. 2006. HLI models this in table 19:
So an initial effect of g = 1.85, and a total impact of 3.48 WELLBYs. To simulate what the SM data will show once the (anticipated to be disappointing) forthcoming Baird et al. RCT is included, they discount this[1] by a factor of 20.
Thus the simulated effect size of Bolton and Bass is now ~0.1. In this simulated case, the Bolton and Bass studies would be reporting negative results, as they would not be powered to detect an effect size as small as g = 0.1. To benchmark, the forthcoming Baird et al. study is 6x larger than these, and its power calculations have minimal detectable effects g = 0.1 or greater.
Yet, apparently, in such a simulated case we should conclude that Strongminds is fractionally better than GD purely on the basis of two trials reporting negative findings, because numerically the treatment groups did slightly (but not significantly) better than the control ones.
Even if in general we are happy with âhey, the effect is small, but it is cheap, so itâs a highly cost-effective interventionâ, we should not accept this at the point when âsmallâ becomes âtoo small to be statistically significantâ. Analysis method + negative findings =! fractionally better in expectation vs. cash transfers, so I take it as diagnostic the analysis is going wrong.
I think âthisâ must be the initial effect size/âintercept, as 3.48 * 0.05 ~ 1.7 not 3.8. I find this counter-intuitive, as I think the drop in total effect should be super not sub linear with intercept, but ignore that.
Thank you for your comments, Gregory. Weâre aware you have strong views on the subject and we appreciate your conscientious contributions. We discussed your previous comments internally but largely concluded revisions werenât necessary as we (a) had already considered them in the report and appendix, (b) will return to them in later versions and didnât expect they would materially affect the results, or (c) simply donât agree with these views. To unpack:
Study quality. We conclude the data set does contain bias, but we account for it (sections 3.2 and 5; itâs an open question among academics how best to do this). We donât believe that the entire field of LMIC psychotherapy should be considered bunk, compromised, or uninformative. Our results are in line with existing meta-analyses of psychotherapy considered to have low risk of bias (see footnote).[1]
Evidentiary standards. We drew on a large number of RCTs for our systematic reviews and meta-analyses of cash transfers and psychotherapy (42 and 74, respectively). If one holds that the evidence for something as well-studied as psychotherapy is too weak to justify any recommendations, charity evaluators could recommend very little.
Outlier exclusion. The issues regarding outlier exclusion were discussed in some depth (3.2 in the main report and in Appendix B). Excluding outliers is thought sensible practice here; two related meta-analyses, Cuijpers et al., 2020c; Tong et al., 2023, used a similar approach. Itâs consistent with not taking the entire literature at face value but also not taking guilt by association too far. If one excludes outliers, the specific way one does this has a minor effect (e.g., a 10% decline in effectiveness, see appendix). Our analysis necessarily makes analytic choices: some were pre-registered, some made on reflection, many were discussed in our sensitivity analysis. If one insisted only on using charity evaluations that had every choice pre-registered, there would be none to choose from.
Bayesian analysis: The method we use (âgrid approximationâ, see 8.3 and 9.3) avoids subjective inputs. It is not this Bayesian analysis itself that âstacks the deckâ in favour of psychotherapy, but the evidence. Given that over 70 studies form the prior, it would be surprising if adding one study, as we did for StrongMinds, would radically alter the conclusions. [Edit 5/â12/â2023: on the point that StrongMinds could be more cost-effective than GiveDirectly, even if StrongMinds only has the small effect we assume it does in our hypothetical placeholder studies, it doesnât seem inconceivable that a small, less effective intervention can still be more cost-effective than a big, expensive one. For context, we estimate it costs StrongMinds $63 per interventionâproviding one person with a course of therapyâwhereas it costs GiveDirectly $1221 per interventionâan $1000 cash transfer which costs $221 in overheads. As the therapy is about 20x cheaper, it can be far less effective yet still more cost-effective.]
Making recommendations: we aim to recommend the most cost-effective ways of increasing WELLBYs weâve found so far. While we have intuitions about how good different interventions are our perspective as an organisation is that conclusions about whatâs cost-effective should be led heavily by the evidence rather than by our pre-evidential beliefs (âpriorsâ). Given the evidence weâve considered, we donât see a strong case for recommending cash transfers over psychotherapy.
This is a working report, and weâll be reflecting on how to incorporate the above, similarly psychotherapy-sceptical perspectives, and other views in the process of preparing it for academic review. In the interests of transparency, we donât plan to engage beyond our comments above so as to preserve team resources.
We find an initial effect is 0.70 SDs, reduced to 0.46 SDs after publication bias adjustments. Cuijpers et al. 2023 find an effect of psychotherapy of 0.49 SDs for studies with low risk of bias (RoB) in low, middle, and high income countries (comparisons = 218), which reduces to between 0.27 and 0.57 after publication adjustment. Tong et al. 2023 find an effect of 0.69 SDs for studies with low RoB in non-western countries (primarily low and middle income; comparisons = 36), which adjust to between 0.42 and 0.60 after publication correction. Hence, our initial and adjusted numbers are similar.
Epistemic status: tentative, itâs been a long time since reading social science papers was a significant part of my life. Happy to edit/âretract this preliminary view as appropriate if someone is able to identify mistakes.
Excluding outliers is thought sensible practice here; two related meta-analyses, Cuijpers et al., 2020c; Tong et al., 2023, used a similar approach.
I canât access Cuijpers et al., but I donât read Tong et al. as supporting what HLI has done here.
In their article, Tong et al. provide the effect size with no exclusions, then with outliers excluded, then with âextreme outliersâ excluded (the latter of which seems to track HLIâs removal criterion). They also provide effect size with various publication-bias measures employed. See PDF at 5-6. If Iâm not mistaken, the publication bias measures are applied to the no-exclusions version, not a version with outliers removed or limited to those with lower RoB. See id. at 6 tbl.2 (n = 117 for combined and 2 of 3 publication-bias effect sizes; 153 with trim-and-fill adding 36 studies; n = 74 for outliers removed & n = 104 for extreme outliers removed; effect sizes after publication-bias measures range from 0.42 to 0.60 seem to be those mentioned in HLIâs footnote above).
Tong et al. âconducted sensitive analyses comparing the results with and without the inclusion of extreme outliers,â PDF at 5, discussing the results without exclusion first and then the results with exclusion. See id. at 5-6. Tables 3-5 are based on data without exclusion of extreme outliers; the versions of Tables 4 and 5 that excludes extreme outliers are relegated to the supplemental tables (not in PDF). See id. at 6. This reads to my eyes as treating both the all-inclusive and extreme-outliers-excluded data seriously, with some pride of place to the all-inclusive data.
I donât read Tong et al. as having reached a conclusion that either the all-inclusive or extreme-outliers-excluded results were more authoritative, saying things like:
Lastly, we were unable to explain the different findings in the analyses with vs. without extreme outliers. The full analyses that included extreme outliers may reflect the true differences in study characteristics, or they may imply the methodological issues raised by studies with effect sizes that were significantly higher than expected.
and
Therefore, the larger treatment effects observed in non-Western trials may not necessarily imply superior treatment outcomes. On the other hand, it could stem from variations in study design and quality.
and
Further research is required to explain the reasons for the differences in study design and quality between Western and non-Western trials, as well as the different results in the analyses with and without extreme outliers.
PDF at 10.
Of course, âfurther research neededâ is an almost inevitable conclusion of the majority of academic papers, and Tong et al. have the luxury of not needing to reach any conclusions to inform the recommended distribution of charitable dollars. But I donât read the article by Tong et al. as supporting the proposition that it is appropriate to just run with the outliers-excluded data. Rather, I read the article as suggesting thatâat least in the absence of compelling reasons to the contraryâone should take both analyses seriously, but neither definitively.
I lack confidence in what taking both analyses seriously, but neither definitively would mean for purposes of conducting a cost-effectiveness analysis. But I speculate that it would likely involve some sort of weighting of the two views.
When we said âExcluding outliers is thought sensible practice here; two related meta-analyses, Cuijpers et al., 2020c; Tong et al., 2023, used a similar approachââI can see that what we meant by âsimilar approachâ was unclear. We meant that, conditional on removing outliers, they identify a similar or greater range of effect sizes as outliers as we do.
This was primarily meant to address the question raised by Gregory about whether to include outliers: âThe cut data by and large doesnât look visually âoutlyingâ to me.â
To rephrase, I think that Cuijpers et al. and Tong et al. would agree that the data we cut looks outlying. Obviously, this is a milder claim than our comment could be interpreted as making.
Turning to wider implications of these meta-analyses, As you rightly point out, they donât have a âpreferred specificationâ and are mostly presenting the options for doing the analysis. They present analyses with and without outlier removal in their main analysis, and they adjust for publication bias without outliers removed (which is not what we do). The first analytic choice doesnât clearly support including or excluding outliers, and the second â if it supports any option, favors Gregâs proposed approach of correcting for publication bias without outliers removed.
I think one takeaway is that we should consider surveying the literature and some experts in the field, in a non-leading way, about what choices theyâd make if they didnât have âthe luxury of not having to reach a conclusionâ.
I think it seems plausible to give some weight to analyses with and without excluding outliers â if we are able find a reasonable way to treat the 2 out of 7 publication bias correction methods that produce the results suggesting that the effect of psychotherapy is in fact sizably negative. Weâll look into this more before our next update.
Cutting the outliers here was part of our first pass attempt at minimising the influence of dubious effects, which weâll follow up with a Risk of Bias analysis in the next version. Our working assumption was that effects greater than ~ 2 standard deviations are suspect on theoretical grounds (that is, if they behave anything like SDs in an normal distribution), and seemed more likely to be the result of some error-generating process (e.g. data-entry error, bias) than a genuine effect.
Weâll look into this more in our next pass, but for this version we felt outlier removal was the most sensible choice.
I recently discovered that GiveWell decided to exclude an outlier in their water chlorination meta-analysis. Iâm not qualified to judge their reasoning, but maybe others with sufficient expertise will weigh in?
We excluded one RCT that meets our other criteria because we think the results are implausibly high such that we donât believe they represent the true effect of chlorination interventions (more in footnote).[4] Itâs unorthodox to exclude studies for this reason when conducting a meta-analysis, but we chose to do so because we think it gives us an overall estimate that is more likely to represent the true effect size.
I recently discovered that GiveWell decided to exclude an outlier in their water chlorination meta-analysis. Iâm not qualified to judge their reasoning, but maybe others with sufficient expertise will weigh in?
We excluded one RCT that meets our other criteria because we think the results are implausibly high such that we donât believe they represent the true effect of chlorination interventions (more in footnote).[4] Itâs unorthodox to exclude studies for this reason when conducting a meta-analysis, but we chose to do so because we think it gives us an overall estimate that is more likely to represent the true effect size.
Evidentiary standards. We drew on a large number of RCTs for our systematic reviews and meta-analyses of cash transfers and psychotherapy (42 and 74, respectively). If one holds that the evidence for something as well-studied as psychotherapy is too weak to justify any recommendations, charity evaluators could recommend very little.
A comparatively minor point, but it doesnât seem to me that the claims in Gregâs post [more] are meaningfully weakened by whether or not psychotherapy is well-studied (as measured by how many RCTs HLI has found on it, noting that you already push back on some object level disagreement on study quality in point 1, which feels more directly relevant).
It also seems pretty unlikely to be true that psychotherapy being well studied necessarily means that StrongMinds is a cost-effective intervention comparable to current OP /â GW funding bars (which is one main point of contention), or that charity evaluators need 74+ RCTs in an area before recommending a charity. Is the implicit claim being made here is that the evidence for StrongMinds being a top charity is stronger than that of AMF, which is (AFAIK) based on less than 74 RCTs?[1]
I have previously let HLI have the last word, but this is too egregious.
Study quality: Publication bias (a property of the literature as a whole) and risk of bias (particular to each individual study which comprise it) are two different things.[1] Accounting for the former does not account for the latter. This is why the Cochrane handbook, the three meta-analyses HLI mentions here, and HLIâs own protocol consider distinguish the two.
Neither Cuijpers et al. 2023 nor Tong et al. 2023 further adjust their low risk of bias subgroup for publication bias.[2] I tabulate the relevant figures from both studies below:
So HLI indeed gets similar initial results and publication bias adjustments to the two other meta-analyses they find. Yetâalthough these are not like-for-likeâthese other two meta-analyses find similarly substantial effect reductions when accounting for study quality as they do when assessing at publication bias of the literature as a whole.
Although neither of these studies âadjust for bothâ, one later mentionedâCuijpers et al. 2020 - does. It finds an additional discount to effect size when doing so.[4] So it suggests that indeed âaccounting forâ publication bias does not adequately account for risk of bias en passant.
Tong et al. 2023 - the meta-analysis expressly on PT in LMICs rather than PT generallyâfinds higher prevalence of indicators of lower study quality in LMICs, and notes this as a competing explanation for the outsized effects.[5]
As previously mentioned, in the previous meta-analysis, unregistered trials had a 3x greater effect size than registered ones. All trials on Strongminds published so far have not been registered. Baird et al., which is registered, is anticipated to report disappointing results.
Evidentiary standards: Indeed, the report drew upon a large number of studies. Yet even a synthesis of 72 million (or whatever) studies can be misleading if issues of publication bias, risk of bias in individual studies (and so on) are not appropriately addressed. That an area has 72 (or whatever) studies upon it does not mean it is well-studied, nor would this number (nor any number) be sufficient, by itself, to satisfy any evidentiary standard.
Outlier exclusion: The reportâs approach to outlier exclusion is dissimilar to both Cuijpers et al. 2020 and Tong et al. 2023, and further is dissimilar with respect to features I highlighted as major causes for concern re. HLIâs approach in my original comment.[6]Specifically:
Both of these studies present the analysis with the full data first in their results. Contrast HLIâs report, where only the results with outliers excluded are presented in the main results, and the analysis without exclusion is found only in the appendix.[7]
Both these studies also report the results with the full data as their main findings (e.g. in their respective abstracts). Cuijpers et al. mentions their outlier excluded results primarily in passing (âoutliersâ appears once in the main text); Tong et al. relegates a lot of theirs to the appendix. HLIâs report does the opposite. (cf. fn 7 above)
Only Tong et al. does further sensitivity analysis on the âoutliers excludedâ subgroup. As Jason describes, this is done alongside the analysis where all data included, the qualitative and quantitative differences which result from this analysis choice are prominently highlighted to the reader and extensively discussed. In HLIâs report, by contrast, the factor of 3 reduction to ultimate effect size when outliers are not excluded is only alluded to qualitatively in a footnote (fn 33)[8] of the main reportâs section (3.2) arguing why outliers should be excluded, not included in the reports sensitivity analysis, and only found in the appendix.[9]
Both studies adjust for publication bias only on all data, not on data with outliers excluded, and these are the publication bias findings they present. Contrast HLIâs report.
The Cuijpers et al. 2023 meta-analysis previously mentioned also differs in its approach to outlier exclusion from HLIâs report in the ways highlighted above. The Cochrane handbook also supports my recommendations on what approach should be taken, which is what the meta-analyses HLI cites approvingly as examples of âsensible practiceâ actually do, but what HLIâs own work does not.
The reports (non) presentation of the stark quantitative sensitivity of its analysisâmaterial to its report bottom line recommendationsâto whether outliers are excluded is clearly inappropriate. It is indefensible if, as I have suggested may be the case, the analysis with outliers included was indeed the analysis first contemplated and conducted.[10] It is even worse if it was the publication bias corrections on the full data was what in fact prompted HLI to start making alternative analysis choices which happened to substantially increase the bottom line figures.
Bayesian analysis: Bayesian methods notoriously do not avoid subjective inputsâmost importantly here, what information we attend to when constructing an âinformed priorâ (or, if one prefers, how to weigh the results with a particular prior stipulated).
In any case, they provide no protection from misunderstanding the calculation being performed, and so misinterpreting the results. The Bayesian method in the report is actually calculating the (adjusted) average effect size of psychotherapy interventions in general, not the expected effect of a given psychotherapy intervention. Although a trial on Strongminds which shows it is relatively ineffectual should not update our view much the efficacy of psychotherapy interventions (/âsimilar to Strongminds) as a whole, it should update us dramatically on the efficacy of Strongminds itself.
Although as a methodological error this is a subtle one (at least, subtle enough for me not to initially pick up on it), the results it gave are nonsense to the naked eye (e.g. SM would still be held as a GiveDirectly-beating intervention even if there were multiple high quality RCTs on Strongminds giving flat or negative results). HLI should have seen this themselves, should have stopped to think after I highlighted these facially invalid outputs of their method in early review, and definitely should not be doubling down on these conclusions even now.
Making recommendations: Although there are other problems, those I have repeated here make the recommendations of the report unsafe. This is why I recommended against publication. Specifically:
Although I donât think the Bayesian method the report uses would be appropriate, if it was calculated properly on its own terms (e.g. prediction intervals not confidence intervals to generate the prior, etc.), and leaving everything else the same, the SM bottom line would drop (Iâm pretty sure) by a factor a bit more than 2.
The results are already essentially sensitive to whether outliers are excluded in analysis or not: SM goes from 3.7x â ~1.1x GD on the back of my envelope, again leaving all else equal.
(1) and (2) combined should net out to SM < GD; (1) or (2) combined with some of the other sensitivity analyses (e.g. spillovers) will also likely net out to SM < GD. Even if one still believes the bulk of (appropriate) analysis paths still support a recommendation, this sensitivity should be made transparent.
E.g. Even if all studies in the field are conducted impeccably, if journals only accept positive results the literature may still show publication bias. Contrariwise, even if all findings get published, failures in allocation/âblinding/âetc. could lead to systemic inflation of effect sizes across the literature. In realityâand hereâyou often have both problems, and they only partially overlap.
Jason correctly interprets Tong et al. 2023: the number of studies included in their publication bias corrections (117 [+36 w/â trim and fill]) equals the number of all studies, not the low risk of bias subgroup (36 - see table 3). I do have access to Cuijpers et al. 2023, which has a very similar results table, with parallel findings (i.e. they do their publication bias corrections on the whole set of studies, not on a low risk of bias subgroup).
HLIâs report does not assess the quality of its included studies, although it plans to do so. I appreciate GRADEing 90 studies or whatever is tedious and time consuming, but skipping this step to crack on with the quantitative synthesis is very unwise: any such synthesis could be hugely distorted by low quality studies.
Risk of bias is another important problem in research on psychotherapies for depression. In 70% of the trials (92/â309) there was at least some risk of bias. And the studies with low risk of bias, clearly indicated smaller effect sizes than the ones that had (at least some) risk of bias. Only four of the 15 specific types of therapy had 5 or more trials without risk of bias. And the effects found in these studies were more modest than what was found for all studies (including the ones with risk of bias). When the studies with low risk of bias were adjusted for publication bias, only two types of therapy remained significant (the âCoping with Depressionâ course, and self-examination therapy).
The larger effect sizes found in non-Western trials were related to the presence of wait-list controls, high risk of bias, cognitive-behavioral therapy, and clinician-diagnosed depression (p < 0.05). The larger treatment effects observed in non-Western trials may result from the high heterogeneous study design and relatively low validity. Further research on long-term effects, adolescent groups, and individual-level data are still needed.
Apparently, all that HLI really meant with âExcluding outliers is thought sensible practice here; two related meta-analyses, Cuijpers et al., 2020c; Tong et al., 2023, used a similar approachâ [my emphasis] was merely â[C]onditional on removing outliers, they identify a similar or greater range of effect sizes as outliers as we do.â (see).
Yeah, right.
I also had the same impression as Jason that HLIâs reply repeatedly strawmans me. The passive aggressive sniping sprinkled throughout and subsequent backpedalling (in fairness, I suspect by people who were not at the keyboard of the corporate account) is less than impressive too. But itâs nearly Christmas, so beyond this footnote Iâll let all this slide.
Received opinion is typically that outlier exclusion should be avoided without a clear rationale why the âoutliersâ arise from a clearly discrepant generating process. If it is to be done, the results of the full data should still be presented as the primary analysis
If we didnât first remove these outliers, the total effect for the recipient of psychotherapy would be much larger (see Section 4.1) but some publication bias adjustment techniques would over-correct the results and suggest the completely implausible result that psychotherapy has negative effects (leading to a smaller adjusted total effect). Once outliers are removed, these methods perform more appropriately. These methods are not magic detectors of publication bias. Instead, they make inferences based on patterns in the data, and we do not want them to make inferences on patterns that are unduly influenced by outliers (e.g., conclude that there is no effect â or, more implausibly, negative effects â of psychotherapy because of the presence of unreasonable effects sizes of up to 10 gs are present and creating large asymmetric patterns). Therefore, we think that removing outliers is appropriate. See Section 5 and Appendix B for more detail.
The sentence in the main text this is a footnote to says:
Removing outliers this way reduced the effect of psychotherapy and improves the sensibility of moderator and publication bias analyses.
[W]ithout excluding data, SM drops from ~3.6x GD to ~1.1x GD. Yet it doesnât get a look in for the sensitivity analysis, where HLIâs âless favourableâ outlier method involves taking an average of the other methods (discounting by ~10%), but not doing no outlier exclusion at all (discounting by ~70%).
My remark about âEven if you didnât pre-specify, presenting your first cut as the primary analysis helps for nothing up my sleeve reasonsâ which Dwyer mentions elsewhere was a reference to ânothing up my sleeve numbersâ in cryptography. In the same way picking pi or e initial digits for arbitrary constants reassures the author didnât pick numbers with some ulterior purpose they are not revealing, reporting what oneâs first analysis showed means readers can compare it to where you ended up after making all the post-hoc judgement calls in the garden of forking paths. âOur first intention analysis would give x, but we ended up convincing ourselves the most appropriate analysis gives a bottom line of 3xâ would rightly arouse a lot of scepticism.
Iâve already mentioned I suspect this is indeed what has happened here: HLIâs first cut was including all data, but argued itself into making the choice to exclude, which gave a 3x higher âbottom lineâ. Beyond âYou didnât say youâd exclude outliers in your protocolâ and âbasically all of your discussion in the appendix re. outlier exclusion concerns the results of publication bias corrections on the bottom line figuresâ, I kinda feel HLI not denying it is beginning to invite an adverse inference from silence. If Iâm right about this, HLI should come clean.
Although there are other problems, those I have repeated here make the recommendations of the report unsafe.
Even if one still believes the bulk of (appropriate) analysis paths still support a recommendation, this sensitivity should be made transparent.
The first statement says HLIâs recommendation is unsafe, but the second implies it is reasonable as long as the sensitivity is clearly explained. Iâm grateful to Greg for presenting the analysis paths which lead to SM < GD, but itâs unclear to me how much those paths should be weighted compared to all the other paths which lead to SM > GD.
Itâs notable that Cuijpers (who has done more than anyone in the field to account for publication bias and risk of bias) is still confident that psychotherapy is effective.
I was also surprised by the use of âunsafeâ. Less cost-effective maybe, but âunsafeâ implies harm and I havenât seen any evidence to support that claim.
You cannot use the distribution for the expected value of an average therapy treatment as the prior distribution for a SPECIFIC therapy treatment, as there will be a large amount of variation between possible therapy treatments that is missed when doing this. Your prior here is that there is a 99%+ chance that StrongMinds will work better than GiveDirectly before looking at any actual StrongMinds results, this is a wildly implausible claim.
You also state âIf one holds that the evidence for something as well-studied as psychotherapy is too weak to justify any recommendations, charity evaluators could recommend very little.â Nothing in Gregoryâs post suggests that he thinks anything like this, he gives a g of ~0.5 in his meta-analysis that doesnât improperly remove outliers without good cause. A g of ~0.5 suggests that individuals suffering from depression would likely greatly benefit from seeking therapy. There is a massive difference between âevidence behind psychotherapy is too weak to justify any recommendationsâ and claiming that âthis particular form of therapy is not vastly better than GiveDirectly with a probability higher than 99% before even looking at RCT resultsâ. Trying to throw out Gregoryâs claims here over a seemingly false statement about his beliefs seems pretty offensive to me.
[Disclaimer: I worked at HLI until March 2023. I now work at the International Alliance of Mental Health Research Funders]
Gregory says
these problems are sufficiently major I think potential donors are ill-advised to follow the recommendations and analysis in this report.
That is a strong claim to make and it requires him to present a convincing case that GiveDirectly is more cost-effective than StrongMinds. Iâve found his previous methodological critiques to be constructive and well-explained. To their credit, HLI has incorporated many of them in the updated analysis. However, in my opinion, the critiques he presents here do not make a convincing case.
Taking his summary points in turn...
The literature on PT in LMICs is a complete mess. Insofar as more sense can be made from it, the most important factors appear to belong to the studies investigating it (e.g. their size) rather than qualities of the PT interventions themselves.
I think this is much too strong. The three meta-analyses (and Gregoryâs own calculations) give me confidence that psychotherapy in LMICs is effective, although the effects are likely to be small.
2. Trying to correct the results of a compromised literature is known to be a nightmare. Here, the qualitative evidence for publication bias is compelling. But quantifying what particular value of âa lot?â the correction should be is fraught: numerically, methods here disagree with one another dramatically, and prove highly sensitive to choices on data exclusion.
There is no consensus on the appropriate methodology for adjusting publication bias. I donât have an informed opinion on this, but HLIâs approach seems reasonable to me and I think itâs reasonable for Greg to take a different view. From my limited understanding, neither approach makes GiveDirectly more cost-effective.
3. Regardless of how PT looks in general, StrongMinds, in particular, is looking less and less promising. Although initial studies looked good, they had various methodological weaknesses, and a forthcoming RCT with much higher methodological quality is expected to deliver disappointing results.
We donât have any new data on StrongMinds so Iâm confused why Greg thinks itâs âless and less promisingâ. HLIâs Bayesian approach is a big improvement on the subjective weightings they used in the first cost-effectiveness analysis. As with publication bias, itâs reasonable to hold different views on how to construct the prior, but personally, I do believe that any psychotherapy intervention in LMICs, so long as cost per patient is <$100, is a ~certain bet to beat cash transfers. There are no specific models of psychotherapy that perform better than the others, so I donât find it surprising that training people to talk to other people about their problems is a more cost-effective way to improve wellbeing in LMICs. Cash transfers are much more expensive and the effects on subjective wellbeing are also small.
4. The evidential trajectory here is all to common, and the outlook typically bleak. It is dubious StrongMinds is a good pick even among psychotherapy interventions (picking one at random which doesnât have a likely-bad-news RCT imminent seems a better bet). Although pricing different interventions is hard, it is even more dubious SM is close to the frontier of âvery well evidencedâ vs. âhas very promising resultsâ plotted out by things like AMF, GD, etc. HLIâs choice to nonetheless recommend SM again this giving season is very surprising. I doubt it will weather hindsight well.
HLI had to start somewhere and I think we should give credit to StrongMinds for being brave enough to open themselves up to the scrutiny theyâve faced. The three meta-analyses and the tentative analysis of Friendship Bench suggest there is âaltruistic goldâ to be found here and HLI has only just started to dig. The field is growing quickly and Iâm optimistic about the trajectories of CE-incubated charities like Vida Plena and Kaya Guides.
In the meantime, although the gap between GiveDirectly and StrongMinds has clearly narrowed, I remain unconvinced that cash is clearly the better option (but I do remain open-minded and open to pushback).
You cannot use the distribution for the expected value of an average therapy treatment as the prior distribution for a SPECIFIC therapy treatment, as there will be a large amount of variation between possible therapy treatments that is missed when doing this.
A specific therapy treatment is drawn from the distribution of therapy treatments. Our best guess about the distribution of value of a specific therapy treatment, without knowing anything about it, should take into account only that it comes from this distribution of therapy treatments. So I donât see whatâs unreasonable about this.
When running a meta-analysis, you can either use a fixed effect assumption (that all variation between studies is just due to sampling error) and a random effect assumption (that studies differ in terms of their âtrue effectsâ.) Therapy treatments differ greatly, so you have to use a random effects model in this case. Then the prior you use for strong minds impact should have a variance that is the sum of the variance in the estimate of average therapy treatments effects AND the variance among different types of treatments effects, both numbers should be available from a random effects meta-analysis. Iâm not quite sure what HLI did exactly to get their prior for strong minds here, but for some reason the variance on it seems WAY too low, and I suspect that they neglected the second type of variance that they should have gotten from a random effects meta-analysis.
Section 2.2.2 of their report is titled âChoosing a fixed or random effects modelâ. They discuss the points you make and clearly say that they use a random effects model. In section 2.2.3 they discuss the standard measures of heterogeneity they use. Section 2.2.4 discusses the specific 4-level random effects model they use and how they did model selection.
I reviewed a small section of the report prior to publication but none of these sections, and it only took me 5 minutes now to check what they did. Iâd like the EA Forum to have a higher bar (as Gregoryâs parent comment exemplifies) before throwing around easily checkable suspicions about what (very basic) mistakes might have been made.
Yes, some of Gregâs examples point to the variance being underestimated, but the problem does not inherently come from the idea of using the distribution of effects as the prior, since that should include both the sampling uncertainty and true heterogeneity. That would be the appropriate approach even under a random effects model (I think; Iâm more used to thinking in terms of Bayesian hierarchical models and the equivalence might not hold)
I think the fundamental point (i.e. âYou cannot use the distribution for the expected value of an average therapy treatment as the prior distribution for a SPECIFIC therapy treatment, as there will be a large amount of variation between possible therapy treatments that is missed when doing this.â) is on the right lines, although subsequent discussion of fixed/ârandom effect models might confuse the issue. (Cf. my reply to Jason).
The typical output of a meta-analysis is an (~) average effect size estimate (the diamond at the bottom of the forest plot, etc.) The confidence interval given for that is (very roughly)[1] the interval we predict the true average effect likely lies. So for the basic model given in Section 4 of the report, the average effect size is 0.64, 95% CI (0.54 â 0.74). So (again, roughly) our best guess of the âtrueâ average effect size of psychotherapy in LMICs from our data is 0.64, and weâre 95% sure(*) this average is somewhere between (0.54, 0.74).
Clearly, it is not the case that if we draw another study from the same population, we should be 95% confident(*) the effect size of this new data point will lie between 0.54 to 0.74. This would not be true even in the unicorn case thereâs no between study heterogeneity (e.g. all the studies are measuring the same effect modulo sampling variance), and even less so when this is marked, as here. To answer that question, what you want is a prediction interval.[2] This interval is always wider, and almost always significantly so, than the confidence interval for the average effect: in the same analysis with the 0.54-0.74 confidence interval, the prediction interval was â0.27 to 1.55.
Although the full model HLI uses in constructing informed priors is different from that presented in S4 (e.g. it includes a bunch of moderators), they appear to be constructed with monte carlo on the confidenceintervals for the average, not the prediction interval for the data. So I believe the informed prior is actually one of the (adjusted) âAverage effect of psychotherapy interventions as a wholeâ, not a prior for (e.g.) âthe effect size reported in a given PT study.â The latter would need to use the prediction intervals, and have a much wider distribution.[3]
I think this ably explains exactly why the Bayesian method for (e.g.) Strongminds gives very bizarre results when deployed as the report does, but they do make much more sense if re-interpreted as (in essence) computing the expected effect size of âa future strongminds-like interventionâ, but not the effect size we should believe StrongMinds actually has once in receipt of trial data upon it specifically. E.g.:
The histogram of effect sizes shows some comparisons had an effect size < 0, but the âinformed priorâ suggests P(ES < 0) is extremely low. As a prior for the effect size of the next study, it is much too confident, given the data, a trial will report positive effects (you have >1/â72 studies being negative, so surely it cannot be <1%, etc.). As a prior for the average effect size, this confidence is warranted: given the large number of studies in our sample, most of which report positive effects, we would be very surprised to discover the true average effect size is negative.
The prior doesnât update very much on data provided. E.g. When we stipulate the trials upon strongminds report a near-zero effect of 0.05 WELLBYs, our estimate of 1.49 WELLBYS goes to 1.26: so we should (apparently) believe in such a circumstance the efficacy of SM is ~25 times greater than the trial data upon it indicates. This is, obviously, absurd. However, such a small update is appropriate if it were to ~the average of PT interventions as a whole: that we observe a new PT intervention has much below average results should cause our average to shift a little towards the new findings, but not much.
In essence, the update we are interested in is not âHow effective should we expect future interventions like Strongminds are given the data on Strongminds efficacyâ, but simply âHow effective should we expect Strongminds is given the data on how effective Strongminds isâ. Given the massive heterogeneity and wide prediction interval, the (correct) informed prior is pretty uninformative, as it isnât that surprised by anything in a very wide range of values, and so on finding trial data on SM with a given estimate in this range, our estimate should update to match it pretty closely.[4]
(This also should mean, unlike the report suggests, the SM estimate is not that ârobustâ to adverse data. Eyeballing it, Iâd guess the posterior should be going down by a factor of 2-3 conditional on the stipulated data versus currently reported results).
Iâm aware confidence intervals are not credible intervals, and that âthe 95% CI tells you where the true value is with 95% likelihoodâ strictly misinterprets what a confidence interval is, etc. (see) But perhaps âclose enoughâ, so Iâm going to pretend these are credible intervals, and asterisk each time I assume the strictly incorrect interpretation.
The summary estimate and confidence interval from a random-effects meta-analysis refer to the centre of the distribution of intervention effects, but do not describe the width of the distribution. Often the summary estimate and its confidence interval are quoted in isolation and portrayed as a sufficient summary of the meta-analysis. This is inappropriate. The confidence interval from a random-effects meta-analysis describes uncertainty in the location of the mean of systematically different effects in the different studies. It does not describe the degree of heterogeneity among studies, as may be commonly believed. For example, when there are many studies in a meta-analysis, we may obtain a very tight confidence interval around the random-effects estimate of the mean effect even when there is a large amount of heterogeneity. A solution to this problem is to consider a prediction interval (see Section 10.10.4.3).
Obviously, modulo all the other issues I suggest with both the meta-analysis as a whole, that we in fact would incorporate other sources of information into our actual prior, etc. etc.
Agree that there seem to be some strawmen in HLIâs response:
We donât believe that the entire field of LMIC psychotherapy should be considered bunk, compromised, or uninformative.
Has anyone suggested that the âentire field of LMIC psychotherapyâ is âbunkâ?
If one insisted only on using charity evaluations that had every choice pre-registered, there would be none to choose from.
Has anyone suggested that, either? As I understand it, itâs typical to look at debatable choices that happen to support the authorâs position with a somewhat more skeptical lens if they havenât been pre-registered. I donât think anyone has claimed lack of certain choices being pre-registered is somehow fatal, only a factor to consider.
Our (the HLI) comment was in reference to these quotes.
The literature on PT in LMICs is a complete mess.
Trying to correct the results of a compromised literature is known to be a nightmare.
I think it is valid to describe these as saying the literature is compromised and (probably) uninformative. I can understand your complaint about the word âbunkâ. Apologies to Gregory if this is a mischaracterization.
Regarding our comment:
If one insisted only on using charity evaluations that had every choice pre-registered, there would be none to choose from.
And your comment:
I donât think anyone has claimed lack of certain choices being pre-registered is somehow fatal, only a factor to consider.
Yeah, I think this is a valid point, and the post should have quoted Gregory directly. The point we were hoping to make here is that weâve attempted to provide a wide range of sensitivity analyses throughout our report, to an extent that we think goes beyond most charity evaluations. Itâs not surprising that weâve missed some in this draft that others would like to see. Gregoryâs comments mentioned âEven if you didnât pre-specify, presenting your first cut as the primary analysis helps for nothing up my sleeve reasonsâ seemed to imply that we were deliberately hiding something, but in my view our interpretation was overly pessimistic.
I think I can diagnose the underlying problem: Bayesian methods are very sensitive to the stipulated prior. In this case, the prior is likely too high, and definitely too narrow/âoverconfident.
Would it have been better to start with a stipulated prior based on evidence of short-course general-purpose[1] psychotherapyâs effect size generally, update that prior based on the LMIC data, and then update that on charity-specific data?
One of the objections to HLIâs earlier analysis was that it was just implausible in light of what we know of psychotherapyâs effectiveness more generally. I donât know that literature well at all, so I donât know how well the effect size in the new stipulated prior compares to the effect size for short-course general-purpose psychotherapy generally. However, given the methodological challenges with measuring effect size in LMICs on available data, it seems like a more general understanding of the effect size should factor into the informed prior somehow. Of course, the LMIC context is considerably different than the context in which most psychotherapy studies have been done, but I am guessing it would be easier to manage quality-control issues with the much broader research base available. So both knowledge bases would likely inform my prior before turning to charity-specific evidence.
[Edit 6-Dec-23: Gregâs response to the remainder of this comment is much better than my musings below. Iâd suggest reading that instead!]
To my not-very-well-trained eyes, one hint to me that thereâs an issue with application of Bayesian analysis here is the failure of the LMIC effect-size model to come anywhere close to predicting the effect size suggested by the SM-specific evidence. If the model were sound, it would seem very unlikely that the first organization evaluated to the medium-to-in-depth level would happen to have charity-specific evidence suggesting an effect size that diverged so strongly from what the model predicted. I think most of us, when faced with such a circumstance, would question whether the model was sound and would put it on the shelf until performing other charity-specific evaluations at the medium-to-in-depth level. That would be particularly true to the extent the modelâs output depended significantly on the methodology used to clean up some problems with the data.[2]
If Gregâs analysis is correct, it seems I shouldnât assign the informed prior much more credence than I have credence in HLIâs decision to remove outliers (and to a lesser extent, its choice of a method). So, again to my layperson way of thinking, one partial way of thinking about the crux could be that the reader must assess their confidence in HLIâs outlier-treatment decision vs. their confidence in the Baird/âOzler RCT on SM.
What prior to formally pick is trickyâI agree the factors you note would be informative, but how to weigh them (vs. other sources of informative evidence) could be a matter of taste. However, sources of evidence like this could be handy to use as âbenchmarksâ to see whether the prior (/âresults of the meta-analysis) are consilient with them, and if not, explore why.
But I think I can now offer a clearer explanation of what is going wrong. The hints you saw point in this direction, although not quite as you describe.
One thing worth being clear on is HLI is not updating on the actual SM specific evidence. As they model it, the estimated effect on this evidence is an initial effect of g = 1.8, and a total effect of ~3.48 WELLBYs, so this would lie on the right tail, not the left, of the informed prior.[1] They discount the effect by a factor of 20 to generate the data they feed into their Bayesian method. Stipulating data which would be (according to their prior) very surprisingly bad would be in itself a strength, not a concern, of the conservative analysis they are attempting.
Next, we need to distinguish an average effect size from a prediction interval. The HLI does report both (Section 4) for a more basic model of PT in LMICs. The (average, random) effect size is 0.64 (95% CI 0.54 to 0.74), whilst the prediction interval is â0.27 to 1.55. The former is giving you the best guess of the average effect (with a confidence interval), the latter is telling youâif I do another study like those already included, the range I can expect its effect size to be within. By loose analogy: if I sample 100 people and their average height is roughly 5Ⲡ7â (95% CI 5â˛6â to 5â˛8â), the 95% range of the individual heights will range much more widely (say 5Ⲡ0â to 6Ⲡ2âł)
Unsurprisingly (especially given the marked heterogeneity), the prediction interval is much wider than the confidence interval around the average effect size. Crucially, if our ânext studyâ reports an effect size of (say) 0.1, our interpretation typically should not be: âThis study canât be right, the real effect of the intervention it studies must be much closer to 0.6â. Rather, as findings are heterogeneous, it is much more likely a study which (genuinely) reports a below average effect.[2] Back to the loose analogy, we would (typically) assume we got it right if we measured some more people at (e.g.) 6â˛0â and 5â˛4â, even though these are significantly above or below the 95% confidence interval of our average, and only start to doubt measurements much outside our prediction interval (e.g. 3â˛10â, 7â˛7âł).
Now the problem with the informed prior becomes clear: it is (essentially) being constructed with confidence intervals of the average, not prediction intervals for its data from its underlying models. As such, it is a prior not of âWhat is the expected impact of a given PT interventionâ, but rather âWhat is the expected average impact of PT interventions as a wholeâ.[3]
With this understanding, the previously bizarre behaviour is made sensible. For the informed prior should assign very little credence to the average impact of PT overall being ~0.4 per the stipulated Strongminds data, even though it should not be that surprised that a particular intervention (e.g. Strongminds!) has an impact much below average, as many other PT interventions studied also do (cf. Although I shouldnât be surprised if I measure someone as 5â˛2â, I should be very surprised if the true average height is actually 5â˛2â, given my large sample averages 5â˛7â). Similarly, if we are given a much smaller additional sample reporting a much different effect size, the updated average effect should remain pretty close to the prior (e.g. if a handful of new people have heights < 5â˛4â, my overall average goes down a little, but not by much).
Needless to say, the results of such an analysis, if indeed for âaverage effect size of psychotherapy as a wholeâ are completely inappropriate for âexpected effect size of a given psychotherapy interventionâ, which is the use it is put to in the report.[4] If the measured effect size of Strongminds was indeed ~0.4, the fact psychotherapy interventions ~average substantially greater effects of ~1.4 gives very little reason to conclude the effect of Strongminds is in fact much higher (e.g. ~1.3). In the same way, if I measure your height is 5â˛0â, the fact the average heights of other people Iâve measured is 5â˛7â does not mean I should conclude youâre probably about 5â˛6âł.[5]
Minor: it does lie pretty far along the right tail of the prior (<top 1st percentile?), so maybe one could be a little concerned. Not much, though: given HLI was searching for particularly effective PT interventions in the literature, it doesnât seem that surprising that this effort could indeed find one at the far-ish right tail of apparent efficacy.
One problem for many types of the examined psychotherapy is that the level of heterogeneity was high, and many of the prediction intervals were broad and included zero. This means that it is difficult to predict the effect size of the next study that is done with this therapy, and that study may just as well find negative effects. The resulting effect sizes differ so much for one type of therapy, that it cannot be reliably predicted what the true effect size is.
Cf. your original point about a low result looking weird given the prior. Perhaps the easiest way to see this is to consider a case where the intervention is harmful. The informed prior says P (ES < 0) is very close to zero. Yet >1/â72 studies in the sample did have an effect size < 0. So obviously a prior of an intervention should not be that confident in predicting it will not have a -ve effect. But a prior of the average effect of PT interventions should be that confident this average is not in fact negative, given the great majority of sampled studies show substantially +ve effects.
In a sense, the posterior is not computing the expected effect of StrongMinds, but rather the expected effect of a future intervention like StrongMinds. Somewhat ironically, this (again, simulated) result would be best interpreted as an anti-recommendation: Strongminds performs much below the average we would expect of interventions similar to it.
It is slightly different for measured height as we usually have very little pure measurement error (versus studies with more significant sampling variance). So youâd update a little less towards the reported study effects vs. the expected value than you would for height measurements vs. the average. But the key points still stand.
âWould it have been better to start with a stipulated prior based on evidence of short-course general-purpose[1] psychotherapyâs effect size generally, update that prior based on the LMIC data, and then update that on charity-specific data?â
1. To your first point, I think adding another layer of priors is a plausible way to do things â but given the effects of psychotherapy in general appear to be similar to the estimates we come up with[1] â itâs not clear how much this would change our estimates.
There are probably two issues with using HIC RCTs as a prior. First, incentives that could bias results probably differ across countries. Iâm not sure how this would pan out. Second, in HICs, the control group (âtreatment as usualâ) is probably a lot better off. In a HIC RCT, thereâs not much you can do to stop someone in the control group of a psychotherapy trial to go get prescribed antidepressants. However, the standard of care in LMICs is much lower (antidepressants typically arenât an option), so we shouldnât be terribly surprised if control groups appear to do worse (and the treatment effect is thus larger).
âTo my not-very-well-trained eyes, one hint to me that thereâs an issue with application of Bayesian analysis here is the failure of the LMIC effect-size model to come anywhere close to predicting the effect size suggested by the SM-specific evidence.â
2. To your second point, does our model predict charity specific effects?
In general, I think itâs a fair test of a model to say it should do a reasonable job at predicting new observations. We canât yet discuss the forthcoming StrongMinds RCT â we will know how well our model works at predicting that RCT when itâs released, but for the Friendship Bench (FB) situation, it is true that we predict a considerably lower effect for FB than the FB-specific evidence would suggest. But this is in part because weâre using a combination of charity specific evidence to inform our prior and the data. Let me explain.
We have two sources of charity specific evidence. First, we have the RCTs, which are based on a charity programme but not as itâs deployed at scale. Second, we have monitoring and evaluation data, which can show how well the charity intervention is implemented in the real world. We donât have a psychotherapy charity at present that has RCT evidence of the programme as itâs deployed in the real world. This matters because I think placing a very high weight on the charity-specific evidence would require that it has a high ecological validity. While the ecological validity of these RCTs is obviously higher than the average study, we still think itâs limited. Iâll explain our concern with FB.
For Friendship Bench, the most recent RCT (Haas et al. 2023, n = 516) reports an attendance rate of around 90% to psychotherapy sessions, but the Friendship Bench M&E data reports an attendance rate more like 30%. We discuss this in Section 8 of the report.
So for the Friendship Bench case we have a couple reasonable quality RCTs for Friendship Bench, but it seems like, based on the M&E data, that something is wrong with the implementation. This evidence of lower implementation quality should be adjusted for, which we do. But we include this adjustment in the prior. So weâre injecting charity specific evidence into both the prior and the data. Note that this is part of the reason why we donât think itâs wild to place a decent amount of weight on the prior. This is something we should probably clean up in a future version.
We canât discuss the details of the Baird et al. RCT until itâs published, but we think there may be an analogous situation to Friendship Bench where the RCT and M&E data tell conflicting stories about implementation quality.
This is all to say, judging how well our predictions fair when predicting the charity specific effects isnât clearly straightforward, since we are trying to predict the effects of the charity as it is actually implemented (something we donât directly observe), not simply the effects from an RCT.
If we try and predict the RCT effects for Friendship Bench (which have much higher attendance than the ârealâ programme), then the gap between the predicted RCT effects and actual RCT effects is much smaller, but still suggests that we canât completely explain why the Friendship Bench RCTs find their large effects.
So, we think the error in our prediction isnât quite as bad as it seems if weâre predicting the RCTs, and stems in large part from the fact that we are actually predicting the charity implementation.
Cuijpers et al. 2023 finds an effect of psychotherapy of 0.49 SDs for studies with low RoB in low, middle, and high income countries (comparisons = 218#), and Tong et al. 2023 find an effect of 0.69 SDs for studies with low RoB in non-western countries (primarily low and middle income; comparisons = 36). Our estimate of the initial effect is 0.70 SDs (before publication bias adjustments). The results tend to be lower (between 0.27 and 0.57, or 0.42 and 0.60) SDs when the authors of the meta-analyses correct for publication bias. In both meta-analyses (Tong et al. and Cuijpers et al.) the authors present the effects after using three publication bias corrected methods: trim-and-fill (0.6; 0.38 SDs), a limit meta-analysis (0.42; 0.28 SDs), and using a selection model (0.49; 0.57 SDs). If we averaged their publication bias corrected results (which they did without removing outliers beforehand) the estimated effect of psychotherapy would be 0.5 SDs and 0.41 for the two meta-analyses. Our estimate of the initial effect (which is most comparable to these meta-analyses), after removing outliers is 0.70 SDs, and our publication bias correction is 36%, implying that we estimate our initial effect to be 0.46 SDs. You can play around with the data they use on the metapsy website.
HLI kindly provided me with an earlier draft of this work to review a couple of weeks ago. Although things have gotten better, I noted what I saw as major problems with the draft as-is, and recommended HLI take its time to fix themâeven though this would take a while, and likely miss the window of Giving Tuesday.
Unfortunately, HLI went ahead anyway with the problems I identified basically unaddressed. Also unfortunately (notwithstanding laudable improvements elsewhere) these problems are sufficiently major I think potential donors are ill-advised to follow the recommendations and analysis in this report.
In essence:
Issues of study quality loom large over this literature, with a high risk of materially undercutting the results (they did last time). The reports interim attempts to manage these problems are inadequate.
Pub bias corrections are relatively mild, but only when all effects g > 2 are excluded from the analysisâthey are much starker (albeit weird) if all data is included. Due to this, the choice to exclude âoutliersâ roughly trebles the bottom line efficacy of PT. This analysis choice is dubious on its own merits, was not pre-specified in the protocol, yet is only found in the appendix rather than the sensitivity analysis in the main report.
The bayesian analysis completely stacks the deck in favour of psychotherapy interventions (i.e. an âinformed priorâ which asserts one should be > 99% confident strongminds is more effective than givedirectly before any data on strongminds is contemplated), such that psychotherapy/âstrongminds/âetc, getting recommended is essentially foreordained.
Study quality
It perhaps comes as little surprise that different studies on psychotherapy in LMICs report very different results:[1]
The x-axis is a standardized measure of effect size for psychotherapy in terms of wellbeing.[2] Mostâbut not allâshow a positive effect (g > 0), but the range is vast, HLI excludes effect sizes over 2 as outliers (much more later), but 2 is already a large effect: to benchmark, it is roughly the height difference between male and female populations.
Something like an â(weighted) average effect sizeâ across this set would look promising (~0.6) - to also benchmark, the effect size of cash transfers on (individual) wellbeing is ~0.1. Yet cash transfers (among many other interventions) have much less heterogeneous results: more like â0.1 +/â- 0.1âł, not ~â0.6 multiply-or-divide by an integerâ. It seems important to understand what is going on.
One hope would be this heterogeneity can be explained in terms of the intervention and length of follow-up. Different studies did (e.g.) different sorts of psychotherapy, did more or less of it, and measured the outcomes at different points afterwards. Once we factor these things in to our analysis, this wide distribution seen when looking at the impact of psychotherapy in general sharpens into a clearer picture for any particular psychotherapeutic intervention. One can then deploy this knowledge to assessâin particularâthe likely efficacy of a charity like Strongminds.
The report attempts this enterprise in section 4 of the report. I think a fair bottom line is despite these efforts, the overall picture is still very cloudy: the best model explains ~12% of the variance in effect sizes. But this best model is still better than no model (but more later), so one can still use it to make a best guess for psychotherapeutic interventions, even if there remains a lot of uncertainty and spread.
But there could be another explanation for why thereâs so much heterogeneity: there are a lot of low-quality studies, and low quality studies tend to report inflated effect sizes. In the worst case, the spread of data suggesting psychotherapyâs efficacy is instead a mirage, and the effect size melts under proper scrutiny.
Hence why most systematic reviews do assess the quality of included studies and their risk of bias. Sometimes this is only used to give a mostly qualitative picture alongside the evidence synthesis (e.g. âX% of our studies have a moderate to high risk of biasâ) or sometimes incorporated quantitatively (e.g. âquality scoreâ of studies included as a predictor/âmoderator, grouping by âhigh/âmoderate/âlowâ risk of bias, etc. - although all are controversial).
HLIâs report does not assess the quality of its included studies, although it plans to do so. I appreciate GRADEing 90 studies or whatever is tedious and time consuming, but skipping this step to crack on with the quantitative synthesis is very unwise:[3] any such synthesis could be hugely distorted by low quality studies. And itâs not like this is a mere possibility: I previously demonstrated in the previous meta-analysis that study registration status (one indicator of study quality) explained a lot of heterogeneity, and unregistered studies had on average a three times [!] greater effect size than registered ones.
The report notes it has done some things to help manage this risk. One is cutting âoutliersâ (g > 2, the teal in the earlier histogram), and extensive assessment of publication bias/âsmall study effects. These things do help: all else equal, Iâd expect bigger studies to be methodologically better ones, so adjusting for small study effects does partially âcontrolâ for study quality; Iâd also expect larger effect sizes to arise from lower-quality work, so cutting them should notch up the average quality of the studies that remain.
But I do not think they help enough[4] - these are loose proxies for what we seek to understand. Thus the findings would be unreliable in virtue of this alone until after this is properly looked at. Crucially, the risk that these features could confound the earlier moderator analysis has not been addressed:[5] maybe the relationship of (e.g.) âmore sessions given â greater effectâ is actually due to studies of such interventions tend to be lower quality than the rest. When I looked last time things like âstudy sizeâ or âregistration statusâ explained a lot more of the heterogeneity than (e.g.) all of the intervention moderators combined. I suspect the same will be true this time too.
Publication bias
I originally suggested (6m ago?) that correction for publication bias/âsmall study effects could be ~an integer division, so I am surprised the correction was a bit less: ~30%. Hereâs the funnel plot:[6]
Unsurprisingly, huge amounts of scatter, but the asymmetry, although there, does not leap off the page: the envelope of points is pretty rectangular, but you can persuade yourself itâs a bit of a parallelogram, and thereâs denser part of it which indeed has a trend going down and to the right (so smaller study â bigger effect).
But this only plots effect sizes g < 2 (those red, not teal, in the histogram). If we include all the studies again, this picture looks a lot clearerâthe âlong tailâ of higher effects tends to come from smaller studies, which are clearly asymmetric.
This effect, visible to the naked eye, also emerges in the statistics. The report uses a variety of numerical methods to correct for publication bias (some very sophisticated). All of them adjust the results much further downwards on the full data than when outliers are excluded to varying degrees (table B1, appendix). It would have a stark effect on the resultsâhereâs the âbottom lineâ result if you take a weighted average of all the different methods, with different approaches to outlier exclusionâred is the full data, green is the outlier exclusion method the report uses.
Needless to say, this choice is highly material to the bottom line results: without excluding data, SM drops from ~3.6x GD to ~1.1x GD. Yet it doesnât get a look in for the sensitivity analysis, where HLIâs âless favourableâ outlier method involves taking an average of the other methods (discounting by ~10%), but not doing no outlier exclusion at all (discounting by ~70%).[7]
Perhaps this is fine if outlier inclusion would be clearly unreasonable. But itâs not: cutting data is generally regarded as dubious, and the rationale for doing so here is not compelling. Briefly:
Received opinion is typically that outlier exclusion should be avoided without a clear rationale why the âoutliersâ arise from a clearly discrepant generating process. If it is to be done, the results of the full data should still be presented as the primary analysis (e.g.).
The cut data by and large doesnât look visually âoutlyingâ to me. The histogram shows a pretty smooth albeit skewed distribution. Cutting off the tail of the distribution at various lengths appears ill-motivated.
Given the interest in assessing small study effects, cutting out the largest effects (which also tend to be the smallest studies) should be expected to attenuate the small study effect (as indeed it does). Yet if our working hypothesis is these effects are large mainly because the studies are small, their datapoints are informative to plot this general trend (e.g. for slightly less small studies which have slightly less inflated results).[8]
The strongest argument given is that, in fact, some numerical methods to correct publication bias give absurd results if given the full data: i.e. one gives an adjusted effect size of â0.6, another â0.2. I could buy an adjustment that drives the effect down to roughly zero, but not one which suggests, despite almost all the data being fairly or very positive, we should conclude from these studies the real effect is actually (highly!) negative.
One could have a long argument on what the most appropriate response is: maybe just keep it, as the weighted average across methods is still sensible (albeit disappointing)? Maybe just drop those methods in particular and do an average of those giving sane answers on the full data? Should we keep g < 2 exclusion but drop p-curve analysis, as it (absurdly?) adjusts the effect slightly upwards? Maybe we should reweigh the averaging of different numerical methods by how volatile their results are when you start excluding data? Maybe pick the outlier exclusion threshold which results in the least disagreement between the different methods? Or maybe just abandon numerical correction, and just say âthereâs clear evidence of significant small study effects, which the current state of the art cannot reliably quantify and correctâ?
So a garden of forking paths opens before us. All of these are varying degrees of âarguableâ, and they do shift the bottom line substantially. One reason pre-specification is so valuable is it ties you to a particular path before getting to peek at the results, avoiding any risk a subconscious finger on the scale to push one down a path of still-defensible choices nonetheless favour a particular bottom line. Even if you didnât pre-specify, presenting your first cut as the primary analysis helps for nothing up my sleeve reasons.
It may be the prespecified or initial stab doesnât do a good job of expressing the data, and a different approach does better. Yet making it clear this subsequent analysis is post-hoc cautions a reader about potential risk of bias in analysis.
Happily, HLI did make a protocol for this work, made before they conducted the analysis. Unfortunately, it is silent on whether outlying data would be excluded, or by what criteria. Also unfortunately, because of this (and other things like the extensive discussion in the appendix discussing the value of outlier removal principally in virtue of its impact on publication bias correction), I am fairly sure the analysis with all data included was the first analysis conducted. Only after seeing the initial publication bias corrections did HLI look at the question of whether some data should be excluded. Maybe it should, but if it came second the initial analysis should be presented first (and definitely included in the sensitivity analysis).
Thereâs also a risk the cloud of quantification buries the qualitative lede. Publication bias is known to be very hard to correct, and despite HLI compiling multiple numerical state of the art methods, they starkly disagree on what the correction factor should be (i.e. from <~0 to > 100%). So perhaps the right answer is we basically do not know how much to discount the apparent effects seen in the PT literature given it also appears to be an extremely compromised one, and if forced to give an overall number, any ânumerical bottom lineâ should have even wider error bars because of this.[9]
Bayesian methods
I previously complained that the guestimate/âBOTEC-y approach HLI used in integrating information from the meta-analysis and the strongminds trial data couldnât be right, as it didnât pass various sanity tests: e.g. still recommending SM as highly effective even if you set the trial data to zero effect. HLI now has a much cleverer Bayesian approach to combining sources of information. On the bright side, this is mechanistically much clearer as well as much cleverer. On the downside, the deck still looks pretty stacked.
Starting at the bottom, hereâs how HLIâs Bayesian method compares SM to GD:
The informed prior (in essence) uses the meta-analysis findings with some monte carlo to get an expected effect for an intervention with strongminds-like traits (e.g. same number of sessions, same deliverer, etc.). The leftmost point of the solid line gives the expectation for the prior: so the prior is that SM is ~4x GDs cost effectiveness (dashed line).
The x axis is how much weight one gives to SM-specific data. Of interest, the line slopes down, so the data gives a negative update on SMs cost-effectiveness. This is because HLIâin anticipation of the Baird/âOzler RCT likely showing disappointing resultsâdiscounted the effect derived from the original SM-specific evidence by a factor of 20, so the likelihood is indeed much lower than the prior. Standard theory gives the appropriate weighting of this vs. the prior, so you adjust down a bit, but not a lot, from the prior (dotted line).
Despite impeccable methods, these results are facially crazy. To illustrate:
The rightmost point on the solid line is the result if you completely discount the prior, and only use the stipulated-to-be-bad SM-specific results. SM is still slightly better than GD on this analysis.[10]
If we âfollow Bayesian updatingâ as HLI recommends, the recommendation is surprisingly insensitive to the forthcoming Baird/âOzler RCT having disappointing findings. Eyeballing it, youâd need such a result to be replicated half a dozen times for the posterior to update to SM is roughly on a par with GD.
Setting the forthcoming data to basically showing zero effect will still return SM is 2-3x GD.[11] Iâd guess youâd need the forthcoming RCT to show astonishingly and absurdly negative results (e.g. SM treatment is worse for your wellbeing than bereavement), to get it to approximate equipoise with GD.
Youâd need even stronger adverse findings for the model to update all the way down to SM being ineffectual, rather than merely âless good than GiveDirectlyâ.
I take it most readers would disagree with the model here tooâe.g. if indeed the only RCT on strongminds is basically flat, that should be sufficient demote SM from putative âtop charityâ status.
I think I can diagnose the underlying problem: Bayesian methods are very sensitive to the stipulated prior. In this case, the prior is likely too high, and definitely too narrow/âoverconfident. See this:
Per the dashed and dotted lines in the previous figure, the âGiveDirectly barâ is fractionally below at the blue dashed line (the point estimate of the stipulated-SM data). The prior distribution is given in red. So the expectation (red dashed line) is indeed ~4x further from the origin (see above).
The solid red curve gives the distribution. Eyeballing the integrals reveals the problem: the integral of this distribution from the blue dashed line to infinity gives the models confidence psychotherapy interventions would be more cost-effective than GD. This is at least 99% of the area, if not 99.9% â 99.99%+. A fortiori, this prior asserts it is essentially certain the intervention is beneficial (total effect >0).
I donât think anyone should think that any intervention is P > 0.99 more cost-effective than give directly (or P < 0.0001 or whatever it is in fact harmful) as a prior,[12] but if one did, it would indeed take masses of evidence to change oneâs mind. Hence the very sluggish moves in response to adverse data (the purple line suggests the posterior is also 99%+ confident SM is better than givedirectly).
I think I can also explain the underlying problem of this underlying problem. HLI constructs its priors exclusively from its primary meta-analytic model (albeit adapted to match the intervention of interest, and recalculated excluding any studies done on this intervention to avoid double counting). Besides the extra uncertainty (so spread) likely implied by variety of factors covered in the sensitivity analysis, in real life our prior would be informed by other things too: the prospect entire literatures can be misguided, a general sense (at least for me) that cash transfers are easy to beat in principle, but much harder in practice, and so on.
In reality, our prior-to-seeing-the-metaanalysis prior would be very broad and probably reasonably pessimistic, and (even if Iâm wrong about the shortcomings I suggest earlier), the âupdateâ on reading it would be a bit upwards, and a little narrower, but not by that much. In turn, the âupdateâ on seeing (e.g.) disappointing RCT results for a given PT intervention would be a larger shift downwards, netting out that this was unlikely better than GiveDirectly after all.
If the Bayesian update was meant only to be a neat illustration, I would have no complaint. But instead the bottom line recommendations and assessments rely upon itâthat readers should indeed adopt the supposed prior the report proposes about the efficacy of PT interventions in general. Crisply, I doubt the typical reader seriously believes (e.g.) basically any psychotherapy intervention in LMICs, so long as cost per patient is <$100, is a ~certain bet to beat cash transfers. If not, they should question the reportâs recommendations too.
Summing up
Criticising is easier than doing better. But I think this is a case where a basic qualitative description tells the appropriate story, the sophisticated numerical methods are essentially a âbridge too farâ given the low quality of what they have to work with, and so confuse rather than clarify the matter. In essence:
The literature on PT in LMICs is a complete mess. Insofar as more sense can be made from it, the most important factors appear to belong to the studies investigating it (e.g. their size) rather than qualities of the PT interventions themselves.
Trying to correct the results of a compromised literature is known to be a nightmare. Here, the qualitative evidence for publication bias is compelling. But quantifying what particular value of âa lot?â the correction should be is fraught: numerically, methods here disagree with one another dramatically, and prove highly sensitive to choices on data exclusion.
Regardless of how PT looks in general, Strongminds, in particular, is looking less and less promising. Although initial studies looked good, they had various methodological weaknesses, and a forthcoming RCT with much higher methodological quality is expected to deliver disappointing results.
The evidential trajectory here is all to common, and the outlook typically bleak. It is dubious StrongMinds is a good pick even among psychotherapy interventions (picking one at random which doesnât have a likely-bad-news RCT imminent seems a better bet). Although pricing different interventions is hard, it is even more dubious SM is close to the frontier of âvery well evidencedâ vs. âhas very promising resultsâ plotted out by things like AMF, GD, etc. HLIâs choice to nonetheless recommend SM again this giving season is very surprising. I doubt it will weather hindsight well.
All of the figures are taken from the report and appendix. The transparency is praiseworthy, although it is a pity despite largely looking at the right things the report often mistakes the right conclusions to draw.
With all the well-worn caveats about measuring well-being.
The Cochrane handbook section on meta-analysis is very clear on this (but to make it clearer, I add emphasis)
As a WIP, the data and code for this report is not yet out, but in my previous statistical noodling on the last one both study size and registration status significantly moderated the effect downwards when included together, suggesting indeed the former isnât telling you everything re. study quality.
The report does mention later (S10.2) controlling a different analysis for study quality, when looking at the effect of sample size itself:
I donât think this analysis is included in the appendix or similar, but later text suggests the âstudy qualityâ correction is a publication bias adjustment. This analysis is least fruitful when applied to study scale, as measures of publication bias are measures of study size: so finding the effects of study scale are attenuated when you control for a proxy of study scale is uninformative.
What would be informative is the impact measures of âstudy scaleâ or publication bias have on the coefficients for the primary moderators. Maybe they too could end up âcontrolled away with quality variables, more of which that we havenât considered might be includedâ?
There are likely better explanations of funnel plots etc. online, but my own attempt is here.
The report charts a much wiser course on a different âOutlier?â question: whether to include very long follow-up studies, where exclusion would cut the total effect in half. I also think including everything here is fine too, but the reportâs discussion in S4.2 clearly articulates the reason for concern, displays what impact inclusion vs. exclusion has, and carefully interrogates the outlying studies to see whether they have features (beyond that they report âoutlyingâ results) which warrants exclusion. They end up going âhalf-and-halfâ, but consider both full exclusion and inclusion in sensitivity analysis.
If you are using study size as an (improvised) measure of study quality, excluding the smallest studies because on an informal read they are particularly low quality makes little sense: this is the trend you are interested in.
A similar type of problem crops up when one is looking at the effect of âdosageâ on PT efficacy.
The solid lines are the fit (blue linear, orange log) on the full data, whilst the dashed lines are fits with extreme values of dosageâsmall or largeâexcluded (purple). The report freely concedes its choices here are very theory led rather than data drivenâand also worth saying getting more of a trend here makes a rod for SM and Friendship Benchâs back, as these deliver smaller numbers of sessions than the average, so adjusting with the dashed lines and not the solid ones reduces the expected effect.
Yet the main message I would take from the scatter plot is the data indeed looks very flat, and there is no demonstrable dose-response relationship of PT. Qualitatively, this isnât great for face validity.
To its credit, the write-up does highlight this, but does not seem to appreciate the implications are crazy: any PT intervention, so long as it is cheap enough, should be thought better than GD, even if studies upon it show very low effect size (which would usually be reported as a negative result, as almost any study in this field would be underpowered to detect effects as low as are being stipulated):
The report describes this clearly itself, but seems to think this is a feature rather than a bug (my emphasis):
Iâm not even sure that âP > 0.99 better than GDâ would be warranted as posterior even for a Givewell recommended top charity, and Iâd guess the GW staff who made the recommendation would often agree.
I think I agree with the rest of this analysis (or at least the parts I could understand). However, the following paragraph seems off:
Apologies if Iâm being naive here, but isnât this just a known problem of first-order cost-effectiveness analysis, not with this particular analysis per se? I mean, since cheapness could be arbitrarily low (or at least down to $0.01), âbetter than GDâ is a bit of a red herring, the claim is merely that a single (even high-quality) study is not enough for someone to update their prior all the way down to zero, or negative.
And stated in English, this seems eminently reasonable to me. There might be good second-order etc reasons to not act on a naive first-order analysis (eg risk/âambiguity aversion, wanting to promote better studies, etc). But ultimately I donât think that literal claim is crazy to me, and naively seems like something that naturally falls out of a direct cost-effectiveness framework.
So the problem I had in mind was in the parenthetical in my paragraph:
To elaborate: the actual data on Strongminds was a n~250 study by Bolton et al. 2003 then followed up by Bass et al. 2006. HLI models this in table 19:
So an initial effect of g = 1.85, and a total impact of 3.48 WELLBYs. To simulate what the SM data will show once the (anticipated to be disappointing) forthcoming Baird et al. RCT is included, they discount this[1] by a factor of 20.
Thus the simulated effect size of Bolton and Bass is now ~0.1. In this simulated case, the Bolton and Bass studies would be reporting negative results, as they would not be powered to detect an effect size as small as g = 0.1. To benchmark, the forthcoming Baird et al. study is 6x larger than these, and its power calculations have minimal detectable effects g = 0.1 or greater.
Yet, apparently, in such a simulated case we should conclude that Strongminds is fractionally better than GD purely on the basis of two trials reporting negative findings, because numerically the treatment groups did slightly (but not significantly) better than the control ones.
Even if in general we are happy with âhey, the effect is small, but it is cheap, so itâs a highly cost-effective interventionâ, we should not accept this at the point when âsmallâ becomes âtoo small to be statistically significantâ. Analysis method + negative findings =! fractionally better in expectation vs. cash transfers, so I take it as diagnostic the analysis is going wrong.
I think âthisâ must be the initial effect size/âintercept, as 3.48 * 0.05 ~ 1.7 not 3.8. I find this counter-intuitive, as I think the drop in total effect should be super not sub linear with intercept, but ignore that.
Thank you for your comments, Gregory. Weâre aware you have strong views on the subject and we appreciate your conscientious contributions. We discussed your previous comments internally but largely concluded revisions werenât necessary as we (a) had already considered them in the report and appendix, (b) will return to them in later versions and didnât expect they would materially affect the results, or (c) simply donât agree with these views. To unpack:
Study quality. We conclude the data set does contain bias, but we account for it (sections 3.2 and 5; itâs an open question among academics how best to do this). We donât believe that the entire field of LMIC psychotherapy should be considered bunk, compromised, or uninformative. Our results are in line with existing meta-analyses of psychotherapy considered to have low risk of bias (see footnote).[1]
Evidentiary standards. We drew on a large number of RCTs for our systematic reviews and meta-analyses of cash transfers and psychotherapy (42 and 74, respectively). If one holds that the evidence for something as well-studied as psychotherapy is too weak to justify any recommendations, charity evaluators could recommend very little.
Outlier exclusion. The issues regarding outlier exclusion were discussed in some depth (3.2 in the main report and in Appendix B). Excluding outliers is thought sensible practice here; two related meta-analyses, Cuijpers et al., 2020c; Tong et al., 2023, used a similar approach. Itâs consistent with not taking the entire literature at face value but also not taking guilt by association too far. If one excludes outliers, the specific way one does this has a minor effect (e.g., a 10% decline in effectiveness, see appendix). Our analysis necessarily makes analytic choices: some were pre-registered, some made on reflection, many were discussed in our sensitivity analysis. If one insisted only on using charity evaluations that had every choice pre-registered, there would be none to choose from.
Bayesian analysis: The method we use (âgrid approximationâ, see 8.3 and 9.3) avoids subjective inputs. It is not this Bayesian analysis itself that âstacks the deckâ in favour of psychotherapy, but the evidence. Given that over 70 studies form the prior, it would be surprising if adding one study, as we did for StrongMinds, would radically alter the conclusions. [Edit 5/â12/â2023: on the point that StrongMinds could be more cost-effective than GiveDirectly, even if StrongMinds only has the small effect we assume it does in our hypothetical placeholder studies, it doesnât seem inconceivable that a small, less effective intervention can still be more cost-effective than a big, expensive one. For context, we estimate it costs StrongMinds $63 per interventionâproviding one person with a course of therapyâwhereas it costs GiveDirectly $1221 per interventionâan $1000 cash transfer which costs $221 in overheads. As the therapy is about 20x cheaper, it can be far less effective yet still more cost-effective.]
Making recommendations: we aim to recommend the most cost-effective ways of increasing WELLBYs weâve found so far. While we have intuitions about how good different interventions are our perspective as an organisation is that conclusions about whatâs cost-effective should be led heavily by the evidence rather than by our pre-evidential beliefs (âpriorsâ). Given the evidence weâve considered, we donât see a strong case for recommending cash transfers over psychotherapy.
This is a working report, and weâll be reflecting on how to incorporate the above, similarly psychotherapy-sceptical perspectives, and other views in the process of preparing it for academic review. In the interests of transparency, we donât plan to engage beyond our comments above so as to preserve team resources.
We find an initial effect is 0.70 SDs, reduced to 0.46 SDs after publication bias adjustments. Cuijpers et al. 2023 find an effect of psychotherapy of 0.49 SDs for studies with low risk of bias (RoB) in low, middle, and high income countries (comparisons = 218), which reduces to between 0.27 and 0.57 after publication adjustment. Tong et al. 2023 find an effect of 0.69 SDs for studies with low RoB in non-western countries (primarily low and middle income; comparisons = 36), which adjust to between 0.42 and 0.60 after publication correction. Hence, our initial and adjusted numbers are similar.
Epistemic status: tentative, itâs been a long time since reading social science papers was a significant part of my life. Happy to edit/âretract this preliminary view as appropriate if someone is able to identify mistakes.
I canât access Cuijpers et al., but I donât read Tong et al. as supporting what HLI has done here.
In their article, Tong et al. provide the effect size with no exclusions, then with outliers excluded, then with âextreme outliersâ excluded (the latter of which seems to track HLIâs removal criterion). They also provide effect size with various publication-bias measures employed. See PDF at 5-6. If Iâm not mistaken, the publication bias measures are applied to the no-exclusions version, not a version with outliers removed or limited to those with lower RoB. See id. at 6 tbl.2 (n = 117 for combined and 2 of 3 publication-bias effect sizes; 153 with trim-and-fill adding 36 studies; n = 74 for outliers removed & n = 104 for extreme outliers removed; effect sizes after publication-bias measures range from 0.42 to 0.60 seem to be those mentioned in HLIâs footnote above).
Tong et al. âconducted sensitive analyses comparing the results with and without the inclusion of extreme outliers,â PDF at 5, discussing the results without exclusion first and then the results with exclusion. See id. at 5-6. Tables 3-5 are based on data without exclusion of extreme outliers; the versions of Tables 4 and 5 that excludes extreme outliers are relegated to the supplemental tables (not in PDF). See id. at 6. This reads to my eyes as treating both the all-inclusive and extreme-outliers-excluded data seriously, with some pride of place to the all-inclusive data.
I donât read Tong et al. as having reached a conclusion that either the all-inclusive or extreme-outliers-excluded results were more authoritative, saying things like:
Lastly, we were unable to explain the different findings in the analyses with vs. without extreme outliers. The full analyses that included extreme outliers may reflect the true differences in study characteristics, or they may imply the methodological issues raised by studies with effect sizes that were significantly higher than expected.
and
Therefore, the larger treatment effects observed in non-Western trials may not necessarily imply superior treatment outcomes. On the other hand, it could stem from variations in study design and quality.
and
Further research is required to explain the reasons for the differences in study design
and quality between Western and non-Western trials, as well as the different results in the analyses with and without extreme outliers.
PDF at 10.
Of course, âfurther research neededâ is an almost inevitable conclusion of the majority of academic papers, and Tong et al. have the luxury of not needing to reach any conclusions to inform the recommended distribution of charitable dollars. But I donât read the article by Tong et al. as supporting the proposition that it is appropriate to just run with the outliers-excluded data. Rather, I read the article as suggesting thatâat least in the absence of compelling reasons to the contraryâone should take both analyses seriously, but neither definitively.
I lack confidence in what taking both analyses seriously, but neither definitively would mean for purposes of conducting a cost-effectiveness analysis. But I speculate that it would likely involve some sort of weighting of the two views.
Hi again Jason,
When we said âExcluding outliers is thought sensible practice here; two related meta-analyses, Cuijpers et al., 2020c; Tong et al., 2023, used a similar approachââI can see that what we meant by âsimilar approachâ was unclear. We meant that, conditional on removing outliers, they identify a similar or greater range of effect sizes as outliers as we do.
This was primarily meant to address the question raised by Gregory about whether to include outliers: âThe cut data by and large doesnât look visually âoutlyingâ to me.â
To rephrase, I think that Cuijpers et al. and Tong et al. would agree that the data we cut looks outlying. Obviously, this is a milder claim than our comment could be interpreted as making.
Turning to wider implications of these meta-analyses, As you rightly point out, they donât have a âpreferred specificationâ and are mostly presenting the options for doing the analysis. They present analyses with and without outlier removal in their main analysis, and they adjust for publication bias without outliers removed (which is not what we do). The first analytic choice doesnât clearly support including or excluding outliers, and the second â if it supports any option, favors Gregâs proposed approach of correcting for publication bias without outliers removed.
I think one takeaway is that we should consider surveying the literature and some experts in the field, in a non-leading way, about what choices theyâd make if they didnât have âthe luxury of not having to reach a conclusionâ.
I think it seems plausible to give some weight to analyses with and without excluding outliers â if we are able find a reasonable way to treat the 2 out of 7 publication bias correction methods that produce the results suggesting that the effect of psychotherapy is in fact sizably negative. Weâll look into this more before our next update.
Cutting the outliers here was part of our first pass attempt at minimising the influence of dubious effects, which weâll follow up with a Risk of Bias analysis in the next version. Our working assumption was that effects greater than ~ 2 standard deviations are suspect on theoretical grounds (that is, if they behave anything like SDs in an normal distribution), and seemed more likely to be the result of some error-generating process (e.g. data-entry error, bias) than a genuine effect.
Weâll look into this more in our next pass, but for this version we felt outlier removal was the most sensible choice.
I recently discovered that GiveWell decided to exclude an outlier in their water chlorination meta-analysis. Iâm not qualified to judge their reasoning, but maybe others with sufficient expertise will weigh in?
I recently discovered that GiveWell decided to exclude an outlier in their water chlorination meta-analysis. Iâm not qualified to judge their reasoning, but maybe others with sufficient expertise will weigh in?
It looks like the same comment got posted several times?
Thanks Rebecca. I will delete the duplicates.
A comparatively minor point, but it doesnât seem to me that the claims in Gregâs post [more] are meaningfully weakened by whether or not psychotherapy is well-studied (as measured by how many RCTs HLI has found on it, noting that you already push back on some object level disagreement on study quality in point 1, which feels more directly relevant).
It also seems pretty unlikely to be true that psychotherapy being well studied necessarily means that StrongMinds is a cost-effective intervention comparable to current OP /â GW funding bars (which is one main point of contention), or that charity evaluators need 74+ RCTs in an area before recommending a charity. Is the implicit claim being made here is that the evidence for StrongMinds being a top charity is stronger than that of AMF, which is (AFAIK) based on less than 74 RCTs?[1]
GiveWell, Cochrane
I have previously let HLI have the last word, but this is too egregious.
Study quality: Publication bias (a property of the literature as a whole) and risk of bias (particular to each individual study which comprise it) are two different things.[1] Accounting for the former does not account for the latter. This is why the Cochrane handbook, the three meta-analyses HLI mentions here, and HLIâs own protocol consider distinguish the two.
Neither Cuijpers et al. 2023 nor Tong et al. 2023 further adjust their low risk of bias subgroup for publication bias.[2] I tabulate the relevant figures from both studies below:
So HLI indeed gets similar initial results and publication bias adjustments to the two other meta-analyses they find. Yetâalthough these are not like-for-likeâthese other two meta-analyses find similarly substantial effect reductions when accounting for study quality as they do when assessing at publication bias of the literature as a whole.
There is ample cause for concern here:[3]
Although neither of these studies âadjust for bothâ, one later mentionedâCuijpers et al. 2020 - does. It finds an additional discount to effect size when doing so.[4] So it suggests that indeed âaccounting forâ publication bias does not adequately account for risk of bias en passant.
Tong et al. 2023 - the meta-analysis expressly on PT in LMICs rather than PT generallyâfinds higher prevalence of indicators of lower study quality in LMICs, and notes this as a competing explanation for the outsized effects.[5]
As previously mentioned, in the previous meta-analysis, unregistered trials had a 3x greater effect size than registered ones. All trials on Strongminds published so far have not been registered. Baird et al., which is registered, is anticipated to report disappointing results.
Evidentiary standards: Indeed, the report drew upon a large number of studies. Yet even a synthesis of 72 million (or whatever) studies can be misleading if issues of publication bias, risk of bias in individual studies (and so on) are not appropriately addressed. That an area has 72 (or whatever) studies upon it does not mean it is well-studied, nor would this number (nor any number) be sufficient, by itself, to satisfy any evidentiary standard.
Outlier exclusion: The reportâs approach to outlier exclusion is dissimilar to both Cuijpers et al. 2020 and Tong et al. 2023, and further is dissimilar with respect to features I highlighted as major causes for concern re. HLIâs approach in my original comment.[6] Specifically:
Both of these studies present the analysis with the full data first in their results. Contrast HLIâs report, where only the results with outliers excluded are presented in the main results, and the analysis without exclusion is found only in the appendix.[7]
Both these studies also report the results with the full data as their main findings (e.g. in their respective abstracts). Cuijpers et al. mentions their outlier excluded results primarily in passing (âoutliersâ appears once in the main text); Tong et al. relegates a lot of theirs to the appendix. HLIâs report does the opposite. (cf. fn 7 above)
Only Tong et al. does further sensitivity analysis on the âoutliers excludedâ subgroup. As Jason describes, this is done alongside the analysis where all data included, the qualitative and quantitative differences which result from this analysis choice are prominently highlighted to the reader and extensively discussed. In HLIâs report, by contrast, the factor of 3 reduction to ultimate effect size when outliers are not excluded is only alluded to qualitatively in a footnote (fn 33)[8] of the main reportâs section (3.2) arguing why outliers should be excluded, not included in the reports sensitivity analysis, and only found in the appendix.[9]
Both studies adjust for publication bias only on all data, not on data with outliers excluded, and these are the publication bias findings they present. Contrast HLIâs report.
The Cuijpers et al. 2023 meta-analysis previously mentioned also differs in its approach to outlier exclusion from HLIâs report in the ways highlighted above. The Cochrane handbook also supports my recommendations on what approach should be taken, which is what the meta-analyses HLI cites approvingly as examples of âsensible practiceâ actually do, but what HLIâs own work does not.
The reports (non) presentation of the stark quantitative sensitivity of its analysisâmaterial to its report bottom line recommendationsâto whether outliers are excluded is clearly inappropriate. It is indefensible if, as I have suggested may be the case, the analysis with outliers included was indeed the analysis first contemplated and conducted.[10] It is even worse if it was the publication bias corrections on the full data was what in fact prompted HLI to start making alternative analysis choices which happened to substantially increase the bottom line figures.
Bayesian analysis: Bayesian methods notoriously do not avoid subjective inputsâmost importantly here, what information we attend to when constructing an âinformed priorâ (or, if one prefers, how to weigh the results with a particular prior stipulated).
In any case, they provide no protection from misunderstanding the calculation being performed, and so misinterpreting the results. The Bayesian method in the report is actually calculating the (adjusted) average effect size of psychotherapy interventions in general, not the expected effect of a given psychotherapy intervention. Although a trial on Strongminds which shows it is relatively ineffectual should not update our view much the efficacy of psychotherapy interventions (/âsimilar to Strongminds) as a whole, it should update us dramatically on the efficacy of Strongminds itself.
Although as a methodological error this is a subtle one (at least, subtle enough for me not to initially pick up on it), the results it gave are nonsense to the naked eye (e.g. SM would still be held as a GiveDirectly-beating intervention even if there were multiple high quality RCTs on Strongminds giving flat or negative results). HLI should have seen this themselves, should have stopped to think after I highlighted these facially invalid outputs of their method in early review, and definitely should not be doubling down on these conclusions even now.
Making recommendations: Although there are other problems, those I have repeated here make the recommendations of the report unsafe. This is why I recommended against publication. Specifically:
Although I donât think the Bayesian method the report uses would be appropriate, if it was calculated properly on its own terms (e.g. prediction intervals not confidence intervals to generate the prior, etc.), and leaving everything else the same, the SM bottom line would drop (Iâm pretty sure) by a factor a bit more than 2.
The results are already essentially sensitive to whether outliers are excluded in analysis or not: SM goes from 3.7x â ~1.1x GD on the back of my envelope, again leaving all else equal.
(1) and (2) combined should net out to SM < GD; (1) or (2) combined with some of the other sensitivity analyses (e.g. spillovers) will also likely net out to SM < GD. Even if one still believes the bulk of (appropriate) analysis paths still support a recommendation, this sensitivity should be made transparent.
E.g. Even if all studies in the field are conducted impeccably, if journals only accept positive results the literature may still show publication bias. Contrariwise, even if all findings get published, failures in allocation/âblinding/âetc. could lead to systemic inflation of effect sizes across the literature. In realityâand hereâyou often have both problems, and they only partially overlap.
Jason correctly interprets Tong et al. 2023: the number of studies included in their publication bias corrections (117 [+36 w/â trim and fill]) equals the number of all studies, not the low risk of bias subgroup (36 - see table 3). I do have access to Cuijpers et al. 2023, which has a very similar results table, with parallel findings (i.e. they do their publication bias corrections on the whole set of studies, not on a low risk of bias subgroup).
Me, previously:
From their discussion (my emphasis):
E.g. from the abstract (my emphasis):
Apparently, all that HLI really meant with âExcluding outliers is thought sensible practice here; two related meta-analyses, Cuijpers et al., 2020c; Tong et al., 2023, used a similar approachâ [my emphasis] was merely â[C]onditional on removing outliers, they identify a similar or greater range of effect sizes as outliers as we do.â (see).
Yeah, right.
I also had the same impression as Jason that HLIâs reply repeatedly strawmans me. The passive aggressive sniping sprinkled throughout and subsequent backpedalling (in fairness, I suspect by people who were not at the keyboard of the corporate account) is less than impressive too. But itâs nearly Christmas, so beyond this footnote Iâll let all this slide.
Me again (my [re-?]emphasis)
Said footnote:
The sentence in the main text this is a footnote to says:
Me again:
My remark about âEven if you didnât pre-specify, presenting your first cut as the primary analysis helps for nothing up my sleeve reasonsâ which Dwyer mentions elsewhere was a reference to ânothing up my sleeve numbersâ in cryptography. In the same way picking pi or e initial digits for arbitrary constants reassures the author didnât pick numbers with some ulterior purpose they are not revealing, reporting what oneâs first analysis showed means readers can compare it to where you ended up after making all the post-hoc judgement calls in the garden of forking paths. âOur first intention analysis would give x, but we ended up convincing ourselves the most appropriate analysis gives a bottom line of 3xâ would rightly arouse a lot of scepticism.
Iâve already mentioned I suspect this is indeed what has happened here: HLIâs first cut was including all data, but argued itself into making the choice to exclude, which gave a 3x higher âbottom lineâ. Beyond âYou didnât say youâd exclude outliers in your protocolâ and âbasically all of your discussion in the appendix re. outlier exclusion concerns the results of publication bias corrections on the bottom line figuresâ, I kinda feel HLI not denying it is beginning to invite an adverse inference from silence. If Iâm right about this, HLI should come clean.
Iâm feeling confused by these two statements:
The first statement says HLIâs recommendation is unsafe, but the second implies it is reasonable as long as the sensitivity is clearly explained. Iâm grateful to Greg for presenting the analysis paths which lead to SM < GD, but itâs unclear to me how much those paths should be weighted compared to all the other paths which lead to SM > GD.
Itâs notable that Cuijpers (who has done more than anyone in the field to account for publication bias and risk of bias) is still confident that psychotherapy is effective.
I was also surprised by the use of âunsafeâ. Less cost-effective maybe, but âunsafeâ implies harm and I havenât seen any evidence to support that claim.
You cannot use the distribution for the expected value of an average therapy treatment as the prior distribution for a SPECIFIC therapy treatment, as there will be a large amount of variation between possible therapy treatments that is missed when doing this. Your prior here is that there is a 99%+ chance that StrongMinds will work better than GiveDirectly before looking at any actual StrongMinds results, this is a wildly implausible claim.
You also state âIf one holds that the evidence for something as well-studied as psychotherapy is too weak to justify any recommendations, charity evaluators could recommend very little.â Nothing in Gregoryâs post suggests that he thinks anything like this, he gives a g of ~0.5 in his meta-analysis that doesnât improperly remove outliers without good cause. A g of ~0.5 suggests that individuals suffering from depression would likely greatly benefit from seeking therapy. There is a massive difference between âevidence behind psychotherapy is too weak to justify any recommendationsâ and claiming that âthis particular form of therapy is not vastly better than GiveDirectly with a probability higher than 99% before even looking at RCT resultsâ. Trying to throw out Gregoryâs claims here over a seemingly false statement about his beliefs seems pretty offensive to me.
[Disclaimer: I worked at HLI until March 2023. I now work at the International Alliance of Mental Health Research Funders]
Gregory says
That is a strong claim to make and it requires him to present a convincing case that GiveDirectly is more cost-effective than StrongMinds. Iâve found his previous methodological critiques to be constructive and well-explained. To their credit, HLI has incorporated many of them in the updated analysis. However, in my opinion, the critiques he presents here do not make a convincing case.
Taking his summary points in turn...
I think this is much too strong. The three meta-analyses (and Gregoryâs own calculations) give me confidence that psychotherapy in LMICs is effective, although the effects are likely to be small.
There is no consensus on the appropriate methodology for adjusting publication bias. I donât have an informed opinion on this, but HLIâs approach seems reasonable to me and I think itâs reasonable for Greg to take a different view. From my limited understanding, neither approach makes GiveDirectly more cost-effective.
We donât have any new data on StrongMinds so Iâm confused why Greg thinks itâs âless and less promisingâ. HLIâs Bayesian approach is a big improvement on the subjective weightings they used in the first cost-effectiveness analysis. As with publication bias, itâs reasonable to hold different views on how to construct the prior, but personally, I do believe that any psychotherapy intervention in LMICs, so long as cost per patient is <$100, is a ~certain bet to beat cash transfers. There are no specific models of psychotherapy that perform better than the others, so I donât find it surprising that training people to talk to other people about their problems is a more cost-effective way to improve wellbeing in LMICs. Cash transfers are much more expensive and the effects on subjective wellbeing are also small.
HLI had to start somewhere and I think we should give credit to StrongMinds for being brave enough to open themselves up to the scrutiny theyâve faced. The three meta-analyses and the tentative analysis of Friendship Bench suggest there is âaltruistic goldâ to be found here and HLI has only just started to dig. The field is growing quickly and Iâm optimistic about the trajectories of CE-incubated charities like Vida Plena and Kaya Guides.
In the meantime, although the gap between GiveDirectly and StrongMinds has clearly narrowed, I remain unconvinced that cash is clearly the better option (but I do remain open-minded and open to pushback).
A specific therapy treatment is drawn from the distribution of therapy treatments. Our best guess about the distribution of value of a specific therapy treatment, without knowing anything about it, should take into account only that it comes from this distribution of therapy treatments. So I donât see whatâs unreasonable about this.
When running a meta-analysis, you can either use a fixed effect assumption (that all variation between studies is just due to sampling error) and a random effect assumption (that studies differ in terms of their âtrue effectsâ.) Therapy treatments differ greatly, so you have to use a random effects model in this case. Then the prior you use for strong minds impact should have a variance that is the sum of the variance in the estimate of average therapy treatments effects AND the variance among different types of treatments effects, both numbers should be available from a random effects meta-analysis. Iâm not quite sure what HLI did exactly to get their prior for strong minds here, but for some reason the variance on it seems WAY too low, and I suspect that they neglected the second type of variance that they should have gotten from a random effects meta-analysis.
Section 2.2.2 of their report is titled âChoosing a fixed or random effects modelâ. They discuss the points you make and clearly say that they use a random effects model. In section 2.2.3 they discuss the standard measures of heterogeneity they use. Section 2.2.4 discusses the specific 4-level random effects model they use and how they did model selection.
I reviewed a small section of the report prior to publication but none of these sections, and it only took me 5 minutes now to check what they did. Iâd like the EA Forum to have a higher bar (as Gregoryâs parent comment exemplifies) before throwing around easily checkable suspicions about what (very basic) mistakes might have been made.
Yes, some of Gregâs examples point to the variance being underestimated, but the problem does not inherently come from the idea of using the distribution of effects as the prior, since that should include both the sampling uncertainty and true heterogeneity. That would be the appropriate approach even under a random effects model (I think; Iâm more used to thinking in terms of Bayesian hierarchical models and the equivalence might not hold)
(@Burner1989 @David Rhys Bernard @Karthik Tadepalli)
I think the fundamental point (i.e. âYou cannot use the distribution for the expected value of an average therapy treatment as the prior distribution for a SPECIFIC therapy treatment, as there will be a large amount of variation between possible therapy treatments that is missed when doing this.â) is on the right lines, although subsequent discussion of fixed/ârandom effect models might confuse the issue. (Cf. my reply to Jason).
The typical output of a meta-analysis is an (~) average effect size estimate (the diamond at the bottom of the forest plot, etc.) The confidence interval given for that is (very roughly)[1] the interval we predict the true average effect likely lies. So for the basic model given in Section 4 of the report, the average effect size is 0.64, 95% CI (0.54 â 0.74). So (again, roughly) our best guess of the âtrueâ average effect size of psychotherapy in LMICs from our data is 0.64, and weâre 95% sure(*) this average is somewhere between (0.54, 0.74).
Clearly, it is not the case that if we draw another study from the same population, we should be 95% confident(*) the effect size of this new data point will lie between 0.54 to 0.74. This would not be true even in the unicorn case thereâs no between study heterogeneity (e.g. all the studies are measuring the same effect modulo sampling variance), and even less so when this is marked, as here. To answer that question, what you want is a prediction interval.[2] This interval is always wider, and almost always significantly so, than the confidence interval for the average effect: in the same analysis with the 0.54-0.74 confidence interval, the prediction interval was â0.27 to 1.55.
Although the full model HLI uses in constructing informed priors is different from that presented in S4 (e.g. it includes a bunch of moderators), they appear to be constructed with monte carlo on the confidence intervals for the average, not the prediction interval for the data. So I believe the informed prior is actually one of the (adjusted) âAverage effect of psychotherapy interventions as a wholeâ, not a prior for (e.g.) âthe effect size reported in a given PT study.â The latter would need to use the prediction intervals, and have a much wider distribution.[3]
I think this ably explains exactly why the Bayesian method for (e.g.) Strongminds gives very bizarre results when deployed as the report does, but they do make much more sense if re-interpreted as (in essence) computing the expected effect size of âa future strongminds-like interventionâ, but not the effect size we should believe StrongMinds actually has once in receipt of trial data upon it specifically. E.g.:
The histogram of effect sizes shows some comparisons had an effect size < 0, but the âinformed priorâ suggests P(ES < 0) is extremely low. As a prior for the effect size of the next study, it is much too confident, given the data, a trial will report positive effects (you have >1/â72 studies being negative, so surely it cannot be <1%, etc.). As a prior for the average effect size, this confidence is warranted: given the large number of studies in our sample, most of which report positive effects, we would be very surprised to discover the true average effect size is negative.
The prior doesnât update very much on data provided. E.g. When we stipulate the trials upon strongminds report a near-zero effect of 0.05 WELLBYs, our estimate of 1.49 WELLBYS goes to 1.26: so we should (apparently) believe in such a circumstance the efficacy of SM is ~25 times greater than the trial data upon it indicates. This is, obviously, absurd. However, such a small update is appropriate if it were to ~the average of PT interventions as a whole: that we observe a new PT intervention has much below average results should cause our average to shift a little towards the new findings, but not much.
In essence, the update we are interested in is not âHow effective should we expect future interventions like Strongminds are given the data on Strongminds efficacyâ, but simply âHow effective should we expect Strongminds is given the data on how effective Strongminds isâ. Given the massive heterogeneity and wide prediction interval, the (correct) informed prior is pretty uninformative, as it isnât that surprised by anything in a very wide range of values, and so on finding trial data on SM with a given estimate in this range, our estimate should update to match it pretty closely.[4]
(This also should mean, unlike the report suggests, the SM estimate is not that ârobustâ to adverse data. Eyeballing it, Iâd guess the posterior should be going down by a factor of 2-3 conditional on the stipulated data versus currently reported results).
Iâm aware confidence intervals are not credible intervals, and that âthe 95% CI tells you where the true value is with 95% likelihoodâ strictly misinterprets what a confidence interval is, etc. (see) But perhaps âclose enoughâ, so Iâm going to pretend these are credible intervals, and asterisk each time I assume the strictly incorrect interpretation.
Cf. Cochrane:
Although I think the same mean, so it will give the right âbest guessâ initial estimates.
Obviously, modulo all the other issues I suggest with both the meta-analysis as a whole, that we in fact would incorporate other sources of information into our actual prior, etc. etc.
Agree that there seem to be some strawmen in HLIâs response:
Has anyone suggested that the âentire field of LMIC psychotherapyâ is âbunkâ?
Has anyone suggested that, either? As I understand it, itâs typical to look at debatable choices that happen to support the authorâs position with a somewhat more skeptical lens if they havenât been pre-registered. I donât think anyone has claimed lack of certain choices being pre-registered is somehow fatal, only a factor to consider.
Hey Jason,
Our (the HLI) comment was in reference to these quotes.
I think it is valid to describe these as saying the literature is compromised and (probably) uninformative. I can understand your complaint about the word âbunkâ. Apologies to Gregory if this is a mischaracterization.
Regarding our comment:
And your comment:
Yeah, I think this is a valid point, and the post should have quoted Gregory directly. The point we were hoping to make here is that weâve attempted to provide a wide range of sensitivity analyses throughout our report, to an extent that we think goes beyond most charity evaluations. Itâs not surprising that weâve missed some in this draft that others would like to see. Gregoryâs comments mentioned âEven if you didnât pre-specify, presenting your first cut as the primary analysis helps for nothing up my sleeve reasonsâ seemed to imply that we were deliberately hiding something, but in my view our interpretation was overly pessimistic.
Cheers for keeping the discourse civil.
Would it have been better to start with a stipulated prior based on evidence of short-course general-purpose[1] psychotherapyâs effect size generally, update that prior based on the LMIC data, and then update that on charity-specific data?
One of the objections to HLIâs earlier analysis was that it was just implausible in light of what we know of psychotherapyâs effectiveness more generally. I donât know that literature well at all, so I donât know how well the effect size in the new stipulated prior compares to the effect size for short-course general-purpose psychotherapy generally. However, given the methodological challenges with measuring effect size in LMICs on available data, it seems like a more general understanding of the effect size should factor into the informed prior somehow. Of course, the LMIC context is considerably different than the context in which most psychotherapy studies have been done, but I am guessing it would be easier to manage quality-control issues with the much broader research base available. So both knowledge bases would likely inform my prior before turning to charity-specific evidence.
[Edit 6-Dec-23: Gregâs response to the remainder of this comment is much better than my musings below. Iâd suggest reading that instead!]
To my not-very-well-trained eyes, one hint to me that thereâs an issue with application of Bayesian analysis here is the failure of the LMIC effect-size model to come anywhere close to predicting the effect size suggested by the SM-specific evidence. If the model were sound, it would seem very unlikely that the first organization evaluated to the medium-to-in-depth level would happen to have charity-specific evidence suggesting an effect size that diverged so strongly from what the model predicted. I think most of us, when faced with such a circumstance, would question whether the model was sound and would put it on the shelf until performing other charity-specific evaluations at the medium-to-in-depth level. That would be particularly true to the extent the modelâs output depended significantly on the methodology used to clean up some problems with the data.[2]
By which I mean not psychotherapy for certain narrow problems (e.g., CBT-I for insomnia, exposure therapy for phobias).
If Gregâs analysis is correct, it seems I shouldnât assign the informed prior much more credence than I have credence in HLIâs decision to remove outliers (and to a lesser extent, its choice of a method). So, again to my layperson way of thinking, one partial way of thinking about the crux could be that the reader must assess their confidence in HLIâs outlier-treatment decision vs. their confidence in the Baird/âOzler RCT on SM.
What prior to formally pick is trickyâI agree the factors you note would be informative, but how to weigh them (vs. other sources of informative evidence) could be a matter of taste. However, sources of evidence like this could be handy to use as âbenchmarksâ to see whether the prior (/âresults of the meta-analysis) are consilient with them, and if not, explore why.
But I think I can now offer a clearer explanation of what is going wrong. The hints you saw point in this direction, although not quite as you describe.
One thing worth being clear on is HLI is not updating on the actual SM specific evidence. As they model it, the estimated effect on this evidence is an initial effect of g = 1.8, and a total effect of ~3.48 WELLBYs, so this would lie on the right tail, not the left, of the informed prior.[1] They discount the effect by a factor of 20 to generate the data they feed into their Bayesian method. Stipulating data which would be (according to their prior) very surprisingly bad would be in itself a strength, not a concern, of the conservative analysis they are attempting.
Next, we need to distinguish an average effect size from a prediction interval. The HLI does report both (Section 4) for a more basic model of PT in LMICs. The (average, random) effect size is 0.64 (95% CI 0.54 to 0.74), whilst the prediction interval is â0.27 to 1.55. The former is giving you the best guess of the average effect (with a confidence interval), the latter is telling youâif I do another study like those already included, the range I can expect its effect size to be within. By loose analogy: if I sample 100 people and their average height is roughly 5Ⲡ7â (95% CI 5â˛6â to 5â˛8â), the 95% range of the individual heights will range much more widely (say 5Ⲡ0â to 6Ⲡ2âł)
Unsurprisingly (especially given the marked heterogeneity), the prediction interval is much wider than the confidence interval around the average effect size. Crucially, if our ânext studyâ reports an effect size of (say) 0.1, our interpretation typically should not be: âThis study canât be right, the real effect of the intervention it studies must be much closer to 0.6â. Rather, as findings are heterogeneous, it is much more likely a study which (genuinely) reports a below average effect.[2] Back to the loose analogy, we would (typically) assume we got it right if we measured some more people at (e.g.) 6â˛0â and 5â˛4â, even though these are significantly above or below the 95% confidence interval of our average, and only start to doubt measurements much outside our prediction interval (e.g. 3â˛10â, 7â˛7âł).
Now the problem with the informed prior becomes clear: it is (essentially) being constructed with confidence intervals of the average, not prediction intervals for its data from its underlying models. As such, it is a prior not of âWhat is the expected impact of a given PT interventionâ, but rather âWhat is the expected average impact of PT interventions as a wholeâ.[3]
With this understanding, the previously bizarre behaviour is made sensible. For the informed prior should assign very little credence to the average impact of PT overall being ~0.4 per the stipulated Strongminds data, even though it should not be that surprised that a particular intervention (e.g. Strongminds!) has an impact much below average, as many other PT interventions studied also do (cf. Although I shouldnât be surprised if I measure someone as 5â˛2â, I should be very surprised if the true average height is actually 5â˛2â, given my large sample averages 5â˛7â). Similarly, if we are given a much smaller additional sample reporting a much different effect size, the updated average effect should remain pretty close to the prior (e.g. if a handful of new people have heights < 5â˛4â, my overall average goes down a little, but not by much).
Needless to say, the results of such an analysis, if indeed for âaverage effect size of psychotherapy as a wholeâ are completely inappropriate for âexpected effect size of a given psychotherapy interventionâ, which is the use it is put to in the report.[4] If the measured effect size of Strongminds was indeed ~0.4, the fact psychotherapy interventions ~average substantially greater effects of ~1.4 gives very little reason to conclude the effect of Strongminds is in fact much higher (e.g. ~1.3). In the same way, if I measure your height is 5â˛0â, the fact the average heights of other people Iâve measured is 5â˛7â does not mean I should conclude youâre probably about 5â˛6âł.[5]
Minor: it does lie pretty far along the right tail of the prior (<top 1st percentile?), so maybe one could be a little concerned. Not much, though: given HLI was searching for particularly effective PT interventions in the literature, it doesnât seem that surprising that this effort could indeed find one at the far-ish right tail of apparent efficacy.
Cf. Cuijpers et al. 2020
Cf. your original point about a low result looking weird given the prior. Perhaps the easiest way to see this is to consider a case where the intervention is harmful. The informed prior says P (ES < 0) is very close to zero. Yet >1/â72 studies in the sample did have an effect size < 0. So obviously a prior of an intervention should not be that confident in predicting it will not have a -ve effect. But a prior of the average effect of PT interventions should be that confident this average is not in fact negative, given the great majority of sampled studies show substantially +ve effects.
In a sense, the posterior is not computing the expected effect of StrongMinds, but rather the expected effect of a future intervention like StrongMinds. Somewhat ironically, this (again, simulated) result would be best interpreted as an anti-recommendation: Strongminds performs much below the average we would expect of interventions similar to it.
It is slightly different for measured height as we usually have very little pure measurement error (versus studies with more significant sampling variance). So youâd update a little less towards the reported study effects vs. the expected value than you would for height measurements vs. the average. But the key points still stand.
Hi Jason,
1. To your first point, I think adding another layer of priors is a plausible way to do things â but given the effects of psychotherapy in general appear to be similar to the estimates we come up with[1] â itâs not clear how much this would change our estimates.
There are probably two issues with using HIC RCTs as a prior. First, incentives that could bias results probably differ across countries. Iâm not sure how this would pan out. Second, in HICs, the control group (âtreatment as usualâ) is probably a lot better off. In a HIC RCT, thereâs not much you can do to stop someone in the control group of a psychotherapy trial to go get prescribed antidepressants. However, the standard of care in LMICs is much lower (antidepressants typically arenât an option), so we shouldnât be terribly surprised if control groups appear to do worse (and the treatment effect is thus larger).
2. To your second point, does our model predict charity specific effects?
In general, I think itâs a fair test of a model to say it should do a reasonable job at predicting new observations. We canât yet discuss the forthcoming StrongMinds RCT â we will know how well our model works at predicting that RCT when itâs released, but for the Friendship Bench (FB) situation, it is true that we predict a considerably lower effect for FB than the FB-specific evidence would suggest. But this is in part because weâre using a combination of charity specific evidence to inform our prior and the data. Let me explain.
We have two sources of charity specific evidence. First, we have the RCTs, which are based on a charity programme but not as itâs deployed at scale. Second, we have monitoring and evaluation data, which can show how well the charity intervention is implemented in the real world. We donât have a psychotherapy charity at present that has RCT evidence of the programme as itâs deployed in the real world. This matters because I think placing a very high weight on the charity-specific evidence would require that it has a high ecological validity. While the ecological validity of these RCTs is obviously higher than the average study, we still think itâs limited. Iâll explain our concern with FB.
For Friendship Bench, the most recent RCT (Haas et al. 2023, n = 516) reports an attendance rate of around 90% to psychotherapy sessions, but the Friendship Bench M&E data reports an attendance rate more like 30%. We discuss this in Section 8 of the report.
So for the Friendship Bench case we have a couple reasonable quality RCTs for Friendship Bench, but it seems like, based on the M&E data, that something is wrong with the implementation. This evidence of lower implementation quality should be adjusted for, which we do. But we include this adjustment in the prior. So weâre injecting charity specific evidence into both the prior and the data. Note that this is part of the reason why we donât think itâs wild to place a decent amount of weight on the prior. This is something we should probably clean up in a future version.
We canât discuss the details of the Baird et al. RCT until itâs published, but we think there may be an analogous situation to Friendship Bench where the RCT and M&E data tell conflicting stories about implementation quality.
This is all to say, judging how well our predictions fair when predicting the charity specific effects isnât clearly straightforward, since we are trying to predict the effects of the charity as it is actually implemented (something we donât directly observe), not simply the effects from an RCT.
If we try and predict the RCT effects for Friendship Bench (which have much higher attendance than the ârealâ programme), then the gap between the predicted RCT effects and actual RCT effects is much smaller, but still suggests that we canât completely explain why the Friendship Bench RCTs find their large effects.
So, we think the error in our prediction isnât quite as bad as it seems if weâre predicting the RCTs, and stems in large part from the fact that we are actually predicting the charity implementation.
Cuijpers et al. 2023 finds an effect of psychotherapy of 0.49 SDs for studies with low RoB in low, middle, and high income countries (comparisons = 218#), and Tong et al. 2023 find an effect of 0.69 SDs for studies with low RoB in non-western countries (primarily low and middle income; comparisons = 36). Our estimate of the initial effect is 0.70 SDs (before publication bias adjustments). The results tend to be lower (between 0.27 and 0.57, or 0.42 and 0.60) SDs when the authors of the meta-analyses correct for publication bias. In both meta-analyses (Tong et al. and Cuijpers et al.) the authors present the effects after using three publication bias corrected methods: trim-and-fill (0.6; 0.38 SDs), a limit meta-analysis (0.42; 0.28 SDs), and using a selection model (0.49; 0.57 SDs). If we averaged their publication bias corrected results (which they did without removing outliers beforehand) the estimated effect of psychotherapy would be 0.5 SDs and 0.41 for the two meta-analyses. Our estimate of the initial effect (which is most comparable to these meta-analyses), after removing outliers is 0.70 SDs, and our publication bias correction is 36%, implying that we estimate our initial effect to be 0.46 SDs. You can play around with the data they use on the metapsy website.