I think we can be pretty sure (cf.) the forthcoming strongminds RCT (the one not conducted by Strongminds themselves, which allegedly found an effect size of d = 1.72 [!?]) will give dramatically worse results than HLIâs evaluation would predictâi.e. somewhere between ânullâ and â2x cash transfersâ rather than âseveral times better than cash transfers, and credibly better than GW top charities.â [Iâll donate 5k USD if the Ozler RCT reports an effect size greater than d = 0.4 â 2x smaller than HLIâs estimate of ~ 0.8, and below the bottom 0.1% of their monte carlo runs.]
This will not, however, surprise those who have criticised the many grave shortcomings in HLIâs evaluationâmistakes HLI should not have made in the first place, and definitely should not have maintained once they were made aware of them. See e.g. Snowden on spillovers, me on statistics (1, 2, 3, etc.), and Givewell generally.
Among other things, this would confirm a) SimonM produced a more accurate and trustworthy assessment of Strongminds in their spare time as a non-subject matter expert than HLI managed as the centrepiece of their activity; b) the ~$250 000 HLI has moved to SM should be counted on the ânegativeâ rather than âpositiveâ side of the ledger, as I expect this will be seen as a significant and preventable misallocation of charitable donations.
Regrettably, it is hard to square this with an unfortunate series of honest mistakes. A better explanation is HLIâs institutional agenda corrupts its ability to conduct fair-minded and even-handed assessment for an intervention where some results were much better for their agenda than others (cf.). I am sceptical this only applies to the SM evaluation, and I am pessimistic this will improve with further financial support.
Iâll donate 5k USD if the Ozler RCT reports an effect size greater than d = 0.4 â 2x smaller than HLIâs estimate of ~ 0.8, and below the bottom 0.1% of their monte carlo runs.
This RCT (which should have been the Baird RCTâmy apologies for mistakenly substituting Sarah Baird with her colleague Berk Ozler as first author previously) is now out.
I was not specific on which effect size would count, but all relevant[1] effect sizes reported by this study are much lower than d = 0.4 - around d = 0.1. I roughly[2] calculate the figures below.
In terms of âSD-years of depression avertedâ or similar, there are a few different ways you could slice it (e.g. which outcome you use, whether you linearly interpolate, do you extend the effects out to 5 years, etc). But when I play with the numbers I get results around 0.1-0.25 SD-years of depression averted per person (as a sense check, this lines up with an initial effect of ~0.1, which seems to last between 1-2 years).
These are indeed âdramatically worse results than HLIâs [2021] evaluation would predictâ. They are also substantially worse than HLIâs (much lower) updated 2023 estimates of Strongminds. The immediate effects of 0.07-0.16 are ~>5x lower than HLIâs (2021) estimate of an immediate effect of 0.8; they are 2-4x lower than HLIâs (2023) informed prior for Strongminds having an immediate effect of 0.39. My calculations of the total effect over time from Baird et al. of 0.1-0.25 SD-years of depression averted are ~10x lower than HLIâs 2021 estimate of 1.92 SD-years averted, and ~3x lower than their most recent estimate of ~0.6.
Baird et al. also comment on the cost-effectiveness of the intervention in their discussion (p18):
Unfortunately, the IPT-G impacts on depression in this trial are too small to pass a cost-effectiveness test. We estimate the cost of the program to have been approximately USD 48 per individual offered the program (the cost per attendee was closer to USD 88). Given impact estimates of a reduction in the prevalence of mild depression of 0.054 pp for a period of one year, it implies that the cost of the program per case of depression averted was nearly USD 916, or 2,670 in 2019 PPP terms. An oft-cited reference point estimates that a health intervention can be considered cost-effective if it costs approximately one to three times the GDP per capita of the relevant country per Disability Adjusted Life Year (DALY) averted (Kazibwe et al., 2022; Robinson et al., 2017). We can then convert a case of mild depression averted into its DALY equivalent using the disability weights calculated for the Global Burden of Disease, which equates one year of mild depression to 0.145 DALYs (Salomon et al., 2012, 2015). This implies that ultimately the program cost USD PPP (2019) 18,413 per DALY averted. Since Uganda had a GDP per capita USD PPP (2019) of 2,345, the IPT-G intervention cannot be considered cost-effective using this benchmark.
Iâm not sure anything more really needs to be said at this point. But much more could be, and I fear Iâll feel obliged to return to these topics before long regardless.
The primary mental health outcomes consist of two binary indicators: (i) having a Patient Health Questionnaire 8 (PHQ-8) score †4, which is indicative of showing no or minimal depression (Kroenke et al., 2009); and (ii) having a General Health Questionnaire 12 (GHQ-12) score < 3, which indicates one is not suffering from psychological distress (Goldberg and Williams, 1988). We supplement these two indicators with five secondary outcomes: (i) The PHQ-8 score (range: 0-24); (ii) the GHQ-12 score (0-12); (iii) the score on the Rosenberg self-esteem scale (0-30) (Rosenberg, 1965); (iv) the score on the Child and Youth Resilience Measure-Revised (0-34) (Jefferies et al., 2019); and (v) the locus of control score (1-10). The discrete PHQ-8 and GHQ-12 scores allow the assessment of impact on the severity of distress in the sample, while the remaining outcomes capture several distinct dimensions of mental health (Shah et al., 2024).
Measurements were taken following treatment completion (âRapid resurveyâ), then at 12m and 24m thereafer (midline and endline respectively).
I use both primary indicators and the discrete values of the underlying scores they are derived from. I havenât carefully looked at the other secondary outcomes nor the human capital variables, but besides being less relevant, I do not think these showed much greater effects.
I.e. I took the figures from Table 6 (comparing IPT-G vs. control) for these measures and plugged them into a webtool for Cohenâs h or d as appropriate. This is rough and ready, although my calculations agree with the effect sizes either mentioned or described in text. They also pass an âeye testâ of comparing them to the cmfs of the scores in figure 3 - these distributions are very close to one another, consistent with small-to-no effect (one surprising result of this study is IPT-G + cash lead to worse outcomes than either control or IPT-G alone):
One of the virtues of this study is it includes a reproducibility package, so Iâd be happy to produce a more rigorous calculation directly from the provided data if folks remain uncertain.
My view is that HLI[1], GWWC[2], Founders Pledge[3], and other EA /â effective giving orgs that recommend or provide StrongMinds as an donation option should ideally at least update their page on StrongMinds to include relevant considerations from this RCT, and do so well before Thanksgiving /â Giving Tuesday in Nov/âDec this year, so donors looking to decide where to spend their dollars most cost effectively can make an informed choice.[4]
Thanks Bruce, would you still think this if Strongminds ditched their adolescent programs as a result of this study and continued with their core groups with older women?
1) I think this RCT is an important proxy for StrongMinds (SM)âs performance âin situâ, and worth updating onâin part because it is currently the only completed RCT of SM. Uninformed readers who read what is currently on e.g. GWWC[1]/âFP[2]/âHLI website might reasonably get the wrong impression of the evidence base behind the recommendation around SM (i.e. there are no concerns sufficiently noteworthy to merit inclusion as a caveat). I think the effective giving community should have a higher bar for being proactively transparent hereâit is much better to include (at minimum) a relevant disclaimer like this, than to be asked questions by donors and make a claim that there wasnât capacity to include.[3]
2) If a SM recommendation is justified as a result of SMâs programme changes, this should still be communicated for trust building purposes (e.g. âWe are recommending SM despite [Baird et al RCT results], because âŠ), both for those who are on the fence about deferring, and for those who now have a reason to re-affirm their existing trust on EA org recommendations.[4]
3) Help potential donors make more informed decisionsâfor example, informed readers who may be unsure about HLIâs methodology and wanted to wait for the RCT results should not have to go search this up themselves or look for a fairly buried comment thread on a post from >1 year ago in order to make this decision when looking at EA recommendations /â links to donateâI donât think itâs an unreasonable amount of effort compared to how it may help. This line of reasoning may also apply to other evaluators (e.g. GWWC evaluator investigations).[5]
GWWC website currently says it only includes recommendations after they review it through their Evaluating Evaluators work, and their evaluation of HLI did not include any quality checks of HLIâs work itself nor finalise a conclusion. Similarly, they say: âwe donât currently include StrongMinds as one of our recommended programs but you can still donate to it via our donation platformâ.
We recommend StrongMinds because IPT-G has shown significant promise as an evidence-backed intervention that can durably reduce depression symptoms. Crucial to our analysis are previous RCTs
Iâm not suggesting at all that they should have done this by now, only ~2 weeks after the Baird RCT results were made public. But I do think three months is a reasonable timeframe for this.
If there was an RCT that showed malaria chemoprevention cost more than $6000 per DALY averted in Nigeria (GDP/âcapita * 3), rather than per life saved (ballpark), I would want to know about it. And I would want to know about it even if Malaria Consortium decided to drop their work in Nigeria, and EA evaluators continued to recommend Malaria Consortium as a result. And how organisations go about communicating updates like this do impact my personal view on how much I should defer to them wrt charity recommendations.
Of course, based on HLIâs current analysis/âapproach, the ?disappointing/â?unsurprising result of this RCT (even if it was on the adult population) would not have meaningfully changed the outcome of the recommendation, even if SM did not make this pivot (pg 66):
Therefore, even if the StrongMinds-specific evidence finds a small total recipient effect (as we present here as a placeholder), and we relied solely on this evidence, then it would still result in a cost-effectiveness that is similar or greater than that of GiveDirectly because StrongMinds programme is very cheap to deliver.
And while I think this is a conversation that has already been hashed out enough on the forum, I do think the point standsâpotential donors who disagree with or are uncertain about HLIâs methodology here would benefit from knowing the results of the RCT, and itâs not an unreasonable ask for organisations doing charity evaluations /â recommendations to include this information.
Acknowledging that this is DALYs not WELLBYs! OTOH, this conclusion is not the GiveWell or GiveDirectly bar, but a ~mainstream global health cost-effectiveness standard of ~3x GDP per capita per DALY averted (in this case, the ~$18k USD PPP/âDALY averted of SM is below the ~$7k USD PPP/âDALY bar for Uganda)
Nice one Bruce. I think I agree that it should be communicated like you say for reasons 2 and 3
I donât think this is a good proxy for their main programs though, as this RCT looks a very different thing than their regular programming. I think other RCTs on group therapy in adult women from the region are better proxies than this study on adolescents.
Why do you think itâs a particularly good proxy? In my mind the same org doing a different treatment, (that seems to work but only a little for a short ish time) with many similarities to their regular treatment of course.
Like I said a year ago, I would have much rather this has been an RCT on Strongminds regular programs rather than this one on a very different program for adolescents. I understand though that âdoes similar group psychotherapy also work for adolescentsâ is a more interesting question from a researcherâs perspective, although less useful for all of us deciding just how good regular StrongMinds group psychotherapy is.
It sounds like youâre interpreting my claim to be âthe Baird RCT is a particularly good proxy (or possibly even better than other RCTs on group therapy in adult women) for the SM adult programme effectivenessâ, but this isnât actually my claim here; and while I think one could reasonably make some different, stronger (donor-relevant) claims based on the discussions on the forum and the Baird RCT results, mine are largely just: âitâs an important proxyâ, âitâs worth updating onâ, and âthe relevant considerations/âupdates should be easily accessible on various recommendation pagesâ. I definitely agree that an RCT on the adult programme would have been better for understanding the adult programme.
(Iâll probably check out of the thread here for now, but good chatting as always Nick! hope youâre well)
Thanks for this Gregory, I think itâs an important result and have updated my views. Iâm not sure why HLI were so optimistic about this. I have a few comments here.
This study was performed on adolescents, which is not the core group of women that Strong Minds and other group IPT programs treat. This study might update me slightly negatively against the effectof their core programming with groups of older women but not by much.
As The study said, âthis marked the first time SMU (i) delivered therapy to out-of-school
adolescent females, (ii) used youth mentors, and (iii) delivered therapy through a partner
organization.â
This result then doesnât surprise me as (high uncertainty) I think itâs generally harder to move the needle with adolescent mental health than with adults.
The therapy still worked, even though the effect sizes were much smaller than other studies and were not cost effective.
As youâve said before, f this kind of truly independent research was done on a lot of interventions, the results might not look nearly as good as the original studies.
I think Strongminds should probably stop their adolescent programs based on this study. Why keep doing it, when your work with adult women currently seems far more cost effective?
Even with the Covid caveat, Iâm stunned at the null/ânegative effect of the cash transfer arm. Interesting stuff and not sure what to make of it.
I would still love a similar independent study on the regular group IPT programs with older women, and these RCTs should be pretty cheap on the scale of things, I doubt weâll get that though as it will probably seen as being too similar and not interesting enough for researchers which is fair enough.
Thanks for this post, and for expressing your views on our work. Point by point:
I agree that StrongMindsâ own study had a surprisingly large effect size (1.72), which was why we never put much weight on it. Our assessment was based on a meta-analysis of psychotherapy studies in low-income countries, in line with academic best practice of looking at the wider sweep of evidence, rather than relying on a single study. You can see how, in table 2 below, reproduced from our analysis of StrongMinds, StrongMindsâ own studies are given relatively little weight in our assessment of the effect size, which we concluded was 0.82 based on the available data. Of course, weâll update our analysis when new evidence appears and weâre particularly interested in the Ozler RCT. However, we think itâs preferable to rely on the existing evidence to draw our conclusions, rather than on forecasts of as-yet unpublished work. We are preparing our psychotherapy meta-analysis to submit it for academic peer review so it can be independently evaluated but, as you know, academia moves slowly.
We are a young, small team with much to learn, and of course, weâll make mistakes. But, I wouldnât characterise these as âgrave shortcomingsâ, so much as the typical, necessary, and important back and forth between researchers. A claims P, B disputes P, A replies to B, B replies to A, and so it goes on. Even excellent researchers overlook things: GiveWell notably awarded us a prize for our reanalysis of their deworming research. Weâve benefitted enormously from the comments weâve got from others and it shows the value of having a range of perspectives and experts. Scientific progress is the result of productive disagreements.
I think itâs worth adding that SimonMâs critique of StrongMinds did not refer to our meta-analytic work, but focused on concerns about StrongMinds own study and analysis done outside HLI. As I noted in 1., we share the concerns about the earlier StrongMinds study, which is why we took the meta-analytic approach. Hence, Iâm not sure SimonMâs analysis told us much, if anything, we hadnât already incorporated. With hindsight, I think we should have communicated far more prominently how small a part StrongMindsâ own studies played in our analysis, and been quicker off the mark to reply to SimonMâs post (it came out during the Christmas holidays and I didnât want to order the team back to their (virtual) desks). Naturally, if you arenât convinced by our work, you will be sceptical of our recommendations.
You suggest we are engaged in motivated reasoning, setting out to prove what we already wanted to believe. This is a challenging accusation to disprove. The more charitable and, I think, the true explanation is that we had a hunch about something important being missed and set out to do further research. We do complex interdisciplinary work to discover the most cost-effective interventions for improving the world. We have done this in good faith, facing an entrenched and sceptical status quo, with no major institutional backing or funding. Naturally, we wonât convince everyone â weâre happy the EA research space is a broad church. Yet, itâs disheartening to see you treat us as acting in bad faith, especially given our fruitful interactions, and we hope that you will continue to engage with us as our work progresses.
HLI has, in fact, put a lot of weight on the d = 1.72 Strongminds RCT. As table 2 shows, you give a weight of 13% to itâjoint highest out of the 5 pieces of direct evidence. As there are ~45 studies in the meta-analytic results, this means this RCT is being given equal or (substantially) greater weight than any other study you include. For similar reasons, the Strongminds phase 2 trial is accorded the third highest weight out of all studies in the analysis.
HLIâs analysis explains the rationale behind the weighting of âusing an appraisal of its risk of bias and relevance to StrongMindsâ present core programmeâ. Yet table 1A notes the quality of the 2020 RCT is âunknownâ - presumably because Strongminds has âonly given the results and some supporting details of the RCTâ. I donât think it can be reasonable to assign the highest weight to an (as far as I can tell) unpublished, not-peer reviewed, unregistered study conducted by Strongminds on its own effectiveness reporting an astonishing effect sizeâbefore it has even been read in full. It should be dramatically downweighted or wholly discounted until then, rather than included at face value with a promise HLI will followup later.
Risk of bias in this field in general is massive: effect sizes commonly melt with improving study quality. Assigning ~40% of a weighted average of effect size to a collection of 5 studies, 4 [actually 3, more later] of which are (marked) outliers in effect effect, of which 2 are conducted by the charity is unreasonable. This can be dramatically demonstrated from HLIâs own data:
One thing I didnât notice last time I looked is HLI did code variables on study quality for the included studies, although none of them seem to be used for any of the published analysis. I have some good news, and some very bad news.
The good news is the first such variable I looked at, ActiveControl, is a significant predictor of greater effect size. Studies with better controls report greater effects (roughly 0.6 versus 0.3). This effect is significant (p = 0.03) although small (10% of the variance) and difficultâat least for meâto explain: I would usually expect worse controls to widen the gap between it and the intervention group, not narrow it. In any case, this marker of study quality definitely does not explain away HLIâs findings.
The second variable I looked at was âUnpubOr(pre?)regâ.[1] As far as I can tell, coding 1 means something like âthe study was publicly registeredâ and 0 means it wasnât (Iâm guessing 0.5 means something intermediate like retrospective registration or similar) - in any case, this variable correlates extremely closely (>0.95) to my own coding of whether a study mentions being registered or not after reviewing all of them myself. If so, using it as a moderator makes devastating reading:[2]
To orientate: in âModel resultsâ the intercept value gives the estimated effect size when the âunpubâ variable is zero (as I understand it, ~unregistered studies), so d ~ 1.4 (!) for this set of studies. The row below gives the change in effect if you move from âunpub = 0â to âunpub = 1â (i.e. ~ registered vs. unregistered studies): this drops effect size by 1, so registered studies give effects of ~0.3. In other words, unregistered and registered studies give dramatically different effects: study registration reduces expected effect size by a factor of 3. [!!!]
The other statistics provided deepen the concern. The included studies have a very high level of heterogeneity (~their effect sizes vary much more than they should by chance). Although HLI attempted to explain this variation with various meta-regressions using features of the intervention, follow-up time, etc., these models left the great bulk of the variation unexplained. Although not like-for-like, here a single indicator of study quality provides compelling explanation for why effect sizes differ so much: it explains three-quarters of the initial variation.[3]
This is easily seen in a grouped forest plotâthe top group is the non registered studies, the second group the registered ones:
This pattern also perfectly fits the 5 pieces of direct evidence: Bolton 2003 (ES = 1.13), Strongminds RCT (1.72), and Strongminds P2 (1.09) are, as far as I can tell, unregistered. Thurman 2017 (0.09) was registered. Bolton 2007 is also registered, and in fact has an effect size of ~0.5, not 1.79 as HLI reports.[4]
To be clear, I do not think HLI knew of this before I found it out just now. But results like this indicate i) the appraisal of the literature in this analysis gravely off-the-markâstudy quality provides the best available explanation for why some trials report dramatically higher effects than others; ii) the result of this oversight is a dramatic over-estimation of likely efficacy of Strongminds (as a ready explanation for the large effects reported in the most ârelevant to strongmindsâ studies is that these studies were not registered and thus prone to ~200%+ inflation of effect size); iii) this is a very surprising mistake for a diligent and impartial evaluator to make: one would expect careful assessment of study qualityâand very sceptical evaluation where this appears to be lackingâto be foremost, especially given the subfield and prior reporting from Strongminds both heavily underline it. This pattern, alas, will prove repetitive.
I also think a finding like this should prompt an urgent withdrawal of both the analysis and recommendation pending further assessment. In honesty, if this doesnât, Iâm not sure what ever could.
2:
Indeed excellent researchers overlook things, and although I think both the frequency and severity of things HLI mistakes or overlooks is less-than-excellent, one could easily attribute this to things like âinexperienceâ, âtrying to do a lot in a hurryâ, âlimited staff capacityâ, and so on.
Yet this cannot account for how starkly asymmetric the impact of these mistakes and oversights are. HLIâs mistakes are consistently to Strongmindâs benefit rather than its detriment, and HLI rarely misses a consideration which could enhance the âmultipleâ, it frequently misses causes of concern which both undermine both strength and reliability of this recommendation. HLIâs award from Givewell deepens my concerns here, as it is consistent with a very selective scepticism: HLI can carefully scruitinize charity evaluations by others it wants to beat, but fails to mete out remotely comparable measure to its own which it intends for triumph.
I think this can also explain how HLI responds to criticism, which I have found by turns concerning and frustrating. HLI makes some splashy claim (cf. âmission accomplishedâ, âconfident recommendationâ, etc.). Someone else (eventually) takes a closer look, and finds the surprising splashy claim, rather than basically checking out âmost reasonable ways you slice itâ, it is highly non-robust, and only follows given HLI slicing it heavily in favour of their bottom line in terms of judgement or analysisâthe latter of which often has errors which further favour said bottom line. HLI reliably responds, but the tenor of this response is less âscientific discourseâ and more âlawyer for defenceâ: where it can, HLI will too often further double down on calls it makes where I aver the typical reasonable spectator would deem at best dubious, and at worst tendentious; where it canât, HLI acknowledges the shortcoming but asserts (again, usually very dubiously) that it isnât that a big deal, so it will deprioritise addressing it versus producing yet more work with the shortcomings familiar to those which came before.
3:
HLIâs meta-analysis in no way allays or rebuts the concerns SimonM raised re. Strongmindsâindeed, appropriate analysis would enhance many of them. Nor is it the case that the meta-analytic work makes HLIâs recommendation robust to shortcomings in the Strongminds-specific evidenceâindeed, the cost effectiveness calculator will robustly recommend Strongminds as superior (commonly, several times superior) to GiveDirectly almost no matter what efficacy results (meta-analytic or otherwise) are fed into it. On each.
a) Meta-analysis could help contextualize the problems SimonM identifies in the Strongminds specific data. For example, a funnel plot which is less of a âfunnelâ but more of a ski-slope (i.e. massive small study effects/ârisk of publication bias), and a contour/âp-curve suggestive of p-hacking would suggest the fieldâs literature needs to be handled with great care. Finding âstrongminds relevantâ studies and direct evidence are marked outliers even relative to this pathological literature should raise alarm given this complements the object-level concerns SimonM presented.
This is indeed true, and these features were present in the studies HLI collected, but HLI failed to recognise it. It may never have if I hadnât gotten curious and did these analyses myself. Said analysis is (relative to the much more elaborate techniques used in HLIâs meta-analysis) simple to conductâmy initial âworkâ was taking the spreadsheet and plugging it into a webtool out of idle curiosity.[5] Again, this is a significant mistake, adds a directional bias in favour of Strongminds, and is surprising for a diligent and impartial evaluator to make.
b) In general, incorporating meta-analytic results into what is essentially a weighted average alongside direct evidence does not clean either it or the direct evidence of object level shortcomings. If (as here) both are severely compromised, the result remains unreliable.
The particular approach HLI took also doesnât make the finding more robust, as the qualitative bottom line of the cost-effectiveness calculation is insensitive to the meta-analytic result. As-is, the calculator gives strongminds as roughly 12x better than GiveDirectly.[6] If you set both meta-analytic effect sizes to zero, the calculator gives Strongminds as ~7x better than GiveDirectly. So the five pieces of direct evidence are (apparently) sufficient to conclude SM is an extremely effective charity. Obviously this isâand HLI has previously acceptedâfacially invalid output.
It is not the only example. It is extremely hard for any reduction of efficacy inputs to the model to give a result that Strongminds is worse than Givedirectly. If we instead leave the meta-analytic results as they were but set all the effect sizes of the direct evidence to zero (in essence discounting them entirelyâwhich I think is approximately what should have been done from the start), we get ~5x better than GiveDirectly. If we set all the effect sizes of both meta-analysis and direct evidence to 0.4 (i.e. the expected effects of registered studies noted before), we get ~6x better than Givedirectly. If we set the meta-analytic results to 0.4 and set all the direct evidence to zero we get ~3x GiveDirectly. Only when one sets all the effect sizes to 0.1 - lower than all but ~three of the studies in the meta-analysisâdoes one approach equipoise.
This result should not surprise on reflection: the CEAâs result is roughly proportional to the ~weighted average of input effect sizes, so an initial finding of â10xâ Givedirectly or similar would require ~a factor of 10 cut to this average to drag it down to equipoise. Yet this âfeatureâ should be seen as a bug: in the same way there should be some non-zero value of the meta-analytic results which should reverse a âmany times better than Givedirectlyâ finding, there should be some non-tiny value of effect sizes for a psychotherapy intervention (or psychotherapy interventions in general) which results in it not being better than GiveDirectly at all.
This does help explain the somewhat surprising coincidence the first charity HLI fully assessed would be one it subsequently announces as the most promising interventions in global health and wellbeing so-far found: rather than a discovery from the data, this finding is largely preordained by how the CEA stacks the deck. To be redundant (and repetitive): i) the cost-effectiveness model HLI is making is unfit-for-purpose, given can produce these absurd results; ii) this introduces a large bias in favour of Strongminds; iii) it is a very surprising mistake for a diligent and impartial evaluator to makeâthese problems are not hard to find.
Theyâre even easier for HLI to find once theyâve been alerted to them. I did, months ago, alongside other problems, and suggested the cost-effectiveness analysis and Strongminds recommendation be withdrawn. Although it should have happened then, perhaps if I repeat myself it might happen now.
4:
Accusations of varying types of bad faith/âmotivated reasoning/âintellectual dishonesty should indeed be made with careâbesides the difficulty in determination, pragmatic considerations raise the bar still higher. Yet I think the evidence of HLI having less of a finger but more of a fist on the scale throughout its work overwhelms even charitable presumptions made by a saint on its behalf. In footballing terms, I donât think HLI is a player cynically diving to win a penalty, but it is like the manager after the game insisting âtheir goal was offside, and my player didnât deserve a red, and.. (etc.)â - highly inaccurate and highly biased. This is a problem when HLI claims itself an impartial referee, especially when it does things akin to awarding fouls every time a particular player gets tackled.
This is even more of a problem precisely because of the complex and interdisciplinary analysis HLI strives to do. No matter the additional analytic arcana, work like this will be largely fermi estimates, with variables being plugged in with little more to inform them than intuitive guesswork. The high degree of complexity provides a vast garden of forking paths available. Although random errors would tend to cancel out, consistent directional bias in model choice, variable selection, and numerical estimates lead to greatly inflated âbottom linesâ.
Although the transparency in (e.g.) data is commendable, the complex analysis also makes scruitiny harder. I expect very few have both the expertise and perseverence to carefully vet HLI analysis themselves; I also expect the vast majority of money HLI has moved has come from those largely taking its results on trust. This trust is ill-placed: HLIâs work weathers scruitiny extremely poorly; my experience is very much âthe more you see, the worse it looksâ. I doubt many donors following HLIâs advice, if they took a peak behind the curtain, would be happy with what they would discover.
If HLI is falling foul of an entrenched status quo, it is not particular presumptions around interventions, nor philosophical abstracta around population ethics, but rather those that work in this community (whether published elsewhere or not) should be even-handed, intellectually honest and trustworthy in all cases; rigorous and reliable commensurate to its expected consequence; and transparently and fairly communicated. I think going against this grain underlies (I suspect) why I am not alone in my concerns, and why HLI has not had the warmest reception. The hope this all changes for the better is not entirely forlorn. But things would have to change a lot, and quicklyâand the track record thus far does not spark joy.
Given I will be making complaints about publication bias, file drawer effects, and garden of forking path issues later in the show, one might wonder how much of this applies to my own criticism. How much time did I spend dredging through HLIâs work looking for something juicy? Is my file drawer stuffed with analyses I hoped would show HLI in a bad light, actually showed it in a good one, so I donât mention them?
Depressingly, the answer is ânot muchâ and ânoâ respectively. Regressing against publication registration was the second analysis I did on booting up the data again (regressing on active control was the first, mentioned in text). My file drawer subsequent to this is full of checks and double-checks for alternative (and better for HLI) explanations for the startling result. Specifically, and in order:
- I used the no_FU (no follow-ups) data initially for convenienceâthe full data can include multiple results of the same study at different follow-up points, and these clustered findings are inappropriate to ignore in a simple random effects model. So I checked both by doing this anyway then using a multi-level model to appropriately manage this structure to the data. No change to the key finding.
- Worried that (somehow) I was messing up or misinterpreting the metaregression, I (re)constructed a simple forest plot of all the studies, and confirmed indeed the unregistered ones were visibly off to the right. I then grouped a forest plot by registration variable to ensure it closely agreed with the meta-regression (in main text). It does.
- I then checked the first 10 studies coded by the variable I think is trial registration to check the registration status of those studies matched the codes. Although all fit, I thought the residual risk I was misunderstanding the variable was unacceptably high for a result significant enough to warrant a retraction demand. So I checked and coded all 46 studies by âregistered or not?â to make sure this agreed with my presumptive interpretation of the variable (in text). It does.
- Adding multiple variables to explain an effect geometrically expands researcher degrees of freedom, thus any unprincipled ad hoc investigation by adding or removing them has very high false discovery rates (I suspect this is a major problem with HLIâs own meta-regression work, but compared to everything else it merits only a passing mention here). But I wanted to check if I could find ways (even if unprincipled and ad hoc) to attenuate a result as stark as âunregistered studies have 3x the registered onesâ.
- I first tried to replicate HLIâs meta-regression work (exponential transformations and all) to see if the registration effect would be attenuated by intervention variables. Unfortunately, I was unable to replicate HLIâs regression results from the information provided (perhaps my fault). In any case, simpler versions I constructed did not give evidence for this.
- I also tried throwing in permutations of IPT-or-not (these studies tend to be unregistered, maybe this is the real cause of the effect?), active control-or-not (given it had a positive effect size, maybe it cancels out registration?) and study Standard Error (a proxyâalbeit a controversial oneâfor study size/âprecision/âquality, so if registration was confounded by it, this slightly challenges interpretation). The worst result across all the variations I tried was to drop the effect size of registration by 20% (~ â1 to â0.8), typically via substitution with SE. Omitted variable bias and multiple comparisons mean any further interpretation would be treacherous, but insofar as it provides further support: adding in more proxies for study quality increases explanatory power, and tends to even greater absolute and relative drops in effect size comparing âhighestâ versus âlowestâ quality studies.
That said, the effect size is so dramatic to be essentially immune to file-drawer worries. Even if I had a hundred null results I forgot to mention, this finding would survive a Bonferroni correction.
Obviously âis the study registered or notâ? is a crude indicator of overal quality. Typically, one would expect better measurement (perhaps by including further proxies for underlying study quality) would further increase the explanatory power of this factor. In other words, although these results look really bad, in reality it is likely to be even worse.
HLIâs write up on Bolton 2007 links to this paper (I did double check to make sure there wasnât another Bolton et al. 2007 which could have been confused with thisâno other match I could find). It has a sample size of 314, not 31 as HLI reportsâI presume a data entry error, although it less than reassuring that this erroneous figure is repeated and subsequently discussed in the text as part of the appraisal of the study: one reason given for weighing it so lightly is its âvery smallâ sample size.
Speaking of erroneous figures, hereâs the table of results from this study:
I see no way to arrive at an effect size of d = 1.79 from these numbers. The right comparison should surely be the pre-post difference of GIP versus control in the intention to treat analysis. These numbers give a cohenâs d ~ 0.5.
I donât think any other reasonable comparison gets much higher numbers, and definitely not > 3x higher numbersâthe differences between any of the groups are lower than the standard deviations, so should bound estimates like Cohenâs d to < 1.
[Re. file drawer, I guess this counts as a spot check (this is the only study I carefully checked data extraction), but not a random one: I did indeed look at this study in particular because it didnât match the âonly unregistered studies report crazy-high effectsâ - an ES of 1.79 is ~2x any other registered study.]
Re. my worries of selective scepticism, HLI did apply these methods in their meta-analysis of cash transfers, where no statistical suggestion of publication bias or p-hacking was evident.
This does depend a bit on whether spillover effects are being accounted for. This seems to cut the multiple by ~20%, but doesnât change the qualitative problems with the CEA. Happy to calculate precisely if someone insists.
Hello Gregory. With apologies, Iâm going to pre-commit both to making this my last reply to you on this post. This thread has been very costly in terms of my time and mental health, and your points below are, as far as I can tell, largely restatements of your earlier ones. As briefly as I can, and point by point again.
1.
A casual reader looking at your original comment might mistakenly conclude that we only used StrongMinds own study, and no other data, for our evaluation. Our point was that SMâs own work has relatively little weight, and we rely on many other sources. At this point, your argument seems rather âmotte-and-baileyâ. I would agree with you that there are different ways to do a meta-analysis (your point 3), and we plan to publish our new psychotherapy meta-analysis in due course so that it can be reviewed.
2.
Here, you are restating your prior suggestions that HLI should be taken in bad faith. Your claim is that HLI is good at spotting errors in othersâ work, but not its own. But there is an obvious explanation about âsurvivorshipâ effects. If you spot errors in your own research, you strip them out. Hence, by the time you publish, youâve found all the ones youâre going to find. This is why peer review is important: external reviewers will spot the errors that authors have missed themselves. Hence, thereâs nothing odd about having errors in your own work but also finding them in others. This is the normal stuff of academia!
3.
Iâm afraid I donât understand your complaint. I think your point is that âany way you slice the meta-analysis, psychotherapy looks more cost-effective than cash transfersâ but then you conclude this shows the meta-analysis must be wrong, rather than itâs sensible to conclude psychotherapy is better. Youâre right that you would have to deflate all the effect sizes by a large proportion to reverse the result. This should give you confidence in psychotherapy being better! Itâs worth pointing out that if psychotherapy is about $150pp, but cash transfers cost about $1100pp ($1000 transfer + delivery costs), therapy will be more cost-effective per intervention unless its per-intervention effect is much smaller
The explanation behind finding a new charity on our first go is not complicated or sinister. In earlier work, including my PhD, I had suggested that, on a SWB analysis, mental health was likely to be relatively neglected compared to status quo prioritising methods. I explained this in terms of the existing psychological literature on affective forecasting errors: weâre not very good at imagining internal suffering, we probably overstate the badness of material due to focusing illusions, and our forecasts donât account for hedonic adaptation (which doesnât occur to mental health). So the simple explanation is that we were âdiggingâ where we thought we were mostly likely to find âaltruistic goldâ, which seems sensible given limited resources.
4.
As much as I enjoyed your football analogies, here also youâre restating, rather than further substantiating, your earlier accusations. You seem to conclude from the fact you found some problems with HLIâs analysis that we should conclude this means HLI, but only HLI, should be distrusted, and retain our confidence in all the other charity evaluators. This seems unwarranted. Why not conclude you would find mistakes elsewhere too? I am reminded of the expression, âif you knew how the sausage was made, you wouldnât want to eat the sausageâ. What I think is true is that HLI is a second-generation charity evaluator, we are aiming to be extremely transparent, and we are proposing novel priorities. As a result, I think we have come in for a far higher level of public scrutiny than others have, so more of our errors have been found, but I donât know that we have made more and worse errors. Quite possibly, where errors have been noticed in othersâ work, they have been quietly and privately identified, and corrected with less fanfare.
we think itâs preferable to rely on the existing evidence to draw our conclusions, rather than on forecasts of as-yet unpublished work.
I sense this is wrong, if I think the unpublished work will change my conclusions a lot, I change my conclusions some of the way now though I understand thatâs a weird thing to do and hard to justify perhaps. Nonetheless I think itâs the right move.
Could you say a bit more about what you mean by âshould not have maintained once they were made aware of themâ in point 2? As you characterize below, this is an org âmaking a funding request in a financially precarious position,â and in that context I think itâs even more important than usual to be clear about HLI has âmaintainedâ its âmistakesâ âonce they were made aware of them.â Furthermore, I think the claim that HLI has âmaintainedâ is an important crux for your final point.
Example: I do not like that HLIâs main donor advice page lists the 77 WELLBY per $1,000 estimate with only a very brief and neutral statement that âNote: we plan to update our analysis of StrongMinds by the end of 2023.â There is a known substantial, near-typographical error underlying that analysis:
The first thing worth acknowledging is that he pointed out a mistake that substantially changes our results. [ . . . .] He pointed out that Kemp et al., (2009) finds a negative effect, while we recorded its effect as positive â meaning we coded the study as having the wrong sign.
[ . . . .]
This correction would reduce the spillover effect from 53% to 38% and reduce the cost-effectiveness comparison from 9.5 to 7.5x, a clear downwards correction.
While Iâm sympathetic to HLIâs small size and desire to produce a more comprehensive updated analysis, I donât think itâs appropriate to be quoting numbers from an unpatched version of the CEA over four months after the error was discovered. (Iâd be somewhat more flexible if this were based on new information rather than HLIâs coding error, and/âor if the difference didnât flip the recommendation for a decent percentage of would-be donors: deprivationists who believe the neutral point is less than 1.56 or so).
With apologies for delay. I agree with you that I am asserting HLIâs mistakes have further âaggravating factorsâ which I also assert invites highly adverse inference. I had hoped the links I provided provided clear substantiation, but demonstrably not (my bad). Hopefully my reply to Michael makes them somewhat clearer, but in case not, I give a couple of examples below with as best an explanation I can muster.
I will also be linking and quoting extensively from the Cochrane handbook for systematic reviewsâso hopefully even if my attempt to clearly explain the issues fail, a reader can satisfy themselves my view on them agrees with expert consensus. (Rather than, say, âCantankerous critic with idiosyncratic statistical tastes flexing his expertise to browbeat the laity into aquiescenceâ.)
0) Per your remarks, thereâs various background issues around reasonableness, materiality, timeliness etc. I think my views basically agree with yours. In essence: I think HLI is significantly âon the hookâ for work (such as the meta-analysis) it relies upon to make recommendations to donorsâwho will likely be taking HLIâs representations on its results and reliability (cf. HLIâs remarks about its âacademic researchâ, ârigourâ etc.) on trust. Discoveries which threaten the âbottom line numbersâ or overall reliability of this work should be addressed with urgency and robustness appropriate to their gravity. âWeâll put checking this on our to-do listâ seems fine for an analytic choice which might be dubious but of unclear direction and small expected magnitude. As you say, a typo which where corrected reduces the bottom line efficacy by ~ 20% should be done promptly.
The two problems I outlined 6 months ago each should have prompted withdrawal/âsuspension of both the work and the recommendation unless and until they were corrected.[1] Instead, HLI has not made appropriate corrections, and instead persists in misguiding donations and misrepresenting the quality of its research on the basis of work it has partly acknowledged (and which reasonable practicioners would overwhelmingly concur) was gravely compromised.[2]
1.0) Publication bias/âSmall study effects
It is commonplace in the literature for smaller studies to show different (typically larger) effect sizes than large studies. This is typically attributed to a mix of factors which differentially inflate effect size in smaller studies (see), perhaps the main one being publication bias: although big studies are likely to be published âeither wayâ, investigators may not finish (or journals may not publish) smaller studies reporting negative results.
It is extremely well recognised that these effects can threaten the validity of meta-analysis results. If you are producing something (very roughly) like an âaverage effect sizeâ from your included studies, the studies being selected for showing a positive effect means the average is inflated upwards. This bias is very difficult to reliably adjust for or âpatchâ (more later), but it can easily be large enough to mean âActually, the treatment has no effect, and your meta-analysis is basically summarizing methodological errors throughout the literatureâ.
Hence why most work on this topic stresses the importance of arduous efforts in prevention (e.g trying really hard to find âunpublishedâ studies) and diagnosis (i.e. carefully checking for statistical evidence of this problem) rather than âcureâ (see eg.). If a carefully conducted analysis nonetheless finds stark small study effects, thisârather than the supposed ~âaverageâ effectâwould typically be (and should definitely be) the main finding: âThe literature is a complete messâmore, and much better, research neededâ.
As in many statistical matters, a basic look at your data can point you in the right direction. For meta-analysis, this standard is a forest plot:
To orientate: each row is a study (presented in order of increasing effect size), and the horizontal scale is effect size (where to the right = greater effect size favouring the intervention). The horizontal bar for each study is gives the confidence interval for the effect size, with the middle square marking the central estimate (also given in the rightmost column). The diamond right at the bottom is the pooled effect sizeâthe (~~)[3] average effect across studies mentioned earlier.
Here, the studies are all over the map, many of which do not overlap with one another, nor with the pooled effect size estimate. In essence, dramatic heterogeneity: the studies are reporting very different effect sizes from another. Heterogeneity is basically a fact of life in meta-analysis, but a forest plot like this invites curiousity (or concern) about why effects are varying quite this much. [Iâm going to be skipping discussion of formal statistical tests/âmetrics for things like this for clarityâyou can safely assume a) yes, you can provide more rigorous statistical assessment of âhow muchâ besides âeyeballing itâ - although visually obvious things are highly informative, b) the things I mention you can see are indeed (highly) statistically significant etc. etc.]
There are some hints from this forest plot that small study effects could have a role to play. Although very noisy, larger studies (those with narrower horizontal lines lines, because bigger study ~ less uncertainty in effect size) tend to be higher up the plot and have smaller effects. There is a another plot designed to look at this betterâa funnel plot.
To orientate: each study is now a point on a scatterplot, with effect size again on the x-axis (right = greater effect). The y-axis is now the standard error: bigger studies have greater precision, and so lower sampling error, so are plotted higher on the y axis. Each point is a single studyâall being well, the scatter should look like a (symmetrical) triangle or funnel like those being drawn on the plot.
All is not well here. The scatter is clearly asymmetric and sloping to the rightâsmaller studies (towards the bottom of the graph) tend towards greater effect sizes. The lines being drawn on the plot make this even clearer. Briefly:
The leftmost âfunnelâ with shaded wings is centered on an effect size of zero (i.e. zero effect). The white middle triangle are findings which would give a p value of > 0.05, and the shaded wings correspond to a p value between 0.05 (âstatistically significantâ) and 0.01: it is an upward-pointing triangle because bigger studies can detect find smaller differences from zero as âstatistically significantâ than smaller ones. There appears to be clustering in the shaded region, suggestive that studies may be being tweaked to get them âacross the thresholdâ of statistically significant effects.
The rightmost âfunnelâ without shading is centered on the pooled effect estimate (0.5). Within the triangle is where you would expect 95% of the scatter of studies to fall in the absence of heterogeneity (i.e. there was just one true effect size, and the studies varied from this just thanks to sampling error). Around half are outside this region.
The red dashed line is the best fit line through the scatter of studies. If there werenât small study effects, it would be basically vertical. Instead, it slopes off heavily to the right.
Although a very asymmetric funnel plot is not proof positive of publication bias, findings like this demand careful investigation and cautious interpretation (see generally). It is challenging to assess exactly âhow big a deal is it, though?â: statistical adjustiment for biases in the original data is extremely fraught.
But we are comfortably in âbig dealâ territory: this finding credibly up-ends HLIâs entire analysis:
a) There are different ways of getting a âpooled estimateâ (~~average, or ~~ typical effect size): random effects (where you assume the true effect is rather a distribution of effects from which each study samples from), vs. fixed effects (where there is a single value for the true effect size). Random effects are commonly preferred asâin realityâone expects the true effect to vary, but the results are much more vulnerable to any small study effects/âpublication bias (see generally). Comparing the random effect vs. the fixed effect estimate can give a quantitative steer on the possible scale of the problem, as well as guide subsequent analysis.[4] Here, the random effect estimate is 0.52, whilst the fixed one is less than half the size: 0.18.
b) There are other statistical methods you could use (more later). One of the easier to understand (but one of the most conservative) goes back to the red dashed line in the funnel plot. You could extrapolate from it to the point where standard error = 0: so the predicted effect of an infinitely large (so infinitely precise) studyâand so also where the âsmall study effectâ is zero. There are a few different variants of these sorts of âregression methodsâ, but the ones I tried predict effect sizes of such a hypothetical study between 0.17 and 0.05. So, quantitatively, 70-90% cuts of effect size are on the table here.
c) A reason why regression methods methods are conservative as they will attribute as much variation in reported results as possible to differences in study size. Yet there could be alternative explanations for this besides publication bias: maybe smaller studies have different patient populations with (genuinely) greater efficacy, etc.
However, this statistical confounding can go the other way. HLI is not presenting simple meta-analytic results, but rather meta-regressions: where the differences in reported effect sizes are being predicted by differences between and within the studies (e.g. follow-up time, how much therapy was provided, etc.). One of HLIâs findings from this work is that psychotherpy with Strongminds-like traits is ~70% more effective than psychotherapy in general (0.8 vs. 0.46). If this is because factors like âgroup or individual therapyâ correlate with study size, the real story for this could simply be: âStrongminds-like traits are indicators for methodological weaknesses which greatly inflate true effect size, rather than for a more effective therapeutic modality.â In HLIâs analysis, the latter is presumed, giving about a ~10% uplift to the bottom line results.[5]
1.2) A major issue, and a major mistake to miss
So this is a big issue, and would be revealed by standard approaches. HLI instead used a very non-standard approach (see), novelâas far as I can tellâto existing practice and, unfortunately, inappropriate (cf., point 5): it gives ~ a 10-15% discount (although Iâm not sure this has been used in the Strongminds assessment, although it is in the psychotherapy one).
I came across these problems ~6m ago, prompted by a question by Ryan Briggs (someone with considerably greater expertise than my own) asking after the forest and funnel plot. I also started digging into the data in general at the same time, and noted the same key points explained labouriously above: looks like marked heterogeneity and small study effects, they look very big, and call the analysis results into question. Long story short, they said they would take a look at it urgently then report back.
This response is fine, but as my comments then indicated, I did have (and I think reasonably had) HLI on pretty thin ice/ââepistemic probationâ after finding these things out. You have to make a lot of odd choices to end up this far from normal practice, nonetheless still have to make some surprising oversights too, to end up missing problems which would appear to greatly undermine a positive finding for Strongminds.[6]
1.3) Maintaining this major mistake
HLI fell through this thin ice after its follow-up. Their approach was to try a bunch of statistical techniques to adjust for publication bias (excellent technique), do the same for their cash transfers meta-analysis (sure), then using the relative discounts between them to get an adjustment for psychotherapy vs. cash transfers (good, esp. as adding directly into the multi-level meta-regressions would be difficult). Further, they provided full code and data for replication (great). But the results made no sense whatsoever:
To orientate: each row is a different statistical technique applied to the two meta-analyses (more later). The x-axis is the âmultipleâ of Strongminds vs. cash transfers, and the black line is at 9.4x, the previous âstatus quo valueâ. Bars shorter than this means adjusting for publication bias results in an overall discount for Strongminds, and vice-versa.
The cash transfers funnel plot looks like this:
Compared to the psychotherapy one, it basically looks fine: the scatter looks roughly like a funnel, and no massive trend towards smaller studies = bigger effects. So how could so many statistical methods discount the âobvious small study effectâ meta-analysis less than the âno apparent small study effectâ meta-analysis, to give an increased multiple? As I said at the time, the results look like nonsense to the naked eye.
One problem was a coding error in two of the statistical methods (blue and pink bars). The bigger problem is how the comparisons are being done is highly misleading.
Take a step back from all the dividing going on to just look at the effect sizes. The basic, nothing fancy, random effects model applied to the psychotherapy data gives an effect size of 0.5. If you take the average across all the other model variants, you get ~0.3, a 40% drop. For the cash transfers meta-analysis, the basic model gives 0.1, and the average of all the other models is ~0.9, a 10% drop. So in fact you are seeingâas you shouldâbigger discounts when adjusting the psychotherapy analysis vs. the cash transfers meta-analysis. This is lost by how the divisions are being done, which largely âplay offâ multiple adjustments against one another. (see, pt.2). What the graph should look like is this:
Two things are notable: 1) the different models tend to point to a significant drop (~30-40% on average) in effect size; 2) there is a lot of variation in the discountâfrom ~0 to ~90% (so visual illustration about why this is known to be v. hard to reliably âadjustâ). I think these results oblige something like the following:
Re. write-up: At least including the forest and funnel plots, alongside a description of why they are concerning. Should also include some âbest guessâ correction from the above, and noting this has a (very) wide range. Probably warrants âback to the drawing boardâ given reliability issues.
Re. overall recommendation: At least a very heavy astericks placed besides the recommendation. Should also highlight both the adjustment and uncertainty in front facing materials (e.g. âtentative suggestionâ vs. ârecommendationâ). Probably warrants withdrawal.
Re. general reflection: I think a reasonable evaluatorâbeyond directional effectsâwould be concerned about the ânearâ(?) miss property of having a major material issue not spotted before pushing a strong recommendation, âphase 1 complete/âmission accomplishedâ etc. - especially when found by a third party many months after initial publication. They might also be concerned about the direction of travel. When published, the multiplier was 12x; with spillovers, it falls to 9.5%; with spillovers and the typo corrected, it falls to 7.5x; with a 30% best guess correction for publication bias, weâre now at 5.3x. Maybe any single adjustment is not recommendation-reversing, but in concert they are, and the track record suggests the next one is more likely to be further down rather than back up.
What happened instead 5 months ago was HLI would read some more and discuss among themselves whether my take on the comparators was the right one (I am, and it is not reasonably controversial, e.g. 1, 2, cf. fn4). Although âlooking at publication bias is part of their intended ârefiningâ of the Strongminds assessment, thereâs been nothing concrete done yet.
Maybe I should have chased, but the exchange on this (alongside the other thing) made me lose faith that HLI was capable of reasonably assessing and appropriately responding to criticisms of their work when material to their bottom line.
2) The cost effectiveness guestimate.
[Readers will be relieved ~no tricky stats here]
As I was looking at the meta-analysis, I added my attempt at âadjustedâ effect sizes of the same into the CEA to see what impact they had on the results. To my surprise, not very much. Hence my previous examples about âEven if the meta-analysis has zero effect the CEA still recommends Strongminds as several times GDâ, and âYou only get to equipoise with GD if you set all the effect sizes in the CEA to near-zero.â
I noted this alongside my discussion around the meta-analysis 6m ago. Earlier remarks from HLI suggested they accepted these were diagnostic of something going wrong with how the CEA is aggregating information (but fixing it would be done but not as a priority); more recent ones suggest more âdoubling downâ.
In any case, they are indeed diagnostic for a lack of face validity. You obviouslywould, in fact, be highly sceptical if the meta-analysis of psychotherapy in general was zero (or harmful!) that nonetheless a particular psychotherapy intervention was extremely effective. The (pseudo-)bayesian gloss on why is that the distribution of reported effect sizes gives additional information on the likely size of the ârealâ effects underlying them. (cf. heterogeneity discussed above) A bunch of weird discrepancies among them, if hard to explain by intervention characteristics, increases the suspicion of weird distortions, rather than true effects, underlie the observations. So increasing discrepancy between indirect and direct evidence should reduce effect size beyond impacts on any weighted average.
It does not help the findings as-is are highly discrepant and generally weird. Among many examples:
Why are the strongminds like trials in the direct evidence having among the greatest effects of any of the studies includedâand ~1.5x-2x the effect of a regression prediction of studies with strongminds-like traits?
Why are the most strongminds-y studies included in the meta-analysis marked outliersâeven after âcorrectionâ for small study effects?
What happened between the original Strongminds Phase 2 and the Strongminds RCT to up the intevention efficacy by 80%?
How come the only study which compares psychotherapy to a cash transfer comparator is also the only study which gives a negative effect size?
I donât know what the magnitude of the directional âadjustmentâ would be, as this relies on specific understanding of the likelier explanations for the odd results (Iâd guess a 10%+ downward correction assuming Iâm wrong about everything elseâobviously, much more if indeed âthe vast bulk in effect variation can be explained by sample size +/â- registration status of the study). Alone, I think it mainly points to the quantative engine needing an overhaul and the analysis being known-unreliable until it is.
In any case, it seems urgent and important to understand and fix. The numbers are being widely used and relied upon (probably all of them need at least a big public astericks pending developing more reliable technique). It seems particularly unwise to be reassured by âWell sure, this is a downward correction, but the CEA still gives a good bottom line multipleâ, as the bottom line number may not be reasonable, especially conditioned on different inputs. Even more so to persist in doing so 6m after being made aware of the problem.
These are mentioned in 3a and 3b of my reply to Michael. Point 1 there (kind of related to 3a) would on its own warrant immediate retraction, but that is not a case (yet) of âmaintainedâ error.
One quote from the Cochrane handbook feels particularly apposite:
Do not start here!
It can be tempting to jump prematurely into a statistical analysis when undertaking a systematic review. The production of a diamond at the bottom of a plot is an exciting moment for many authors, but results of meta-analyses can be very misleading if suitable attention has not been given to formulating the review question; specifying eligibility criteria; identifying and selecting studies; collecting appropriate data; considering risk of bias; planning intervention comparisons; and deciding what data would be meaningful to analyse. Review authors should consult the chapters that precede this one before a meta-analysis is undertaken.
In the presence of heterogeneity, a random-effects meta-analysis weights the studies relatively more equally than a fixed-effect analysis (see Chapter 10, Section 10.10.4.1). It follows that in the presence of small-study effects, in which the intervention effect is systematically different in the smaller compared with the larger studies, the random-effects estimate of the intervention effect will shift towards the results of the smaller studies. We recommend that when review authors are concerned about the influence of small-study effects on the results of a meta-analysis in which there is evidence of between-study heterogeneity (I2 > 0), they compare the fixed-effect and random-effects estimates of the intervention effect. If the estimates are similar, then any small-study effects have little effect on the intervention effect estimate. If the random-effects estimate has shifted towards the results of the smaller studies, review authors should consider whether it is reasonable to conclude that the intervention was genuinely different in the smaller studies, or if results of smaller studies were disseminated selectively. Formal investigations of heterogeneity may reveal other explanations for funnel plot asymmetry, in which case presentation of results should focus on these. If the larger studies tend to be those conducted with more methodological rigour, or conducted in circumstances more typical of the use of the intervention in practice, then review authors should consider reporting the results of meta-analyses restricted to the larger, more rigorous studies.
This is not the only problem in HLIâs meta-regression analysis. Analyses here should be pre-specified (especially if intended as the primary result rather than some secondary exploratory analysis), to limit risks of inadvertently cherry-picking a model which gives a preferred result. Cochrane (see):
Authors should, whenever possible, pre-specify characteristics in the protocol that later will be subject to subgroup analyses or meta-regression. The plan specified in the protocol should then be followed (data permitting), without undue emphasis on any particular findings (see MECIR Box 10.11.b). Pre-specifying characteristics reduces the likelihood of spurious findings, first by limiting the number of subgroups investigated, and second by preventing knowledge of the studiesâ results influencing which subgroups are analysed. True pre-specification is difficult in systematic reviews, because the results of some of the relevant studies are often known when the protocol is drafted. If a characteristic was overlooked in the protocol, but is clearly of major importance and justified by external evidence, then authors should not be reluctant to explore it. However, such post-hoc analyses should be identified as such.
HLI does not mention any pre-specification, and there is good circumstantial evidence of a lot of this work being ad hoc re. âStrongminds-like traitsâ. HLIâs earlier analysis on psychotherapy in general, using most (?all) of the same studies as in their Strongminds CEA (4.2, here), had different variables used in a meta-regression on intervention properties (table 2). It seems likely the change of model happened after study data was extracted (the lack of significant prediction and including a large number of variables for a relatively small number of studies would be further concerns). This modification seems to favour the intervention: I think the earlier model, if applied to Strongminds, gives an effect size of ~0.6.
I really appreciate you putting in the work and being so diligent Gregory. I did very little here, though I appreciate your kind words. Without you seriously digging in, weâd have a very distorted picture of this important area.
Hello Jason. FWIW, Iâve drafted a reply to your other comment and Iâm getting it checked internally before I post it.
On this comment about you not liking that we hadnât updated our website to include the new numbers: we all agree with you! Itâs a reasonable complaint. The explanation is fairly boring: we have been working on a new charity recommendations page for the website, at which point we were going to update the numbers at add a note, so we could do it all in one go. (We still plan to do a bigger reanalysis later this year.) However, that has gone slower than expected and hadnât happened yet. Because of your comment, weâll add a âhot fixâ update in the next week, and hopefully have the new charity recommendations page live in a couple of weeks.
I think weâd have moved faster on this if it had substantially changed the results. On our numbers, StrongMinds is still the best life-improving intervention (itâs several times better than cash and weâre not confident deworming has a longterm effect). Youâre right it would slightly change the crossover point for choosing between life-saving and life-improving interventions, but weâve got the impression that donors werenât making much use of our analysis anyway; even if they were, itâs a pretty small difference, and well within the margin of uncertainty.
(Looking back at the comment, I see the example actually ended up taking more space than the lead point! Although I definitely agree that the hot fix should happen, I hope the example didnât overshadow the commentâs main intended pointâthat people who have concerns about HLIâs response to recent criticisms should raise their concerns with a degree of specificity, and explain why they have those concerns, to allow HLI an opportunity to address them.)
Meta-note as a casual lurker in this thread: This comment being down-voted to oblivion while Jasonâs comment is not, is pretty bizarre to me. The only explanation I can think of is that people who have provided criticism think Michael is saying they shouldnât criticise? It is blatantly obvious to me that this is not what he is saying and is simply agreeing with Jason that specific actionable-criticism is better.
Fun meta-meta note I just realized after writing the above: This does mean I am potentially criticising some critics who are critical of how Micheal is criticising their criticism.
Okkkk, thatâs enough internet for me. Peace and love, yâall.
Michaelâs comment has 14 non-author up/âdownvotes and 10 non-author agree/âdisagreevotes; mine has one of each. This is possibly due to the potential to ascribe a comment by HLIâs director several meanings that are not plausible to give a comment by a disinterested observerâe.g., âOrg expresses openness to changes to address concerns,â âOrg is critical of critics,â etc.
Iâm not endorsing any potential meaning, although I have an upvote on his comment.
The more disappointing meta-note to me is that helpful, concrete suggestions have been relatively sparse on this post as a whole. I wrote some suggestions for future epistemic practices, and someone else called for withdrawing the SM recommendation and report. But overall, there seemed to be much more energy invested in litigating than in figuring out a path forward.
...helpful, concrete suggestions have been relatively sparse on this post as a whole.
I donât really share this sense (I think that even most of Gregory Lewisâ posts in this thread have had concretely useful advice for HLI, e.g. this one), but letâs suppose for the moment that itâs true. Should we care?
In the last round of posts, four to six months ago, HLI got plenty of concrete and helpful suggestions. A lot of them were unpleasant, stuff like âyou should withdraw your cost-effectiveness analysisâ and âhere are ~10 easy-to-catch problems with the stats you publishedâ, but highly specific and actionable. What came of that? What improvements has HLI made? As far as I can tell, almost nothing has changed, and theyâre still fundraising off of the same flawed analyses. There wasnât even any movement on this unambiguous blunder until you called it out. It seems to me that giving helpful, concrete suggestions to HLI has been tried, and shown to be low impact.
One thing people can do in a thread like this one is talk to HLI, to praise them, ask them questions, or try to get them to do things differently. But another thing they can do is talk to each other, to try and figure out whether they should donate to HLI or not. For that, criticism of HLI is valuable, even if itâs not directed to HLI. This, too, counts as âfiguring out a path forwardâ.
edited to that I only had a couple of comments rather than 4
I am confident those involved really care about doing good and work really hard. And i donât want that to be lost in this confusion. Something is going on here, but I think âit is confusingâ is better than âHLI are baddiesâ.
For clarity being 2x better than cash transfers would still provide it with good reason to be on GWWCâs top charity list, right? Since GiveDirectly is?
I guess the most damning claim seems to be about dishonesty, which I find hard to square with the caliber of the team. So, whatâs going on here? If, as seems likely the forthcoming RCT downgrades SM a lot and the HLI team should have seen this coming, why didnât they act? Or do they still believe that they RCT will return very positive results. What happens when as seems likely, they are very wrong?
Among other things, this would confirm a) SimonM produced a more accurate and trustworthy assessment of Strongminds in their spare time as a non-subject matter expert than HLI managed as the centrepiece of their activity
Note that SimonM is a quant by day and for a time top on metaculus, so I am less surprised that he can produce such high caliber work in his spare time[1].
I donât know how to say this but it doesnât surprise me that top individuals are able to do work comparable with research teams. In fact I think itâs one of the best cases for the forum. Sometimes talented generalists compete toe to toe with experts.
Finally it seems possible to me that criticisms can be true but HLI can still have done work we want to fund. The world is ugly and complicated like this. I think we should aim to make the right call in this case. For me the key question is, why havenât they updated in light of StrongMinds likely being worse than they thought.
Simon worked as a crypto quant and has since lost his job (cos of the crash caused by FTX) so is looking for work including EA work. You can message him if interested.
+1 Regarding extending the principle of charity towards HLI. Anecdotally it seems very common for initial CEA estimates to be revised down as the analysis is critiqued. I think HLI has done an exceptional job at being transparent and open regarding their methodology and the source of disagreements e.g. see Joelâs comment outlining the sources of disagreement between HLI and GiveWell, which I thought were really exceptional (https://ââforum.effectivealtruism.org/ââposts/ââh5sJepiwGZLbK476N/ââassessment-of-happier-lives-institute-s-cost-effectiveness?commentId=LqFS5yHdRcfYmX9jw). Obviously I havenât spent as much time digging into the results as Gregory has made, but the mistakes he points to donât seem like the kind that should be treated too harshly.
As a separate point, I think itâs generally a lot easier to critique and build upon an analysis after the initial work has been done. E.g. even if it is the case that SimonMâs assessment of Strong Minds is more reliable than HLIâs (HLI seem to dispute that the critique he levies are all that important as they only assign a 13% weight to that RCT), this isnât necessarily evidence that SimonM is more competent than the HLI team. When the heavy lifting has been done, itâs easier to focus in on particular mistakes (and of course valuable to do so!).
For clarity being 2x better than cash transfers would still provide it with good reason to be on GWWCâs top charity list, right? Since GiveDirectly is?
I think GiveDirectly gets special privilege because âjust give the money to the poorest peopleâ is such a safe bet for how to spend money altruistically.
Like if a billionaire wanted to spend a million dollars making your life better, they could either:
just give you the million dollars directly, or
spend the money on something that they personally think would be best for you
Youâd want them to set a pretty high bar of âI have high confidence that the thing I chose to spend the money on will be much better than whatever you would spend the money on yourself.â
GiveDirectly does not have the âtop-ratedâ label on GWWCâs list, while SM does as of this morning.
I canât find the discussion, but my understanding is that âtop-ratedâ means that an evaluator GWWC trustsâin SMâs case, that was Founderâs Pledgeâthinks that a charity is at a certain multiple (was it like 4x?) over GiveDirectly.
However, on this post, Matt Lerner @ FP wrote that âWe disagree with HLI about SMâs rating â we use HLIâs work as a starting point and arrive at an undiscounted rating of 5-6x; subjective discounts place it between 1-2x, which squares with GiveWellâs analysis.â
So it seems that GWWC should withdraw the âtop-ratedâ flag because none of its trusted evaluation partners currently rate SM at better than 2.3X cash. It should not, however, remove SM from the GWWC platform as it meets the criteria for inclusion.
Hmm this feels a bit off. I donât think GiveDirectly should get special privelege. Though I agree the out of model factors seem to go well for GD than others, so I would kind of bump it up.
Hello Nathan. Thanks for the comment. I think the only key place where I would disagree with you is what you said here
If, as seems likely the forthcoming RCT downgrades SM a lot and the HLI team should have seen this coming, why didnât they act?
As I said in response to Greg (to which I see youâve replied) we use the conventional scientific approach of relying on the sweep of existing dataârather than on our predictions of what future evidence (from a single study) will show. Indeed, Iâm not sure how easily these would come apart: I would base my predictions substantially on the existing data, which weâve already gathered in our meta-analysis (obviously, itâs a matter of debate as to how to synthesise data from different sources and opinions will differ). I donât have any reason to assume the new RCT will show effects substantially lower than the existing evidence, but perhaps others are aware of something weâre not.
HLIâbut if for whatever reason theyâre unable or unwilling to receive the donation at resolution, Strongminds.
The âresolution criteriaâ are also potentially ambiguous (my bad). I intend to resolve any ambiguity stringently against me, but you are welcome to be my adjudicator.
[To add: Iâd guess ~30-something% chance I end up paying out: d = 0.4 is at or below pooled effect estimates for psychotherapy generally. I am banking on significant discounts with increasing study size and quality (as well as other things I mention above I take as adverse indicators), but even if I price these right, I expect high variance.
I set the bar this low (versus, say, d = 0.6 - at the ~ 5th percentile of HLIâs estimate) primarily to make a strong rod for my own back. Mordantly criticising an org whilst they are making a funding request in a financially precarious position should not be done lightly. Although Iâd stand by my criticism of HLI even if the trial found Strongminds was even better than HLI predicted, I would regret being quite as strident if the results were any less than dramatically discordant.
If so, me retreating to something like âMeh, they got luckyâ/ââSure I was (/âkinda) wrong, but you didnât deserve to be rightâ seems craven after over-cooking remarks potentially highly adverse to HLIâs fundraising efforts. Fairer would be that I suffer some financial embarrassment, which helps compensate HLI for their injury from my excess.
Perhaps I could have (or should have) done something better. But in fairness to me, I think this is all supererogatory on my part: I do not think my comment is the only example of stark criticism on this forum, but it might be unique in its author levying an expected cost of over $1000 on themselves for making it.]
[Own views]
I think we can be pretty sure (cf.) the forthcoming strongminds RCT (the one not conducted by Strongminds themselves, which allegedly found an effect size of d = 1.72 [!?]) will give dramatically worse results than HLIâs evaluation would predictâi.e. somewhere between ânullâ and â2x cash transfersâ rather than âseveral times better than cash transfers, and credibly better than GW top charities.â [Iâll donate 5k USD if the Ozler RCT reports an effect size greater than d = 0.4 â 2x smaller than HLIâs estimate of ~ 0.8, and below the bottom 0.1% of their monte carlo runs.]
This will not, however, surprise those who have criticised the many grave shortcomings in HLIâs evaluationâmistakes HLI should not have made in the first place, and definitely should not have maintained once they were made aware of them. See e.g. Snowden on spillovers, me on statistics (1, 2, 3, etc.), and Givewell generally.
Among other things, this would confirm a) SimonM produced a more accurate and trustworthy assessment of Strongminds in their spare time as a non-subject matter expert than HLI managed as the centrepiece of their activity; b) the ~$250 000 HLI has moved to SM should be counted on the ânegativeâ rather than âpositiveâ side of the ledger, as I expect this will be seen as a significant and preventable misallocation of charitable donations.
Regrettably, it is hard to square this with an unfortunate series of honest mistakes. A better explanation is HLIâs institutional agenda corrupts its ability to conduct fair-minded and even-handed assessment for an intervention where some results were much better for their agenda than others (cf.). I am sceptical this only applies to the SM evaluation, and I am pessimistic this will improve with further financial support.
An update:
This RCT (which should have been the Baird RCTâmy apologies for mistakenly substituting Sarah Baird with her colleague Berk Ozler as first author previously) is now out.
I was not specific on which effect size would count, but all relevant[1] effect sizes reported by this study are much lower than d = 0.4 - around d = 0.1. I roughly[2] calculate the figures below.
In terms of âSD-years of depression avertedâ or similar, there are a few different ways you could slice it (e.g. which outcome you use, whether you linearly interpolate, do you extend the effects out to 5 years, etc). But when I play with the numbers I get results around 0.1-0.25 SD-years of depression averted per person (as a sense check, this lines up with an initial effect of ~0.1, which seems to last between 1-2 years).
These are indeed âdramatically worse results than HLIâs [2021] evaluation would predictâ. They are also substantially worse than HLIâs (much lower) updated 2023 estimates of Strongminds. The immediate effects of 0.07-0.16 are ~>5x lower than HLIâs (2021) estimate of an immediate effect of 0.8; they are 2-4x lower than HLIâs (2023) informed prior for Strongminds having an immediate effect of 0.39. My calculations of the total effect over time from Baird et al. of 0.1-0.25 SD-years of depression averted are ~10x lower than HLIâs 2021 estimate of 1.92 SD-years averted, and ~3x lower than their most recent estimate of ~0.6.
Baird et al. also comment on the cost-effectiveness of the intervention in their discussion (p18):
Iâm not sure anything more really needs to be said at this point. But much more could be, and I fear Iâll feel obliged to return to these topics before long regardless.
The report describes the outcomes on p.10:
Measurements were taken following treatment completion (âRapid resurveyâ), then at 12m and 24m thereafer (midline and endline respectively).
I use both primary indicators and the discrete values of the underlying scores they are derived from. I havenât carefully looked at the other secondary outcomes nor the human capital variables, but besides being less relevant, I do not think these showed much greater effects.
I.e. I took the figures from Table 6 (comparing IPT-G vs. control) for these measures and plugged them into a webtool for Cohenâs h or d as appropriate. This is rough and ready, although my calculations agree with the effect sizes either mentioned or described in text. They also pass an âeye testâ of comparing them to the cmfs of the scores in figure 3 - these distributions are very close to one another, consistent with small-to-no effect (one surprising result of this study is IPT-G + cash lead to worse outcomes than either control or IPT-G alone):
One of the virtues of this study is it includes a reproducibility package, so Iâd be happy to produce a more rigorous calculation directly from the provided data if folks remain uncertain.
My view is that HLI[1], GWWC[2], Founders Pledge[3], and other EA /â effective giving orgs that recommend or provide StrongMinds as an donation option should ideally at least update their page on StrongMinds to include relevant considerations from this RCT, and do so well before Thanksgiving /â Giving Tuesday in Nov/âDec this year, so donors looking to decide where to spend their dollars most cost effectively can make an informed choice.[4]
Listed as a top recommendation
Not currently a recommendation, (but to included as an option to donate)
Currently tagged as an âactive recommendationâ
Acknowledging that HLIâs current schedule is âBy Dec 2024â, though this may only give donors 3 days before Giving Tuesday.
Thanks Bruce, would you still think this if Strongminds ditched their adolescent programs as a result of this study and continued with their core groups with older women?
Yes, because:
1) I think this RCT is an important proxy for StrongMinds (SM)âs performance âin situâ, and worth updating onâin part because it is currently the only completed RCT of SM. Uninformed readers who read what is currently on e.g. GWWC[1]/âFP[2]/âHLI website might reasonably get the wrong impression of the evidence base behind the recommendation around SM (i.e. there are no concerns sufficiently noteworthy to merit inclusion as a caveat). I think the effective giving community should have a higher bar for being proactively transparent hereâit is much better to include (at minimum) a relevant disclaimer like this, than to be asked questions by donors and make a claim that there wasnât capacity to include.[3]
2) If a SM recommendation is justified as a result of SMâs programme changes, this should still be communicated for trust building purposes (e.g. âWe are recommending SM despite [Baird et al RCT results], because âŠ), both for those who are on the fence about deferring, and for those who now have a reason to re-affirm their existing trust on EA org recommendations.[4]
3) Help potential donors make more informed decisionsâfor example, informed readers who may be unsure about HLIâs methodology and wanted to wait for the RCT results should not have to go search this up themselves or look for a fairly buried comment thread on a post from >1 year ago in order to make this decision when looking at EA recommendations /â links to donateâI donât think itâs an unreasonable amount of effort compared to how it may help. This line of reasoning may also apply to other evaluators (e.g. GWWC evaluator investigations).[5]
GWWC website currently says it only includes recommendations after they review it through their Evaluating Evaluators work, and their evaluation of HLI did not include any quality checks of HLIâs work itself nor finalise a conclusion. Similarly, they say: âwe donât currently include StrongMinds as one of our recommended programs but you can still donate to it via our donation platformâ.
Founders Pledgeâs current website says:
Iâm not suggesting at all that they should have done this by now, only ~2 weeks after the Baird RCT results were made public. But I do think three months is a reasonable timeframe for this.
If there was an RCT that showed malaria chemoprevention cost more than $6000 per DALY averted in Nigeria (GDP/âcapita * 3), rather than per life saved (ballpark), I would want to know about it. And I would want to know about it even if Malaria Consortium decided to drop their work in Nigeria, and EA evaluators continued to recommend Malaria Consortium as a result. And how organisations go about communicating updates like this do impact my personal view on how much I should defer to them wrt charity recommendations.
Of course, based on HLIâs current analysis/âapproach, the ?disappointing/â?unsurprising result of this RCT (even if it was on the adult population) would not have meaningfully changed the outcome of the recommendation, even if SM did not make this pivot (pg 66):
And while I think this is a conversation that has already been hashed out enough on the forum, I do think the point standsâpotential donors who disagree with or are uncertain about HLIâs methodology here would benefit from knowing the results of the RCT, and itâs not an unreasonable ask for organisations doing charity evaluations /â recommendations to include this information.
Based on Nigeriaâs GDP/âcapita * 3
Acknowledging that this is DALYs not WELLBYs! OTOH, this conclusion is not the GiveWell or GiveDirectly bar, but a ~mainstream global health cost-effectiveness standard of ~3x GDP per capita per DALY averted (in this case, the ~$18k USD PPP/âDALY averted of SM is below the ~$7k USD PPP/âDALY bar for Uganda)
Nice one Bruce. I think I agree that it should be communicated like you say for reasons 2 and 3
I donât think this is a good proxy for their main programs though, as this RCT looks a very different thing than their regular programming. I think other RCTs on group therapy in adult women from the region are better proxies than this study on adolescents.
Why do you think itâs a particularly good proxy? In my mind the same org doing a different treatment, (that seems to work but only a little for a short ish time) with many similarities to their regular treatment of course.
Like I said a year ago, I would have much rather this has been an RCT on Strongminds regular programs rather than this one on a very different program for adolescents. I understand though that âdoes similar group psychotherapy also work for adolescentsâ is a more interesting question from a researcherâs perspective, although less useful for all of us deciding just how good regular StrongMinds group psychotherapy is.
It sounds like youâre interpreting my claim to be âthe Baird RCT is a particularly good proxy (or possibly even better than other RCTs on group therapy in adult women) for the SM adult programme effectivenessâ, but this isnât actually my claim here; and while I think one could reasonably make some different, stronger (donor-relevant) claims based on the discussions on the forum and the Baird RCT results, mine are largely just: âitâs an important proxyâ, âitâs worth updating onâ, and âthe relevant considerations/âupdates should be easily accessible on various recommendation pagesâ. I definitely agree that an RCT on the adult programme would have been better for understanding the adult programme.
(Iâll probably check out of the thread here for now, but good chatting as always Nick! hope youâre well)
Nice one 100% agree no need to check in again!
Thanks for this Gregory, I think itâs an important result and have updated my views. Iâm not sure why HLI were so optimistic about this. I have a few comments here.
This study was performed on adolescents, which is not the core group of women that Strong Minds and other group IPT programs treat. This study might update me slightly negatively against the effectof their core programming with groups of older women but not by much.
As The study said, âthis marked the first time SMU (i) delivered therapy to out-of-school adolescent females, (ii) used youth mentors, and (iii) delivered therapy through a partner organization.â
This result then doesnât surprise me as (high uncertainty) I think itâs generally harder to move the needle with adolescent mental health than with adults.
The therapy still worked, even though the effect sizes were much smaller than other studies and were not cost effective.
As youâve said before, f this kind of truly independent research was done on a lot of interventions, the results might not look nearly as good as the original studies.
I think Strongminds should probably stop their adolescent programs based on this study. Why keep doing it, when your work with adult women currently seems far more cost effective?
Even with the Covid caveat, Iâm stunned at the null/ânegative effect of the cash transfer arm. Interesting stuff and not sure what to make of it.
I would still love a similar independent study on the regular group IPT programs with older women, and these RCTs should be pretty cheap on the scale of things, I doubt weâll get that though as it will probably seen as being too similar and not interesting enough for researchers which is fair enough.
Hi Greg,
Thanks for this post, and for expressing your views on our work. Point by point:
I agree that StrongMindsâ own study had a surprisingly large effect size (1.72), which was why we never put much weight on it. Our assessment was based on a meta-analysis of psychotherapy studies in low-income countries, in line with academic best practice of looking at the wider sweep of evidence, rather than relying on a single study. You can see how, in table 2 below, reproduced from our analysis of StrongMinds, StrongMindsâ own studies are given relatively little weight in our assessment of the effect size, which we concluded was 0.82 based on the available data. Of course, weâll update our analysis when new evidence appears and weâre particularly interested in the Ozler RCT. However, we think itâs preferable to rely on the existing evidence to draw our conclusions, rather than on forecasts of as-yet unpublished work. We are preparing our psychotherapy meta-analysis to submit it for academic peer review so it can be independently evaluated but, as you know, academia moves slowly.
We are a young, small team with much to learn, and of course, weâll make mistakes. But, I wouldnât characterise these as âgrave shortcomingsâ, so much as the typical, necessary, and important back and forth between researchers. A claims P, B disputes P, A replies to B, B replies to A, and so it goes on. Even excellent researchers overlook things: GiveWell notably awarded us a prize for our reanalysis of their deworming research. Weâve benefitted enormously from the comments weâve got from others and it shows the value of having a range of perspectives and experts. Scientific progress is the result of productive disagreements.
I think itâs worth adding that SimonMâs critique of StrongMinds did not refer to our meta-analytic work, but focused on concerns about StrongMinds own study and analysis done outside HLI. As I noted in 1., we share the concerns about the earlier StrongMinds study, which is why we took the meta-analytic approach. Hence, Iâm not sure SimonMâs analysis told us much, if anything, we hadnât already incorporated. With hindsight, I think we should have communicated far more prominently how small a part StrongMindsâ own studies played in our analysis, and been quicker off the mark to reply to SimonMâs post (it came out during the Christmas holidays and I didnât want to order the team back to their (virtual) desks). Naturally, if you arenât convinced by our work, you will be sceptical of our recommendations.
You suggest we are engaged in motivated reasoning, setting out to prove what we already wanted to believe. This is a challenging accusation to disprove. The more charitable and, I think, the true explanation is that we had a hunch about something important being missed and set out to do further research. We do complex interdisciplinary work to discover the most cost-effective interventions for improving the world. We have done this in good faith, facing an entrenched and sceptical status quo, with no major institutional backing or funding. Naturally, we wonât convince everyone â weâre happy the EA research space is a broad church. Yet, itâs disheartening to see you treat us as acting in bad faith, especially given our fruitful interactions, and we hope that you will continue to engage with us as our work progresses.
Table 2.
Hello Michael,
Thanks for your reply. In turn:
1:
HLI has, in fact, put a lot of weight on the d = 1.72 Strongminds RCT. As table 2 shows, you give a weight of 13% to itâjoint highest out of the 5 pieces of direct evidence. As there are ~45 studies in the meta-analytic results, this means this RCT is being given equal or (substantially) greater weight than any other study you include. For similar reasons, the Strongminds phase 2 trial is accorded the third highest weight out of all studies in the analysis.
HLIâs analysis explains the rationale behind the weighting of âusing an appraisal of its risk of bias and relevance to StrongMindsâ present core programmeâ. Yet table 1A notes the quality of the 2020 RCT is âunknownâ - presumably because Strongminds has âonly given the results and some supporting details of the RCTâ. I donât think it can be reasonable to assign the highest weight to an (as far as I can tell) unpublished, not-peer reviewed, unregistered study conducted by Strongminds on its own effectiveness reporting an astonishing effect sizeâbefore it has even been read in full. It should be dramatically downweighted or wholly discounted until then, rather than included at face value with a promise HLI will followup later.
Risk of bias in this field in general is massive: effect sizes commonly melt with improving study quality. Assigning ~40% of a weighted average of effect size to a collection of 5 studies, 4 [actually 3, more later] of which are (marked) outliers in effect effect, of which 2 are conducted by the charity is unreasonable. This can be dramatically demonstrated from HLIâs own data:
One thing I didnât notice last time I looked is HLI did code variables on study quality for the included studies, although none of them seem to be used for any of the published analysis. I have some good news, and some very bad news.
The good news is the first such variable I looked at, ActiveControl, is a significant predictor of greater effect size. Studies with better controls report greater effects (roughly 0.6 versus 0.3). This effect is significant (p = 0.03) although small (10% of the variance) and difficultâat least for meâto explain: I would usually expect worse controls to widen the gap between it and the intervention group, not narrow it. In any case, this marker of study quality definitely does not explain away HLIâs findings.
The second variable I looked at was âUnpubOr(pre?)regâ.[1] As far as I can tell, coding 1 means something like âthe study was publicly registeredâ and 0 means it wasnât (Iâm guessing 0.5 means something intermediate like retrospective registration or similar) - in any case, this variable correlates extremely closely (>0.95) to my own coding of whether a study mentions being registered or not after reviewing all of them myself. If so, using it as a moderator makes devastating reading:[2]
To orientate: in âModel resultsâ the intercept value gives the estimated effect size when the âunpubâ variable is zero (as I understand it, ~unregistered studies), so d ~ 1.4 (!) for this set of studies. The row below gives the change in effect if you move from âunpub = 0â to âunpub = 1â (i.e. ~ registered vs. unregistered studies): this drops effect size by 1, so registered studies give effects of ~0.3. In other words, unregistered and registered studies give dramatically different effects: study registration reduces expected effect size by a factor of 3. [!!!]
The other statistics provided deepen the concern. The included studies have a very high level of heterogeneity (~their effect sizes vary much more than they should by chance). Although HLI attempted to explain this variation with various meta-regressions using features of the intervention, follow-up time, etc., these models left the great bulk of the variation unexplained. Although not like-for-like, here a single indicator of study quality provides compelling explanation for why effect sizes differ so much: it explains three-quarters of the initial variation.[3]
This is easily seen in a grouped forest plotâthe top group is the non registered studies, the second group the registered ones:
This pattern also perfectly fits the 5 pieces of direct evidence: Bolton 2003 (ES = 1.13), Strongminds RCT (1.72), and Strongminds P2 (1.09) are, as far as I can tell, unregistered. Thurman 2017 (0.09) was registered. Bolton 2007 is also registered, and in fact has an effect size of ~0.5, not 1.79 as HLI reports.[4]
To be clear, I do not think HLI knew of this before I found it out just now. But results like this indicate i) the appraisal of the literature in this analysis gravely off-the-markâstudy quality provides the best available explanation for why some trials report dramatically higher effects than others; ii) the result of this oversight is a dramatic over-estimation of likely efficacy of Strongminds (as a ready explanation for the large effects reported in the most ârelevant to strongmindsâ studies is that these studies were not registered and thus prone to ~200%+ inflation of effect size); iii) this is a very surprising mistake for a diligent and impartial evaluator to make: one would expect careful assessment of study qualityâand very sceptical evaluation where this appears to be lackingâto be foremost, especially given the subfield and prior reporting from Strongminds both heavily underline it. This pattern, alas, will prove repetitive.
I also think a finding like this should prompt an urgent withdrawal of both the analysis and recommendation pending further assessment. In honesty, if this doesnât, Iâm not sure what ever could.
2:
Indeed excellent researchers overlook things, and although I think both the frequency and severity of things HLI mistakes or overlooks is less-than-excellent, one could easily attribute this to things like âinexperienceâ, âtrying to do a lot in a hurryâ, âlimited staff capacityâ, and so on.
Yet this cannot account for how starkly asymmetric the impact of these mistakes and oversights are. HLIâs mistakes are consistently to Strongmindâs benefit rather than its detriment, and HLI rarely misses a consideration which could enhance the âmultipleâ, it frequently misses causes of concern which both undermine both strength and reliability of this recommendation. HLIâs award from Givewell deepens my concerns here, as it is consistent with a very selective scepticism: HLI can carefully scruitinize charity evaluations by others it wants to beat, but fails to mete out remotely comparable measure to its own which it intends for triumph.
I think this can also explain how HLI responds to criticism, which I have found by turns concerning and frustrating. HLI makes some splashy claim (cf. âmission accomplishedâ, âconfident recommendationâ, etc.). Someone else (eventually) takes a closer look, and finds the surprising splashy claim, rather than basically checking out âmost reasonable ways you slice itâ, it is highly non-robust, and only follows given HLI slicing it heavily in favour of their bottom line in terms of judgement or analysisâthe latter of which often has errors which further favour said bottom line. HLI reliably responds, but the tenor of this response is less âscientific discourseâ and more âlawyer for defenceâ: where it can, HLI will too often further double down on calls it makes where I aver the typical reasonable spectator would deem at best dubious, and at worst tendentious; where it canât, HLI acknowledges the shortcoming but asserts (again, usually very dubiously) that it isnât that a big deal, so it will deprioritise addressing it versus producing yet more work with the shortcomings familiar to those which came before.
3:
HLIâs meta-analysis in no way allays or rebuts the concerns SimonM raised re. Strongmindsâindeed, appropriate analysis would enhance many of them. Nor is it the case that the meta-analytic work makes HLIâs recommendation robust to shortcomings in the Strongminds-specific evidenceâindeed, the cost effectiveness calculator will robustly recommend Strongminds as superior (commonly, several times superior) to GiveDirectly almost no matter what efficacy results (meta-analytic or otherwise) are fed into it. On each.
a) Meta-analysis could help contextualize the problems SimonM identifies in the Strongminds specific data. For example, a funnel plot which is less of a âfunnelâ but more of a ski-slope (i.e. massive small study effects/ârisk of publication bias), and a contour/âp-curve suggestive of p-hacking would suggest the fieldâs literature needs to be handled with great care. Finding âstrongminds relevantâ studies and direct evidence are marked outliers even relative to this pathological literature should raise alarm given this complements the object-level concerns SimonM presented.
This is indeed true, and these features were present in the studies HLI collected, but HLI failed to recognise it. It may never have if I hadnât gotten curious and did these analyses myself. Said analysis is (relative to the much more elaborate techniques used in HLIâs meta-analysis) simple to conductâmy initial âworkâ was taking the spreadsheet and plugging it into a webtool out of idle curiosity.[5] Again, this is a significant mistake, adds a directional bias in favour of Strongminds, and is surprising for a diligent and impartial evaluator to make.
b) In general, incorporating meta-analytic results into what is essentially a weighted average alongside direct evidence does not clean either it or the direct evidence of object level shortcomings. If (as here) both are severely compromised, the result remains unreliable.
The particular approach HLI took also doesnât make the finding more robust, as the qualitative bottom line of the cost-effectiveness calculation is insensitive to the meta-analytic result. As-is, the calculator gives strongminds as roughly 12x better than GiveDirectly.[6] If you set both meta-analytic effect sizes to zero, the calculator gives Strongminds as ~7x better than GiveDirectly. So the five pieces of direct evidence are (apparently) sufficient to conclude SM is an extremely effective charity. Obviously this isâand HLI has previously acceptedâfacially invalid output.
It is not the only example. It is extremely hard for any reduction of efficacy inputs to the model to give a result that Strongminds is worse than Givedirectly. If we instead leave the meta-analytic results as they were but set all the effect sizes of the direct evidence to zero (in essence discounting them entirelyâwhich I think is approximately what should have been done from the start), we get ~5x better than GiveDirectly. If we set all the effect sizes of both meta-analysis and direct evidence to 0.4 (i.e. the expected effects of registered studies noted before), we get ~6x better than Givedirectly. If we set the meta-analytic results to 0.4 and set all the direct evidence to zero we get ~3x GiveDirectly. Only when one sets all the effect sizes to 0.1 - lower than all but ~three of the studies in the meta-analysisâdoes one approach equipoise.
This result should not surprise on reflection: the CEAâs result is roughly proportional to the ~weighted average of input effect sizes, so an initial finding of â10xâ Givedirectly or similar would require ~a factor of 10 cut to this average to drag it down to equipoise. Yet this âfeatureâ should be seen as a bug: in the same way there should be some non-zero value of the meta-analytic results which should reverse a âmany times better than Givedirectlyâ finding, there should be some non-tiny value of effect sizes for a psychotherapy intervention (or psychotherapy interventions in general) which results in it not being better than GiveDirectly at all.
This does help explain the somewhat surprising coincidence the first charity HLI fully assessed would be one it subsequently announces as the most promising interventions in global health and wellbeing so-far found: rather than a discovery from the data, this finding is largely preordained by how the CEA stacks the deck. To be redundant (and repetitive): i) the cost-effectiveness model HLI is making is unfit-for-purpose, given can produce these absurd results; ii) this introduces a large bias in favour of Strongminds; iii) it is a very surprising mistake for a diligent and impartial evaluator to makeâthese problems are not hard to find.
Theyâre even easier for HLI to find once theyâve been alerted to them. I did, months ago, alongside other problems, and suggested the cost-effectiveness analysis and Strongminds recommendation be withdrawn. Although it should have happened then, perhaps if I repeat myself it might happen now.
4:
Accusations of varying types of bad faith/âmotivated reasoning/âintellectual dishonesty should indeed be made with careâbesides the difficulty in determination, pragmatic considerations raise the bar still higher. Yet I think the evidence of HLI having less of a finger but more of a fist on the scale throughout its work overwhelms even charitable presumptions made by a saint on its behalf. In footballing terms, I donât think HLI is a player cynically diving to win a penalty, but it is like the manager after the game insisting âtheir goal was offside, and my player didnât deserve a red, and.. (etc.)â - highly inaccurate and highly biased. This is a problem when HLI claims itself an impartial referee, especially when it does things akin to awarding fouls every time a particular player gets tackled.
This is even more of a problem precisely because of the complex and interdisciplinary analysis HLI strives to do. No matter the additional analytic arcana, work like this will be largely fermi estimates, with variables being plugged in with little more to inform them than intuitive guesswork. The high degree of complexity provides a vast garden of forking paths available. Although random errors would tend to cancel out, consistent directional bias in model choice, variable selection, and numerical estimates lead to greatly inflated âbottom linesâ.
Although the transparency in (e.g.) data is commendable, the complex analysis also makes scruitiny harder. I expect very few have both the expertise and perseverence to carefully vet HLI analysis themselves; I also expect the vast majority of money HLI has moved has come from those largely taking its results on trust. This trust is ill-placed: HLIâs work weathers scruitiny extremely poorly; my experience is very much âthe more you see, the worse it looksâ. I doubt many donors following HLIâs advice, if they took a peak behind the curtain, would be happy with what they would discover.
If HLI is falling foul of an entrenched status quo, it is not particular presumptions around interventions, nor philosophical abstracta around population ethics, but rather those that work in this community (whether published elsewhere or not) should be even-handed, intellectually honest and trustworthy in all cases; rigorous and reliable commensurate to its expected consequence; and transparently and fairly communicated. I think going against this grain underlies (I suspect) why I am not alone in my concerns, and why HLI has not had the warmest reception. The hope this all changes for the better is not entirely forlorn. But things would have to change a lot, and quicklyâand the track record thus far does not spark joy.
Really surprised I missed this last time, to be honest. Especially because it is the only column title in the spreadsheet highlighted in red.
Given I will be making complaints about publication bias, file drawer effects, and garden of forking path issues later in the show, one might wonder how much of this applies to my own criticism. How much time did I spend dredging through HLIâs work looking for something juicy? Is my file drawer stuffed with analyses I hoped would show HLI in a bad light, actually showed it in a good one, so I donât mention them?
Depressingly, the answer is ânot muchâ and ânoâ respectively. Regressing against publication registration was the second analysis I did on booting up the data again (regressing on active control was the first, mentioned in text). My file drawer subsequent to this is full of checks and double-checks for alternative (and better for HLI) explanations for the startling result. Specifically, and in order:
- I used the no_FU (no follow-ups) data initially for convenienceâthe full data can include multiple results of the same study at different follow-up points, and these clustered findings are inappropriate to ignore in a simple random effects model. So I checked both by doing this anyway then using a multi-level model to appropriately manage this structure to the data. No change to the key finding.
- Worried that (somehow) I was messing up or misinterpreting the metaregression, I (re)constructed a simple forest plot of all the studies, and confirmed indeed the unregistered ones were visibly off to the right. I then grouped a forest plot by registration variable to ensure it closely agreed with the meta-regression (in main text). It does.
- I then checked the first 10 studies coded by the variable I think is trial registration to check the registration status of those studies matched the codes. Although all fit, I thought the residual risk I was misunderstanding the variable was unacceptably high for a result significant enough to warrant a retraction demand. So I checked and coded all 46 studies by âregistered or not?â to make sure this agreed with my presumptive interpretation of the variable (in text). It does.
- Adding multiple variables to explain an effect geometrically expands researcher degrees of freedom, thus any unprincipled ad hoc investigation by adding or removing them has very high false discovery rates (I suspect this is a major problem with HLIâs own meta-regression work, but compared to everything else it merits only a passing mention here). But I wanted to check if I could find ways (even if unprincipled and ad hoc) to attenuate a result as stark as âunregistered studies have 3x the registered onesâ.
- I first tried to replicate HLIâs meta-regression work (exponential transformations and all) to see if the registration effect would be attenuated by intervention variables. Unfortunately, I was unable to replicate HLIâs regression results from the information provided (perhaps my fault). In any case, simpler versions I constructed did not give evidence for this.
- I also tried throwing in permutations of IPT-or-not (these studies tend to be unregistered, maybe this is the real cause of the effect?), active control-or-not (given it had a positive effect size, maybe it cancels out registration?) and study Standard Error (a proxyâalbeit a controversial oneâfor study size/âprecision/âquality, so if registration was confounded by it, this slightly challenges interpretation). The worst result across all the variations I tried was to drop the effect size of registration by 20% (~ â1 to â0.8), typically via substitution with SE. Omitted variable bias and multiple comparisons mean any further interpretation would be treacherous, but insofar as it provides further support: adding in more proxies for study quality increases explanatory power, and tends to even greater absolute and relative drops in effect size comparing âhighestâ versus âlowestâ quality studies.
That said, the effect size is so dramatic to be essentially immune to file-drawer worries. Even if I had a hundred null results I forgot to mention, this finding would survive a Bonferroni correction.
Obviously âis the study registered or notâ? is a crude indicator of overal quality. Typically, one would expect better measurement (perhaps by including further proxies for underlying study quality) would further increase the explanatory power of this factor. In other words, although these results look really bad, in reality it is likely to be even worse.
HLIâs write up on Bolton 2007 links to this paper (I did double check to make sure there wasnât another Bolton et al. 2007 which could have been confused with thisâno other match I could find). It has a sample size of 314, not 31 as HLI reportsâI presume a data entry error, although it less than reassuring that this erroneous figure is repeated and subsequently discussed in the text as part of the appraisal of the study: one reason given for weighing it so lightly is its âvery smallâ sample size.
Speaking of erroneous figures, hereâs the table of results from this study:
I see no way to arrive at an effect size of d = 1.79 from these numbers. The right comparison should surely be the pre-post difference of GIP versus control in the intention to treat analysis. These numbers give a cohenâs d ~ 0.5.
I donât think any other reasonable comparison gets much higher numbers, and definitely not > 3x higher numbersâthe differences between any of the groups are lower than the standard deviations, so should bound estimates like Cohenâs d to < 1.
[Re. file drawer, I guess this counts as a spot check (this is the only study I carefully checked data extraction), but not a random one: I did indeed look at this study in particular because it didnât match the âonly unregistered studies report crazy-high effectsâ - an ES of 1.79 is ~2x any other registered study.]
Re. my worries of selective scepticism, HLI did apply these methods in their meta-analysis of cash transfers, where no statistical suggestion of publication bias or p-hacking was evident.
This does depend a bit on whether spillover effects are being accounted for. This seems to cut the multiple by ~20%, but doesnât change the qualitative problems with the CEA. Happy to calculate precisely if someone insists.
Hello Gregory. With apologies, Iâm going to pre-commit both to making this my last reply to you on this post. This thread has been very costly in terms of my time and mental health, and your points below are, as far as I can tell, largely restatements of your earlier ones. As briefly as I can, and point by point again.
1.
A casual reader looking at your original comment might mistakenly conclude that we only used StrongMinds own study, and no other data, for our evaluation. Our point was that SMâs own work has relatively little weight, and we rely on many other sources. At this point, your argument seems rather âmotte-and-baileyâ. I would agree with you that there are different ways to do a meta-analysis (your point 3), and we plan to publish our new psychotherapy meta-analysis in due course so that it can be reviewed.
2.
Here, you are restating your prior suggestions that HLI should be taken in bad faith. Your claim is that HLI is good at spotting errors in othersâ work, but not its own. But there is an obvious explanation about âsurvivorshipâ effects. If you spot errors in your own research, you strip them out. Hence, by the time you publish, youâve found all the ones youâre going to find. This is why peer review is important: external reviewers will spot the errors that authors have missed themselves. Hence, thereâs nothing odd about having errors in your own work but also finding them in others. This is the normal stuff of academia!
3.
Iâm afraid I donât understand your complaint. I think your point is that âany way you slice the meta-analysis, psychotherapy looks more cost-effective than cash transfersâ but then you conclude this shows the meta-analysis must be wrong, rather than itâs sensible to conclude psychotherapy is better. Youâre right that you would have to deflate all the effect sizes by a large proportion to reverse the result. This should give you confidence in psychotherapy being better! Itâs worth pointing out that if psychotherapy is about $150pp, but cash transfers cost about $1100pp ($1000 transfer + delivery costs), therapy will be more cost-effective per intervention unless its per-intervention effect is much smaller
The explanation behind finding a new charity on our first go is not complicated or sinister. In earlier work, including my PhD, I had suggested that, on a SWB analysis, mental health was likely to be relatively neglected compared to status quo prioritising methods. I explained this in terms of the existing psychological literature on affective forecasting errors: weâre not very good at imagining internal suffering, we probably overstate the badness of material due to focusing illusions, and our forecasts donât account for hedonic adaptation (which doesnât occur to mental health). So the simple explanation is that we were âdiggingâ where we thought we were mostly likely to find âaltruistic goldâ, which seems sensible given limited resources.
4.
As much as I enjoyed your football analogies, here also youâre restating, rather than further substantiating, your earlier accusations. You seem to conclude from the fact you found some problems with HLIâs analysis that we should conclude this means HLI, but only HLI, should be distrusted, and retain our confidence in all the other charity evaluators. This seems unwarranted. Why not conclude you would find mistakes elsewhere too? I am reminded of the expression, âif you knew how the sausage was made, you wouldnât want to eat the sausageâ. What I think is true is that HLI is a second-generation charity evaluator, we are aiming to be extremely transparent, and we are proposing novel priorities. As a result, I think we have come in for a far higher level of public scrutiny than others have, so more of our errors have been found, but I donât know that we have made more and worse errors. Quite possibly, where errors have been noticed in othersâ work, they have been quietly and privately identified, and corrected with less fanfare.
Props on the clear and gracious reply.
I sense this is wrong, if I think the unpublished work will change my conclusions a lot, I change my conclusions some of the way now though I understand thatâs a weird thing to do and hard to justify perhaps. Nonetheless I think itâs the right move.
Could you say a bit more about what you mean by âshould not have maintained once they were made aware of themâ in point 2? As you characterize below, this is an org âmaking a funding request in a financially precarious position,â and in that context I think itâs even more important than usual to be clear about HLI has âmaintainedâ its âmistakesâ âonce they were made aware of them.â Furthermore, I think the claim that HLI has âmaintainedâ is an important crux for your final point.
Example: I do not like that HLIâs main donor advice page lists the 77 WELLBY per $1,000 estimate with only a very brief and neutral statement that âNote: we plan to update our analysis of StrongMinds by the end of 2023.â There is a known substantial, near-typographical error underlying that analysis:
While Iâm sympathetic to HLIâs small size and desire to produce a more comprehensive updated analysis, I donât think itâs appropriate to be quoting numbers from an unpatched version of the CEA over four months after the error was discovered. (Iâd be somewhat more flexible if this were based on new information rather than HLIâs coding error, and/âor if the difference didnât flip the recommendation for a decent percentage of would-be donors: deprivationists who believe the neutral point is less than 1.56 or so).
Hello Jason,
With apologies for delay. I agree with you that I am asserting HLIâs mistakes have further âaggravating factorsâ which I also assert invites highly adverse inference. I had hoped the links I provided provided clear substantiation, but demonstrably not (my bad). Hopefully my reply to Michael makes them somewhat clearer, but in case not, I give a couple of examples below with as best an explanation I can muster.
I will also be linking and quoting extensively from the Cochrane handbook for systematic reviewsâso hopefully even if my attempt to clearly explain the issues fail, a reader can satisfy themselves my view on them agrees with expert consensus. (Rather than, say, âCantankerous critic with idiosyncratic statistical tastes flexing his expertise to browbeat the laity into aquiescenceâ.)
0) Per your remarks, thereâs various background issues around reasonableness, materiality, timeliness etc. I think my views basically agree with yours. In essence: I think HLI is significantly âon the hookâ for work (such as the meta-analysis) it relies upon to make recommendations to donorsâwho will likely be taking HLIâs representations on its results and reliability (cf. HLIâs remarks about its âacademic researchâ, ârigourâ etc.) on trust. Discoveries which threaten the âbottom line numbersâ or overall reliability of this work should be addressed with urgency and robustness appropriate to their gravity. âWeâll put checking this on our to-do listâ seems fine for an analytic choice which might be dubious but of unclear direction and small expected magnitude. As you say, a typo which where corrected reduces the bottom line efficacy by ~ 20% should be done promptly.
The two problems I outlined 6 months ago each should have prompted withdrawal/âsuspension of both the work and the recommendation unless and until they were corrected.[1] Instead, HLI has not made appropriate corrections, and instead persists in misguiding donations and misrepresenting the quality of its research on the basis of work it has partly acknowledged (and which reasonable practicioners would overwhelmingly concur) was gravely compromised.[2]
1.0) Publication bias/âSmall study effects
It is commonplace in the literature for smaller studies to show different (typically larger) effect sizes than large studies. This is typically attributed to a mix of factors which differentially inflate effect size in smaller studies (see), perhaps the main one being publication bias: although big studies are likely to be published âeither wayâ, investigators may not finish (or journals may not publish) smaller studies reporting negative results.
It is extremely well recognised that these effects can threaten the validity of meta-analysis results. If you are producing something (very roughly) like an âaverage effect sizeâ from your included studies, the studies being selected for showing a positive effect means the average is inflated upwards. This bias is very difficult to reliably adjust for or âpatchâ (more later), but it can easily be large enough to mean âActually, the treatment has no effect, and your meta-analysis is basically summarizing methodological errors throughout the literatureâ.
Hence why most work on this topic stresses the importance of arduous efforts in prevention (e.g trying really hard to find âunpublishedâ studies) and diagnosis (i.e. carefully checking for statistical evidence of this problem) rather than âcureâ (see eg.). If a carefully conducted analysis nonetheless finds stark small study effects, thisârather than the supposed ~âaverageâ effectâwould typically be (and should definitely be) the main finding: âThe literature is a complete messâmore, and much better, research neededâ.
As in many statistical matters, a basic look at your data can point you in the right direction. For meta-analysis, this standard is a forest plot:
To orientate: each row is a study (presented in order of increasing effect size), and the horizontal scale is effect size (where to the right = greater effect size favouring the intervention). The horizontal bar for each study is gives the confidence interval for the effect size, with the middle square marking the central estimate (also given in the rightmost column). The diamond right at the bottom is the pooled effect sizeâthe (~~)[3] average effect across studies mentioned earlier.
Here, the studies are all over the map, many of which do not overlap with one another, nor with the pooled effect size estimate. In essence, dramatic heterogeneity: the studies are reporting very different effect sizes from another. Heterogeneity is basically a fact of life in meta-analysis, but a forest plot like this invites curiousity (or concern) about why effects are varying quite this much. [Iâm going to be skipping discussion of formal statistical tests/âmetrics for things like this for clarityâyou can safely assume a) yes, you can provide more rigorous statistical assessment of âhow muchâ besides âeyeballing itâ - although visually obvious things are highly informative, b) the things I mention you can see are indeed (highly) statistically significant etc. etc.]
There are some hints from this forest plot that small study effects could have a role to play. Although very noisy, larger studies (those with narrower horizontal lines lines, because bigger study ~ less uncertainty in effect size) tend to be higher up the plot and have smaller effects. There is a another plot designed to look at this betterâa funnel plot.
To orientate: each study is now a point on a scatterplot, with effect size again on the x-axis (right = greater effect). The y-axis is now the standard error: bigger studies have greater precision, and so lower sampling error, so are plotted higher on the y axis. Each point is a single studyâall being well, the scatter should look like a (symmetrical) triangle or funnel like those being drawn on the plot.
All is not well here. The scatter is clearly asymmetric and sloping to the rightâsmaller studies (towards the bottom of the graph) tend towards greater effect sizes. The lines being drawn on the plot make this even clearer. Briefly:
The leftmost âfunnelâ with shaded wings is centered on an effect size of zero (i.e. zero effect). The white middle triangle are findings which would give a p value of > 0.05, and the shaded wings correspond to a p value between 0.05 (âstatistically significantâ) and 0.01: it is an upward-pointing triangle because bigger studies can detect find smaller differences from zero as âstatistically significantâ than smaller ones. There appears to be clustering in the shaded region, suggestive that studies may be being tweaked to get them âacross the thresholdâ of statistically significant effects.
The rightmost âfunnelâ without shading is centered on the pooled effect estimate (0.5). Within the triangle is where you would expect 95% of the scatter of studies to fall in the absence of heterogeneity (i.e. there was just one true effect size, and the studies varied from this just thanks to sampling error). Around half are outside this region.
The red dashed line is the best fit line through the scatter of studies. If there werenât small study effects, it would be basically vertical. Instead, it slopes off heavily to the right.
Although a very asymmetric funnel plot is not proof positive of publication bias, findings like this demand careful investigation and cautious interpretation (see generally). It is challenging to assess exactly âhow big a deal is it, though?â: statistical adjustiment for biases in the original data is extremely fraught.
But we are comfortably in âbig dealâ territory: this finding credibly up-ends HLIâs entire analysis:
a) There are different ways of getting a âpooled estimateâ (~~average, or ~~ typical effect size): random effects (where you assume the true effect is rather a distribution of effects from which each study samples from), vs. fixed effects (where there is a single value for the true effect size). Random effects are commonly preferred asâin realityâone expects the true effect to vary, but the results are much more vulnerable to any small study effects/âpublication bias (see generally). Comparing the random effect vs. the fixed effect estimate can give a quantitative steer on the possible scale of the problem, as well as guide subsequent analysis.[4] Here, the random effect estimate is 0.52, whilst the fixed one is less than half the size: 0.18.
b) There are other statistical methods you could use (more later). One of the easier to understand (but one of the most conservative) goes back to the red dashed line in the funnel plot. You could extrapolate from it to the point where standard error = 0: so the predicted effect of an infinitely large (so infinitely precise) studyâand so also where the âsmall study effectâ is zero. There are a few different variants of these sorts of âregression methodsâ, but the ones I tried predict effect sizes of such a hypothetical study between 0.17 and 0.05. So, quantitatively, 70-90% cuts of effect size are on the table here.
c) A reason why regression methods methods are conservative as they will attribute as much variation in reported results as possible to differences in study size. Yet there could be alternative explanations for this besides publication bias: maybe smaller studies have different patient populations with (genuinely) greater efficacy, etc.
However, this statistical confounding can go the other way. HLI is not presenting simple meta-analytic results, but rather meta-regressions: where the differences in reported effect sizes are being predicted by differences between and within the studies (e.g. follow-up time, how much therapy was provided, etc.). One of HLIâs findings from this work is that psychotherpy with Strongminds-like traits is ~70% more effective than psychotherapy in general (0.8 vs. 0.46). If this is because factors like âgroup or individual therapyâ correlate with study size, the real story for this could simply be: âStrongminds-like traits are indicators for methodological weaknesses which greatly inflate true effect size, rather than for a more effective therapeutic modality.â In HLIâs analysis, the latter is presumed, giving about a ~10% uplift to the bottom line results.[5]
1.2) A major issue, and a major mistake to miss
So this is a big issue, and would be revealed by standard approaches. HLI instead used a very non-standard approach (see), novelâas far as I can tellâto existing practice and, unfortunately, inappropriate (cf., point 5): it gives ~ a 10-15% discount (although Iâm not sure this has been used in the Strongminds assessment, although it is in the psychotherapy one).
I came across these problems ~6m ago, prompted by a question by Ryan Briggs (someone with considerably greater expertise than my own) asking after the forest and funnel plot. I also started digging into the data in general at the same time, and noted the same key points explained labouriously above: looks like marked heterogeneity and small study effects, they look very big, and call the analysis results into question. Long story short, they said they would take a look at it urgently then report back.
This response is fine, but as my comments then indicated, I did have (and I think reasonably had) HLI on pretty thin ice/ââepistemic probationâ after finding these things out. You have to make a lot of odd choices to end up this far from normal practice, nonetheless still have to make some surprising oversights too, to end up missing problems which would appear to greatly undermine a positive finding for Strongminds.[6]
1.3) Maintaining this major mistake
HLI fell through this thin ice after its follow-up. Their approach was to try a bunch of statistical techniques to adjust for publication bias (excellent technique), do the same for their cash transfers meta-analysis (sure), then using the relative discounts between them to get an adjustment for psychotherapy vs. cash transfers (good, esp. as adding directly into the multi-level meta-regressions would be difficult). Further, they provided full code and data for replication (great). But the results made no sense whatsoever:
To orientate: each row is a different statistical technique applied to the two meta-analyses (more later). The x-axis is the âmultipleâ of Strongminds vs. cash transfers, and the black line is at 9.4x, the previous âstatus quo valueâ. Bars shorter than this means adjusting for publication bias results in an overall discount for Strongminds, and vice-versa.
The cash transfers funnel plot looks like this:
Compared to the psychotherapy one, it basically looks fine: the scatter looks roughly like a funnel, and no massive trend towards smaller studies = bigger effects. So how could so many statistical methods discount the âobvious small study effectâ meta-analysis less than the âno apparent small study effectâ meta-analysis, to give an increased multiple? As I said at the time, the results look like nonsense to the naked eye.
One problem was a coding error in two of the statistical methods (blue and pink bars). The bigger problem is how the comparisons are being done is highly misleading.
Take a step back from all the dividing going on to just look at the effect sizes. The basic, nothing fancy, random effects model applied to the psychotherapy data gives an effect size of 0.5. If you take the average across all the other model variants, you get ~0.3, a 40% drop. For the cash transfers meta-analysis, the basic model gives 0.1, and the average of all the other models is ~0.9, a 10% drop. So in fact you are seeingâas you shouldâbigger discounts when adjusting the psychotherapy analysis vs. the cash transfers meta-analysis. This is lost by how the divisions are being done, which largely âplay offâ multiple adjustments against one another. (see, pt.2). What the graph should look like is this:
Two things are notable: 1) the different models tend to point to a significant drop (~30-40% on average) in effect size; 2) there is a lot of variation in the discountâfrom ~0 to ~90% (so visual illustration about why this is known to be v. hard to reliably âadjustâ). I think these results oblige something like the following:
Re. write-up: At least including the forest and funnel plots, alongside a description of why they are concerning. Should also include some âbest guessâ correction from the above, and noting this has a (very) wide range. Probably warrants âback to the drawing boardâ given reliability issues.
Re. overall recommendation: At least a very heavy astericks placed besides the recommendation. Should also highlight both the adjustment and uncertainty in front facing materials (e.g. âtentative suggestionâ vs. ârecommendationâ). Probably warrants withdrawal.
Re. general reflection: I think a reasonable evaluatorâbeyond directional effectsâwould be concerned about the ânearâ(?) miss property of having a major material issue not spotted before pushing a strong recommendation, âphase 1 complete/âmission accomplishedâ etc. - especially when found by a third party many months after initial publication. They might also be concerned about the direction of travel. When published, the multiplier was 12x; with spillovers, it falls to 9.5%; with spillovers and the typo corrected, it falls to 7.5x; with a 30% best guess correction for publication bias, weâre now at 5.3x. Maybe any single adjustment is not recommendation-reversing, but in concert they are, and the track record suggests the next one is more likely to be further down rather than back up.
What happened instead 5 months ago was HLI would read some more and discuss among themselves whether my take on the comparators was the right one (I am, and it is not reasonably controversial, e.g. 1, 2, cf. fn4). Although âlooking at publication bias is part of their intended ârefiningâ of the Strongminds assessment, thereâs been nothing concrete done yet.
Maybe I should have chased, but the exchange on this (alongside the other thing) made me lose faith that HLI was capable of reasonably assessing and appropriately responding to criticisms of their work when material to their bottom line.
2) The cost effectiveness guestimate.
[Readers will be relieved ~no tricky stats here]
As I was looking at the meta-analysis, I added my attempt at âadjustedâ effect sizes of the same into the CEA to see what impact they had on the results. To my surprise, not very much. Hence my previous examples about âEven if the meta-analysis has zero effect the CEA still recommends Strongminds as several times GDâ, and âYou only get to equipoise with GD if you set all the effect sizes in the CEA to near-zero.â
I noted this alongside my discussion around the meta-analysis 6m ago. Earlier remarks from HLI suggested they accepted these were diagnostic of something going wrong with how the CEA is aggregating information (but fixing it would be done but not as a priority); more recent ones suggest more âdoubling downâ.
In any case, they are indeed diagnostic for a lack of face validity. You obviously would, in fact, be highly sceptical if the meta-analysis of psychotherapy in general was zero (or harmful!) that nonetheless a particular psychotherapy intervention was extremely effective. The (pseudo-)bayesian gloss on why is that the distribution of reported effect sizes gives additional information on the likely size of the ârealâ effects underlying them. (cf. heterogeneity discussed above) A bunch of weird discrepancies among them, if hard to explain by intervention characteristics, increases the suspicion of weird distortions, rather than true effects, underlie the observations. So increasing discrepancy between indirect and direct evidence should reduce effect size beyond impacts on any weighted average.
It does not help the findings as-is are highly discrepant and generally weird. Among many examples:
Why are the strongminds like trials in the direct evidence having among the greatest effects of any of the studies includedâand ~1.5x-2x the effect of a regression prediction of studies with strongminds-like traits?
Why are the most strongminds-y studies included in the meta-analysis marked outliersâeven after âcorrectionâ for small study effects?
What happened between the original Strongminds Phase 2 and the Strongminds RCT to up the intevention efficacy by 80%?
How come the only study which compares psychotherapy to a cash transfer comparator is also the only study which gives a negative effect size?
I donât know what the magnitude of the directional âadjustmentâ would be, as this relies on specific understanding of the likelier explanations for the odd results (Iâd guess a 10%+ downward correction assuming Iâm wrong about everything elseâobviously, much more if indeed âthe vast bulk in effect variation can be explained by sample size +/â- registration status of the study). Alone, I think it mainly points to the quantative engine needing an overhaul and the analysis being known-unreliable until it is.
In any case, it seems urgent and important to understand and fix. The numbers are being widely used and relied upon (probably all of them need at least a big public astericks pending developing more reliable technique). It seems particularly unwise to be reassured by âWell sure, this is a downward correction, but the CEA still gives a good bottom line multipleâ, as the bottom line number may not be reasonable, especially conditioned on different inputs. Even more so to persist in doing so 6m after being made aware of the problem.
These are mentioned in 3a and 3b of my reply to Michael. Point 1 there (kind of related to 3a) would on its own warrant immediate retraction, but that is not a case (yet) of âmaintainedâ error.
So in terms of âepistemic probationâ, I think this was available 6m ago, but closed after flagrant and ongoing âviolationsâ.
One quote from the Cochrane handbook feels particularly apposite:
Cochrane
This is not the only problem in HLIâs meta-regression analysis. Analyses here should be pre-specified (especially if intended as the primary result rather than some secondary exploratory analysis), to limit risks of inadvertently cherry-picking a model which gives a preferred result. Cochrane (see):
HLI does not mention any pre-specification, and there is good circumstantial evidence of a lot of this work being ad hoc re. âStrongminds-like traitsâ. HLIâs earlier analysis on psychotherapy in general, using most (?all) of the same studies as in their Strongminds CEA (4.2, here), had different variables used in a meta-regression on intervention properties (table 2). It seems likely the change of model happened after study data was extracted (the lack of significant prediction and including a large number of variables for a relatively small number of studies would be further concerns). This modification seems to favour the intervention: I think the earlier model, if applied to Strongminds, gives an effect size of ~0.6.
Briggs comments have a similar theme, suggestive that my attitude does not solely arise from particular cynicism on my part.
I really appreciate you putting in the work and being so diligent Gregory. I did very little here, though I appreciate your kind words. Without you seriously digging in, weâd have a very distorted picture of this important area.
Hello Jason. FWIW, Iâve drafted a reply to your other comment and Iâm getting it checked internally before I post it.
On this comment about you not liking that we hadnât updated our website to include the new numbers: we all agree with you! Itâs a reasonable complaint. The explanation is fairly boring: we have been working on a new charity recommendations page for the website, at which point we were going to update the numbers at add a note, so we could do it all in one go. (We still plan to do a bigger reanalysis later this year.) However, that has gone slower than expected and hadnât happened yet. Because of your comment, weâll add a âhot fixâ update in the next week, and hopefully have the new charity recommendations page live in a couple of weeks.
I think weâd have moved faster on this if it had substantially changed the results. On our numbers, StrongMinds is still the best life-improving intervention (itâs several times better than cash and weâre not confident deworming has a longterm effect). Youâre right it would slightly change the crossover point for choosing between life-saving and life-improving interventions, but weâve got the impression that donors werenât making much use of our analysis anyway; even if they were, itâs a pretty small difference, and well within the margin of uncertainty.
Thanks, I appreciate that.
(Looking back at the comment, I see the example actually ended up taking more space than the lead point! Although I definitely agree that the hot fix should happen, I hope the example didnât overshadow the commentâs main intended pointâthat people who have concerns about HLIâs response to recent criticisms should raise their concerns with a degree of specificity, and explain why they have those concerns, to allow HLI an opportunity to address them.)
Oh yes. I agree with you that it would be good if people could make helpful suggestions as to what we could do, rather than just criticise.
Meta-note as a casual lurker in this thread: This comment being down-voted to oblivion while Jasonâs comment is not, is pretty bizarre to me. The only explanation I can think of is that people who have provided criticism think Michael is saying they shouldnât criticise? It is blatantly obvious to me that this is not what he is saying and is simply agreeing with Jason that specific actionable-criticism is better.
Fun meta-meta note I just realized after writing the above: This does mean I am potentially criticising some critics who are critical of how Micheal is criticising their criticism.
Okkkk, thatâs enough internet for me. Peace and love, yâall.
Michaelâs comment has 14 non-author up/âdownvotes and 10 non-author agree/âdisagreevotes; mine has one of each. This is possibly due to the potential to ascribe a comment by HLIâs director several meanings that are not plausible to give a comment by a disinterested observerâe.g., âOrg expresses openness to changes to address concerns,â âOrg is critical of critics,â etc.
Iâm not endorsing any potential meaning, although I have an upvote on his comment.
The more disappointing meta-note to me is that helpful, concrete suggestions have been relatively sparse on this post as a whole. I wrote some suggestions for future epistemic practices, and someone else called for withdrawing the SM recommendation and report. But overall, there seemed to be much more energy invested in litigating than in figuring out a path forward.
I donât really share this sense (I think that even most of Gregory Lewisâ posts in this thread have had concretely useful advice for HLI, e.g. this one), but letâs suppose for the moment that itâs true. Should we care?
In the last round of posts, four to six months ago, HLI got plenty of concrete and helpful suggestions. A lot of them were unpleasant, stuff like âyou should withdraw your cost-effectiveness analysisâ and âhere are ~10 easy-to-catch problems with the stats you publishedâ, but highly specific and actionable. What came of that? What improvements has HLI made? As far as I can tell, almost nothing has changed, and theyâre still fundraising off of the same flawed analyses. There wasnât even any movement on this unambiguous blunder until you called it out. It seems to me that giving helpful, concrete suggestions to HLI has been tried, and shown to be low impact.
One thing people can do in a thread like this one is talk to HLI, to praise them, ask them questions, or try to get them to do things differently. But another thing they can do is talk to each other, to try and figure out whether they should donate to HLI or not. For that, criticism of HLI is valuable, even if itâs not directed to HLI. This, too, counts as âfiguring out a path forwardâ.
edited to that I only had a couple of comments rather than 4
I am confident those involved really care about doing good and work really hard. And i donât want that to be lost in this confusion. Something is going on here, but I think âit is confusingâ is better than âHLI are baddiesâ.
For clarity being 2x better than cash transfers would still provide it with good reason to be on GWWCâs top charity list, right? Since GiveDirectly is?
I guess the most damning claim seems to be about dishonesty, which I find hard to square with the caliber of the team. So, whatâs going on here? If, as seems likely the forthcoming RCT downgrades SM a lot and the HLI team should have seen this coming, why didnât they act? Or do they still believe that they RCT will return very positive results. What happens when as seems likely, they are very wrong?
Note that SimonM is a quant by day and for a time top on metaculus, so I am less surprised that he can produce such high caliber work in his spare time[1].
I donât know how to say this but it doesnât surprise me that top individuals are able to do work comparable with research teams. In fact I think itâs one of the best cases for the forum. Sometimes talented generalists compete toe to toe with experts.
Finally it seems possible to me that criticisms can be true but HLI can still have done work we want to fund. The world is ugly and complicated like this. I think we should aim to make the right call in this case. For me the key question is, why havenât they updated in light of StrongMinds likely being worse than they thought.
Iâd be curious Gregory on your thoughts on this comment by Matt Lerner that responds to yours. https://ââforum.effectivealtruism.org/ââposts/ââg4QWGj3JFLiKRyxZe/ââthe-happier-lives-institute-is-funding-constrained-and-needs?commentId=Bd9jqxAR6zfg8z4Wy
Simon worked as a crypto quant and has since lost his job (cos of the crash caused by FTX) so is looking for work including EA work. You can message him if interested.
+1 Regarding extending the principle of charity towards HLI. Anecdotally it seems very common for initial CEA estimates to be revised down as the analysis is critiqued. I think HLI has done an exceptional job at being transparent and open regarding their methodology and the source of disagreements e.g. see Joelâs comment outlining the sources of disagreement between HLI and GiveWell, which I thought were really exceptional (https://ââforum.effectivealtruism.org/ââposts/ââh5sJepiwGZLbK476N/ââassessment-of-happier-lives-institute-s-cost-effectiveness?commentId=LqFS5yHdRcfYmX9jw). Obviously I havenât spent as much time digging into the results as Gregory has made, but the mistakes he points to donât seem like the kind that should be treated too harshly.
As a separate point, I think itâs generally a lot easier to critique and build upon an analysis after the initial work has been done. E.g. even if it is the case that SimonMâs assessment of Strong Minds is more reliable than HLIâs (HLI seem to dispute that the critique he levies are all that important as they only assign a 13% weight to that RCT), this isnât necessarily evidence that SimonM is more competent than the HLI team. When the heavy lifting has been done, itâs easier to focus in on particular mistakes (and of course valuable to do so!).
I think GiveDirectly gets special privilege because âjust give the money to the poorest peopleâ is such a safe bet for how to spend money altruistically.
Like if a billionaire wanted to spend a million dollars making your life better, they could either:
just give you the million dollars directly, or
spend the money on something that they personally think would be best for you
Youâd want them to set a pretty high bar of âI have high confidence that the thing I chose to spend the money on will be much better than whatever you would spend the money on yourself.â
GiveDirectly does not have the âtop-ratedâ label on GWWCâs list, while SM does as of this morning.
I canât find the discussion, but my understanding is that âtop-ratedâ means that an evaluator GWWC trustsâin SMâs case, that was Founderâs Pledgeâthinks that a charity is at a certain multiple (was it like 4x?) over GiveDirectly.
However, on this post, Matt Lerner @ FP wrote that âWe disagree with HLI about SMâs rating â we use HLIâs work as a starting point and arrive at an undiscounted rating of 5-6x; subjective discounts place it between 1-2x, which squares with GiveWellâs analysis.â
So it seems that GWWC should withdraw the âtop-ratedâ flag because none of its trusted evaluation partners currently rate SM at better than 2.3X cash. It should not, however, remove SM from the GWWC platform as it meets the criteria for inclusion.
Hmm this feels a bit off. I donât think GiveDirectly should get special privelege. Though I agree the out of model factors seem to go well for GD than others, so I would kind of bump it up.
Hello Nathan. Thanks for the comment. I think the only key place where I would disagree with you is what you said here
As I said in response to Greg (to which I see youâve replied) we use the conventional scientific approach of relying on the sweep of existing dataârather than on our predictions of what future evidence (from a single study) will show. Indeed, Iâm not sure how easily these would come apart: I would base my predictions substantially on the existing data, which weâve already gathered in our meta-analysis (obviously, itâs a matter of debate as to how to synthesise data from different sources and opinions will differ). I donât have any reason to assume the new RCT will show effects substantially lower than the existing evidence, but perhaps others are aware of something weâre not.
Yeah for what itâs worth it wasnât clear to me until later that this was only like 10% of the weighting on your analysis.
Man, why donât images resize properly. Iâve deleted it because it was too obnoxious when huge.
Here is a manifold market for Gregoryâs claim if you want to bet on it.
Is your 5K donation promised to Strongminds or HLI?
HLIâbut if for whatever reason theyâre unable or unwilling to receive the donation at resolution, Strongminds.
The âresolution criteriaâ are also potentially ambiguous (my bad). I intend to resolve any ambiguity stringently against me, but you are welcome to be my adjudicator.
[To add: Iâd guess ~30-something% chance I end up paying out: d = 0.4 is at or below pooled effect estimates for psychotherapy generally. I am banking on significant discounts with increasing study size and quality (as well as other things I mention above I take as adverse indicators), but even if I price these right, I expect high variance.
I set the bar this low (versus, say, d = 0.6 - at the ~ 5th percentile of HLIâs estimate) primarily to make a strong rod for my own back. Mordantly criticising an org whilst they are making a funding request in a financially precarious position should not be done lightly. Although Iâd stand by my criticism of HLI even if the trial found Strongminds was even better than HLI predicted, I would regret being quite as strident if the results were any less than dramatically discordant.
If so, me retreating to something like âMeh, they got luckyâ/ââSure I was (/âkinda) wrong, but you didnât deserve to be rightâ seems craven after over-cooking remarks potentially highly adverse to HLIâs fundraising efforts. Fairer would be that I suffer some financial embarrassment, which helps compensate HLI for their injury from my excess.
Perhaps I could have (or should have) done something better. But in fairness to me, I think this is all supererogatory on my part: I do not think my comment is the only example of stark criticism on this forum, but it might be unique in its author levying an expected cost of over $1000 on themselves for making it.]
Would you happen to have a prediction of the likelihood of d > or = 0.6? (No money involved, youâve put more than enough $ on the line already!)
8%, but perhaps expected drift of a factor of two either way if I thought about it for a few hours vs. a few minutes.