HLI has, in fact, put a lot of weight on the d = 1.72 Strongminds RCT. As table 2 shows, you give a weight of 13% to itâjoint highest out of the 5 pieces of direct evidence. As there are ~45 studies in the meta-analytic results, this means this RCT is being given equal or (substantially) greater weight than any other study you include. For similar reasons, the Strongminds phase 2 trial is accorded the third highest weight out of all studies in the analysis.
HLIâs analysis explains the rationale behind the weighting of âusing an appraisal of its risk of bias and relevance to StrongMindsâ present core programmeâ. Yet table 1A notes the quality of the 2020 RCT is âunknownâ - presumably because Strongminds has âonly given the results and some supporting details of the RCTâ. I donât think it can be reasonable to assign the highest weight to an (as far as I can tell) unpublished, not-peer reviewed, unregistered study conducted by Strongminds on its own effectiveness reporting an astonishing effect sizeâbefore it has even been read in full. It should be dramatically downweighted or wholly discounted until then, rather than included at face value with a promise HLI will followup later.
Risk of bias in this field in general is massive: effect sizes commonly melt with improving study quality. Assigning ~40% of a weighted average of effect size to a collection of 5 studies, 4 [actually 3, more later] of which are (marked) outliers in effect effect, of which 2 are conducted by the charity is unreasonable. This can be dramatically demonstrated from HLIâs own data:
One thing I didnât notice last time I looked is HLI did code variables on study quality for the included studies, although none of them seem to be used for any of the published analysis. I have some good news, and some very bad news.
The good news is the first such variable I looked at, ActiveControl, is a significant predictor of greater effect size. Studies with better controls report greater effects (roughly 0.6 versus 0.3). This effect is significant (p = 0.03) although small (10% of the variance) and difficultâat least for meâto explain: I would usually expect worse controls to widen the gap between it and the intervention group, not narrow it. In any case, this marker of study quality definitely does not explain away HLIâs findings.
The second variable I looked at was âUnpubOr(pre?)regâ.[1] As far as I can tell, coding 1 means something like âthe study was publicly registeredâ and 0 means it wasnât (Iâm guessing 0.5 means something intermediate like retrospective registration or similar) - in any case, this variable correlates extremely closely (>0.95) to my own coding of whether a study mentions being registered or not after reviewing all of them myself. If so, using it as a moderator makes devastating reading:[2]
To orientate: in âModel resultsâ the intercept value gives the estimated effect size when the âunpubâ variable is zero (as I understand it, ~unregistered studies), so d ~ 1.4 (!) for this set of studies. The row below gives the change in effect if you move from âunpub = 0â to âunpub = 1â (i.e. ~ registered vs. unregistered studies): this drops effect size by 1, so registered studies give effects of ~0.3. In other words, unregistered and registered studies give dramatically different effects: study registration reduces expected effect size by a factor of 3. [!!!]
The other statistics provided deepen the concern. The included studies have a very high level of heterogeneity (~their effect sizes vary much more than they should by chance). Although HLI attempted to explain this variation with various meta-regressions using features of the intervention, follow-up time, etc., these models left the great bulk of the variation unexplained. Although not like-for-like, here a single indicator of study quality provides compelling explanation for why effect sizes differ so much: it explains three-quarters of the initial variation.[3]
This is easily seen in a grouped forest plotâthe top group is the non registered studies, the second group the registered ones:
This pattern also perfectly fits the 5 pieces of direct evidence: Bolton 2003 (ES = 1.13), Strongminds RCT (1.72), and Strongminds P2 (1.09) are, as far as I can tell, unregistered. Thurman 2017 (0.09) was registered. Bolton 2007 is also registered, and in fact has an effect size of ~0.5, not 1.79 as HLI reports.[4]
To be clear, I do not think HLI knew of this before I found it out just now. But results like this indicate i) the appraisal of the literature in this analysis gravely off-the-markâstudy quality provides the best available explanation for why some trials report dramatically higher effects than others; ii) the result of this oversight is a dramatic over-estimation of likely efficacy of Strongminds (as a ready explanation for the large effects reported in the most ârelevant to strongmindsâ studies is that these studies were not registered and thus prone to ~200%+ inflation of effect size); iii) this is a very surprising mistake for a diligent and impartial evaluator to make: one would expect careful assessment of study qualityâand very sceptical evaluation where this appears to be lackingâto be foremost, especially given the subfield and prior reporting from Strongminds both heavily underline it. This pattern, alas, will prove repetitive.
I also think a finding like this should prompt an urgent withdrawal of both the analysis and recommendation pending further assessment. In honesty, if this doesnât, Iâm not sure what ever could.
2:
Indeed excellent researchers overlook things, and although I think both the frequency and severity of things HLI mistakes or overlooks is less-than-excellent, one could easily attribute this to things like âinexperienceâ, âtrying to do a lot in a hurryâ, âlimited staff capacityâ, and so on.
Yet this cannot account for how starkly asymmetric the impact of these mistakes and oversights are. HLIâs mistakes are consistently to Strongmindâs benefit rather than its detriment, and HLI rarely misses a consideration which could enhance the âmultipleâ, it frequently misses causes of concern which both undermine both strength and reliability of this recommendation. HLIâs award from Givewell deepens my concerns here, as it is consistent with a very selective scepticism: HLI can carefully scruitinize charity evaluations by others it wants to beat, but fails to mete out remotely comparable measure to its own which it intends for triumph.
I think this can also explain how HLI responds to criticism, which I have found by turns concerning and frustrating. HLI makes some splashy claim (cf. âmission accomplishedâ, âconfident recommendationâ, etc.). Someone else (eventually) takes a closer look, and finds the surprising splashy claim, rather than basically checking out âmost reasonable ways you slice itâ, it is highly non-robust, and only follows given HLI slicing it heavily in favour of their bottom line in terms of judgement or analysisâthe latter of which often has errors which further favour said bottom line. HLI reliably responds, but the tenor of this response is less âscientific discourseâ and more âlawyer for defenceâ: where it can, HLI will too often further double down on calls it makes where I aver the typical reasonable spectator would deem at best dubious, and at worst tendentious; where it canât, HLI acknowledges the shortcoming but asserts (again, usually very dubiously) that it isnât that a big deal, so it will deprioritise addressing it versus producing yet more work with the shortcomings familiar to those which came before.
3:
HLIâs meta-analysis in no way allays or rebuts the concerns SimonM raised re. Strongmindsâindeed, appropriate analysis would enhance many of them. Nor is it the case that the meta-analytic work makes HLIâs recommendation robust to shortcomings in the Strongminds-specific evidenceâindeed, the cost effectiveness calculator will robustly recommend Strongminds as superior (commonly, several times superior) to GiveDirectly almost no matter what efficacy results (meta-analytic or otherwise) are fed into it. On each.
a) Meta-analysis could help contextualize the problems SimonM identifies in the Strongminds specific data. For example, a funnel plot which is less of a âfunnelâ but more of a ski-slope (i.e. massive small study effects/ârisk of publication bias), and a contour/âp-curve suggestive of p-hacking would suggest the fieldâs literature needs to be handled with great care. Finding âstrongminds relevantâ studies and direct evidence are marked outliers even relative to this pathological literature should raise alarm given this complements the object-level concerns SimonM presented.
This is indeed true, and these features were present in the studies HLI collected, but HLI failed to recognise it. It may never have if I hadnât gotten curious and did these analyses myself. Said analysis is (relative to the much more elaborate techniques used in HLIâs meta-analysis) simple to conductâmy initial âworkâ was taking the spreadsheet and plugging it into a webtool out of idle curiosity.[5] Again, this is a significant mistake, adds a directional bias in favour of Strongminds, and is surprising for a diligent and impartial evaluator to make.
b) In general, incorporating meta-analytic results into what is essentially a weighted average alongside direct evidence does not clean either it or the direct evidence of object level shortcomings. If (as here) both are severely compromised, the result remains unreliable.
The particular approach HLI took also doesnât make the finding more robust, as the qualitative bottom line of the cost-effectiveness calculation is insensitive to the meta-analytic result. As-is, the calculator gives strongminds as roughly 12x better than GiveDirectly.[6] If you set both meta-analytic effect sizes to zero, the calculator gives Strongminds as ~7x better than GiveDirectly. So the five pieces of direct evidence are (apparently) sufficient to conclude SM is an extremely effective charity. Obviously this isâand HLI has previously acceptedâfacially invalid output.
It is not the only example. It is extremely hard for any reduction of efficacy inputs to the model to give a result that Strongminds is worse than Givedirectly. If we instead leave the meta-analytic results as they were but set all the effect sizes of the direct evidence to zero (in essence discounting them entirelyâwhich I think is approximately what should have been done from the start), we get ~5x better than GiveDirectly. If we set all the effect sizes of both meta-analysis and direct evidence to 0.4 (i.e. the expected effects of registered studies noted before), we get ~6x better than Givedirectly. If we set the meta-analytic results to 0.4 and set all the direct evidence to zero we get ~3x GiveDirectly. Only when one sets all the effect sizes to 0.1 - lower than all but ~three of the studies in the meta-analysisâdoes one approach equipoise.
This result should not surprise on reflection: the CEAâs result is roughly proportional to the ~weighted average of input effect sizes, so an initial finding of â10xâ Givedirectly or similar would require ~a factor of 10 cut to this average to drag it down to equipoise. Yet this âfeatureâ should be seen as a bug: in the same way there should be some non-zero value of the meta-analytic results which should reverse a âmany times better than Givedirectlyâ finding, there should be some non-tiny value of effect sizes for a psychotherapy intervention (or psychotherapy interventions in general) which results in it not being better than GiveDirectly at all.
This does help explain the somewhat surprising coincidence the first charity HLI fully assessed would be one it subsequently announces as the most promising interventions in global health and wellbeing so-far found: rather than a discovery from the data, this finding is largely preordained by how the CEA stacks the deck. To be redundant (and repetitive): i) the cost-effectiveness model HLI is making is unfit-for-purpose, given can produce these absurd results; ii) this introduces a large bias in favour of Strongminds; iii) it is a very surprising mistake for a diligent and impartial evaluator to makeâthese problems are not hard to find.
Theyâre even easier for HLI to find once theyâve been alerted to them. I did, months ago, alongside other problems, and suggested the cost-effectiveness analysis and Strongminds recommendation be withdrawn. Although it should have happened then, perhaps if I repeat myself it might happen now.
4:
Accusations of varying types of bad faith/âmotivated reasoning/âintellectual dishonesty should indeed be made with careâbesides the difficulty in determination, pragmatic considerations raise the bar still higher. Yet I think the evidence of HLI having less of a finger but more of a fist on the scale throughout its work overwhelms even charitable presumptions made by a saint on its behalf. In footballing terms, I donât think HLI is a player cynically diving to win a penalty, but it is like the manager after the game insisting âtheir goal was offside, and my player didnât deserve a red, and.. (etc.)â - highly inaccurate and highly biased. This is a problem when HLI claims itself an impartial referee, especially when it does things akin to awarding fouls every time a particular player gets tackled.
This is even more of a problem precisely because of the complex and interdisciplinary analysis HLI strives to do. No matter the additional analytic arcana, work like this will be largely fermi estimates, with variables being plugged in with little more to inform them than intuitive guesswork. The high degree of complexity provides a vast garden of forking paths available. Although random errors would tend to cancel out, consistent directional bias in model choice, variable selection, and numerical estimates lead to greatly inflated âbottom linesâ.
Although the transparency in (e.g.) data is commendable, the complex analysis also makes scruitiny harder. I expect very few have both the expertise and perseverence to carefully vet HLI analysis themselves; I also expect the vast majority of money HLI has moved has come from those largely taking its results on trust. This trust is ill-placed: HLIâs work weathers scruitiny extremely poorly; my experience is very much âthe more you see, the worse it looksâ. I doubt many donors following HLIâs advice, if they took a peak behind the curtain, would be happy with what they would discover.
If HLI is falling foul of an entrenched status quo, it is not particular presumptions around interventions, nor philosophical abstracta around population ethics, but rather those that work in this community (whether published elsewhere or not) should be even-handed, intellectually honest and trustworthy in all cases; rigorous and reliable commensurate to its expected consequence; and transparently and fairly communicated. I think going against this grain underlies (I suspect) why I am not alone in my concerns, and why HLI has not had the warmest reception. The hope this all changes for the better is not entirely forlorn. But things would have to change a lot, and quicklyâand the track record thus far does not spark joy.
Given I will be making complaints about publication bias, file drawer effects, and garden of forking path issues later in the show, one might wonder how much of this applies to my own criticism. How much time did I spend dredging through HLIâs work looking for something juicy? Is my file drawer stuffed with analyses I hoped would show HLI in a bad light, actually showed it in a good one, so I donât mention them?
Depressingly, the answer is ânot muchâ and ânoâ respectively. Regressing against publication registration was the second analysis I did on booting up the data again (regressing on active control was the first, mentioned in text). My file drawer subsequent to this is full of checks and double-checks for alternative (and better for HLI) explanations for the startling result. Specifically, and in order:
- I used the no_FU (no follow-ups) data initially for convenienceâthe full data can include multiple results of the same study at different follow-up points, and these clustered findings are inappropriate to ignore in a simple random effects model. So I checked both by doing this anyway then using a multi-level model to appropriately manage this structure to the data. No change to the key finding.
- Worried that (somehow) I was messing up or misinterpreting the metaregression, I (re)constructed a simple forest plot of all the studies, and confirmed indeed the unregistered ones were visibly off to the right. I then grouped a forest plot by registration variable to ensure it closely agreed with the meta-regression (in main text). It does.
- I then checked the first 10 studies coded by the variable I think is trial registration to check the registration status of those studies matched the codes. Although all fit, I thought the residual risk I was misunderstanding the variable was unacceptably high for a result significant enough to warrant a retraction demand. So I checked and coded all 46 studies by âregistered or not?â to make sure this agreed with my presumptive interpretation of the variable (in text). It does.
- Adding multiple variables to explain an effect geometrically expands researcher degrees of freedom, thus any unprincipled ad hoc investigation by adding or removing them has very high false discovery rates (I suspect this is a major problem with HLIâs own meta-regression work, but compared to everything else it merits only a passing mention here). But I wanted to check if I could find ways (even if unprincipled and ad hoc) to attenuate a result as stark as âunregistered studies have 3x the registered onesâ.
- I first tried to replicate HLIâs meta-regression work (exponential transformations and all) to see if the registration effect would be attenuated by intervention variables. Unfortunately, I was unable to replicate HLIâs regression results from the information provided (perhaps my fault). In any case, simpler versions I constructed did not give evidence for this.
- I also tried throwing in permutations of IPT-or-not (these studies tend to be unregistered, maybe this is the real cause of the effect?), active control-or-not (given it had a positive effect size, maybe it cancels out registration?) and study Standard Error (a proxyâalbeit a controversial oneâfor study size/âprecision/âquality, so if registration was confounded by it, this slightly challenges interpretation). The worst result across all the variations I tried was to drop the effect size of registration by 20% (~ â1 to â0.8), typically via substitution with SE. Omitted variable bias and multiple comparisons mean any further interpretation would be treacherous, but insofar as it provides further support: adding in more proxies for study quality increases explanatory power, and tends to even greater absolute and relative drops in effect size comparing âhighestâ versus âlowestâ quality studies.
That said, the effect size is so dramatic to be essentially immune to file-drawer worries. Even if I had a hundred null results I forgot to mention, this finding would survive a Bonferroni correction.
Obviously âis the study registered or notâ? is a crude indicator of overal quality. Typically, one would expect better measurement (perhaps by including further proxies for underlying study quality) would further increase the explanatory power of this factor. In other words, although these results look really bad, in reality it is likely to be even worse.
HLIâs write up on Bolton 2007 links to this paper (I did double check to make sure there wasnât another Bolton et al. 2007 which could have been confused with thisâno other match I could find). It has a sample size of 314, not 31 as HLI reportsâI presume a data entry error, although it less than reassuring that this erroneous figure is repeated and subsequently discussed in the text as part of the appraisal of the study: one reason given for weighing it so lightly is its âvery smallâ sample size.
Speaking of erroneous figures, hereâs the table of results from this study:
I see no way to arrive at an effect size of d = 1.79 from these numbers. The right comparison should surely be the pre-post difference of GIP versus control in the intention to treat analysis. These numbers give a cohenâs d ~ 0.5.
I donât think any other reasonable comparison gets much higher numbers, and definitely not > 3x higher numbersâthe differences between any of the groups are lower than the standard deviations, so should bound estimates like Cohenâs d to < 1.
[Re. file drawer, I guess this counts as a spot check (this is the only study I carefully checked data extraction), but not a random one: I did indeed look at this study in particular because it didnât match the âonly unregistered studies report crazy-high effectsâ - an ES of 1.79 is ~2x any other registered study.]
Re. my worries of selective scepticism, HLI did apply these methods in their meta-analysis of cash transfers, where no statistical suggestion of publication bias or p-hacking was evident.
This does depend a bit on whether spillover effects are being accounted for. This seems to cut the multiple by ~20%, but doesnât change the qualitative problems with the CEA. Happy to calculate precisely if someone insists.
Hello Gregory. With apologies, Iâm going to pre-commit both to making this my last reply to you on this post. This thread has been very costly in terms of my time and mental health, and your points below are, as far as I can tell, largely restatements of your earlier ones. As briefly as I can, and point by point again.
1.
A casual reader looking at your original comment might mistakenly conclude that we only used StrongMinds own study, and no other data, for our evaluation. Our point was that SMâs own work has relatively little weight, and we rely on many other sources. At this point, your argument seems rather âmotte-and-baileyâ. I would agree with you that there are different ways to do a meta-analysis (your point 3), and we plan to publish our new psychotherapy meta-analysis in due course so that it can be reviewed.
2.
Here, you are restating your prior suggestions that HLI should be taken in bad faith. Your claim is that HLI is good at spotting errors in othersâ work, but not its own. But there is an obvious explanation about âsurvivorshipâ effects. If you spot errors in your own research, you strip them out. Hence, by the time you publish, youâve found all the ones youâre going to find. This is why peer review is important: external reviewers will spot the errors that authors have missed themselves. Hence, thereâs nothing odd about having errors in your own work but also finding them in others. This is the normal stuff of academia!
3.
Iâm afraid I donât understand your complaint. I think your point is that âany way you slice the meta-analysis, psychotherapy looks more cost-effective than cash transfersâ but then you conclude this shows the meta-analysis must be wrong, rather than itâs sensible to conclude psychotherapy is better. Youâre right that you would have to deflate all the effect sizes by a large proportion to reverse the result. This should give you confidence in psychotherapy being better! Itâs worth pointing out that if psychotherapy is about $150pp, but cash transfers cost about $1100pp ($1000 transfer + delivery costs), therapy will be more cost-effective per intervention unless its per-intervention effect is much smaller
The explanation behind finding a new charity on our first go is not complicated or sinister. In earlier work, including my PhD, I had suggested that, on a SWB analysis, mental health was likely to be relatively neglected compared to status quo prioritising methods. I explained this in terms of the existing psychological literature on affective forecasting errors: weâre not very good at imagining internal suffering, we probably overstate the badness of material due to focusing illusions, and our forecasts donât account for hedonic adaptation (which doesnât occur to mental health). So the simple explanation is that we were âdiggingâ where we thought we were mostly likely to find âaltruistic goldâ, which seems sensible given limited resources.
4.
As much as I enjoyed your football analogies, here also youâre restating, rather than further substantiating, your earlier accusations. You seem to conclude from the fact you found some problems with HLIâs analysis that we should conclude this means HLI, but only HLI, should be distrusted, and retain our confidence in all the other charity evaluators. This seems unwarranted. Why not conclude you would find mistakes elsewhere too? I am reminded of the expression, âif you knew how the sausage was made, you wouldnât want to eat the sausageâ. What I think is true is that HLI is a second-generation charity evaluator, we are aiming to be extremely transparent, and we are proposing novel priorities. As a result, I think we have come in for a far higher level of public scrutiny than others have, so more of our errors have been found, but I donât know that we have made more and worse errors. Quite possibly, where errors have been noticed in othersâ work, they have been quietly and privately identified, and corrected with less fanfare.
Hello Michael,
Thanks for your reply. In turn:
1:
HLI has, in fact, put a lot of weight on the d = 1.72 Strongminds RCT. As table 2 shows, you give a weight of 13% to itâjoint highest out of the 5 pieces of direct evidence. As there are ~45 studies in the meta-analytic results, this means this RCT is being given equal or (substantially) greater weight than any other study you include. For similar reasons, the Strongminds phase 2 trial is accorded the third highest weight out of all studies in the analysis.
HLIâs analysis explains the rationale behind the weighting of âusing an appraisal of its risk of bias and relevance to StrongMindsâ present core programmeâ. Yet table 1A notes the quality of the 2020 RCT is âunknownâ - presumably because Strongminds has âonly given the results and some supporting details of the RCTâ. I donât think it can be reasonable to assign the highest weight to an (as far as I can tell) unpublished, not-peer reviewed, unregistered study conducted by Strongminds on its own effectiveness reporting an astonishing effect sizeâbefore it has even been read in full. It should be dramatically downweighted or wholly discounted until then, rather than included at face value with a promise HLI will followup later.
Risk of bias in this field in general is massive: effect sizes commonly melt with improving study quality. Assigning ~40% of a weighted average of effect size to a collection of 5 studies, 4 [actually 3, more later] of which are (marked) outliers in effect effect, of which 2 are conducted by the charity is unreasonable. This can be dramatically demonstrated from HLIâs own data:
One thing I didnât notice last time I looked is HLI did code variables on study quality for the included studies, although none of them seem to be used for any of the published analysis. I have some good news, and some very bad news.
The good news is the first such variable I looked at, ActiveControl, is a significant predictor of greater effect size. Studies with better controls report greater effects (roughly 0.6 versus 0.3). This effect is significant (p = 0.03) although small (10% of the variance) and difficultâat least for meâto explain: I would usually expect worse controls to widen the gap between it and the intervention group, not narrow it. In any case, this marker of study quality definitely does not explain away HLIâs findings.
The second variable I looked at was âUnpubOr(pre?)regâ.[1] As far as I can tell, coding 1 means something like âthe study was publicly registeredâ and 0 means it wasnât (Iâm guessing 0.5 means something intermediate like retrospective registration or similar) - in any case, this variable correlates extremely closely (>0.95) to my own coding of whether a study mentions being registered or not after reviewing all of them myself. If so, using it as a moderator makes devastating reading:[2]
To orientate: in âModel resultsâ the intercept value gives the estimated effect size when the âunpubâ variable is zero (as I understand it, ~unregistered studies), so d ~ 1.4 (!) for this set of studies. The row below gives the change in effect if you move from âunpub = 0â to âunpub = 1â (i.e. ~ registered vs. unregistered studies): this drops effect size by 1, so registered studies give effects of ~0.3. In other words, unregistered and registered studies give dramatically different effects: study registration reduces expected effect size by a factor of 3. [!!!]
The other statistics provided deepen the concern. The included studies have a very high level of heterogeneity (~their effect sizes vary much more than they should by chance). Although HLI attempted to explain this variation with various meta-regressions using features of the intervention, follow-up time, etc., these models left the great bulk of the variation unexplained. Although not like-for-like, here a single indicator of study quality provides compelling explanation for why effect sizes differ so much: it explains three-quarters of the initial variation.[3]
This is easily seen in a grouped forest plotâthe top group is the non registered studies, the second group the registered ones:
This pattern also perfectly fits the 5 pieces of direct evidence: Bolton 2003 (ES = 1.13), Strongminds RCT (1.72), and Strongminds P2 (1.09) are, as far as I can tell, unregistered. Thurman 2017 (0.09) was registered. Bolton 2007 is also registered, and in fact has an effect size of ~0.5, not 1.79 as HLI reports.[4]
To be clear, I do not think HLI knew of this before I found it out just now. But results like this indicate i) the appraisal of the literature in this analysis gravely off-the-markâstudy quality provides the best available explanation for why some trials report dramatically higher effects than others; ii) the result of this oversight is a dramatic over-estimation of likely efficacy of Strongminds (as a ready explanation for the large effects reported in the most ârelevant to strongmindsâ studies is that these studies were not registered and thus prone to ~200%+ inflation of effect size); iii) this is a very surprising mistake for a diligent and impartial evaluator to make: one would expect careful assessment of study qualityâand very sceptical evaluation where this appears to be lackingâto be foremost, especially given the subfield and prior reporting from Strongminds both heavily underline it. This pattern, alas, will prove repetitive.
I also think a finding like this should prompt an urgent withdrawal of both the analysis and recommendation pending further assessment. In honesty, if this doesnât, Iâm not sure what ever could.
2:
Indeed excellent researchers overlook things, and although I think both the frequency and severity of things HLI mistakes or overlooks is less-than-excellent, one could easily attribute this to things like âinexperienceâ, âtrying to do a lot in a hurryâ, âlimited staff capacityâ, and so on.
Yet this cannot account for how starkly asymmetric the impact of these mistakes and oversights are. HLIâs mistakes are consistently to Strongmindâs benefit rather than its detriment, and HLI rarely misses a consideration which could enhance the âmultipleâ, it frequently misses causes of concern which both undermine both strength and reliability of this recommendation. HLIâs award from Givewell deepens my concerns here, as it is consistent with a very selective scepticism: HLI can carefully scruitinize charity evaluations by others it wants to beat, but fails to mete out remotely comparable measure to its own which it intends for triumph.
I think this can also explain how HLI responds to criticism, which I have found by turns concerning and frustrating. HLI makes some splashy claim (cf. âmission accomplishedâ, âconfident recommendationâ, etc.). Someone else (eventually) takes a closer look, and finds the surprising splashy claim, rather than basically checking out âmost reasonable ways you slice itâ, it is highly non-robust, and only follows given HLI slicing it heavily in favour of their bottom line in terms of judgement or analysisâthe latter of which often has errors which further favour said bottom line. HLI reliably responds, but the tenor of this response is less âscientific discourseâ and more âlawyer for defenceâ: where it can, HLI will too often further double down on calls it makes where I aver the typical reasonable spectator would deem at best dubious, and at worst tendentious; where it canât, HLI acknowledges the shortcoming but asserts (again, usually very dubiously) that it isnât that a big deal, so it will deprioritise addressing it versus producing yet more work with the shortcomings familiar to those which came before.
3:
HLIâs meta-analysis in no way allays or rebuts the concerns SimonM raised re. Strongmindsâindeed, appropriate analysis would enhance many of them. Nor is it the case that the meta-analytic work makes HLIâs recommendation robust to shortcomings in the Strongminds-specific evidenceâindeed, the cost effectiveness calculator will robustly recommend Strongminds as superior (commonly, several times superior) to GiveDirectly almost no matter what efficacy results (meta-analytic or otherwise) are fed into it. On each.
a) Meta-analysis could help contextualize the problems SimonM identifies in the Strongminds specific data. For example, a funnel plot which is less of a âfunnelâ but more of a ski-slope (i.e. massive small study effects/ârisk of publication bias), and a contour/âp-curve suggestive of p-hacking would suggest the fieldâs literature needs to be handled with great care. Finding âstrongminds relevantâ studies and direct evidence are marked outliers even relative to this pathological literature should raise alarm given this complements the object-level concerns SimonM presented.
This is indeed true, and these features were present in the studies HLI collected, but HLI failed to recognise it. It may never have if I hadnât gotten curious and did these analyses myself. Said analysis is (relative to the much more elaborate techniques used in HLIâs meta-analysis) simple to conductâmy initial âworkâ was taking the spreadsheet and plugging it into a webtool out of idle curiosity.[5] Again, this is a significant mistake, adds a directional bias in favour of Strongminds, and is surprising for a diligent and impartial evaluator to make.
b) In general, incorporating meta-analytic results into what is essentially a weighted average alongside direct evidence does not clean either it or the direct evidence of object level shortcomings. If (as here) both are severely compromised, the result remains unreliable.
The particular approach HLI took also doesnât make the finding more robust, as the qualitative bottom line of the cost-effectiveness calculation is insensitive to the meta-analytic result. As-is, the calculator gives strongminds as roughly 12x better than GiveDirectly.[6] If you set both meta-analytic effect sizes to zero, the calculator gives Strongminds as ~7x better than GiveDirectly. So the five pieces of direct evidence are (apparently) sufficient to conclude SM is an extremely effective charity. Obviously this isâand HLI has previously acceptedâfacially invalid output.
It is not the only example. It is extremely hard for any reduction of efficacy inputs to the model to give a result that Strongminds is worse than Givedirectly. If we instead leave the meta-analytic results as they were but set all the effect sizes of the direct evidence to zero (in essence discounting them entirelyâwhich I think is approximately what should have been done from the start), we get ~5x better than GiveDirectly. If we set all the effect sizes of both meta-analysis and direct evidence to 0.4 (i.e. the expected effects of registered studies noted before), we get ~6x better than Givedirectly. If we set the meta-analytic results to 0.4 and set all the direct evidence to zero we get ~3x GiveDirectly. Only when one sets all the effect sizes to 0.1 - lower than all but ~three of the studies in the meta-analysisâdoes one approach equipoise.
This result should not surprise on reflection: the CEAâs result is roughly proportional to the ~weighted average of input effect sizes, so an initial finding of â10xâ Givedirectly or similar would require ~a factor of 10 cut to this average to drag it down to equipoise. Yet this âfeatureâ should be seen as a bug: in the same way there should be some non-zero value of the meta-analytic results which should reverse a âmany times better than Givedirectlyâ finding, there should be some non-tiny value of effect sizes for a psychotherapy intervention (or psychotherapy interventions in general) which results in it not being better than GiveDirectly at all.
This does help explain the somewhat surprising coincidence the first charity HLI fully assessed would be one it subsequently announces as the most promising interventions in global health and wellbeing so-far found: rather than a discovery from the data, this finding is largely preordained by how the CEA stacks the deck. To be redundant (and repetitive): i) the cost-effectiveness model HLI is making is unfit-for-purpose, given can produce these absurd results; ii) this introduces a large bias in favour of Strongminds; iii) it is a very surprising mistake for a diligent and impartial evaluator to makeâthese problems are not hard to find.
Theyâre even easier for HLI to find once theyâve been alerted to them. I did, months ago, alongside other problems, and suggested the cost-effectiveness analysis and Strongminds recommendation be withdrawn. Although it should have happened then, perhaps if I repeat myself it might happen now.
4:
Accusations of varying types of bad faith/âmotivated reasoning/âintellectual dishonesty should indeed be made with careâbesides the difficulty in determination, pragmatic considerations raise the bar still higher. Yet I think the evidence of HLI having less of a finger but more of a fist on the scale throughout its work overwhelms even charitable presumptions made by a saint on its behalf. In footballing terms, I donât think HLI is a player cynically diving to win a penalty, but it is like the manager after the game insisting âtheir goal was offside, and my player didnât deserve a red, and.. (etc.)â - highly inaccurate and highly biased. This is a problem when HLI claims itself an impartial referee, especially when it does things akin to awarding fouls every time a particular player gets tackled.
This is even more of a problem precisely because of the complex and interdisciplinary analysis HLI strives to do. No matter the additional analytic arcana, work like this will be largely fermi estimates, with variables being plugged in with little more to inform them than intuitive guesswork. The high degree of complexity provides a vast garden of forking paths available. Although random errors would tend to cancel out, consistent directional bias in model choice, variable selection, and numerical estimates lead to greatly inflated âbottom linesâ.
Although the transparency in (e.g.) data is commendable, the complex analysis also makes scruitiny harder. I expect very few have both the expertise and perseverence to carefully vet HLI analysis themselves; I also expect the vast majority of money HLI has moved has come from those largely taking its results on trust. This trust is ill-placed: HLIâs work weathers scruitiny extremely poorly; my experience is very much âthe more you see, the worse it looksâ. I doubt many donors following HLIâs advice, if they took a peak behind the curtain, would be happy with what they would discover.
If HLI is falling foul of an entrenched status quo, it is not particular presumptions around interventions, nor philosophical abstracta around population ethics, but rather those that work in this community (whether published elsewhere or not) should be even-handed, intellectually honest and trustworthy in all cases; rigorous and reliable commensurate to its expected consequence; and transparently and fairly communicated. I think going against this grain underlies (I suspect) why I am not alone in my concerns, and why HLI has not had the warmest reception. The hope this all changes for the better is not entirely forlorn. But things would have to change a lot, and quicklyâand the track record thus far does not spark joy.
Really surprised I missed this last time, to be honest. Especially because it is the only column title in the spreadsheet highlighted in red.
Given I will be making complaints about publication bias, file drawer effects, and garden of forking path issues later in the show, one might wonder how much of this applies to my own criticism. How much time did I spend dredging through HLIâs work looking for something juicy? Is my file drawer stuffed with analyses I hoped would show HLI in a bad light, actually showed it in a good one, so I donât mention them?
Depressingly, the answer is ânot muchâ and ânoâ respectively. Regressing against publication registration was the second analysis I did on booting up the data again (regressing on active control was the first, mentioned in text). My file drawer subsequent to this is full of checks and double-checks for alternative (and better for HLI) explanations for the startling result. Specifically, and in order:
- I used the no_FU (no follow-ups) data initially for convenienceâthe full data can include multiple results of the same study at different follow-up points, and these clustered findings are inappropriate to ignore in a simple random effects model. So I checked both by doing this anyway then using a multi-level model to appropriately manage this structure to the data. No change to the key finding.
- Worried that (somehow) I was messing up or misinterpreting the metaregression, I (re)constructed a simple forest plot of all the studies, and confirmed indeed the unregistered ones were visibly off to the right. I then grouped a forest plot by registration variable to ensure it closely agreed with the meta-regression (in main text). It does.
- I then checked the first 10 studies coded by the variable I think is trial registration to check the registration status of those studies matched the codes. Although all fit, I thought the residual risk I was misunderstanding the variable was unacceptably high for a result significant enough to warrant a retraction demand. So I checked and coded all 46 studies by âregistered or not?â to make sure this agreed with my presumptive interpretation of the variable (in text). It does.
- Adding multiple variables to explain an effect geometrically expands researcher degrees of freedom, thus any unprincipled ad hoc investigation by adding or removing them has very high false discovery rates (I suspect this is a major problem with HLIâs own meta-regression work, but compared to everything else it merits only a passing mention here). But I wanted to check if I could find ways (even if unprincipled and ad hoc) to attenuate a result as stark as âunregistered studies have 3x the registered onesâ.
- I first tried to replicate HLIâs meta-regression work (exponential transformations and all) to see if the registration effect would be attenuated by intervention variables. Unfortunately, I was unable to replicate HLIâs regression results from the information provided (perhaps my fault). In any case, simpler versions I constructed did not give evidence for this.
- I also tried throwing in permutations of IPT-or-not (these studies tend to be unregistered, maybe this is the real cause of the effect?), active control-or-not (given it had a positive effect size, maybe it cancels out registration?) and study Standard Error (a proxyâalbeit a controversial oneâfor study size/âprecision/âquality, so if registration was confounded by it, this slightly challenges interpretation). The worst result across all the variations I tried was to drop the effect size of registration by 20% (~ â1 to â0.8), typically via substitution with SE. Omitted variable bias and multiple comparisons mean any further interpretation would be treacherous, but insofar as it provides further support: adding in more proxies for study quality increases explanatory power, and tends to even greater absolute and relative drops in effect size comparing âhighestâ versus âlowestâ quality studies.
That said, the effect size is so dramatic to be essentially immune to file-drawer worries. Even if I had a hundred null results I forgot to mention, this finding would survive a Bonferroni correction.
Obviously âis the study registered or notâ? is a crude indicator of overal quality. Typically, one would expect better measurement (perhaps by including further proxies for underlying study quality) would further increase the explanatory power of this factor. In other words, although these results look really bad, in reality it is likely to be even worse.
HLIâs write up on Bolton 2007 links to this paper (I did double check to make sure there wasnât another Bolton et al. 2007 which could have been confused with thisâno other match I could find). It has a sample size of 314, not 31 as HLI reportsâI presume a data entry error, although it less than reassuring that this erroneous figure is repeated and subsequently discussed in the text as part of the appraisal of the study: one reason given for weighing it so lightly is its âvery smallâ sample size.
Speaking of erroneous figures, hereâs the table of results from this study:
I see no way to arrive at an effect size of d = 1.79 from these numbers. The right comparison should surely be the pre-post difference of GIP versus control in the intention to treat analysis. These numbers give a cohenâs d ~ 0.5.
I donât think any other reasonable comparison gets much higher numbers, and definitely not > 3x higher numbersâthe differences between any of the groups are lower than the standard deviations, so should bound estimates like Cohenâs d to < 1.
[Re. file drawer, I guess this counts as a spot check (this is the only study I carefully checked data extraction), but not a random one: I did indeed look at this study in particular because it didnât match the âonly unregistered studies report crazy-high effectsâ - an ES of 1.79 is ~2x any other registered study.]
Re. my worries of selective scepticism, HLI did apply these methods in their meta-analysis of cash transfers, where no statistical suggestion of publication bias or p-hacking was evident.
This does depend a bit on whether spillover effects are being accounted for. This seems to cut the multiple by ~20%, but doesnât change the qualitative problems with the CEA. Happy to calculate precisely if someone insists.
Hello Gregory. With apologies, Iâm going to pre-commit both to making this my last reply to you on this post. This thread has been very costly in terms of my time and mental health, and your points below are, as far as I can tell, largely restatements of your earlier ones. As briefly as I can, and point by point again.
1.
A casual reader looking at your original comment might mistakenly conclude that we only used StrongMinds own study, and no other data, for our evaluation. Our point was that SMâs own work has relatively little weight, and we rely on many other sources. At this point, your argument seems rather âmotte-and-baileyâ. I would agree with you that there are different ways to do a meta-analysis (your point 3), and we plan to publish our new psychotherapy meta-analysis in due course so that it can be reviewed.
2.
Here, you are restating your prior suggestions that HLI should be taken in bad faith. Your claim is that HLI is good at spotting errors in othersâ work, but not its own. But there is an obvious explanation about âsurvivorshipâ effects. If you spot errors in your own research, you strip them out. Hence, by the time you publish, youâve found all the ones youâre going to find. This is why peer review is important: external reviewers will spot the errors that authors have missed themselves. Hence, thereâs nothing odd about having errors in your own work but also finding them in others. This is the normal stuff of academia!
3.
Iâm afraid I donât understand your complaint. I think your point is that âany way you slice the meta-analysis, psychotherapy looks more cost-effective than cash transfersâ but then you conclude this shows the meta-analysis must be wrong, rather than itâs sensible to conclude psychotherapy is better. Youâre right that you would have to deflate all the effect sizes by a large proportion to reverse the result. This should give you confidence in psychotherapy being better! Itâs worth pointing out that if psychotherapy is about $150pp, but cash transfers cost about $1100pp ($1000 transfer + delivery costs), therapy will be more cost-effective per intervention unless its per-intervention effect is much smaller
The explanation behind finding a new charity on our first go is not complicated or sinister. In earlier work, including my PhD, I had suggested that, on a SWB analysis, mental health was likely to be relatively neglected compared to status quo prioritising methods. I explained this in terms of the existing psychological literature on affective forecasting errors: weâre not very good at imagining internal suffering, we probably overstate the badness of material due to focusing illusions, and our forecasts donât account for hedonic adaptation (which doesnât occur to mental health). So the simple explanation is that we were âdiggingâ where we thought we were mostly likely to find âaltruistic goldâ, which seems sensible given limited resources.
4.
As much as I enjoyed your football analogies, here also youâre restating, rather than further substantiating, your earlier accusations. You seem to conclude from the fact you found some problems with HLIâs analysis that we should conclude this means HLI, but only HLI, should be distrusted, and retain our confidence in all the other charity evaluators. This seems unwarranted. Why not conclude you would find mistakes elsewhere too? I am reminded of the expression, âif you knew how the sausage was made, you wouldnât want to eat the sausageâ. What I think is true is that HLI is a second-generation charity evaluator, we are aiming to be extremely transparent, and we are proposing novel priorities. As a result, I think we have come in for a far higher level of public scrutiny than others have, so more of our errors have been found, but I donât know that we have made more and worse errors. Quite possibly, where errors have been noticed in othersâ work, they have been quietly and privately identified, and corrected with less fanfare.