0) My bad re rma.rv output, sorry. I’ve corrected the offending section. (I’ll return to some second order matters later).
1) I imagine climbing in Mexico is more pleasant than arguing statistical methods on the internet, so I’ve attempted to save you at least some time on the latter by attempting to replicate your analysis myself.
This attempt was only partially successful: I took the ‘Lay or Group cleaner’ sheet and (per previous comments) flipped the signs where necessary so only Houshofer et al. shows a negative effect. Plugging this into R means I get basically identical results for the forest plot (RE mean 0.50 versus 0.51) and funnel plot (Eggers lim value 0.2671 vs. 0.2670). I get broadly similar but discordant values for the univariate linear and exp decay, as well as model 1 in table 2 [henceforth ‘model 3’] (intercepts and coefficients ~ within a standard error of the write-up’s figures), and much more discordant values for the others in table 2.
I expect this ‘failure to fully replicate’ is mostly owed to a mix of i) very small discrepancies between the datasets we are working off are likely to be amplified in more complex analysis than simpler forest plots etc. ii) I’d guess the covariates would be much more discrepant, and there are more degrees of freedom in how they could be incorporated, so it is much more likely we aren’t doing exactly the same thing (e.g. ‘Layness’ in my sheet seems to be ordinal—values of 0-3 depending on how well trained the provider was, whilst the table suggests it was coded as categorical (trained or not)in the original analysis. Hopefully it is ‘close enough’ for at least some indicative trends not to be operator error. In the spirit of qualified reassurance here’s my funnel plot:
2) Per above, one of the things I wanted to check is whether indeed you see large drops in effect size when you control for small studies/publication bias/etc. You can’t neatly merge (e.g.) Egger into meta-regression (at least, I can’t), but I can add in study standard error as a moderator. Although there would be many misgivings of doing this vs. (e.g.) some transformation (although I expect working harder to linearize etc. would accentuate any effects), there are two benefits: i) extremely simple, ii) it also means the intercept value is where SE = 0, and so gives an estimate of what a hypothetical maximally sized study would suggest.
Adding in SE as a moderator reduces the intercept effect size by roughly half (model 1: 0.51 → 0.25; model 2: 0.42 → 0.23; model 3: 0.69 ->0.36). SE inclusion has ~no effect on the exponential model time decay coefficient, but does seem to confound the linear decay coefficient (effect size down by a third, so no longer a significant predictor) and the single group or individual variable I thought I could helpfully look at (down by ~20%). I take this as suggestive there is significant confounding of results by small study effects, and bayesian best guess correction is somewhere around a 50% discount.
3) As previously mentioned, if you plug this into the guestimate you do not materially change the CEA (roughly 12x to 9x if you halve the effect sizes), but this is because this CEA will return strongminds at least seven times better than cash transfers even if the effect size in the MRAs are set to zero. I did wonder how negative the estimate would have to be to change the analysis, but the gears in the guestimate include logs so a lot of it errors if you feed in negative values. I fear though, if it were adapted, it would give absurd results (e.g. still recommending strongminds even if the MRAs found psychotherapy exacerbated depression more than serious adverse life events).
4) To have an empty file-drawer, I also looked at ‘source’ to see whether cost survey studies gave higher effects due to the selection bias noted above. No: non-significantly numerically lower.
5) So it looks like the publication bias is much higher than estimated in the write-up: more 50% than 15%. I fear part of the reason for this discrepancy is the approach taken in Table A.2 is likely methodologically and conceptually unsound. I’m not aware of a similar method in the literature, but it sounds like what you did is linearly (?meta)regress g on N for the metaPsy dataset (at least, I get similar figures when I do this, although my coefficient is 10x larger). If so, this doesn’t make a lot of sense to me—SE is non-linear in N, the coefficient doesn’t limit appropriately (e.g. an infinitely large study has +inf or -inf effects depending on which side of zero the coefficient is), and you’re also extrapolating greatly out of sample for the correction between average study sizes. The largest study in MetaPsy is ~800 (I see two points on my plot above 650), but you are taking the difference of N values at ~630 and ~2700.
Even more importantly, it is very odd to use a third set of studies to make the estimate versus the two literatures you are evaluating (given an objective is to compare the evidence bases, why not investigate them directly?) Treating them alike also assumes they share the same degree of small study effects—there are just at different points ‘along the line’ because one tends to have bigger studies than the other. It would seem reasonable to consider that the fields may differ in their susceptibility to publication bias and p-hacking, so—controlling for N—cash transfer studies are less biased than psychotherapy ones. As we see from the respective forest plots, this is clearly the case—the regression slope for psychotherapy is like 10x or something as slope-y as the one for cash transfers.
(As a side note, MetaPsy lets you shove all of their studies into a forest plot, which looks approximately as asymmetric as the one from the present analysis:)
6) Back to the meta stuff.
I don’t suspect either you or HLI of nefarious or deceptive behaviour (besides priors, this is strongly ruled against by publishing data that folks could analyse for themselves). But I do suspect partiality and imperfect intellectual honesty. By loose analogy, rather than a football referee who is (hopefully) unbiased but perhaps error prone, this is more like the manager of one of the teams claiming “obviously” their side got the rough end of the refereeing decisions (maybe more error prone in general, definitely more likely to make mistakes favouring one ‘side’, but plausibly/probably sincere), but not like (e.g.) a player cynically diving to try and win a penalty. In other words, I suspect—if anything—you mostly pulled the wool over your own eyes, without really meaning to.
One reason this arises is, unfortunately, the more I look into things the more cause for concern I find. Moreover, the direction of concern re. these questionable-to-dubious analysis choices strongly tend in the direction of favouring the intervention. Maybe I see what I want to, but can’t think of many cases where the analysis was surprisingly incurious about a consideration which would likely result in the effect size being adjusted upwards, nor where a concern about accuracy and generalizability could be further allayed with an alternative statistical technique (one minor example of the latter—it looks like you coded Mid and Low therapy as categoricals when testing sensitivity to therapyness: if you ordered them I expect you’d get a significant test for trend).
I’m sorry again for mistaking the output you were getting, but—respectfully—it still seems a bit sus. It is not like one should have had a low index of suspicion for lots of heterogeneity given how permissively you were including studies; although Q is not an oracular test statistic, P<0.001 should be an prompt to look at this further (especially as you can look at how Q changes when you add in covariates, and lack of great improvement when you do is a further signal); and presumably the very low R2 values mentioned earlier would be another indicator.
Although meta-analysis as a whole is arduous, knocking up a forest and funnel plot to have a look (e.g. whether one should indeed use random vs. fixed effects, given one argument for the latter is they are less sensitive to small study effects) is much easier: I would have no chance of doing any of this statistical assessment without all your work getting the data in the first place; with it, I got the (low-quality, but informative) plots in well under an hour, and do what you’ve read above took a morning.
I had the luxury of not being on a deadline, but I’m afraid a remark like “I didn’t feel like I had time to put everything in both CEAs, explain it, and finish both CEAs before 2021 ended (which we saw as important for continuing to exist)” inspires sympathy but not reassurance on objectivity. I would guess HLI would have seen not only the quality and timeliness of the CEAs as important to its continued existence, but also the substantive conclusions they made: “We find the intervention we’ve discovered is X times better than cash transfers, and credibly better than Givewell recs” seems much better in that regard than (e.g.) “We find the intervention we previously discovered and recommended, now seems inferior to cash transfers—leave alone Givewell top charities—by the lights of our own further assessment”.
Besides being less pleasant, speculating over intentions is much less informative than the actual work itself. I look forward to any further thoughts you have on whether I am on the right track re. correction for small study effects, and I hope future work will indeed show this intervention is indeed as promising as your original analysis suggests.
Hello Joel,
0) My bad re rma.rv output, sorry. I’ve corrected the offending section. (I’ll return to some second order matters later).
1) I imagine climbing in Mexico is more pleasant than arguing statistical methods on the internet, so I’ve attempted to save you at least some time on the latter by attempting to replicate your analysis myself.
This attempt was only partially successful: I took the ‘Lay or Group cleaner’ sheet and (per previous comments) flipped the signs where necessary so only Houshofer et al. shows a negative effect. Plugging this into R means I get basically identical results for the forest plot (RE mean 0.50 versus 0.51) and funnel plot (Eggers lim value 0.2671 vs. 0.2670). I get broadly similar but discordant values for the univariate linear and exp decay, as well as model 1 in table 2 [henceforth ‘model 3’] (intercepts and coefficients ~ within a standard error of the write-up’s figures), and much more discordant values for the others in table 2.
I expect this ‘failure to fully replicate’ is mostly owed to a mix of i) very small discrepancies between the datasets we are working off are likely to be amplified in more complex analysis than simpler forest plots etc. ii) I’d guess the covariates would be much more discrepant, and there are more degrees of freedom in how they could be incorporated, so it is much more likely we aren’t doing exactly the same thing (e.g. ‘Layness’ in my sheet seems to be ordinal—values of 0-3 depending on how well trained the provider was, whilst the table suggests it was coded as categorical (trained or not)in the original analysis. Hopefully it is ‘close enough’ for at least some indicative trends not to be operator error. In the spirit of qualified reassurance here’s my funnel plot:
2) Per above, one of the things I wanted to check is whether indeed you see large drops in effect size when you control for small studies/publication bias/etc. You can’t neatly merge (e.g.) Egger into meta-regression (at least, I can’t), but I can add in study standard error as a moderator. Although there would be many misgivings of doing this vs. (e.g.) some transformation (although I expect working harder to linearize etc. would accentuate any effects), there are two benefits: i) extremely simple, ii) it also means the intercept value is where SE = 0, and so gives an estimate of what a hypothetical maximally sized study would suggest.
Adding in SE as a moderator reduces the intercept effect size by roughly half (model 1: 0.51 → 0.25; model 2: 0.42 → 0.23; model 3: 0.69 ->0.36). SE inclusion has ~no effect on the exponential model time decay coefficient, but does seem to confound the linear decay coefficient (effect size down by a third, so no longer a significant predictor) and the single group or individual variable I thought I could helpfully look at (down by ~20%). I take this as suggestive there is significant confounding of results by small study effects, and bayesian best guess correction is somewhere around a 50% discount.
3) As previously mentioned, if you plug this into the guestimate you do not materially change the CEA (roughly 12x to 9x if you halve the effect sizes), but this is because this CEA will return strongminds at least seven times better than cash transfers even if the effect size in the MRAs are set to zero. I did wonder how negative the estimate would have to be to change the analysis, but the gears in the guestimate include logs so a lot of it errors if you feed in negative values. I fear though, if it were adapted, it would give absurd results (e.g. still recommending strongminds even if the MRAs found psychotherapy exacerbated depression more than serious adverse life events).
4) To have an empty file-drawer, I also looked at ‘source’ to see whether cost survey studies gave higher effects due to the selection bias noted above. No: non-significantly numerically lower.
5) So it looks like the publication bias is much higher than estimated in the write-up: more 50% than 15%. I fear part of the reason for this discrepancy is the approach taken in Table A.2 is likely methodologically and conceptually unsound. I’m not aware of a similar method in the literature, but it sounds like what you did is linearly (?meta)regress g on N for the metaPsy dataset (at least, I get similar figures when I do this, although my coefficient is 10x larger). If so, this doesn’t make a lot of sense to me—SE is non-linear in N, the coefficient doesn’t limit appropriately (e.g. an infinitely large study has +inf or -inf effects depending on which side of zero the coefficient is), and you’re also extrapolating greatly out of sample for the correction between average study sizes. The largest study in MetaPsy is ~800 (I see two points on my plot above 650), but you are taking the difference of N values at ~630 and ~2700.
Even more importantly, it is very odd to use a third set of studies to make the estimate versus the two literatures you are evaluating (given an objective is to compare the evidence bases, why not investigate them directly?) Treating them alike also assumes they share the same degree of small study effects—there are just at different points ‘along the line’ because one tends to have bigger studies than the other. It would seem reasonable to consider that the fields may differ in their susceptibility to publication bias and p-hacking, so—controlling for N—cash transfer studies are less biased than psychotherapy ones. As we see from the respective forest plots, this is clearly the case—the regression slope for psychotherapy is like 10x or something as slope-y as the one for cash transfers.
(As a side note, MetaPsy lets you shove all of their studies into a forest plot, which looks approximately as asymmetric as the one from the present analysis:)
6) Back to the meta stuff.
I don’t suspect either you or HLI of nefarious or deceptive behaviour (besides priors, this is strongly ruled against by publishing data that folks could analyse for themselves). But I do suspect partiality and imperfect intellectual honesty. By loose analogy, rather than a football referee who is (hopefully) unbiased but perhaps error prone, this is more like the manager of one of the teams claiming “obviously” their side got the rough end of the refereeing decisions (maybe more error prone in general, definitely more likely to make mistakes favouring one ‘side’, but plausibly/probably sincere), but not like (e.g.) a player cynically diving to try and win a penalty. In other words, I suspect—if anything—you mostly pulled the wool over your own eyes, without really meaning to.
One reason this arises is, unfortunately, the more I look into things the more cause for concern I find. Moreover, the direction of concern re. these questionable-to-dubious analysis choices strongly tend in the direction of favouring the intervention. Maybe I see what I want to, but can’t think of many cases where the analysis was surprisingly incurious about a consideration which would likely result in the effect size being adjusted upwards, nor where a concern about accuracy and generalizability could be further allayed with an alternative statistical technique (one minor example of the latter—it looks like you coded Mid and Low therapy as categoricals when testing sensitivity to therapyness: if you ordered them I expect you’d get a significant test for trend).
I’m sorry again for mistaking the output you were getting, but—respectfully—it still seems a bit sus. It is not like one should have had a low index of suspicion for lots of heterogeneity given how permissively you were including studies; although Q is not an oracular test statistic, P<0.001 should be an prompt to look at this further (especially as you can look at how Q changes when you add in covariates, and lack of great improvement when you do is a further signal); and presumably the very low R2 values mentioned earlier would be another indicator.
Although meta-analysis as a whole is arduous, knocking up a forest and funnel plot to have a look (e.g. whether one should indeed use random vs. fixed effects, given one argument for the latter is they are less sensitive to small study effects) is much easier: I would have no chance of doing any of this statistical assessment without all your work getting the data in the first place; with it, I got the (low-quality, but informative) plots in well under an hour, and do what you’ve read above took a morning.
I had the luxury of not being on a deadline, but I’m afraid a remark like “I didn’t feel like I had time to put everything in both CEAs, explain it, and finish both CEAs before 2021 ended (which we saw as important for continuing to exist)” inspires sympathy but not reassurance on objectivity. I would guess HLI would have seen not only the quality and timeliness of the CEAs as important to its continued existence, but also the substantive conclusions they made: “We find the intervention we’ve discovered is X times better than cash transfers, and credibly better than Givewell recs” seems much better in that regard than (e.g.) “We find the intervention we previously discovered and recommended, now seems inferior to cash transfers—leave alone Givewell top charities—by the lights of our own further assessment”.
Besides being less pleasant, speculating over intentions is much less informative than the actual work itself. I look forward to any further thoughts you have on whether I am on the right track re. correction for small study effects, and I hope future work will indeed show this intervention is indeed as promising as your original analysis suggests.