0) My bad re rma.rv output, sorry. Iâve corrected the offending section. (Iâll return to some second order matters later).
1) I imagine climbing in Mexico is more pleasant than arguing statistical methods on the internet, so Iâve attempted to save you at least some time on the latter by attempting to replicate your analysis myself.
This attempt was only partially successful: I took the âLay or Group cleanerâ sheet and (per previous comments) flipped the signs where necessary so only Houshofer et al. shows a negative effect. Plugging this into R means I get basically identical results for the forest plot (RE mean 0.50 versus 0.51) and funnel plot (Eggers lim value 0.2671 vs. 0.2670). I get broadly similar but discordant values for the univariate linear and exp decay, as well as model 1 in table 2 [henceforth âmodel 3â] (intercepts and coefficients ~ within a standard error of the write-upâs figures), and much more discordant values for the others in table 2.
I expect this âfailure to fully replicateâ is mostly owed to a mix of i) very small discrepancies between the datasets we are working off are likely to be amplified in more complex analysis than simpler forest plots etc. ii) Iâd guess the covariates would be much more discrepant, and there are more degrees of freedom in how they could be incorporated, so it is much more likely we arenât doing exactly the same thing (e.g. âLaynessâ in my sheet seems to be ordinalâvalues of 0-3 depending on how well trained the provider was, whilst the table suggests it was coded as categorical (trained or not)in the original analysis. Hopefully it is âclose enoughâ for at least some indicative trends not to be operator error. In the spirit of qualified reassurance hereâs my funnel plot:
2) Per above, one of the things I wanted to check is whether indeed you see large drops in effect size when you control for small studies/âpublication bias/âetc. You canât neatly merge (e.g.) Egger into meta-regression (at least, I canât), but I can add in study standard error as a moderator. Although there would be many misgivings of doing this vs. (e.g.) some transformation (although I expect working harder to linearize etc. would accentuate any effects), there are two benefits: i) extremely simple, ii) it also means the intercept value is where SE = 0, and so gives an estimate of what a hypothetical maximally sized study would suggest.
Adding in SE as a moderator reduces the intercept effect size by roughly half (model 1: 0.51 â 0.25; model 2: 0.42 â 0.23; model 3: 0.69 ->0.36). SE inclusion has ~no effect on the exponential model time decay coefficient, but does seem to confound the linear decay coefficient (effect size down by a third, so no longer a significant predictor) and the single group or individual variable I thought I could helpfully look at (down by ~20%). I take this as suggestive there is significant confounding of results by small study effects, and bayesian best guess correction is somewhere around a 50% discount.
3) As previously mentioned, if you plug this into the guestimate you do not materially change the CEA (roughly 12x to 9x if you halve the effect sizes), but this is because this CEA will return strongminds at least seven times better than cash transfers even if the effect size in the MRAs are set to zero. I did wonder how negative the estimate would have to be to change the analysis, but the gears in the guestimate include logs so a lot of it errors if you feed in negative values. I fear though, if it were adapted, it would give absurd results (e.g. still recommending strongminds even if the MRAs found psychotherapy exacerbated depression more than serious adverse life events).
4) To have an empty file-drawer, I also looked at âsourceâ to see whether cost survey studies gave higher effects due to the selection bias noted above. No: non-significantly numerically lower.
5) So it looks like the publication bias is much higher than estimated in the write-up: more 50% than 15%. I fear part of the reason for this discrepancy is the approach taken in Table A.2 is likely methodologically and conceptually unsound. Iâm not aware of a similar method in the literature, but it sounds like what you did is linearly (?meta)regress g on N for the metaPsy dataset (at least, I get similar figures when I do this, although my coefficient is 10x larger). If so, this doesnât make a lot of sense to meâSE is non-linear in N, the coefficient doesnât limit appropriately (e.g. an infinitely large study has +inf or -inf effects depending on which side of zero the coefficient is), and youâre also extrapolating greatly out of sample for the correction between average study sizes. The largest study in MetaPsy is ~800 (I see two points on my plot above 650), but you are taking the difference of N values at ~630 and ~2700.
Even more importantly, it is very odd to use a third set of studies to make the estimate versus the two literatures you are evaluating (given an objective is to compare the evidence bases, why not investigate them directly?) Treating them alike also assumes they share the same degree of small study effectsâthere are just at different points âalong the lineâ because one tends to have bigger studies than the other. It would seem reasonable to consider that the fields may differ in their susceptibility to publication bias and p-hacking, soâcontrolling for Nâcash transfer studies are less biased than psychotherapy ones. As we see from the respective forest plots, this is clearly the caseâthe regression slope for psychotherapy is like 10x or something as slope-y as the one for cash transfers.
(As a side note, MetaPsy lets you shove all of their studies into a forest plot, which looks approximately as asymmetric as the one from the present analysis:)
6) Back to the meta stuff.
I donât suspect either you or HLI of nefarious or deceptive behaviour (besides priors, this is strongly ruled against by publishing data that folks could analyse for themselves). But I do suspect partiality and imperfect intellectual honesty. By loose analogy, rather than a football referee who is (hopefully) unbiased but perhaps error prone, this is more like the manager of one of the teams claiming âobviouslyâ their side got the rough end of the refereeing decisions (maybe more error prone in general, definitely more likely to make mistakes favouring one âsideâ, but plausibly/âprobably sincere), but not like (e.g.) a player cynically diving to try and win a penalty. In other words, I suspectâif anythingâyou mostly pulled the wool over your own eyes, without really meaning to.
One reason this arises is, unfortunately, the more I look into things the more cause for concern I find. Moreover, the direction of concern re. these questionable-to-dubious analysis choices strongly tend in the direction of favouring the intervention. Maybe I see what I want to, but canât think of many cases where the analysis was surprisingly incurious about a consideration which would likely result in the effect size being adjusted upwards, nor where a concern about accuracy and generalizability could be further allayed with an alternative statistical technique (one minor example of the latterâit looks like you coded Mid and Low therapy as categoricals when testing sensitivity to therapyness: if you ordered them I expect youâd get a significant test for trend).
Iâm sorry again for mistaking the output you were getting, butârespectfullyâit still seems a bit sus. It is not like one should have had a low index of suspicion for lots of heterogeneity given how permissively you were including studies; although Q is not an oracular test statistic, P<0.001 should be an prompt to look at this further (especially as you can look at how Q changes when you add in covariates, and lack of great improvement when you do is a further signal); and presumably the very low R2 values mentioned earlier would be another indicator.
Although meta-analysis as a whole is arduous, knocking up a forest and funnel plot to have a look (e.g. whether one should indeed use random vs. fixed effects, given one argument for the latter is they are less sensitive to small study effects) is much easier: I would have no chance of doing any of this statistical assessment without all your work getting the data in the first place; with it, I got the (low-quality, but informative) plots in well under an hour, and do what youâve read above took a morning.
I had the luxury of not being on a deadline, but Iâm afraid a remark like âI didnât feel like I had time to put everything in both CEAs, explain it, and finish both CEAs before 2021 ended (which we saw as important for continuing to exist)â inspires sympathy but not reassurance on objectivity. I would guess HLI would have seen not only the quality and timeliness of the CEAs as important to its continued existence, but also the substantive conclusions they made: âWe find the intervention weâve discovered is X times better than cash transfers, and credibly better than Givewell recsâ seems much better in that regard than (e.g.) âWe find the intervention we previously discovered and recommended, now seems inferior to cash transfersâleave alone Givewell top charitiesâby the lights of our own further assessmentâ.
Besides being less pleasant, speculating over intentions is much less informative than the actual work itself. I look forward to any further thoughts you have on whether I am on the right track re. correction for small study effects, and I hope future work will indeed show this intervention is indeed as promising as your original analysis suggests.
Hello Joel,
0) My bad re rma.rv output, sorry. Iâve corrected the offending section. (Iâll return to some second order matters later).
1) I imagine climbing in Mexico is more pleasant than arguing statistical methods on the internet, so Iâve attempted to save you at least some time on the latter by attempting to replicate your analysis myself.
This attempt was only partially successful: I took the âLay or Group cleanerâ sheet and (per previous comments) flipped the signs where necessary so only Houshofer et al. shows a negative effect. Plugging this into R means I get basically identical results for the forest plot (RE mean 0.50 versus 0.51) and funnel plot (Eggers lim value 0.2671 vs. 0.2670). I get broadly similar but discordant values for the univariate linear and exp decay, as well as model 1 in table 2 [henceforth âmodel 3â] (intercepts and coefficients ~ within a standard error of the write-upâs figures), and much more discordant values for the others in table 2.
I expect this âfailure to fully replicateâ is mostly owed to a mix of i) very small discrepancies between the datasets we are working off are likely to be amplified in more complex analysis than simpler forest plots etc. ii) Iâd guess the covariates would be much more discrepant, and there are more degrees of freedom in how they could be incorporated, so it is much more likely we arenât doing exactly the same thing (e.g. âLaynessâ in my sheet seems to be ordinalâvalues of 0-3 depending on how well trained the provider was, whilst the table suggests it was coded as categorical (trained or not)in the original analysis. Hopefully it is âclose enoughâ for at least some indicative trends not to be operator error. In the spirit of qualified reassurance hereâs my funnel plot:
2) Per above, one of the things I wanted to check is whether indeed you see large drops in effect size when you control for small studies/âpublication bias/âetc. You canât neatly merge (e.g.) Egger into meta-regression (at least, I canât), but I can add in study standard error as a moderator. Although there would be many misgivings of doing this vs. (e.g.) some transformation (although I expect working harder to linearize etc. would accentuate any effects), there are two benefits: i) extremely simple, ii) it also means the intercept value is where SE = 0, and so gives an estimate of what a hypothetical maximally sized study would suggest.
Adding in SE as a moderator reduces the intercept effect size by roughly half (model 1: 0.51 â 0.25; model 2: 0.42 â 0.23; model 3: 0.69 ->0.36). SE inclusion has ~no effect on the exponential model time decay coefficient, but does seem to confound the linear decay coefficient (effect size down by a third, so no longer a significant predictor) and the single group or individual variable I thought I could helpfully look at (down by ~20%). I take this as suggestive there is significant confounding of results by small study effects, and bayesian best guess correction is somewhere around a 50% discount.
3) As previously mentioned, if you plug this into the guestimate you do not materially change the CEA (roughly 12x to 9x if you halve the effect sizes), but this is because this CEA will return strongminds at least seven times better than cash transfers even if the effect size in the MRAs are set to zero. I did wonder how negative the estimate would have to be to change the analysis, but the gears in the guestimate include logs so a lot of it errors if you feed in negative values. I fear though, if it were adapted, it would give absurd results (e.g. still recommending strongminds even if the MRAs found psychotherapy exacerbated depression more than serious adverse life events).
4) To have an empty file-drawer, I also looked at âsourceâ to see whether cost survey studies gave higher effects due to the selection bias noted above. No: non-significantly numerically lower.
5) So it looks like the publication bias is much higher than estimated in the write-up: more 50% than 15%. I fear part of the reason for this discrepancy is the approach taken in Table A.2 is likely methodologically and conceptually unsound. Iâm not aware of a similar method in the literature, but it sounds like what you did is linearly (?meta)regress g on N for the metaPsy dataset (at least, I get similar figures when I do this, although my coefficient is 10x larger). If so, this doesnât make a lot of sense to meâSE is non-linear in N, the coefficient doesnât limit appropriately (e.g. an infinitely large study has +inf or -inf effects depending on which side of zero the coefficient is), and youâre also extrapolating greatly out of sample for the correction between average study sizes. The largest study in MetaPsy is ~800 (I see two points on my plot above 650), but you are taking the difference of N values at ~630 and ~2700.
Even more importantly, it is very odd to use a third set of studies to make the estimate versus the two literatures you are evaluating (given an objective is to compare the evidence bases, why not investigate them directly?) Treating them alike also assumes they share the same degree of small study effectsâthere are just at different points âalong the lineâ because one tends to have bigger studies than the other. It would seem reasonable to consider that the fields may differ in their susceptibility to publication bias and p-hacking, soâcontrolling for Nâcash transfer studies are less biased than psychotherapy ones. As we see from the respective forest plots, this is clearly the caseâthe regression slope for psychotherapy is like 10x or something as slope-y as the one for cash transfers.
(As a side note, MetaPsy lets you shove all of their studies into a forest plot, which looks approximately as asymmetric as the one from the present analysis:)
6) Back to the meta stuff.
I donât suspect either you or HLI of nefarious or deceptive behaviour (besides priors, this is strongly ruled against by publishing data that folks could analyse for themselves). But I do suspect partiality and imperfect intellectual honesty. By loose analogy, rather than a football referee who is (hopefully) unbiased but perhaps error prone, this is more like the manager of one of the teams claiming âobviouslyâ their side got the rough end of the refereeing decisions (maybe more error prone in general, definitely more likely to make mistakes favouring one âsideâ, but plausibly/âprobably sincere), but not like (e.g.) a player cynically diving to try and win a penalty. In other words, I suspectâif anythingâyou mostly pulled the wool over your own eyes, without really meaning to.
One reason this arises is, unfortunately, the more I look into things the more cause for concern I find. Moreover, the direction of concern re. these questionable-to-dubious analysis choices strongly tend in the direction of favouring the intervention. Maybe I see what I want to, but canât think of many cases where the analysis was surprisingly incurious about a consideration which would likely result in the effect size being adjusted upwards, nor where a concern about accuracy and generalizability could be further allayed with an alternative statistical technique (one minor example of the latterâit looks like you coded Mid and Low therapy as categoricals when testing sensitivity to therapyness: if you ordered them I expect youâd get a significant test for trend).
Iâm sorry again for mistaking the output you were getting, butârespectfullyâit still seems a bit sus. It is not like one should have had a low index of suspicion for lots of heterogeneity given how permissively you were including studies; although Q is not an oracular test statistic, P<0.001 should be an prompt to look at this further (especially as you can look at how Q changes when you add in covariates, and lack of great improvement when you do is a further signal); and presumably the very low R2 values mentioned earlier would be another indicator.
Although meta-analysis as a whole is arduous, knocking up a forest and funnel plot to have a look (e.g. whether one should indeed use random vs. fixed effects, given one argument for the latter is they are less sensitive to small study effects) is much easier: I would have no chance of doing any of this statistical assessment without all your work getting the data in the first place; with it, I got the (low-quality, but informative) plots in well under an hour, and do what youâve read above took a morning.
I had the luxury of not being on a deadline, but Iâm afraid a remark like âI didnât feel like I had time to put everything in both CEAs, explain it, and finish both CEAs before 2021 ended (which we saw as important for continuing to exist)â inspires sympathy but not reassurance on objectivity. I would guess HLI would have seen not only the quality and timeliness of the CEAs as important to its continued existence, but also the substantive conclusions they made: âWe find the intervention weâve discovered is X times better than cash transfers, and credibly better than Givewell recsâ seems much better in that regard than (e.g.) âWe find the intervention we previously discovered and recommended, now seems inferior to cash transfersâleave alone Givewell top charitiesâby the lights of our own further assessmentâ.
Besides being less pleasant, speculating over intentions is much less informative than the actual work itself. I look forward to any further thoughts you have on whether I am on the right track re. correction for small study effects, and I hope future work will indeed show this intervention is indeed as promising as your original analysis suggests.