Hi Gregory, I wanted to respond quickly on a few points. A longer respond about what I see as the biggest issue (is our analysis overestimating the effects of psychotherapy and StrongMinds by =< 2x??) may take a bit longer as I think about this and run some analyses as wifi permits (I’m currently climbing in Mexico).
This is really useful stuff, and I think I understand where you’re coming from.
I’d take this episode as a qualified defence of the ‘old fashioned way of doing things’.
FWIW, as I think I’ve expressed elsewhere, I think I went too far trying to build a newer better wheel for this analysis, and we’ve planned on doing a traditional systematic review and meta-analysis of psychotherapy in LMICs since the fall.
It is also odd to have an extensive discussion of publication bias (up to and including ones own attempt to make a rubric to correct for it) without doing the normal funnel plot +/- tests for small study effects.
I get it, and while I could do some more self flagellation on behalf of my former hubris at pursuing this rubric, I’ll temporarily refrain and point out that small study effects were incorporated as a discount against psychotherapy—they just didn’t end up being very big.
Even if you didn’t look for it, metareg in R will confront you with heterogeneity estimates for all your models in its output (cf.). One should naturally expect curiosity (or alarm) on finding >90% heterogeneity, which I suspect stays around or >90% even with the most expansive meta-regressions. Not only are these not reported in the write-up, but in the R outputs provided (e.g. here) these parts of the results have been cropped out.
But it doesn’t do that if you 1. aren’t using metareg or 2. are using multi-level models. Here’s the full output from the metafor::rma.mv() call I was hiding.
It contains a Q test for heterogeneity, which flags statistically significant heterogeneity. What does this mean? I’ll quote from the text we’ve referenced.
Cochran’s Q undefinedincreases both when the number of studies increases, and when the precision (i.e. the sample size of a study) increases.
Therefore Q, undefined and whether it is significant highly depends on the size of your meta-analysis, and thus its statistical power. We should therefore not only rely on undefinedQ, and particularly the undefinedQ-test, when assessing between-study heterogeneity.
It also reports sigma^2 which should be equivalent to the tau^2 / tau statistic which “quantifies the variance of the true effect sizes underlying our data.” We can use it to create a 95% CI for the true effect of the intercept, which is:
> 0.58 - (1.96 * 0.3996) = −0.203216
> 0.58 + (1.96 * 0.3996) = 1.363216
This is similar to what we find we calculate the prediction intervals (-0.2692, 1.4225). Quoting the text again regarding prediction intervals:
Prediction intervals give us a range into which we can expect the effects of future studies to fall based on present evidence.
Say that our prediction interval lies completely on the “positive” side favoring the intervention. This means that, despite varying effects, the intervention is expected to be beneficial in the future across the contexts we studied. If the prediction interval includes zero, we can be less sure about this, although it should be noted that broad prediction intervals are quite common.
Commenting on the emphasized section, the key thing I’ve tried to keep in mind “is how does the psychotherapy evidence base / meta-analysis compare to the cash transfer evidence base / meta-analysis / CEA?”. So while the prediction interval for psychotherapy contains negative values, which is typically seen as a sign of high heterogeneity, it also did so in the cash transfers meta-analysis. So I’m not quite sure what to make of the magnitude or qualitative difference in heterogeneity, which I’ve assumed is the relevant feature.
I guess a general point is that calculating and assessing heterogeneity is not straightforward, especially for multi-level models. Now, while one could argue we used multi-level models as part of our nefarious plan to pull the wool over folks eyes, that’s just not the case. It just seems like the appropriate way to account dependency introduced by including multiple timepoints in a study, which seems necessary to avoid basing our estimates of how long the effects last on guesswork.
That something is up (i.e. huge hetereogeneity, huge small study effects) with the data can be seen on the forest plot (and definitely in the funnel plot). It is odd to skip these figures and basic assessment before launching into a much more elaborate multi-level metaregression.
Understandable, but for a bit of context—we also didn’t get into the meta-analytic diagnostics in our CEA of cash transfers. While my co-authors and I did this stuff in the meta-analysis the CEA was based on, I didn’t feel like I had time to put everything in both CEAs, explain it, and finish both CEAs before 2021 ended (which we saw as important for continuing to exist) -- especially after wasting precious time on my quest to be clever (see bias rubric in appendix C). Doing the full meta-analysis for cash transfers took up the better part of a year, and we couldn’t afford to do that again. So I thought that broadly mirroring the CEA I did for cash transfers was a way to “cut to the chase”. I saw the meta-analysis as a way to get an input to the CEA, and I was trying to do the 20% (with a meta-analysis in ~3 months rather than a year) . I’m not saying that this absolves me, but it’s certainly context for the tunnel vision.
Mentioning prior sensitivity analyses which didn’t make the cut for the write-up invites wondering what else got left in the file-drawer.
Fair point! This is an omission I hope to remedy in due course. In the mean time, I’ll try and respond with some more detailed comments about correcting for publication bias—which I expect is also not as straightforward as it may sound.
0) My bad re rma.rv output, sorry. I’ve corrected the offending section. (I’ll return to some second order matters later).
1) I imagine climbing in Mexico is more pleasant than arguing statistical methods on the internet, so I’ve attempted to save you at least some time on the latter by attempting to replicate your analysis myself.
This attempt was only partially successful: I took the ‘Lay or Group cleaner’ sheet and (per previous comments) flipped the signs where necessary so only Houshofer et al. shows a negative effect. Plugging this into R means I get basically identical results for the forest plot (RE mean 0.50 versus 0.51) and funnel plot (Eggers lim value 0.2671 vs. 0.2670). I get broadly similar but discordant values for the univariate linear and exp decay, as well as model 1 in table 2 [henceforth ‘model 3’] (intercepts and coefficients ~ within a standard error of the write-up’s figures), and much more discordant values for the others in table 2.
I expect this ‘failure to fully replicate’ is mostly owed to a mix of i) very small discrepancies between the datasets we are working off are likely to be amplified in more complex analysis than simpler forest plots etc. ii) I’d guess the covariates would be much more discrepant, and there are more degrees of freedom in how they could be incorporated, so it is much more likely we aren’t doing exactly the same thing (e.g. ‘Layness’ in my sheet seems to be ordinal—values of 0-3 depending on how well trained the provider was, whilst the table suggests it was coded as categorical (trained or not)in the original analysis. Hopefully it is ‘close enough’ for at least some indicative trends not to be operator error. In the spirit of qualified reassurance here’s my funnel plot:
2) Per above, one of the things I wanted to check is whether indeed you see large drops in effect size when you control for small studies/publication bias/etc. You can’t neatly merge (e.g.) Egger into meta-regression (at least, I can’t), but I can add in study standard error as a moderator. Although there would be many misgivings of doing this vs. (e.g.) some transformation (although I expect working harder to linearize etc. would accentuate any effects), there are two benefits: i) extremely simple, ii) it also means the intercept value is where SE = 0, and so gives an estimate of what a hypothetical maximally sized study would suggest.
Adding in SE as a moderator reduces the intercept effect size by roughly half (model 1: 0.51 → 0.25; model 2: 0.42 → 0.23; model 3: 0.69 ->0.36). SE inclusion has ~no effect on the exponential model time decay coefficient, but does seem to confound the linear decay coefficient (effect size down by a third, so no longer a significant predictor) and the single group or individual variable I thought I could helpfully look at (down by ~20%). I take this as suggestive there is significant confounding of results by small study effects, and bayesian best guess correction is somewhere around a 50% discount.
3) As previously mentioned, if you plug this into the guestimate you do not materially change the CEA (roughly 12x to 9x if you halve the effect sizes), but this is because this CEA will return strongminds at least seven times better than cash transfers even if the effect size in the MRAs are set to zero. I did wonder how negative the estimate would have to be to change the analysis, but the gears in the guestimate include logs so a lot of it errors if you feed in negative values. I fear though, if it were adapted, it would give absurd results (e.g. still recommending strongminds even if the MRAs found psychotherapy exacerbated depression more than serious adverse life events).
4) To have an empty file-drawer, I also looked at ‘source’ to see whether cost survey studies gave higher effects due to the selection bias noted above. No: non-significantly numerically lower.
5) So it looks like the publication bias is much higher than estimated in the write-up: more 50% than 15%. I fear part of the reason for this discrepancy is the approach taken in Table A.2 is likely methodologically and conceptually unsound. I’m not aware of a similar method in the literature, but it sounds like what you did is linearly (?meta)regress g on N for the metaPsy dataset (at least, I get similar figures when I do this, although my coefficient is 10x larger). If so, this doesn’t make a lot of sense to me—SE is non-linear in N, the coefficient doesn’t limit appropriately (e.g. an infinitely large study has +inf or -inf effects depending on which side of zero the coefficient is), and you’re also extrapolating greatly out of sample for the correction between average study sizes. The largest study in MetaPsy is ~800 (I see two points on my plot above 650), but you are taking the difference of N values at ~630 and ~2700.
Even more importantly, it is very odd to use a third set of studies to make the estimate versus the two literatures you are evaluating (given an objective is to compare the evidence bases, why not investigate them directly?) Treating them alike also assumes they share the same degree of small study effects—there are just at different points ‘along the line’ because one tends to have bigger studies than the other. It would seem reasonable to consider that the fields may differ in their susceptibility to publication bias and p-hacking, so—controlling for N—cash transfer studies are less biased than psychotherapy ones. As we see from the respective forest plots, this is clearly the case—the regression slope for psychotherapy is like 10x or something as slope-y as the one for cash transfers.
(As a side note, MetaPsy lets you shove all of their studies into a forest plot, which looks approximately as asymmetric as the one from the present analysis:)
6) Back to the meta stuff.
I don’t suspect either you or HLI of nefarious or deceptive behaviour (besides priors, this is strongly ruled against by publishing data that folks could analyse for themselves). But I do suspect partiality and imperfect intellectual honesty. By loose analogy, rather than a football referee who is (hopefully) unbiased but perhaps error prone, this is more like the manager of one of the teams claiming “obviously” their side got the rough end of the refereeing decisions (maybe more error prone in general, definitely more likely to make mistakes favouring one ‘side’, but plausibly/probably sincere), but not like (e.g.) a player cynically diving to try and win a penalty. In other words, I suspect—if anything—you mostly pulled the wool over your own eyes, without really meaning to.
One reason this arises is, unfortunately, the more I look into things the more cause for concern I find. Moreover, the direction of concern re. these questionable-to-dubious analysis choices strongly tend in the direction of favouring the intervention. Maybe I see what I want to, but can’t think of many cases where the analysis was surprisingly incurious about a consideration which would likely result in the effect size being adjusted upwards, nor where a concern about accuracy and generalizability could be further allayed with an alternative statistical technique (one minor example of the latter—it looks like you coded Mid and Low therapy as categoricals when testing sensitivity to therapyness: if you ordered them I expect you’d get a significant test for trend).
I’m sorry again for mistaking the output you were getting, but—respectfully—it still seems a bit sus. It is not like one should have had a low index of suspicion for lots of heterogeneity given how permissively you were including studies; although Q is not an oracular test statistic, P<0.001 should be an prompt to look at this further (especially as you can look at how Q changes when you add in covariates, and lack of great improvement when you do is a further signal); and presumably the very low R2 values mentioned earlier would be another indicator.
Although meta-analysis as a whole is arduous, knocking up a forest and funnel plot to have a look (e.g. whether one should indeed use random vs. fixed effects, given one argument for the latter is they are less sensitive to small study effects) is much easier: I would have no chance of doing any of this statistical assessment without all your work getting the data in the first place; with it, I got the (low-quality, but informative) plots in well under an hour, and do what you’ve read above took a morning.
I had the luxury of not being on a deadline, but I’m afraid a remark like “I didn’t feel like I had time to put everything in both CEAs, explain it, and finish both CEAs before 2021 ended (which we saw as important for continuing to exist)” inspires sympathy but not reassurance on objectivity. I would guess HLI would have seen not only the quality and timeliness of the CEAs as important to its continued existence, but also the substantive conclusions they made: “We find the intervention we’ve discovered is X times better than cash transfers, and credibly better than Givewell recs” seems much better in that regard than (e.g.) “We find the intervention we previously discovered and recommended, now seems inferior to cash transfers—leave alone Givewell top charities—by the lights of our own further assessment”.
Besides being less pleasant, speculating over intentions is much less informative than the actual work itself. I look forward to any further thoughts you have on whether I am on the right track re. correction for small study effects, and I hope future work will indeed show this intervention is indeed as promising as your original analysis suggests.
Hi Gregory, I wanted to respond quickly on a few points. A longer respond about what I see as the biggest issue (is our analysis overestimating the effects of psychotherapy and StrongMinds by =< 2x??) may take a bit longer as I think about this and run some analyses as wifi permits (I’m currently climbing in Mexico).
This is really useful stuff, and I think I understand where you’re coming from.
FWIW, as I think I’ve expressed elsewhere, I think I went too far trying to build a newer better wheel for this analysis, and we’ve planned on doing a traditional systematic review and meta-analysis of psychotherapy in LMICs since the fall.
I get it, and while I could do some more self flagellation on behalf of my former hubris at pursuing this rubric, I’ll temporarily refrain and point out that small study effects were incorporated as a discount against psychotherapy—they just didn’t end up being very big.
But it doesn’t do that if you 1. aren’t using metareg or 2. are using multi-level models. Here’s the full output from the metafor::rma.mv() call I was hiding.
It contains a Q test for heterogeneity, which flags statistically significant heterogeneity. What does this mean? I’ll quote from the text we’ve referenced.
It also reports sigma^2 which should be equivalent to the tau^2 / tau statistic which “quantifies the variance of the true effect sizes underlying our data.” We can use it to create a 95% CI for the true effect of the intercept, which is:
> 0.58 - (1.96 * 0.3996) = −0.203216
> 0.58 + (1.96 * 0.3996) = 1.363216
This is similar to what we find we calculate the prediction intervals (-0.2692, 1.4225). Quoting the text again regarding prediction intervals:
Commenting on the emphasized section, the key thing I’ve tried to keep in mind “is how does the psychotherapy evidence base / meta-analysis compare to the cash transfer evidence base / meta-analysis / CEA?”. So while the prediction interval for psychotherapy contains negative values, which is typically seen as a sign of high heterogeneity, it also did so in the cash transfers meta-analysis. So I’m not quite sure what to make of the magnitude or qualitative difference in heterogeneity, which I’ve assumed is the relevant feature.
I guess a general point is that calculating and assessing heterogeneity is not straightforward, especially for multi-level models. Now, while one could argue we used multi-level models as part of our nefarious plan to pull the wool over folks eyes, that’s just not the case. It just seems like the appropriate way to account dependency introduced by including multiple timepoints in a study, which seems necessary to avoid basing our estimates of how long the effects last on guesswork.
Understandable, but for a bit of context—we also didn’t get into the meta-analytic diagnostics in our CEA of cash transfers. While my co-authors and I did this stuff in the meta-analysis the CEA was based on, I didn’t feel like I had time to put everything in both CEAs, explain it, and finish both CEAs before 2021 ended (which we saw as important for continuing to exist) -- especially after wasting precious time on my quest to be clever (see bias rubric in appendix C). Doing the full meta-analysis for cash transfers took up the better part of a year, and we couldn’t afford to do that again. So I thought that broadly mirroring the CEA I did for cash transfers was a way to “cut to the chase”. I saw the meta-analysis as a way to get an input to the CEA, and I was trying to do the 20% (with a meta-analysis in ~3 months rather than a year) . I’m not saying that this absolves me, but it’s certainly context for the tunnel vision.
Fair point! This is an omission I hope to remedy in due course. In the mean time, I’ll try and respond with some more detailed comments about correcting for publication bias—which I expect is also not as straightforward as it may sound.
Hello Joel,
0) My bad re rma.rv output, sorry. I’ve corrected the offending section. (I’ll return to some second order matters later).
1) I imagine climbing in Mexico is more pleasant than arguing statistical methods on the internet, so I’ve attempted to save you at least some time on the latter by attempting to replicate your analysis myself.
This attempt was only partially successful: I took the ‘Lay or Group cleaner’ sheet and (per previous comments) flipped the signs where necessary so only Houshofer et al. shows a negative effect. Plugging this into R means I get basically identical results for the forest plot (RE mean 0.50 versus 0.51) and funnel plot (Eggers lim value 0.2671 vs. 0.2670). I get broadly similar but discordant values for the univariate linear and exp decay, as well as model 1 in table 2 [henceforth ‘model 3’] (intercepts and coefficients ~ within a standard error of the write-up’s figures), and much more discordant values for the others in table 2.
I expect this ‘failure to fully replicate’ is mostly owed to a mix of i) very small discrepancies between the datasets we are working off are likely to be amplified in more complex analysis than simpler forest plots etc. ii) I’d guess the covariates would be much more discrepant, and there are more degrees of freedom in how they could be incorporated, so it is much more likely we aren’t doing exactly the same thing (e.g. ‘Layness’ in my sheet seems to be ordinal—values of 0-3 depending on how well trained the provider was, whilst the table suggests it was coded as categorical (trained or not)in the original analysis. Hopefully it is ‘close enough’ for at least some indicative trends not to be operator error. In the spirit of qualified reassurance here’s my funnel plot:
2) Per above, one of the things I wanted to check is whether indeed you see large drops in effect size when you control for small studies/publication bias/etc. You can’t neatly merge (e.g.) Egger into meta-regression (at least, I can’t), but I can add in study standard error as a moderator. Although there would be many misgivings of doing this vs. (e.g.) some transformation (although I expect working harder to linearize etc. would accentuate any effects), there are two benefits: i) extremely simple, ii) it also means the intercept value is where SE = 0, and so gives an estimate of what a hypothetical maximally sized study would suggest.
Adding in SE as a moderator reduces the intercept effect size by roughly half (model 1: 0.51 → 0.25; model 2: 0.42 → 0.23; model 3: 0.69 ->0.36). SE inclusion has ~no effect on the exponential model time decay coefficient, but does seem to confound the linear decay coefficient (effect size down by a third, so no longer a significant predictor) and the single group or individual variable I thought I could helpfully look at (down by ~20%). I take this as suggestive there is significant confounding of results by small study effects, and bayesian best guess correction is somewhere around a 50% discount.
3) As previously mentioned, if you plug this into the guestimate you do not materially change the CEA (roughly 12x to 9x if you halve the effect sizes), but this is because this CEA will return strongminds at least seven times better than cash transfers even if the effect size in the MRAs are set to zero. I did wonder how negative the estimate would have to be to change the analysis, but the gears in the guestimate include logs so a lot of it errors if you feed in negative values. I fear though, if it were adapted, it would give absurd results (e.g. still recommending strongminds even if the MRAs found psychotherapy exacerbated depression more than serious adverse life events).
4) To have an empty file-drawer, I also looked at ‘source’ to see whether cost survey studies gave higher effects due to the selection bias noted above. No: non-significantly numerically lower.
5) So it looks like the publication bias is much higher than estimated in the write-up: more 50% than 15%. I fear part of the reason for this discrepancy is the approach taken in Table A.2 is likely methodologically and conceptually unsound. I’m not aware of a similar method in the literature, but it sounds like what you did is linearly (?meta)regress g on N for the metaPsy dataset (at least, I get similar figures when I do this, although my coefficient is 10x larger). If so, this doesn’t make a lot of sense to me—SE is non-linear in N, the coefficient doesn’t limit appropriately (e.g. an infinitely large study has +inf or -inf effects depending on which side of zero the coefficient is), and you’re also extrapolating greatly out of sample for the correction between average study sizes. The largest study in MetaPsy is ~800 (I see two points on my plot above 650), but you are taking the difference of N values at ~630 and ~2700.
Even more importantly, it is very odd to use a third set of studies to make the estimate versus the two literatures you are evaluating (given an objective is to compare the evidence bases, why not investigate them directly?) Treating them alike also assumes they share the same degree of small study effects—there are just at different points ‘along the line’ because one tends to have bigger studies than the other. It would seem reasonable to consider that the fields may differ in their susceptibility to publication bias and p-hacking, so—controlling for N—cash transfer studies are less biased than psychotherapy ones. As we see from the respective forest plots, this is clearly the case—the regression slope for psychotherapy is like 10x or something as slope-y as the one for cash transfers.
(As a side note, MetaPsy lets you shove all of their studies into a forest plot, which looks approximately as asymmetric as the one from the present analysis:)
6) Back to the meta stuff.
I don’t suspect either you or HLI of nefarious or deceptive behaviour (besides priors, this is strongly ruled against by publishing data that folks could analyse for themselves). But I do suspect partiality and imperfect intellectual honesty. By loose analogy, rather than a football referee who is (hopefully) unbiased but perhaps error prone, this is more like the manager of one of the teams claiming “obviously” their side got the rough end of the refereeing decisions (maybe more error prone in general, definitely more likely to make mistakes favouring one ‘side’, but plausibly/probably sincere), but not like (e.g.) a player cynically diving to try and win a penalty. In other words, I suspect—if anything—you mostly pulled the wool over your own eyes, without really meaning to.
One reason this arises is, unfortunately, the more I look into things the more cause for concern I find. Moreover, the direction of concern re. these questionable-to-dubious analysis choices strongly tend in the direction of favouring the intervention. Maybe I see what I want to, but can’t think of many cases where the analysis was surprisingly incurious about a consideration which would likely result in the effect size being adjusted upwards, nor where a concern about accuracy and generalizability could be further allayed with an alternative statistical technique (one minor example of the latter—it looks like you coded Mid and Low therapy as categoricals when testing sensitivity to therapyness: if you ordered them I expect you’d get a significant test for trend).
I’m sorry again for mistaking the output you were getting, but—respectfully—it still seems a bit sus. It is not like one should have had a low index of suspicion for lots of heterogeneity given how permissively you were including studies; although Q is not an oracular test statistic, P<0.001 should be an prompt to look at this further (especially as you can look at how Q changes when you add in covariates, and lack of great improvement when you do is a further signal); and presumably the very low R2 values mentioned earlier would be another indicator.
Although meta-analysis as a whole is arduous, knocking up a forest and funnel plot to have a look (e.g. whether one should indeed use random vs. fixed effects, given one argument for the latter is they are less sensitive to small study effects) is much easier: I would have no chance of doing any of this statistical assessment without all your work getting the data in the first place; with it, I got the (low-quality, but informative) plots in well under an hour, and do what you’ve read above took a morning.
I had the luxury of not being on a deadline, but I’m afraid a remark like “I didn’t feel like I had time to put everything in both CEAs, explain it, and finish both CEAs before 2021 ended (which we saw as important for continuing to exist)” inspires sympathy but not reassurance on objectivity. I would guess HLI would have seen not only the quality and timeliness of the CEAs as important to its continued existence, but also the substantive conclusions they made: “We find the intervention we’ve discovered is X times better than cash transfers, and credibly better than Givewell recs” seems much better in that regard than (e.g.) “We find the intervention we previously discovered and recommended, now seems inferior to cash transfers—leave alone Givewell top charities—by the lights of our own further assessment”.
Besides being less pleasant, speculating over intentions is much less informative than the actual work itself. I look forward to any further thoughts you have on whether I am on the right track re. correction for small study effects, and I hope future work will indeed show this intervention is indeed as promising as your original analysis suggests.