Thanks. I’ve taken the liberty of quickly meta-analysing (rather, quickly plugging your spreadsheet into metamar). I have further questions.
1. My forest plot (ignoring repeated measures—more later) shows studies with effect sizes >0 (i.e. disfavouring intervention) and <-2 (i.e.. greatly favouring intervention). Yet fig 1 (and subsequent figures) suggests the effect sizes of the included studies are between 0 and −2. Appendix B also says the same: what am I missing?
2. My understanding is it is an error to straightforwardly include multiple results from the same study (i.e. F/U at t1, t2, etc.) into meta-analysis (see Cochrane handbook here): naively, one would expect doing so would overweight these studies versus those which report outcomes only once. How did the analysis account for this?
3. Are the meta-regression results fixed or random effects? I’m pretty sure metareg in R does random effects by default, but it is intuitively surprising you would get the impact halved if one medium-sized study is excluded (Baranov et al. 2020). Perhaps what is going on is the overall calculated impact is much more sensitive to the regression coefficient for time decay than the pooled effect size, so the lone study with longer follow-up exerts a lot of weight dragging this upwards.
4. On the external validity point, it is notable that Baranov et al. was a study of pre-natal psychotherapy in Pakistan: it looks dubious that the results of this study would really double our estimates of effect persistence—particularly of, as I understand it, more general provision in sub-Saharan Africa. There seem facially credible reasons why the effects in this population could be persistent in a non-generalising way: e.g. that better maternal mental health post-partum means better economic decision making at a pivotal time (which then improves material circumstances thereafter).
In general inclusion seems overly permissive: by analogy, it is akin to doing a meta-analysis of the efficacy of aspirin on all cause mortality where you pool all of its indications, and are indifferent to whether is mono-, primary or adjunct Tx. I grant efficacy findings in one subgroup are informative re. efficacy in another, but not so informative that results can be weighed equally versus studies performed in the subgroup of interest (ditto including studies which only partly or tangentially involve any form of psychotherapy—inclusion looks dubious given the degree to which outcomes can be attributed to the intervention of interest is uncertain). Typical meta-analyses have much more stringent criteria (cf. PICO), and for good reason.
5. You elect for exp decay over linear decay in part as the former model has a higher R2 than the latter. What were the R2s? By visual inspection I guess both figures are pretty low. Similarly, it would be useful to report these or similar statistics for all of the metaregressions reported: if the residual heterogeneity remains very high, this supplies caution to the analysis: effects vary a lot, and we do not have good explanations why.
6. A general challenge here is metagression tends insensitive, and may struggle to ably disentangle between-study heterogeneity—especially when, as here, there’s a pile of plausible confounds owed to the permissive inclusion criteria (e.g. besides clinical subpopulation, what about location?). This is particularly pressing if the overall results are sensitive to strong assumptions made of the presumptive drivers of said heterogeneity, given the high potential for unaccounted-for confounders distorting the true effects.
7. The write-up notes one potential confounder to apparent time decay: better studies have more extensive followup, but perhaps better studies also report lesser effects. It is unfortunate small study effects were not assessed, as these appear substantial:
Note both the marked asymmetry (Eggers P < 0.001), as well as a large number of intervention favouring studies finding themselves in the P 0.01 to 0.05 band. Quantitative correction would be far from straightforward, but plausibly an integer divisor. It may also be worth controlling for this effect in the other metaregressions.
8. Given the analysis is atypical (re. inclusion, selection/search, analysis, etc.) ‘analysing as you go’ probably is not the best way of managing researcher degrees of freedom. Although it is perhaps a little too late to make a prior analysis plan, a multiverse analysis could be informative.
I regret my hunch is this would find the presented analysis is pretty out on the tail of ‘psychotherapy favouring results’: most other reasonable ways of slicing it lead to weaker or more uncertain conclusions.
The data we use is from the tab “Before 23.02.2022 Edits Data”. The “LayOrGroup Cleaner” is another tab that we used to do specific exploratory tests. So the selection of studies changes a bit.
1. We also clean the data in our code so the effects are set to positive in our analysis (i.e., all of the studies find reductions in depression/increases in wellbeing). Except for the Haushofer et al., which is the only decline in wellbeing.
2. We attempt to control for this problem by using a multi-level model (with random intercepts clustered at the level of the authors), but this type of meta-analysis is not super common.
3. We are using random effects. We are planning on exploring how best to set the model in our next analysis, and how using different models changes our analysis. Our aim is to do something more in the spirit of a multiverse analysis than our present analysis.
Perhaps what is going on is the overall calculated impact is much more sensitive to the regression coefficient for time decay than the pooled effect size, so the lone study with longer follow-up exerts a lot of weight dragging this upwards.
Yes, Baranov et al. has an especially strong effect on the time decay coefficient, not the pooled effect size. I’m less concerned this was a fluke as Bhat et al., (2021) has since been published. , which also found very durable effects of lay delivered psychotherapy primarily delivered to women.
4. I think you raise some fair challenges regarding the permissiveness of inclusion. Ideally, we’d include many studies that are at least somewhat relevant, and then weight the study by its precision and relevance. But there isn’t a clear way to find out which characteristics of a study may drive the difference in its effect without including a wide evidence base and running a lot of moderating analyses. I think many meta-analyses through the baby out with the bath water because of the strictness of their PICOs, and miss answering some very important questions because of it, e.g. “like how do the effects decay over time?”.
5. As we say in the report.
We prefer an exponential model because it ts our data better (it has a higher 𝑅2) and it matches the pattern found in other studies of psychotherapy’s trajectory. (footnote: The only two studies we have found that have tracked the trajectory of psychotherapy with suficient time granularity also find that the effects decay at a diminishing rate (Ali et al., 2017; Bastiaansen et al., 2020)).
So R^2 wasn’t the only reason, but yes it was very low. I agree that it would be a good idea to report more statistics including the residual heterogeneity in future reports.
6. I think this is fair, and that more robustness checks are warranted in the next version of the analysis.
7. We planned quantitatively comparing the publication bias / small study effects between psychotherapy and cash transfers, as psychotherapy does appear to have more risk as you pointed out.
8. At the risk of sounding like a broken record, we plan on doing many more robustness checks in the flavor of a multiverse analysis when we update the analysis. If we find that our previous analyses appeared to have been unusually optimistic, we will adjust it until we think it’s sensible.
These are good points, and I think they make me realize we could have framed our analysis differently. I saw this meta-analysis as:
An attempt to push EA analyses away from using one or a couple of studies and towards using larger bodies of evidence.
To try point that the ho the change effects over time is an important parameter and we should try to estimate it.
A way to form a prior on the size and persistence of the effects of psychotherapy in low income countries.
That the quality of this analysis was an attempt to be more rigorous than most shallow EA analyses, but definitely less rigorous than an quality peer reviewed academic paper.
I think this last point is not something we clearly communicated.
Thanks for the forest and funnel plots—much more accurate and informative than my own (although it seems the core upshots are unchanged)
I’ll return to the second order matters later in the show, but on the merits, surely the discovery of marked small study effects should call the results of this analysis (and subsequent recommendation of Strongminds) into doubt?
Specifically:
The marked small study effect is difficult to control for, but it seems my remark of an ‘integer division’ re. effect size is in the right ballpark. I would expect* (more later) real effects 2x-4x lower than thought could change the bottom lines.
Heterogeneity remains vast, but the small study effect is likely the best predictor of it versus time decay, intervention properties similar to strongminds, etc. It seems important to repeat the analysis controlling for small study effects, as overall impact calculation is much more sensitive to coefficient estimates which are plausibly confounded by this currently unaccounted for effect.
Discovery the surveyed studies appear riven with publication bias and p hacking should provide further scepticism of outliers (like the SM-specific studies heavily relied upon).
Re. each in turn:
1. I think the typical ‘Cochrane-esque’ norms would say the pooled effects and metaregression results are essentially meaningless given profound heterogeneity and marked small study effects. From your other comments, I presume you more favour a ‘Bayesian Best Guess’ approach: rather than throwing up our hands if noise and bias loom large, we should do our best to correct for them and give the best estimate on the data.
In this spirit of statistical adventure, we could use the Egger’s regression slope to infer the effect size the perfectly precise study would have (I agree with Briggs this is dubious technique, but seems one of the better available quantitative ‘best guesses’). Reading your funnel plot, the limit value is around 0.15 ~ 4x lower than the random effects estimate. Your output suggests it is higher (0.26), which I guess is owed to a multilevel model rather than the simpler one in the forest and funnel plots, but either way is ~2x lower than the previous ‘t=0’ intercept values.
These are substantial corrections, and probably should be made urgently to the published analysis (given donors may be relying upon it for donation decisions).
2. As it looks like ‘study size’ is the best predictor of heterogeneity so far discovered, there’s a natural fear that previous coefficient estimates for time decay and SM-intervention-like properties are confounded by it. So the overall correction to calculated impact could be greater than flat a 50-75% discount, if the less resilient coefficients ‘go the wrong way’ when this factor is controlled. I would speculate adding this in would give a further discount, albeit a (relatively) mild one: it is plausible that study size collides with time decay (so controlling results in somewhat greater persistence), but I would suspect the SM-trait coefficients go down markedly, so the MR including them would no longer give ~80% larger effects.
Perhaps the natural thing would be including study size/precision as a coefficient in the metaregressions (e.g. adding on to model 5), and using these coefficients (rather than univariate analysis previous done for time decay) in the analysis (again, pace the health warnings Briggs would likely provide). Again, this seems a matter of some importance, given the material risk of upending the previously published analysis.
3. As perhaps goes without saying, seeing a lot of statistical evidence for publication bias and p-hacking in the literature probably should lead one to regard outliers with even greater suspicion—both because they are even greater outliers versus the (best guess) ‘real’ average effect, and because the prior analysis gives an adverse prior of what is really driving the impressive results.
It is worth noting that the strongminds recommendation is surprisingly insensitive to the MR results, despite comprising the bulk of the analysis. With the guestimate as-is, SM removes roughly 12SDs (SD-years, I take it) of depression for 1k. When I set the effect sizes of the metaregressions to zero, the guestimate still spits out an estimate SM removes 7.1SDs for 1k (so roughly ’7x more effective than givedirectly). This suggests that the ~5 individual small studies are sufficient for the evaluation to give the nod to SM even if (e.g.) the metaanalysis found no impact of psychotherapy.
I take this to be diagnostic the integration of information in evaluation is not working as it should. Perhaps the Bayesian thing to do is to further discount these studies given they are increasingly discordant from the (corrected) metaregression results, and their apparently high risk of bias given the literature they emerge from. There should surely be some non-negative value of the meta-analysis effect size which reverses the recommendation.
#
Back to the second order stuff. I’d take this episode as a qualified defence of the ‘old fashioned way of doing things’. There are two benefits in being aiming towards higher standards of rigour.
First, sometimes the conventions are valuable guard rails. Shortcuts may not just add expected noise, but add expected bias. Or, another way of looking at it, the evidential value of the work could be very concave with ‘study quality’.
These things can be subtle. One example I haven’t previously mentioned on inclusion was the sampling/extraction was incomplete. The first shortcut you took (i.e. culling references from prior meta-analyses) was a fair one—sure, there might be more data to find, but there’s not much reason to think this would introduce directional selection with effect size.
Unfortunately, the second source—references from your attempts to survey the literature on the cost of psychotherapy—we would expect to be biased towards positive effects: the typical study here is a cost-effectiveness assessment, and such assessment is only relevant if the intervention is effective in the first place (if no effect, the cost-effectiveness is zero by definition). Such studies would be expected to ~uniformly report significant positive effects, and thus including this source biases the sample used in the analysis. (And hey, maybe a meta-regression doesn’t find ‘from this source versus that one’ is a significant predictor, but if so I would attribute it more to the literature being so generally pathological rather than cost-effectiveness studies are unbiased samples of effectiveness simpliciter).
Second, following standard practice is a good way of demonstrating you have ‘nothing up your sleeve’: that you didn’t keep re-analysing until you found results you liked, or selectively reporting results to favour a pre-written bottom line. Although I appreciate this analysis was written before the Simeon’s critique, prior to this one may worry that HLI, given its organisational position on wellbeing etc. would really like to find an intervention that ‘beats’ orthodox recommendations, and this could act as a finger on the scale of their assessments. (cf. ACE’s various shortcomings back in the day)
It is unfortunate that this analysis is not so much ‘avoiding even the appearance of impropriety’ but ‘looking a bit sus’. My experience so far has been further investigation into something or other in the analysis typically reveals a shortcoming (and these shortcomings tend to point in the ‘favouring psychotherapy/SM’ direction).
To give some examples:
That something is up (i.e. huge hetereogeneity, huge small study effects) with the data can be seen on the forest plot (and definitely in the funnel plot). It is odd to skip these figures and basic assessment before launching into a much more elaborate multi-level metaregression.
It is also odd to have an extensive discussion of publication bias (up to and including ones own attempt to make a rubric to correct for it) without doing the normal funnel plot +/- tests for small study effects.
Even if you didn’t look for it, metareg in R will confront you with heterogeneity estimates for all your models in its output (cf.). One should naturally expect curiosity (or alarm) on finding >90% heterogeneity, which I suspect stays around or >90% even with the most expansive meta-regressions. Not only are these not reported in the write-up, but in the R outputs provided (e.g. here) these parts of the results have been cropped out. This was mistaken; mea maxima culpa.
Mentioning prior sensitivity analyses which didn’t make the cut for the write-up invites wondering what else got left in the file-drawer.
Hi Gregory, I wanted to respond quickly on a few points. A longer respond about what I see as the biggest issue (is our analysis overestimating the effects of psychotherapy and StrongMinds by =< 2x??) may take a bit longer as I think about this and run some analyses as wifi permits (I’m currently climbing in Mexico).
This is really useful stuff, and I think I understand where you’re coming from.
I’d take this episode as a qualified defence of the ‘old fashioned way of doing things’.
FWIW, as I think I’ve expressed elsewhere, I think I went too far trying to build a newer better wheel for this analysis, and we’ve planned on doing a traditional systematic review and meta-analysis of psychotherapy in LMICs since the fall.
It is also odd to have an extensive discussion of publication bias (up to and including ones own attempt to make a rubric to correct for it) without doing the normal funnel plot +/- tests for small study effects.
I get it, and while I could do some more self flagellation on behalf of my former hubris at pursuing this rubric, I’ll temporarily refrain and point out that small study effects were incorporated as a discount against psychotherapy—they just didn’t end up being very big.
Even if you didn’t look for it, metareg in R will confront you with heterogeneity estimates for all your models in its output (cf.). One should naturally expect curiosity (or alarm) on finding >90% heterogeneity, which I suspect stays around or >90% even with the most expansive meta-regressions. Not only are these not reported in the write-up, but in the R outputs provided (e.g. here) these parts of the results have been cropped out.
But it doesn’t do that if you 1. aren’t using metareg or 2. are using multi-level models. Here’s the full output from the metafor::rma.mv() call I was hiding.
It contains a Q test for heterogeneity, which flags statistically significant heterogeneity. What does this mean? I’ll quote from the text we’ve referenced.
Cochran’s Q undefinedincreases both when the number of studies increases, and when the precision (i.e. the sample size of a study) increases.
Therefore Q, undefined and whether it is significant highly depends on the size of your meta-analysis, and thus its statistical power. We should therefore not only rely on undefinedQ, and particularly the undefinedQ-test, when assessing between-study heterogeneity.
It also reports sigma^2 which should be equivalent to the tau^2 / tau statistic which “quantifies the variance of the true effect sizes underlying our data.” We can use it to create a 95% CI for the true effect of the intercept, which is:
> 0.58 - (1.96 * 0.3996) = −0.203216
> 0.58 + (1.96 * 0.3996) = 1.363216
This is similar to what we find we calculate the prediction intervals (-0.2692, 1.4225). Quoting the text again regarding prediction intervals:
Prediction intervals give us a range into which we can expect the effects of future studies to fall based on present evidence.
Say that our prediction interval lies completely on the “positive” side favoring the intervention. This means that, despite varying effects, the intervention is expected to be beneficial in the future across the contexts we studied. If the prediction interval includes zero, we can be less sure about this, although it should be noted that broad prediction intervals are quite common.
Commenting on the emphasized section, the key thing I’ve tried to keep in mind “is how does the psychotherapy evidence base / meta-analysis compare to the cash transfer evidence base / meta-analysis / CEA?”. So while the prediction interval for psychotherapy contains negative values, which is typically seen as a sign of high heterogeneity, it also did so in the cash transfers meta-analysis. So I’m not quite sure what to make of the magnitude or qualitative difference in heterogeneity, which I’ve assumed is the relevant feature.
I guess a general point is that calculating and assessing heterogeneity is not straightforward, especially for multi-level models. Now, while one could argue we used multi-level models as part of our nefarious plan to pull the wool over folks eyes, that’s just not the case. It just seems like the appropriate way to account dependency introduced by including multiple timepoints in a study, which seems necessary to avoid basing our estimates of how long the effects last on guesswork.
That something is up (i.e. huge hetereogeneity, huge small study effects) with the data can be seen on the forest plot (and definitely in the funnel plot). It is odd to skip these figures and basic assessment before launching into a much more elaborate multi-level metaregression.
Understandable, but for a bit of context—we also didn’t get into the meta-analytic diagnostics in our CEA of cash transfers. While my co-authors and I did this stuff in the meta-analysis the CEA was based on, I didn’t feel like I had time to put everything in both CEAs, explain it, and finish both CEAs before 2021 ended (which we saw as important for continuing to exist) -- especially after wasting precious time on my quest to be clever (see bias rubric in appendix C). Doing the full meta-analysis for cash transfers took up the better part of a year, and we couldn’t afford to do that again. So I thought that broadly mirroring the CEA I did for cash transfers was a way to “cut to the chase”. I saw the meta-analysis as a way to get an input to the CEA, and I was trying to do the 20% (with a meta-analysis in ~3 months rather than a year) . I’m not saying that this absolves me, but it’s certainly context for the tunnel vision.
Mentioning prior sensitivity analyses which didn’t make the cut for the write-up invites wondering what else got left in the file-drawer.
Fair point! This is an omission I hope to remedy in due course. In the mean time, I’ll try and respond with some more detailed comments about correcting for publication bias—which I expect is also not as straightforward as it may sound.
0) My bad re rma.rv output, sorry. I’ve corrected the offending section. (I’ll return to some second order matters later).
1) I imagine climbing in Mexico is more pleasant than arguing statistical methods on the internet, so I’ve attempted to save you at least some time on the latter by attempting to replicate your analysis myself.
This attempt was only partially successful: I took the ‘Lay or Group cleaner’ sheet and (per previous comments) flipped the signs where necessary so only Houshofer et al. shows a negative effect. Plugging this into R means I get basically identical results for the forest plot (RE mean 0.50 versus 0.51) and funnel plot (Eggers lim value 0.2671 vs. 0.2670). I get broadly similar but discordant values for the univariate linear and exp decay, as well as model 1 in table 2 [henceforth ‘model 3’] (intercepts and coefficients ~ within a standard error of the write-up’s figures), and much more discordant values for the others in table 2.
I expect this ‘failure to fully replicate’ is mostly owed to a mix of i) very small discrepancies between the datasets we are working off are likely to be amplified in more complex analysis than simpler forest plots etc. ii) I’d guess the covariates would be much more discrepant, and there are more degrees of freedom in how they could be incorporated, so it is much more likely we aren’t doing exactly the same thing (e.g. ‘Layness’ in my sheet seems to be ordinal—values of 0-3 depending on how well trained the provider was, whilst the table suggests it was coded as categorical (trained or not)in the original analysis. Hopefully it is ‘close enough’ for at least some indicative trends not to be operator error. In the spirit of qualified reassurance here’s my funnel plot:
2) Per above, one of the things I wanted to check is whether indeed you see large drops in effect size when you control for small studies/publication bias/etc. You can’t neatly merge (e.g.) Egger into meta-regression (at least, I can’t), but I can add in study standard error as a moderator. Although there would be many misgivings of doing this vs. (e.g.) some transformation (although I expect working harder to linearize etc. would accentuate any effects), there are two benefits: i) extremely simple, ii) it also means the intercept value is where SE = 0, and so gives an estimate of what a hypothetical maximally sized study would suggest.
Adding in SE as a moderator reduces the intercept effect size by roughly half (model 1: 0.51 → 0.25; model 2: 0.42 → 0.23; model 3: 0.69 ->0.36). SE inclusion has ~no effect on the exponential model time decay coefficient, but does seem to confound the linear decay coefficient (effect size down by a third, so no longer a significant predictor) and the single group or individual variable I thought I could helpfully look at (down by ~20%). I take this as suggestive there is significant confounding of results by small study effects, and bayesian best guess correction is somewhere around a 50% discount.
3) As previously mentioned, if you plug this into the guestimate you do not materially change the CEA (roughly 12x to 9x if you halve the effect sizes), but this is because this CEA will return strongminds at least seven times better than cash transfers even if the effect size in the MRAs are set to zero. I did wonder how negative the estimate would have to be to change the analysis, but the gears in the guestimate include logs so a lot of it errors if you feed in negative values. I fear though, if it were adapted, it would give absurd results (e.g. still recommending strongminds even if the MRAs found psychotherapy exacerbated depression more than serious adverse life events).
4) To have an empty file-drawer, I also looked at ‘source’ to see whether cost survey studies gave higher effects due to the selection bias noted above. No: non-significantly numerically lower.
5) So it looks like the publication bias is much higher than estimated in the write-up: more 50% than 15%. I fear part of the reason for this discrepancy is the approach taken in Table A.2 is likely methodologically and conceptually unsound. I’m not aware of a similar method in the literature, but it sounds like what you did is linearly (?meta)regress g on N for the metaPsy dataset (at least, I get similar figures when I do this, although my coefficient is 10x larger). If so, this doesn’t make a lot of sense to me—SE is non-linear in N, the coefficient doesn’t limit appropriately (e.g. an infinitely large study has +inf or -inf effects depending on which side of zero the coefficient is), and you’re also extrapolating greatly out of sample for the correction between average study sizes. The largest study in MetaPsy is ~800 (I see two points on my plot above 650), but you are taking the difference of N values at ~630 and ~2700.
Even more importantly, it is very odd to use a third set of studies to make the estimate versus the two literatures you are evaluating (given an objective is to compare the evidence bases, why not investigate them directly?) Treating them alike also assumes they share the same degree of small study effects—there are just at different points ‘along the line’ because one tends to have bigger studies than the other. It would seem reasonable to consider that the fields may differ in their susceptibility to publication bias and p-hacking, so—controlling for N—cash transfer studies are less biased than psychotherapy ones. As we see from the respective forest plots, this is clearly the case—the regression slope for psychotherapy is like 10x or something as slope-y as the one for cash transfers.
(As a side note, MetaPsy lets you shove all of their studies into a forest plot, which looks approximately as asymmetric as the one from the present analysis:)
6) Back to the meta stuff.
I don’t suspect either you or HLI of nefarious or deceptive behaviour (besides priors, this is strongly ruled against by publishing data that folks could analyse for themselves). But I do suspect partiality and imperfect intellectual honesty. By loose analogy, rather than a football referee who is (hopefully) unbiased but perhaps error prone, this is more like the manager of one of the teams claiming “obviously” their side got the rough end of the refereeing decisions (maybe more error prone in general, definitely more likely to make mistakes favouring one ‘side’, but plausibly/probably sincere), but not like (e.g.) a player cynically diving to try and win a penalty. In other words, I suspect—if anything—you mostly pulled the wool over your own eyes, without really meaning to.
One reason this arises is, unfortunately, the more I look into things the more cause for concern I find. Moreover, the direction of concern re. these questionable-to-dubious analysis choices strongly tend in the direction of favouring the intervention. Maybe I see what I want to, but can’t think of many cases where the analysis was surprisingly incurious about a consideration which would likely result in the effect size being adjusted upwards, nor where a concern about accuracy and generalizability could be further allayed with an alternative statistical technique (one minor example of the latter—it looks like you coded Mid and Low therapy as categoricals when testing sensitivity to therapyness: if you ordered them I expect you’d get a significant test for trend).
I’m sorry again for mistaking the output you were getting, but—respectfully—it still seems a bit sus. It is not like one should have had a low index of suspicion for lots of heterogeneity given how permissively you were including studies; although Q is not an oracular test statistic, P<0.001 should be an prompt to look at this further (especially as you can look at how Q changes when you add in covariates, and lack of great improvement when you do is a further signal); and presumably the very low R2 values mentioned earlier would be another indicator.
Although meta-analysis as a whole is arduous, knocking up a forest and funnel plot to have a look (e.g. whether one should indeed use random vs. fixed effects, given one argument for the latter is they are less sensitive to small study effects) is much easier: I would have no chance of doing any of this statistical assessment without all your work getting the data in the first place; with it, I got the (low-quality, but informative) plots in well under an hour, and do what you’ve read above took a morning.
I had the luxury of not being on a deadline, but I’m afraid a remark like “I didn’t feel like I had time to put everything in both CEAs, explain it, and finish both CEAs before 2021 ended (which we saw as important for continuing to exist)” inspires sympathy but not reassurance on objectivity. I would guess HLI would have seen not only the quality and timeliness of the CEAs as important to its continued existence, but also the substantive conclusions they made: “We find the intervention we’ve discovered is X times better than cash transfers, and credibly better than Givewell recs” seems much better in that regard than (e.g.) “We find the intervention we previously discovered and recommended, now seems inferior to cash transfers—leave alone Givewell top charities—by the lights of our own further assessment”.
Besides being less pleasant, speculating over intentions is much less informative than the actual work itself. I look forward to any further thoughts you have on whether I am on the right track re. correction for small study effects, and I hope future work will indeed show this intervention is indeed as promising as your original analysis suggests.
Thanks. I’ve taken the liberty of quickly meta-analysing (rather, quickly plugging your spreadsheet into metamar). I have further questions.
1. My forest plot (ignoring repeated measures—more later) shows studies with effect sizes >0 (i.e. disfavouring intervention) and <-2 (i.e.. greatly favouring intervention). Yet fig 1 (and subsequent figures) suggests the effect sizes of the included studies are between 0 and −2. Appendix B also says the same: what am I missing?
2. My understanding is it is an error to straightforwardly include multiple results from the same study (i.e. F/U at t1, t2, etc.) into meta-analysis (see Cochrane handbook here): naively, one would expect doing so would overweight these studies versus those which report outcomes only once. How did the analysis account for this?
3. Are the meta-regression results fixed or random effects? I’m pretty sure metareg in R does random effects by default, but it is intuitively surprising you would get the impact halved if one medium-sized study is excluded (Baranov et al. 2020). Perhaps what is going on is the overall calculated impact is much more sensitive to the regression coefficient for time decay than the pooled effect size, so the lone study with longer follow-up exerts a lot of weight dragging this upwards.
4. On the external validity point, it is notable that Baranov et al. was a study of pre-natal psychotherapy in Pakistan: it looks dubious that the results of this study would really double our estimates of effect persistence—particularly of, as I understand it, more general provision in sub-Saharan Africa. There seem facially credible reasons why the effects in this population could be persistent in a non-generalising way: e.g. that better maternal mental health post-partum means better economic decision making at a pivotal time (which then improves material circumstances thereafter).
In general inclusion seems overly permissive: by analogy, it is akin to doing a meta-analysis of the efficacy of aspirin on all cause mortality where you pool all of its indications, and are indifferent to whether is mono-, primary or adjunct Tx. I grant efficacy findings in one subgroup are informative re. efficacy in another, but not so informative that results can be weighed equally versus studies performed in the subgroup of interest (ditto including studies which only partly or tangentially involve any form of psychotherapy—inclusion looks dubious given the degree to which outcomes can be attributed to the intervention of interest is uncertain). Typical meta-analyses have much more stringent criteria (cf. PICO), and for good reason.
5. You elect for exp decay over linear decay in part as the former model has a higher R2 than the latter. What were the R2s? By visual inspection I guess both figures are pretty low. Similarly, it would be useful to report these or similar statistics for all of the metaregressions reported: if the residual heterogeneity remains very high, this supplies caution to the analysis: effects vary a lot, and we do not have good explanations why.
6. A general challenge here is metagression tends insensitive, and may struggle to ably disentangle between-study heterogeneity—especially when, as here, there’s a pile of plausible confounds owed to the permissive inclusion criteria (e.g. besides clinical subpopulation, what about location?). This is particularly pressing if the overall results are sensitive to strong assumptions made of the presumptive drivers of said heterogeneity, given the high potential for unaccounted-for confounders distorting the true effects.
7. The write-up notes one potential confounder to apparent time decay: better studies have more extensive followup, but perhaps better studies also report lesser effects. It is unfortunate small study effects were not assessed, as these appear substantial:
Note both the marked asymmetry (Eggers P < 0.001), as well as a large number of intervention favouring studies finding themselves in the P 0.01 to 0.05 band. Quantitative correction would be far from straightforward, but plausibly an integer divisor. It may also be worth controlling for this effect in the other metaregressions.
8. Given the analysis is atypical (re. inclusion, selection/search, analysis, etc.) ‘analysing as you go’ probably is not the best way of managing researcher degrees of freedom. Although it is perhaps a little too late to make a prior analysis plan, a multiverse analysis could be informative.
I regret my hunch is this would find the presented analysis is pretty out on the tail of ‘psychotherapy favouring results’: most other reasonable ways of slicing it lead to weaker or more uncertain conclusions.
Hi Gregory,
The data we use is from the tab “Before 23.02.2022 Edits Data”. The “LayOrGroup Cleaner” is another tab that we used to do specific exploratory tests. So the selection of studies changes a bit.
1. We also clean the data in our code so the effects are set to positive in our analysis (i.e., all of the studies find reductions in depression/increases in wellbeing). Except for the Haushofer et al., which is the only decline in wellbeing.
2. We attempt to control for this problem by using a multi-level model (with random intercepts clustered at the level of the authors), but this type of meta-analysis is not super common.
3. We are using random effects. We are planning on exploring how best to set the model in our next analysis, and how using different models changes our analysis. Our aim is to do something more in the spirit of a multiverse analysis than our present analysis.
Yes, Baranov et al. has an especially strong effect on the time decay coefficient, not the pooled effect size. I’m less concerned this was a fluke as Bhat et al., (2021) has since been published. , which also found very durable effects of lay delivered psychotherapy primarily delivered to women.
4. I think you raise some fair challenges regarding the permissiveness of inclusion. Ideally, we’d include many studies that are at least somewhat relevant, and then weight the study by its precision and relevance. But there isn’t a clear way to find out which characteristics of a study may drive the difference in its effect without including a wide evidence base and running a lot of moderating analyses. I think many meta-analyses through the baby out with the bath water because of the strictness of their PICOs, and miss answering some very important questions because of it, e.g. “like how do the effects decay over time?”.
5. As we say in the report.
So R^2 wasn’t the only reason, but yes it was very low. I agree that it would be a good idea to report more statistics including the residual heterogeneity in future reports.
6. I think this is fair, and that more robustness checks are warranted in the next version of the analysis.
7. We planned quantitatively comparing the publication bias / small study effects between psychotherapy and cash transfers, as psychotherapy does appear to have more risk as you pointed out.
8. At the risk of sounding like a broken record, we plan on doing many more robustness checks in the flavor of a multiverse analysis when we update the analysis. If we find that our previous analyses appeared to have been unusually optimistic, we will adjust it until we think it’s sensible.
These are good points, and I think they make me realize we could have framed our analysis differently. I saw this meta-analysis as:
An attempt to push EA analyses away from using one or a couple of studies and towards using larger bodies of evidence.
To try point that the ho the change effects over time is an important parameter and we should try to estimate it.
A way to form a prior on the size and persistence of the effects of psychotherapy in low income countries.
That the quality of this analysis was an attempt to be more rigorous than most shallow EA analyses, but definitely less rigorous than an quality peer reviewed academic paper.
I think this last point is not something we clearly communicated.
Thanks for the forest and funnel plots—much more accurate and informative than my own (although it seems the core upshots are unchanged)
I’ll return to the second order matters later in the show, but on the merits, surely the discovery of marked small study effects should call the results of this analysis (and subsequent recommendation of Strongminds) into doubt?
Specifically:
The marked small study effect is difficult to control for, but it seems my remark of an ‘integer division’ re. effect size is in the right ballpark. I would expect* (more later) real effects 2x-4x lower than thought could change the bottom lines.
Heterogeneity remains vast, but the small study effect is likely the best predictor of it versus time decay, intervention properties similar to strongminds, etc. It seems important to repeat the analysis controlling for small study effects, as overall impact calculation is much more sensitive to coefficient estimates which are plausibly confounded by this currently unaccounted for effect.
Discovery the surveyed studies appear riven with publication bias and p hacking should provide further scepticism of outliers (like the SM-specific studies heavily relied upon).
Re. each in turn:
1. I think the typical ‘Cochrane-esque’ norms would say the pooled effects and metaregression results are essentially meaningless given profound heterogeneity and marked small study effects. From your other comments, I presume you more favour a ‘Bayesian Best Guess’ approach: rather than throwing up our hands if noise and bias loom large, we should do our best to correct for them and give the best estimate on the data.
In this spirit of statistical adventure, we could use the Egger’s regression slope to infer the effect size the perfectly precise study would have (I agree with Briggs this is dubious technique, but seems one of the better available quantitative ‘best guesses’). Reading your funnel plot, the limit value is around 0.15 ~ 4x lower than the random effects estimate. Your output suggests it is higher (0.26), which I guess is owed to a multilevel model rather than the simpler one in the forest and funnel plots, but either way is ~2x lower than the previous ‘t=0’ intercept values.
These are substantial corrections, and probably should be made urgently to the published analysis (given donors may be relying upon it for donation decisions).
2. As it looks like ‘study size’ is the best predictor of heterogeneity so far discovered, there’s a natural fear that previous coefficient estimates for time decay and SM-intervention-like properties are confounded by it. So the overall correction to calculated impact could be greater than flat a 50-75% discount, if the less resilient coefficients ‘go the wrong way’ when this factor is controlled. I would speculate adding this in would give a further discount, albeit a (relatively) mild one: it is plausible that study size collides with time decay (so controlling results in somewhat greater persistence), but I would suspect the SM-trait coefficients go down markedly, so the MR including them would no longer give ~80% larger effects.
Perhaps the natural thing would be including study size/precision as a coefficient in the metaregressions (e.g. adding on to model 5), and using these coefficients (rather than univariate analysis previous done for time decay) in the analysis (again, pace the health warnings Briggs would likely provide). Again, this seems a matter of some importance, given the material risk of upending the previously published analysis.
3. As perhaps goes without saying, seeing a lot of statistical evidence for publication bias and p-hacking in the literature probably should lead one to regard outliers with even greater suspicion—both because they are even greater outliers versus the (best guess) ‘real’ average effect, and because the prior analysis gives an adverse prior of what is really driving the impressive results.
It is worth noting that the strongminds recommendation is surprisingly insensitive to the MR results, despite comprising the bulk of the analysis. With the guestimate as-is, SM removes roughly 12SDs (SD-years, I take it) of depression for 1k. When I set the effect sizes of the metaregressions to zero, the guestimate still spits out an estimate SM removes 7.1SDs for 1k (so roughly ’7x more effective than givedirectly). This suggests that the ~5 individual small studies are sufficient for the evaluation to give the nod to SM even if (e.g.) the metaanalysis found no impact of psychotherapy.
I take this to be diagnostic the integration of information in evaluation is not working as it should. Perhaps the Bayesian thing to do is to further discount these studies given they are increasingly discordant from the (corrected) metaregression results, and their apparently high risk of bias given the literature they emerge from. There should surely be some non-negative value of the meta-analysis effect size which reverses the recommendation.
#
Back to the second order stuff. I’d take this episode as a qualified defence of the ‘old fashioned way of doing things’. There are two benefits in being aiming towards higher standards of rigour.
First, sometimes the conventions are valuable guard rails. Shortcuts may not just add expected noise, but add expected bias. Or, another way of looking at it, the evidential value of the work could be very concave with ‘study quality’.
These things can be subtle. One example I haven’t previously mentioned on inclusion was the sampling/extraction was incomplete. The first shortcut you took (i.e. culling references from prior meta-analyses) was a fair one—sure, there might be more data to find, but there’s not much reason to think this would introduce directional selection with effect size.
Unfortunately, the second source—references from your attempts to survey the literature on the cost of psychotherapy—we would expect to be biased towards positive effects: the typical study here is a cost-effectiveness assessment, and such assessment is only relevant if the intervention is effective in the first place (if no effect, the cost-effectiveness is zero by definition). Such studies would be expected to ~uniformly report significant positive effects, and thus including this source biases the sample used in the analysis. (And hey, maybe a meta-regression doesn’t find ‘from this source versus that one’ is a significant predictor, but if so I would attribute it more to the literature being so generally pathological rather than cost-effectiveness studies are unbiased samples of effectiveness simpliciter).
Second, following standard practice is a good way of demonstrating you have ‘nothing up your sleeve’: that you didn’t keep re-analysing until you found results you liked, or selectively reporting results to favour a pre-written bottom line. Although I appreciate this analysis was written before the Simeon’s critique, prior to this one may worry that HLI, given its organisational position on wellbeing etc. would really like to find an intervention that ‘beats’ orthodox recommendations, and this could act as a finger on the scale of their assessments. (cf. ACE’s various shortcomings back in the day)
It is unfortunate that this analysis is not so much ‘avoiding even the appearance of impropriety’ but ‘looking a bit sus’. My experience so far has been further investigation into something or other in the analysis typically reveals a shortcoming (and these shortcomings tend to point in the ‘favouring psychotherapy/SM’ direction).
To give some examples:
That something is up (i.e. huge hetereogeneity, huge small study effects) with the data can be seen on the forest plot (and definitely in the funnel plot). It is odd to skip these figures and basic assessment before launching into a much more elaborate multi-level metaregression.
It is also odd to have an extensive discussion of publication bias (up to and including ones own attempt to make a rubric to correct for it) without doing the normal funnel plot +/- tests for small study effects.
Even if you didn’t look for it, metareg in R will confront you with heterogeneity estimates for all your models in its output (cf.). One should naturally expect curiosity (or alarm) on finding >90% heterogeneity, which I suspect stays around or >90% even with the most expansive meta-regressions. Not only are these not reported in the write-up, but in the R outputs provided (e.g.here) these parts of the results have been cropped out.This was mistaken; mea maxima culpa.Mentioning prior sensitivity analyses which didn’t make the cut for the write-up invites wondering what else got left in the file-drawer.
Hi Gregory, I wanted to respond quickly on a few points. A longer respond about what I see as the biggest issue (is our analysis overestimating the effects of psychotherapy and StrongMinds by =< 2x??) may take a bit longer as I think about this and run some analyses as wifi permits (I’m currently climbing in Mexico).
This is really useful stuff, and I think I understand where you’re coming from.
FWIW, as I think I’ve expressed elsewhere, I think I went too far trying to build a newer better wheel for this analysis, and we’ve planned on doing a traditional systematic review and meta-analysis of psychotherapy in LMICs since the fall.
I get it, and while I could do some more self flagellation on behalf of my former hubris at pursuing this rubric, I’ll temporarily refrain and point out that small study effects were incorporated as a discount against psychotherapy—they just didn’t end up being very big.
But it doesn’t do that if you 1. aren’t using metareg or 2. are using multi-level models. Here’s the full output from the metafor::rma.mv() call I was hiding.
It contains a Q test for heterogeneity, which flags statistically significant heterogeneity. What does this mean? I’ll quote from the text we’ve referenced.
It also reports sigma^2 which should be equivalent to the tau^2 / tau statistic which “quantifies the variance of the true effect sizes underlying our data.” We can use it to create a 95% CI for the true effect of the intercept, which is:
> 0.58 - (1.96 * 0.3996) = −0.203216
> 0.58 + (1.96 * 0.3996) = 1.363216
This is similar to what we find we calculate the prediction intervals (-0.2692, 1.4225). Quoting the text again regarding prediction intervals:
Commenting on the emphasized section, the key thing I’ve tried to keep in mind “is how does the psychotherapy evidence base / meta-analysis compare to the cash transfer evidence base / meta-analysis / CEA?”. So while the prediction interval for psychotherapy contains negative values, which is typically seen as a sign of high heterogeneity, it also did so in the cash transfers meta-analysis. So I’m not quite sure what to make of the magnitude or qualitative difference in heterogeneity, which I’ve assumed is the relevant feature.
I guess a general point is that calculating and assessing heterogeneity is not straightforward, especially for multi-level models. Now, while one could argue we used multi-level models as part of our nefarious plan to pull the wool over folks eyes, that’s just not the case. It just seems like the appropriate way to account dependency introduced by including multiple timepoints in a study, which seems necessary to avoid basing our estimates of how long the effects last on guesswork.
Understandable, but for a bit of context—we also didn’t get into the meta-analytic diagnostics in our CEA of cash transfers. While my co-authors and I did this stuff in the meta-analysis the CEA was based on, I didn’t feel like I had time to put everything in both CEAs, explain it, and finish both CEAs before 2021 ended (which we saw as important for continuing to exist) -- especially after wasting precious time on my quest to be clever (see bias rubric in appendix C). Doing the full meta-analysis for cash transfers took up the better part of a year, and we couldn’t afford to do that again. So I thought that broadly mirroring the CEA I did for cash transfers was a way to “cut to the chase”. I saw the meta-analysis as a way to get an input to the CEA, and I was trying to do the 20% (with a meta-analysis in ~3 months rather than a year) . I’m not saying that this absolves me, but it’s certainly context for the tunnel vision.
Fair point! This is an omission I hope to remedy in due course. In the mean time, I’ll try and respond with some more detailed comments about correcting for publication bias—which I expect is also not as straightforward as it may sound.
Hello Joel,
0) My bad re rma.rv output, sorry. I’ve corrected the offending section. (I’ll return to some second order matters later).
1) I imagine climbing in Mexico is more pleasant than arguing statistical methods on the internet, so I’ve attempted to save you at least some time on the latter by attempting to replicate your analysis myself.
This attempt was only partially successful: I took the ‘Lay or Group cleaner’ sheet and (per previous comments) flipped the signs where necessary so only Houshofer et al. shows a negative effect. Plugging this into R means I get basically identical results for the forest plot (RE mean 0.50 versus 0.51) and funnel plot (Eggers lim value 0.2671 vs. 0.2670). I get broadly similar but discordant values for the univariate linear and exp decay, as well as model 1 in table 2 [henceforth ‘model 3’] (intercepts and coefficients ~ within a standard error of the write-up’s figures), and much more discordant values for the others in table 2.
I expect this ‘failure to fully replicate’ is mostly owed to a mix of i) very small discrepancies between the datasets we are working off are likely to be amplified in more complex analysis than simpler forest plots etc. ii) I’d guess the covariates would be much more discrepant, and there are more degrees of freedom in how they could be incorporated, so it is much more likely we aren’t doing exactly the same thing (e.g. ‘Layness’ in my sheet seems to be ordinal—values of 0-3 depending on how well trained the provider was, whilst the table suggests it was coded as categorical (trained or not)in the original analysis. Hopefully it is ‘close enough’ for at least some indicative trends not to be operator error. In the spirit of qualified reassurance here’s my funnel plot:
2) Per above, one of the things I wanted to check is whether indeed you see large drops in effect size when you control for small studies/publication bias/etc. You can’t neatly merge (e.g.) Egger into meta-regression (at least, I can’t), but I can add in study standard error as a moderator. Although there would be many misgivings of doing this vs. (e.g.) some transformation (although I expect working harder to linearize etc. would accentuate any effects), there are two benefits: i) extremely simple, ii) it also means the intercept value is where SE = 0, and so gives an estimate of what a hypothetical maximally sized study would suggest.
Adding in SE as a moderator reduces the intercept effect size by roughly half (model 1: 0.51 → 0.25; model 2: 0.42 → 0.23; model 3: 0.69 ->0.36). SE inclusion has ~no effect on the exponential model time decay coefficient, but does seem to confound the linear decay coefficient (effect size down by a third, so no longer a significant predictor) and the single group or individual variable I thought I could helpfully look at (down by ~20%). I take this as suggestive there is significant confounding of results by small study effects, and bayesian best guess correction is somewhere around a 50% discount.
3) As previously mentioned, if you plug this into the guestimate you do not materially change the CEA (roughly 12x to 9x if you halve the effect sizes), but this is because this CEA will return strongminds at least seven times better than cash transfers even if the effect size in the MRAs are set to zero. I did wonder how negative the estimate would have to be to change the analysis, but the gears in the guestimate include logs so a lot of it errors if you feed in negative values. I fear though, if it were adapted, it would give absurd results (e.g. still recommending strongminds even if the MRAs found psychotherapy exacerbated depression more than serious adverse life events).
4) To have an empty file-drawer, I also looked at ‘source’ to see whether cost survey studies gave higher effects due to the selection bias noted above. No: non-significantly numerically lower.
5) So it looks like the publication bias is much higher than estimated in the write-up: more 50% than 15%. I fear part of the reason for this discrepancy is the approach taken in Table A.2 is likely methodologically and conceptually unsound. I’m not aware of a similar method in the literature, but it sounds like what you did is linearly (?meta)regress g on N for the metaPsy dataset (at least, I get similar figures when I do this, although my coefficient is 10x larger). If so, this doesn’t make a lot of sense to me—SE is non-linear in N, the coefficient doesn’t limit appropriately (e.g. an infinitely large study has +inf or -inf effects depending on which side of zero the coefficient is), and you’re also extrapolating greatly out of sample for the correction between average study sizes. The largest study in MetaPsy is ~800 (I see two points on my plot above 650), but you are taking the difference of N values at ~630 and ~2700.
Even more importantly, it is very odd to use a third set of studies to make the estimate versus the two literatures you are evaluating (given an objective is to compare the evidence bases, why not investigate them directly?) Treating them alike also assumes they share the same degree of small study effects—there are just at different points ‘along the line’ because one tends to have bigger studies than the other. It would seem reasonable to consider that the fields may differ in their susceptibility to publication bias and p-hacking, so—controlling for N—cash transfer studies are less biased than psychotherapy ones. As we see from the respective forest plots, this is clearly the case—the regression slope for psychotherapy is like 10x or something as slope-y as the one for cash transfers.
(As a side note, MetaPsy lets you shove all of their studies into a forest plot, which looks approximately as asymmetric as the one from the present analysis:)
6) Back to the meta stuff.
I don’t suspect either you or HLI of nefarious or deceptive behaviour (besides priors, this is strongly ruled against by publishing data that folks could analyse for themselves). But I do suspect partiality and imperfect intellectual honesty. By loose analogy, rather than a football referee who is (hopefully) unbiased but perhaps error prone, this is more like the manager of one of the teams claiming “obviously” their side got the rough end of the refereeing decisions (maybe more error prone in general, definitely more likely to make mistakes favouring one ‘side’, but plausibly/probably sincere), but not like (e.g.) a player cynically diving to try and win a penalty. In other words, I suspect—if anything—you mostly pulled the wool over your own eyes, without really meaning to.
One reason this arises is, unfortunately, the more I look into things the more cause for concern I find. Moreover, the direction of concern re. these questionable-to-dubious analysis choices strongly tend in the direction of favouring the intervention. Maybe I see what I want to, but can’t think of many cases where the analysis was surprisingly incurious about a consideration which would likely result in the effect size being adjusted upwards, nor where a concern about accuracy and generalizability could be further allayed with an alternative statistical technique (one minor example of the latter—it looks like you coded Mid and Low therapy as categoricals when testing sensitivity to therapyness: if you ordered them I expect you’d get a significant test for trend).
I’m sorry again for mistaking the output you were getting, but—respectfully—it still seems a bit sus. It is not like one should have had a low index of suspicion for lots of heterogeneity given how permissively you were including studies; although Q is not an oracular test statistic, P<0.001 should be an prompt to look at this further (especially as you can look at how Q changes when you add in covariates, and lack of great improvement when you do is a further signal); and presumably the very low R2 values mentioned earlier would be another indicator.
Although meta-analysis as a whole is arduous, knocking up a forest and funnel plot to have a look (e.g. whether one should indeed use random vs. fixed effects, given one argument for the latter is they are less sensitive to small study effects) is much easier: I would have no chance of doing any of this statistical assessment without all your work getting the data in the first place; with it, I got the (low-quality, but informative) plots in well under an hour, and do what you’ve read above took a morning.
I had the luxury of not being on a deadline, but I’m afraid a remark like “I didn’t feel like I had time to put everything in both CEAs, explain it, and finish both CEAs before 2021 ended (which we saw as important for continuing to exist)” inspires sympathy but not reassurance on objectivity. I would guess HLI would have seen not only the quality and timeliness of the CEAs as important to its continued existence, but also the substantive conclusions they made: “We find the intervention we’ve discovered is X times better than cash transfers, and credibly better than Givewell recs” seems much better in that regard than (e.g.) “We find the intervention we previously discovered and recommended, now seems inferior to cash transfers—leave alone Givewell top charities—by the lights of our own further assessment”.
Besides being less pleasant, speculating over intentions is much less informative than the actual work itself. I look forward to any further thoughts you have on whether I am on the right track re. correction for small study effects, and I hope future work will indeed show this intervention is indeed as promising as your original analysis suggests.