You cannot use the distribution for the expected value of an average therapy treatment as the prior distribution for a SPECIFIC therapy treatment, as there will be a large amount of variation between possible therapy treatments that is missed when doing this. Your prior here is that there is a 99%+ chance that StrongMinds will work better than GiveDirectly before looking at any actual StrongMinds results, this is a wildly implausible claim.
You also state “If one holds that the evidence for something as well-studied as psychotherapy is too weak to justify any recommendations, charity evaluators could recommend very little.” Nothing in Gregory’s post suggests that he thinks anything like this, he gives a g of ~0.5 in his meta-analysis that doesn’t improperly remove outliers without good cause. A g of ~0.5 suggests that individuals suffering from depression would likely greatly benefit from seeking therapy. There is a massive difference between “evidence behind psychotherapy is too weak to justify any recommendations” and claiming that “this particular form of therapy is not vastly better than GiveDirectly with a probability higher than 99% before even looking at RCT results”. Trying to throw out Gregory’s claims here over a seemingly false statement about his beliefs seems pretty offensive to me.
[Disclaimer: I worked at HLI until March 2023. I now work at the International Alliance of Mental Health Research Funders]
Gregory says
these problems are sufficiently major I think potential donors are ill-advised to follow the recommendations and analysis in this report.
That is a strong claim to make and it requires him to present a convincing case that GiveDirectly is more cost-effective than StrongMinds. I’ve found his previous methodological critiques to be constructive and well-explained. To their credit, HLI has incorporated many of them in the updated analysis. However, in my opinion, the critiques he presents here do not make a convincing case.
Taking his summary points in turn...
The literature on PT in LMICs is a complete mess. Insofar as more sense can be made from it, the most important factors appear to belong to the studies investigating it (e.g. their size) rather than qualities of the PT interventions themselves.
I think this is much too strong. The three meta-analyses (and Gregory’s own calculations) give me confidence that psychotherapy in LMICs is effective, although the effects are likely to be small.
2. Trying to correct the results of a compromised literature is known to be a nightmare. Here, the qualitative evidence for publication bias is compelling. But quantifying what particular value of ‘a lot?’ the correction should be is fraught: numerically, methods here disagree with one another dramatically, and prove highly sensitive to choices on data exclusion.
There is no consensus on the appropriate methodology for adjusting publication bias. I don’t have an informed opinion on this, but HLI’s approach seems reasonable to me and I think it’s reasonable for Greg to take a different view. From my limited understanding, neither approach makes GiveDirectly more cost-effective.
3. Regardless of how PT looks in general, StrongMinds, in particular, is looking less and less promising. Although initial studies looked good, they had various methodological weaknesses, and a forthcoming RCT with much higher methodological quality is expected to deliver disappointing results.
We don’t have any new data on StrongMinds so I’m confused why Greg thinks it’s “less and less promising”. HLI’s Bayesian approach is a big improvement on the subjective weightings they used in the first cost-effectiveness analysis. As with publication bias, it’s reasonable to hold different views on how to construct the prior, but personally, I do believe that any psychotherapy intervention in LMICs, so long as cost per patient is <$100, is a ~certain bet to beat cash transfers. There are no specific models of psychotherapy that perform better than the others, so I don’t find it surprising that training people to talk to other people about their problems is a more cost-effective way to improve wellbeing in LMICs. Cash transfers are much more expensive and the effects on subjective wellbeing are also small.
4. The evidential trajectory here is all to common, and the outlook typically bleak. It is dubious StrongMinds is a good pick even among psychotherapy interventions (picking one at random which doesn’t have a likely-bad-news RCT imminent seems a better bet). Although pricing different interventions is hard, it is even more dubious SM is close to the frontier of “very well evidenced” vs. “has very promising results” plotted out by things like AMF, GD, etc. HLI’s choice to nonetheless recommend SM again this giving season is very surprising. I doubt it will weather hindsight well.
HLI had to start somewhere and I think we should give credit to StrongMinds for being brave enough to open themselves up to the scrutiny they’ve faced. The three meta-analyses and the tentative analysis of Friendship Bench suggest there is ‘altruistic gold’ to be found here and HLI has only just started to dig. The field is growing quickly and I’m optimistic about the trajectories of CE-incubated charities like Vida Plena and Kaya Guides.
In the meantime, although the gap between GiveDirectly and StrongMinds has clearly narrowed, I remain unconvinced that cash is clearly the better option (but I do remain open-minded and open to pushback).
You cannot use the distribution for the expected value of an average therapy treatment as the prior distribution for a SPECIFIC therapy treatment, as there will be a large amount of variation between possible therapy treatments that is missed when doing this.
A specific therapy treatment is drawn from the distribution of therapy treatments. Our best guess about the distribution of value of a specific therapy treatment, without knowing anything about it, should take into account only that it comes from this distribution of therapy treatments. So I don’t see what’s unreasonable about this.
When running a meta-analysis, you can either use a fixed effect assumption (that all variation between studies is just due to sampling error) and a random effect assumption (that studies differ in terms of their “true effects”.) Therapy treatments differ greatly, so you have to use a random effects model in this case. Then the prior you use for strong minds impact should have a variance that is the sum of the variance in the estimate of average therapy treatments effects AND the variance among different types of treatments effects, both numbers should be available from a random effects meta-analysis. I’m not quite sure what HLI did exactly to get their prior for strong minds here, but for some reason the variance on it seems WAY too low, and I suspect that they neglected the second type of variance that they should have gotten from a random effects meta-analysis.
Section 2.2.2 of their report is titled “Choosing a fixed or random effects model”. They discuss the points you make and clearly say that they use a random effects model. In section 2.2.3 they discuss the standard measures of heterogeneity they use. Section 2.2.4 discusses the specific 4-level random effects model they use and how they did model selection.
I reviewed a small section of the report prior to publication but none of these sections, and it only took me 5 minutes now to check what they did. I’d like the EA Forum to have a higher bar (as Gregory’s parent comment exemplifies) before throwing around easily checkable suspicions about what (very basic) mistakes might have been made.
Yes, some of Greg’s examples point to the variance being underestimated, but the problem does not inherently come from the idea of using the distribution of effects as the prior, since that should include both the sampling uncertainty and true heterogeneity. That would be the appropriate approach even under a random effects model (I think; I’m more used to thinking in terms of Bayesian hierarchical models and the equivalence might not hold)
I think the fundamental point (i.e. “You cannot use the distribution for the expected value of an average therapy treatment as the prior distribution for a SPECIFIC therapy treatment, as there will be a large amount of variation between possible therapy treatments that is missed when doing this.”) is on the right lines, although subsequent discussion of fixed/random effect models might confuse the issue. (Cf. my reply to Jason).
The typical output of a meta-analysis is an (~) average effect size estimate (the diamond at the bottom of the forest plot, etc.) The confidence interval given for that is (very roughly)[1] the interval we predict the true average effect likely lies. So for the basic model given in Section 4 of the report, the average effect size is 0.64, 95% CI (0.54 − 0.74). So (again, roughly) our best guess of the ‘true’ average effect size of psychotherapy in LMICs from our data is 0.64, and we’re 95% sure(*) this average is somewhere between (0.54, 0.74).
Clearly, it is not the case that if we draw another study from the same population, we should be 95% confident(*) the effect size of this new data point will lie between 0.54 to 0.74. This would not be true even in the unicorn case there’s no between study heterogeneity (e.g. all the studies are measuring the same effect modulo sampling variance), and even less so when this is marked, as here. To answer that question, what you want is a prediction interval.[2] This interval is always wider, and almost always significantly so, than the confidence interval for the average effect: in the same analysis with the 0.54-0.74 confidence interval, the prediction interval was −0.27 to 1.55.
Although the full model HLI uses in constructing informed priors is different from that presented in S4 (e.g. it includes a bunch of moderators), they appear to be constructed with monte carlo on the confidenceintervals for the average, not the prediction interval for the data. So I believe the informed prior is actually one of the (adjusted) “Average effect of psychotherapy interventions as a whole”, not a prior for (e.g.) “the effect size reported in a given PT study.” The latter would need to use the prediction intervals, and have a much wider distribution.[3]
I think this ably explains exactly why the Bayesian method for (e.g.) Strongminds gives very bizarre results when deployed as the report does, but they do make much more sense if re-interpreted as (in essence) computing the expected effect size of ‘a future strongminds-like intervention’, but not the effect size we should believe StrongMinds actually has once in receipt of trial data upon it specifically. E.g.:
The histogram of effect sizes shows some comparisons had an effect size < 0, but the ‘informed prior’ suggests P(ES < 0) is extremely low. As a prior for the effect size of the next study, it is much too confident, given the data, a trial will report positive effects (you have >1/72 studies being negative, so surely it cannot be <1%, etc.). As a prior for the average effect size, this confidence is warranted: given the large number of studies in our sample, most of which report positive effects, we would be very surprised to discover the true average effect size is negative.
The prior doesn’t update very much on data provided. E.g. When we stipulate the trials upon strongminds report a near-zero effect of 0.05 WELLBYs, our estimate of 1.49 WELLBYS goes to 1.26: so we should (apparently) believe in such a circumstance the efficacy of SM is ~25 times greater than the trial data upon it indicates. This is, obviously, absurd. However, such a small update is appropriate if it were to ~the average of PT interventions as a whole: that we observe a new PT intervention has much below average results should cause our average to shift a little towards the new findings, but not much.
In essence, the update we are interested in is not “How effective should we expect future interventions like Strongminds are given the data on Strongminds efficacy”, but simply “How effective should we expect Strongminds is given the data on how effective Strongminds is”. Given the massive heterogeneity and wide prediction interval, the (correct) informed prior is pretty uninformative, as it isn’t that surprised by anything in a very wide range of values, and so on finding trial data on SM with a given estimate in this range, our estimate should update to match it pretty closely.[4]
(This also should mean, unlike the report suggests, the SM estimate is not that ‘robust’ to adverse data. Eyeballing it, I’d guess the posterior should be going down by a factor of 2-3 conditional on the stipulated data versus currently reported results).
I’m aware confidence intervals are not credible intervals, and that ‘the 95% CI tells you where the true value is with 95% likelihood’ strictly misinterprets what a confidence interval is, etc. (see) But perhaps ‘close enough’, so I’m going to pretend these are credible intervals, and asterisk each time I assume the strictly incorrect interpretation.
The summary estimate and confidence interval from a random-effects meta-analysis refer to the centre of the distribution of intervention effects, but do not describe the width of the distribution. Often the summary estimate and its confidence interval are quoted in isolation and portrayed as a sufficient summary of the meta-analysis. This is inappropriate. The confidence interval from a random-effects meta-analysis describes uncertainty in the location of the mean of systematically different effects in the different studies. It does not describe the degree of heterogeneity among studies, as may be commonly believed. For example, when there are many studies in a meta-analysis, we may obtain a very tight confidence interval around the random-effects estimate of the mean effect even when there is a large amount of heterogeneity. A solution to this problem is to consider a prediction interval (see Section 10.10.4.3).
Obviously, modulo all the other issues I suggest with both the meta-analysis as a whole, that we in fact would incorporate other sources of information into our actual prior, etc. etc.
Agree that there seem to be some strawmen in HLI’s response:
We don’t believe that the entire field of LMIC psychotherapy should be considered bunk, compromised, or uninformative.
Has anyone suggested that the “entire field of LMIC psychotherapy” is “bunk”?
If one insisted only on using charity evaluations that had every choice pre-registered, there would be none to choose from.
Has anyone suggested that, either? As I understand it, it’s typical to look at debatable choices that happen to support the author’s position with a somewhat more skeptical lens if they haven’t been pre-registered. I don’t think anyone has claimed lack of certain choices being pre-registered is somehow fatal, only a factor to consider.
Our (the HLI) comment was in reference to these quotes.
The literature on PT in LMICs is a complete mess.
Trying to correct the results of a compromised literature is known to be a nightmare.
I think it is valid to describe these as saying the literature is compromised and (probably) uninformative. I can understand your complaint about the word “bunk”. Apologies to Gregory if this is a mischaracterization.
Regarding our comment:
If one insisted only on using charity evaluations that had every choice pre-registered, there would be none to choose from.
And your comment:
I don’t think anyone has claimed lack of certain choices being pre-registered is somehow fatal, only a factor to consider.
Yeah, I think this is a valid point, and the post should have quoted Gregory directly. The point we were hoping to make here is that we’ve attempted to provide a wide range of sensitivity analyses throughout our report, to an extent that we think goes beyond most charity evaluations. It’s not surprising that we’ve missed some in this draft that others would like to see. Gregory’s comments mentioned “Even if you didn’t pre-specify, presenting your first cut as the primary analysis helps for nothing up my sleeve reasons” seemed to imply that we were deliberately hiding something, but in my view our interpretation was overly pessimistic.
You cannot use the distribution for the expected value of an average therapy treatment as the prior distribution for a SPECIFIC therapy treatment, as there will be a large amount of variation between possible therapy treatments that is missed when doing this. Your prior here is that there is a 99%+ chance that StrongMinds will work better than GiveDirectly before looking at any actual StrongMinds results, this is a wildly implausible claim.
You also state “If one holds that the evidence for something as well-studied as psychotherapy is too weak to justify any recommendations, charity evaluators could recommend very little.” Nothing in Gregory’s post suggests that he thinks anything like this, he gives a g of ~0.5 in his meta-analysis that doesn’t improperly remove outliers without good cause. A g of ~0.5 suggests that individuals suffering from depression would likely greatly benefit from seeking therapy. There is a massive difference between “evidence behind psychotherapy is too weak to justify any recommendations” and claiming that “this particular form of therapy is not vastly better than GiveDirectly with a probability higher than 99% before even looking at RCT results”. Trying to throw out Gregory’s claims here over a seemingly false statement about his beliefs seems pretty offensive to me.
[Disclaimer: I worked at HLI until March 2023. I now work at the International Alliance of Mental Health Research Funders]
Gregory says
That is a strong claim to make and it requires him to present a convincing case that GiveDirectly is more cost-effective than StrongMinds. I’ve found his previous methodological critiques to be constructive and well-explained. To their credit, HLI has incorporated many of them in the updated analysis. However, in my opinion, the critiques he presents here do not make a convincing case.
Taking his summary points in turn...
I think this is much too strong. The three meta-analyses (and Gregory’s own calculations) give me confidence that psychotherapy in LMICs is effective, although the effects are likely to be small.
There is no consensus on the appropriate methodology for adjusting publication bias. I don’t have an informed opinion on this, but HLI’s approach seems reasonable to me and I think it’s reasonable for Greg to take a different view. From my limited understanding, neither approach makes GiveDirectly more cost-effective.
We don’t have any new data on StrongMinds so I’m confused why Greg thinks it’s “less and less promising”. HLI’s Bayesian approach is a big improvement on the subjective weightings they used in the first cost-effectiveness analysis. As with publication bias, it’s reasonable to hold different views on how to construct the prior, but personally, I do believe that any psychotherapy intervention in LMICs, so long as cost per patient is <$100, is a ~certain bet to beat cash transfers. There are no specific models of psychotherapy that perform better than the others, so I don’t find it surprising that training people to talk to other people about their problems is a more cost-effective way to improve wellbeing in LMICs. Cash transfers are much more expensive and the effects on subjective wellbeing are also small.
HLI had to start somewhere and I think we should give credit to StrongMinds for being brave enough to open themselves up to the scrutiny they’ve faced. The three meta-analyses and the tentative analysis of Friendship Bench suggest there is ‘altruistic gold’ to be found here and HLI has only just started to dig. The field is growing quickly and I’m optimistic about the trajectories of CE-incubated charities like Vida Plena and Kaya Guides.
In the meantime, although the gap between GiveDirectly and StrongMinds has clearly narrowed, I remain unconvinced that cash is clearly the better option (but I do remain open-minded and open to pushback).
A specific therapy treatment is drawn from the distribution of therapy treatments. Our best guess about the distribution of value of a specific therapy treatment, without knowing anything about it, should take into account only that it comes from this distribution of therapy treatments. So I don’t see what’s unreasonable about this.
When running a meta-analysis, you can either use a fixed effect assumption (that all variation between studies is just due to sampling error) and a random effect assumption (that studies differ in terms of their “true effects”.) Therapy treatments differ greatly, so you have to use a random effects model in this case. Then the prior you use for strong minds impact should have a variance that is the sum of the variance in the estimate of average therapy treatments effects AND the variance among different types of treatments effects, both numbers should be available from a random effects meta-analysis. I’m not quite sure what HLI did exactly to get their prior for strong minds here, but for some reason the variance on it seems WAY too low, and I suspect that they neglected the second type of variance that they should have gotten from a random effects meta-analysis.
Section 2.2.2 of their report is titled “Choosing a fixed or random effects model”. They discuss the points you make and clearly say that they use a random effects model. In section 2.2.3 they discuss the standard measures of heterogeneity they use. Section 2.2.4 discusses the specific 4-level random effects model they use and how they did model selection.
I reviewed a small section of the report prior to publication but none of these sections, and it only took me 5 minutes now to check what they did. I’d like the EA Forum to have a higher bar (as Gregory’s parent comment exemplifies) before throwing around easily checkable suspicions about what (very basic) mistakes might have been made.
Yes, some of Greg’s examples point to the variance being underestimated, but the problem does not inherently come from the idea of using the distribution of effects as the prior, since that should include both the sampling uncertainty and true heterogeneity. That would be the appropriate approach even under a random effects model (I think; I’m more used to thinking in terms of Bayesian hierarchical models and the equivalence might not hold)
(@Burner1989 @David Rhys Bernard @Karthik Tadepalli)
I think the fundamental point (i.e. “You cannot use the distribution for the expected value of an average therapy treatment as the prior distribution for a SPECIFIC therapy treatment, as there will be a large amount of variation between possible therapy treatments that is missed when doing this.”) is on the right lines, although subsequent discussion of fixed/random effect models might confuse the issue. (Cf. my reply to Jason).
The typical output of a meta-analysis is an (~) average effect size estimate (the diamond at the bottom of the forest plot, etc.) The confidence interval given for that is (very roughly)[1] the interval we predict the true average effect likely lies. So for the basic model given in Section 4 of the report, the average effect size is 0.64, 95% CI (0.54 − 0.74). So (again, roughly) our best guess of the ‘true’ average effect size of psychotherapy in LMICs from our data is 0.64, and we’re 95% sure(*) this average is somewhere between (0.54, 0.74).
Clearly, it is not the case that if we draw another study from the same population, we should be 95% confident(*) the effect size of this new data point will lie between 0.54 to 0.74. This would not be true even in the unicorn case there’s no between study heterogeneity (e.g. all the studies are measuring the same effect modulo sampling variance), and even less so when this is marked, as here. To answer that question, what you want is a prediction interval.[2] This interval is always wider, and almost always significantly so, than the confidence interval for the average effect: in the same analysis with the 0.54-0.74 confidence interval, the prediction interval was −0.27 to 1.55.
Although the full model HLI uses in constructing informed priors is different from that presented in S4 (e.g. it includes a bunch of moderators), they appear to be constructed with monte carlo on the confidence intervals for the average, not the prediction interval for the data. So I believe the informed prior is actually one of the (adjusted) “Average effect of psychotherapy interventions as a whole”, not a prior for (e.g.) “the effect size reported in a given PT study.” The latter would need to use the prediction intervals, and have a much wider distribution.[3]
I think this ably explains exactly why the Bayesian method for (e.g.) Strongminds gives very bizarre results when deployed as the report does, but they do make much more sense if re-interpreted as (in essence) computing the expected effect size of ‘a future strongminds-like intervention’, but not the effect size we should believe StrongMinds actually has once in receipt of trial data upon it specifically. E.g.:
The histogram of effect sizes shows some comparisons had an effect size < 0, but the ‘informed prior’ suggests P(ES < 0) is extremely low. As a prior for the effect size of the next study, it is much too confident, given the data, a trial will report positive effects (you have >1/72 studies being negative, so surely it cannot be <1%, etc.). As a prior for the average effect size, this confidence is warranted: given the large number of studies in our sample, most of which report positive effects, we would be very surprised to discover the true average effect size is negative.
The prior doesn’t update very much on data provided. E.g. When we stipulate the trials upon strongminds report a near-zero effect of 0.05 WELLBYs, our estimate of 1.49 WELLBYS goes to 1.26: so we should (apparently) believe in such a circumstance the efficacy of SM is ~25 times greater than the trial data upon it indicates. This is, obviously, absurd. However, such a small update is appropriate if it were to ~the average of PT interventions as a whole: that we observe a new PT intervention has much below average results should cause our average to shift a little towards the new findings, but not much.
In essence, the update we are interested in is not “How effective should we expect future interventions like Strongminds are given the data on Strongminds efficacy”, but simply “How effective should we expect Strongminds is given the data on how effective Strongminds is”. Given the massive heterogeneity and wide prediction interval, the (correct) informed prior is pretty uninformative, as it isn’t that surprised by anything in a very wide range of values, and so on finding trial data on SM with a given estimate in this range, our estimate should update to match it pretty closely.[4]
(This also should mean, unlike the report suggests, the SM estimate is not that ‘robust’ to adverse data. Eyeballing it, I’d guess the posterior should be going down by a factor of 2-3 conditional on the stipulated data versus currently reported results).
I’m aware confidence intervals are not credible intervals, and that ‘the 95% CI tells you where the true value is with 95% likelihood’ strictly misinterprets what a confidence interval is, etc. (see) But perhaps ‘close enough’, so I’m going to pretend these are credible intervals, and asterisk each time I assume the strictly incorrect interpretation.
Cf. Cochrane:
Although I think the same mean, so it will give the right ‘best guess’ initial estimates.
Obviously, modulo all the other issues I suggest with both the meta-analysis as a whole, that we in fact would incorporate other sources of information into our actual prior, etc. etc.
Agree that there seem to be some strawmen in HLI’s response:
Has anyone suggested that the “entire field of LMIC psychotherapy” is “bunk”?
Has anyone suggested that, either? As I understand it, it’s typical to look at debatable choices that happen to support the author’s position with a somewhat more skeptical lens if they haven’t been pre-registered. I don’t think anyone has claimed lack of certain choices being pre-registered is somehow fatal, only a factor to consider.
Hey Jason,
Our (the HLI) comment was in reference to these quotes.
I think it is valid to describe these as saying the literature is compromised and (probably) uninformative. I can understand your complaint about the word “bunk”. Apologies to Gregory if this is a mischaracterization.
Regarding our comment:
And your comment:
Yeah, I think this is a valid point, and the post should have quoted Gregory directly. The point we were hoping to make here is that we’ve attempted to provide a wide range of sensitivity analyses throughout our report, to an extent that we think goes beyond most charity evaluations. It’s not surprising that we’ve missed some in this draft that others would like to see. Gregory’s comments mentioned “Even if you didn’t pre-specify, presenting your first cut as the primary analysis helps for nothing up my sleeve reasons” seemed to imply that we were deliberately hiding something, but in my view our interpretation was overly pessimistic.
Cheers for keeping the discourse civil.