I think I can diagnose the underlying problem: Bayesian methods are very sensitive to the stipulated prior. In this case, the prior is likely too high, and definitely too narrow/overconfident.
Would it have been better to start with a stipulated prior based on evidence of short-course general-purpose[1] psychotherapy’s effect size generally, update that prior based on the LMIC data, and then update that on charity-specific data?
One of the objections to HLI’s earlier analysis was that it was just implausible in light of what we know of psychotherapy’s effectiveness more generally. I don’t know that literature well at all, so I don’t know how well the effect size in the new stipulated prior compares to the effect size for short-course general-purpose psychotherapy generally. However, given the methodological challenges with measuring effect size in LMICs on available data, it seems like a more general understanding of the effect size should factor into the informed prior somehow. Of course, the LMIC context is considerably different than the context in which most psychotherapy studies have been done, but I am guessing it would be easier to manage quality-control issues with the much broader research base available. So both knowledge bases would likely inform my prior before turning to charity-specific evidence.
[Edit 6-Dec-23: Greg’s response to the remainder of this comment is much better than my musings below. I’d suggest reading that instead!]
To my not-very-well-trained eyes, one hint to me that there’s an issue with application of Bayesian analysis here is the failure of the LMIC effect-size model to come anywhere close to predicting the effect size suggested by the SM-specific evidence. If the model were sound, it would seem very unlikely that the first organization evaluated to the medium-to-in-depth level would happen to have charity-specific evidence suggesting an effect size that diverged so strongly from what the model predicted. I think most of us, when faced with such a circumstance, would question whether the model was sound and would put it on the shelf until performing other charity-specific evaluations at the medium-to-in-depth level. That would be particularly true to the extent the model’s output depended significantly on the methodology used to clean up some problems with the data.[2]
If Greg’s analysis is correct, it seems I shouldn’t assign the informed prior much more credence than I have credence in HLI’s decision to remove outliers (and to a lesser extent, its choice of a method). So, again to my layperson way of thinking, one partial way of thinking about the crux could be that the reader must assess their confidence in HLI’s outlier-treatment decision vs. their confidence in the Baird/Ozler RCT on SM.
What prior to formally pick is tricky—I agree the factors you note would be informative, but how to weigh them (vs. other sources of informative evidence) could be a matter of taste. However, sources of evidence like this could be handy to use as ‘benchmarks’ to see whether the prior (/results of the meta-analysis) are consilient with them, and if not, explore why.
But I think I can now offer a clearer explanation of what is going wrong. The hints you saw point in this direction, although not quite as you describe.
One thing worth being clear on is HLI is not updating on the actual SM specific evidence. As they model it, the estimated effect on this evidence is an initial effect of g = 1.8, and a total effect of ~3.48 WELLBYs, so this would lie on the right tail, not the left, of the informed prior.[1] They discount the effect by a factor of 20 to generate the data they feed into their Bayesian method. Stipulating data which would be (according to their prior) very surprisingly bad would be in itself a strength, not a concern, of the conservative analysis they are attempting.
Next, we need to distinguish an average effect size from a prediction interval. The HLI does report both (Section 4) for a more basic model of PT in LMICs. The (average, random) effect size is 0.64 (95% CI 0.54 to 0.74), whilst the prediction interval is −0.27 to 1.55. The former is giving you the best guess of the average effect (with a confidence interval), the latter is telling you—if I do another study like those already included, the range I can expect its effect size to be within. By loose analogy: if I sample 100 people and their average height is roughly 5′ 7“ (95% CI 5′6” to 5′8“), the 95% range of the individual heights will range much more widely (say 5′ 0” to 6′ 2″)
Unsurprisingly (especially given the marked heterogeneity), the prediction interval is much wider than the confidence interval around the average effect size. Crucially, if our ‘next study’ reports an effect size of (say) 0.1, our interpretation typically should not be: “This study can’t be right, the real effect of the intervention it studies must be much closer to 0.6“. Rather, as findings are heterogeneous, it is much more likely a study which (genuinely) reports a below average effect.[2] Back to the loose analogy, we would (typically) assume we got it right if we measured some more people at (e.g.) 6′0” and 5′4“, even though these are significantly above or below the 95% confidence interval of our average, and only start to doubt measurements much outside our prediction interval (e.g. 3′10”, 7′7″).
Now the problem with the informed prior becomes clear: it is (essentially) being constructed with confidence intervals of the average, not prediction intervals for its data from its underlying models. As such, it is a prior not of “What is the expected impact of a given PT intervention”, but rather “What is the expected average impact of PT interventions as a whole”.[3]
With this understanding, the previously bizarre behaviour is made sensible. For the informed prior should assign very little credence to the average impact of PT overall being ~0.4 per the stipulated Strongminds data, even though it should not be that surprised that a particular intervention (e.g. Strongminds!) has an impact much below average, as many other PT interventions studied also do (cf. Although I shouldn’t be surprised if I measure someone as 5′2“, I should be very surprised if the true average height is actually 5′2”, given my large sample averages 5′7“). Similarly, if we are given a much smaller additional sample reporting a much different effect size, the updated average effect should remain pretty close to the prior (e.g. if a handful of new people have heights < 5′4”, my overall average goes down a little, but not by much).
Needless to say, the results of such an analysis, if indeed for “average effect size of psychotherapy as a whole” are completely inappropriate for “expected effect size of a given psychotherapy intervention”, which is the use it is put to in the report.[4] If the measured effect size of Strongminds was indeed ~0.4, the fact psychotherapy interventions ~average substantially greater effects of ~1.4 gives very little reason to conclude the effect of Strongminds is in fact much higher (e.g. ~1.3). In the same way, if I measure your height is 5′0“, the fact the average heights of other people I’ve measured is 5′7” does not mean I should conclude you’re probably about 5′6″.[5]
Minor: it does lie pretty far along the right tail of the prior (<top 1st percentile?), so maybe one could be a little concerned. Not much, though: given HLI was searching for particularly effective PT interventions in the literature, it doesn’t seem that surprising that this effort could indeed find one at the far-ish right tail of apparent efficacy.
One problem for many types of the examined psychotherapy is that the level of heterogeneity was high, and many of the prediction intervals were broad and included zero. This means that it is difficult to predict the effect size of the next study that is done with this therapy, and that study may just as well find negative effects. The resulting effect sizes differ so much for one type of therapy, that it cannot be reliably predicted what the true effect size is.
Cf. your original point about a low result looking weird given the prior. Perhaps the easiest way to see this is to consider a case where the intervention is harmful. The informed prior says P (ES < 0) is very close to zero. Yet >1/72 studies in the sample did have an effect size < 0. So obviously a prior of an intervention should not be that confident in predicting it will not have a -ve effect. But a prior of the average effect of PT interventions should be that confident this average is not in fact negative, given the great majority of sampled studies show substantially +ve effects.
In a sense, the posterior is not computing the expected effect of StrongMinds, but rather the expected effect of a future intervention like StrongMinds. Somewhat ironically, this (again, simulated) result would be best interpreted as an anti-recommendation: Strongminds performs much below the average we would expect of interventions similar to it.
It is slightly different for measured height as we usually have very little pure measurement error (versus studies with more significant sampling variance). So you’d update a little less towards the reported study effects vs. the expected value than you would for height measurements vs. the average. But the key points still stand.
“Would it have been better to start with a stipulated prior based on evidence of short-course general-purpose[1] psychotherapy’s effect size generally, update that prior based on the LMIC data, and then update that on charity-specific data?”
1. To your first point, I think adding another layer of priors is a plausible way to do things – but given the effects of psychotherapy in general appear to be similar to the estimates we come up with[1] – it’s not clear how much this would change our estimates.
There are probably two issues with using HIC RCTs as a prior. First, incentives that could bias results probably differ across countries. I’m not sure how this would pan out. Second, in HICs, the control group (“treatment as usual”) is probably a lot better off. In a HIC RCT, there’s not much you can do to stop someone in the control group of a psychotherapy trial to go get prescribed antidepressants. However, the standard of care in LMICs is much lower (antidepressants typically aren’t an option), so we shouldn’t be terribly surprised if control groups appear to do worse (and the treatment effect is thus larger).
“To my not-very-well-trained eyes, one hint to me that there’s an issue with application of Bayesian analysis here is the failure of the LMIC effect-size model to come anywhere close to predicting the effect size suggested by the SM-specific evidence.”
2. To your second point, does our model predict charity specific effects?
In general, I think it’s a fair test of a model to say it should do a reasonable job at predicting new observations. We can’t yet discuss the forthcoming StrongMinds RCT – we will know how well our model works at predicting that RCT when it’s released, but for the Friendship Bench (FB) situation, it is true that we predict a considerably lower effect for FB than the FB-specific evidence would suggest. But this is in part because we’re using a combination of charity specific evidence to inform our prior and the data. Let me explain.
We have two sources of charity specific evidence. First, we have the RCTs, which are based on a charity programme but not as it’s deployed at scale. Second, we have monitoring and evaluation data, which can show how well the charity intervention is implemented in the real world. We don’t have a psychotherapy charity at present that has RCT evidence of the programme as it’s deployed in the real world. This matters because I think placing a very high weight on the charity-specific evidence would require that it has a high ecological validity. While the ecological validity of these RCTs is obviously higher than the average study, we still think it’s limited. I’ll explain our concern with FB.
For Friendship Bench, the most recent RCT (Haas et al. 2023, n = 516) reports an attendance rate of around 90% to psychotherapy sessions, but the Friendship Bench M&E data reports an attendance rate more like 30%. We discuss this in Section 8 of the report.
So for the Friendship Bench case we have a couple reasonable quality RCTs for Friendship Bench, but it seems like, based on the M&E data, that something is wrong with the implementation. This evidence of lower implementation quality should be adjusted for, which we do. But we include this adjustment in the prior. So we’re injecting charity specific evidence into both the prior and the data. Note that this is part of the reason why we don’t think it’s wild to place a decent amount of weight on the prior. This is something we should probably clean up in a future version.
We can’t discuss the details of the Baird et al. RCT until it’s published, but we think there may be an analogous situation to Friendship Bench where the RCT and M&E data tell conflicting stories about implementation quality.
This is all to say, judging how well our predictions fair when predicting the charity specific effects isn’t clearly straightforward, since we are trying to predict the effects of the charity as it is actually implemented (something we don’t directly observe), not simply the effects from an RCT.
If we try and predict the RCT effects for Friendship Bench (which have much higher attendance than the “real” programme), then the gap between the predicted RCT effects and actual RCT effects is much smaller, but still suggests that we can’t completely explain why the Friendship Bench RCTs find their large effects.
So, we think the error in our prediction isn’t quite as bad as it seems if we’re predicting the RCTs, and stems in large part from the fact that we are actually predicting the charity implementation.
Cuijpers et al. 2023 finds an effect of psychotherapy of 0.49 SDs for studies with low RoB in low, middle, and high income countries (comparisons = 218#), and Tong et al. 2023 find an effect of 0.69 SDs for studies with low RoB in non-western countries (primarily low and middle income; comparisons = 36). Our estimate of the initial effect is 0.70 SDs (before publication bias adjustments). The results tend to be lower (between 0.27 and 0.57, or 0.42 and 0.60) SDs when the authors of the meta-analyses correct for publication bias. In both meta-analyses (Tong et al. and Cuijpers et al.) the authors present the effects after using three publication bias corrected methods: trim-and-fill (0.6; 0.38 SDs), a limit meta-analysis (0.42; 0.28 SDs), and using a selection model (0.49; 0.57 SDs). If we averaged their publication bias corrected results (which they did without removing outliers beforehand) the estimated effect of psychotherapy would be 0.5 SDs and 0.41 for the two meta-analyses. Our estimate of the initial effect (which is most comparable to these meta-analyses), after removing outliers is 0.70 SDs, and our publication bias correction is 36%, implying that we estimate our initial effect to be 0.46 SDs. You can play around with the data they use on the metapsy website.
Would it have been better to start with a stipulated prior based on evidence of short-course general-purpose[1] psychotherapy’s effect size generally, update that prior based on the LMIC data, and then update that on charity-specific data?
One of the objections to HLI’s earlier analysis was that it was just implausible in light of what we know of psychotherapy’s effectiveness more generally. I don’t know that literature well at all, so I don’t know how well the effect size in the new stipulated prior compares to the effect size for short-course general-purpose psychotherapy generally. However, given the methodological challenges with measuring effect size in LMICs on available data, it seems like a more general understanding of the effect size should factor into the informed prior somehow. Of course, the LMIC context is considerably different than the context in which most psychotherapy studies have been done, but I am guessing it would be easier to manage quality-control issues with the much broader research base available. So both knowledge bases would likely inform my prior before turning to charity-specific evidence.
[Edit 6-Dec-23: Greg’s response to the remainder of this comment is much better than my musings below. I’d suggest reading that instead!]
To my not-very-well-trained eyes, one hint to me that there’s an issue with application of Bayesian analysis here is the failure of the LMIC effect-size model to come anywhere close to predicting the effect size suggested by the SM-specific evidence. If the model were sound, it would seem very unlikely that the first organization evaluated to the medium-to-in-depth level would happen to have charity-specific evidence suggesting an effect size that diverged so strongly from what the model predicted. I think most of us, when faced with such a circumstance, would question whether the model was sound and would put it on the shelf until performing other charity-specific evaluations at the medium-to-in-depth level. That would be particularly true to the extent the model’s output depended significantly on the methodology used to clean up some problems with the data.[2]
By which I mean not psychotherapy for certain narrow problems (e.g., CBT-I for insomnia, exposure therapy for phobias).
If Greg’s analysis is correct, it seems I shouldn’t assign the informed prior much more credence than I have credence in HLI’s decision to remove outliers (and to a lesser extent, its choice of a method). So, again to my layperson way of thinking, one partial way of thinking about the crux could be that the reader must assess their confidence in HLI’s outlier-treatment decision vs. their confidence in the Baird/Ozler RCT on SM.
What prior to formally pick is tricky—I agree the factors you note would be informative, but how to weigh them (vs. other sources of informative evidence) could be a matter of taste. However, sources of evidence like this could be handy to use as ‘benchmarks’ to see whether the prior (/results of the meta-analysis) are consilient with them, and if not, explore why.
But I think I can now offer a clearer explanation of what is going wrong. The hints you saw point in this direction, although not quite as you describe.
One thing worth being clear on is HLI is not updating on the actual SM specific evidence. As they model it, the estimated effect on this evidence is an initial effect of g = 1.8, and a total effect of ~3.48 WELLBYs, so this would lie on the right tail, not the left, of the informed prior.[1] They discount the effect by a factor of 20 to generate the data they feed into their Bayesian method. Stipulating data which would be (according to their prior) very surprisingly bad would be in itself a strength, not a concern, of the conservative analysis they are attempting.
Next, we need to distinguish an average effect size from a prediction interval. The HLI does report both (Section 4) for a more basic model of PT in LMICs. The (average, random) effect size is 0.64 (95% CI 0.54 to 0.74), whilst the prediction interval is −0.27 to 1.55. The former is giving you the best guess of the average effect (with a confidence interval), the latter is telling you—if I do another study like those already included, the range I can expect its effect size to be within. By loose analogy: if I sample 100 people and their average height is roughly 5′ 7“ (95% CI 5′6” to 5′8“), the 95% range of the individual heights will range much more widely (say 5′ 0” to 6′ 2″)
Unsurprisingly (especially given the marked heterogeneity), the prediction interval is much wider than the confidence interval around the average effect size. Crucially, if our ‘next study’ reports an effect size of (say) 0.1, our interpretation typically should not be: “This study can’t be right, the real effect of the intervention it studies must be much closer to 0.6“. Rather, as findings are heterogeneous, it is much more likely a study which (genuinely) reports a below average effect.[2] Back to the loose analogy, we would (typically) assume we got it right if we measured some more people at (e.g.) 6′0” and 5′4“, even though these are significantly above or below the 95% confidence interval of our average, and only start to doubt measurements much outside our prediction interval (e.g. 3′10”, 7′7″).
Now the problem with the informed prior becomes clear: it is (essentially) being constructed with confidence intervals of the average, not prediction intervals for its data from its underlying models. As such, it is a prior not of “What is the expected impact of a given PT intervention”, but rather “What is the expected average impact of PT interventions as a whole”.[3]
With this understanding, the previously bizarre behaviour is made sensible. For the informed prior should assign very little credence to the average impact of PT overall being ~0.4 per the stipulated Strongminds data, even though it should not be that surprised that a particular intervention (e.g. Strongminds!) has an impact much below average, as many other PT interventions studied also do (cf. Although I shouldn’t be surprised if I measure someone as 5′2“, I should be very surprised if the true average height is actually 5′2”, given my large sample averages 5′7“). Similarly, if we are given a much smaller additional sample reporting a much different effect size, the updated average effect should remain pretty close to the prior (e.g. if a handful of new people have heights < 5′4”, my overall average goes down a little, but not by much).
Needless to say, the results of such an analysis, if indeed for “average effect size of psychotherapy as a whole” are completely inappropriate for “expected effect size of a given psychotherapy intervention”, which is the use it is put to in the report.[4] If the measured effect size of Strongminds was indeed ~0.4, the fact psychotherapy interventions ~average substantially greater effects of ~1.4 gives very little reason to conclude the effect of Strongminds is in fact much higher (e.g. ~1.3). In the same way, if I measure your height is 5′0“, the fact the average heights of other people I’ve measured is 5′7” does not mean I should conclude you’re probably about 5′6″.[5]
Minor: it does lie pretty far along the right tail of the prior (<top 1st percentile?), so maybe one could be a little concerned. Not much, though: given HLI was searching for particularly effective PT interventions in the literature, it doesn’t seem that surprising that this effort could indeed find one at the far-ish right tail of apparent efficacy.
Cf. Cuijpers et al. 2020
Cf. your original point about a low result looking weird given the prior. Perhaps the easiest way to see this is to consider a case where the intervention is harmful. The informed prior says P (ES < 0) is very close to zero. Yet >1/72 studies in the sample did have an effect size < 0. So obviously a prior of an intervention should not be that confident in predicting it will not have a -ve effect. But a prior of the average effect of PT interventions should be that confident this average is not in fact negative, given the great majority of sampled studies show substantially +ve effects.
In a sense, the posterior is not computing the expected effect of StrongMinds, but rather the expected effect of a future intervention like StrongMinds. Somewhat ironically, this (again, simulated) result would be best interpreted as an anti-recommendation: Strongminds performs much below the average we would expect of interventions similar to it.
It is slightly different for measured height as we usually have very little pure measurement error (versus studies with more significant sampling variance). So you’d update a little less towards the reported study effects vs. the expected value than you would for height measurements vs. the average. But the key points still stand.
Hi Jason,
1. To your first point, I think adding another layer of priors is a plausible way to do things – but given the effects of psychotherapy in general appear to be similar to the estimates we come up with[1] – it’s not clear how much this would change our estimates.
There are probably two issues with using HIC RCTs as a prior. First, incentives that could bias results probably differ across countries. I’m not sure how this would pan out. Second, in HICs, the control group (“treatment as usual”) is probably a lot better off. In a HIC RCT, there’s not much you can do to stop someone in the control group of a psychotherapy trial to go get prescribed antidepressants. However, the standard of care in LMICs is much lower (antidepressants typically aren’t an option), so we shouldn’t be terribly surprised if control groups appear to do worse (and the treatment effect is thus larger).
2. To your second point, does our model predict charity specific effects?
In general, I think it’s a fair test of a model to say it should do a reasonable job at predicting new observations. We can’t yet discuss the forthcoming StrongMinds RCT – we will know how well our model works at predicting that RCT when it’s released, but for the Friendship Bench (FB) situation, it is true that we predict a considerably lower effect for FB than the FB-specific evidence would suggest. But this is in part because we’re using a combination of charity specific evidence to inform our prior and the data. Let me explain.
We have two sources of charity specific evidence. First, we have the RCTs, which are based on a charity programme but not as it’s deployed at scale. Second, we have monitoring and evaluation data, which can show how well the charity intervention is implemented in the real world. We don’t have a psychotherapy charity at present that has RCT evidence of the programme as it’s deployed in the real world. This matters because I think placing a very high weight on the charity-specific evidence would require that it has a high ecological validity. While the ecological validity of these RCTs is obviously higher than the average study, we still think it’s limited. I’ll explain our concern with FB.
For Friendship Bench, the most recent RCT (Haas et al. 2023, n = 516) reports an attendance rate of around 90% to psychotherapy sessions, but the Friendship Bench M&E data reports an attendance rate more like 30%. We discuss this in Section 8 of the report.
So for the Friendship Bench case we have a couple reasonable quality RCTs for Friendship Bench, but it seems like, based on the M&E data, that something is wrong with the implementation. This evidence of lower implementation quality should be adjusted for, which we do. But we include this adjustment in the prior. So we’re injecting charity specific evidence into both the prior and the data. Note that this is part of the reason why we don’t think it’s wild to place a decent amount of weight on the prior. This is something we should probably clean up in a future version.
We can’t discuss the details of the Baird et al. RCT until it’s published, but we think there may be an analogous situation to Friendship Bench where the RCT and M&E data tell conflicting stories about implementation quality.
This is all to say, judging how well our predictions fair when predicting the charity specific effects isn’t clearly straightforward, since we are trying to predict the effects of the charity as it is actually implemented (something we don’t directly observe), not simply the effects from an RCT.
If we try and predict the RCT effects for Friendship Bench (which have much higher attendance than the “real” programme), then the gap between the predicted RCT effects and actual RCT effects is much smaller, but still suggests that we can’t completely explain why the Friendship Bench RCTs find their large effects.
So, we think the error in our prediction isn’t quite as bad as it seems if we’re predicting the RCTs, and stems in large part from the fact that we are actually predicting the charity implementation.
Cuijpers et al. 2023 finds an effect of psychotherapy of 0.49 SDs for studies with low RoB in low, middle, and high income countries (comparisons = 218#), and Tong et al. 2023 find an effect of 0.69 SDs for studies with low RoB in non-western countries (primarily low and middle income; comparisons = 36). Our estimate of the initial effect is 0.70 SDs (before publication bias adjustments). The results tend to be lower (between 0.27 and 0.57, or 0.42 and 0.60) SDs when the authors of the meta-analyses correct for publication bias. In both meta-analyses (Tong et al. and Cuijpers et al.) the authors present the effects after using three publication bias corrected methods: trim-and-fill (0.6; 0.38 SDs), a limit meta-analysis (0.42; 0.28 SDs), and using a selection model (0.49; 0.57 SDs). If we averaged their publication bias corrected results (which they did without removing outliers beforehand) the estimated effect of psychotherapy would be 0.5 SDs and 0.41 for the two meta-analyses. Our estimate of the initial effect (which is most comparable to these meta-analyses), after removing outliers is 0.70 SDs, and our publication bias correction is 36%, implying that we estimate our initial effect to be 0.46 SDs. You can play around with the data they use on the metapsy website.