What prior to formally pick is trickyâI agree the factors you note would be informative, but how to weigh them (vs. other sources of informative evidence) could be a matter of taste. However, sources of evidence like this could be handy to use as âbenchmarksâ to see whether the prior (/âresults of the meta-analysis) are consilient with them, and if not, explore why.
But I think I can now offer a clearer explanation of what is going wrong. The hints you saw point in this direction, although not quite as you describe.
One thing worth being clear on is HLI is not updating on the actual SM specific evidence. As they model it, the estimated effect on this evidence is an initial effect of g = 1.8, and a total effect of ~3.48 WELLBYs, so this would lie on the right tail, not the left, of the informed prior.[1] They discount the effect by a factor of 20 to generate the data they feed into their Bayesian method. Stipulating data which would be (according to their prior) very surprisingly bad would be in itself a strength, not a concern, of the conservative analysis they are attempting.
Next, we need to distinguish an average effect size from a prediction interval. The HLI does report both (Section 4) for a more basic model of PT in LMICs. The (average, random) effect size is 0.64 (95% CI 0.54 to 0.74), whilst the prediction interval is â0.27 to 1.55. The former is giving you the best guess of the average effect (with a confidence interval), the latter is telling youâif I do another study like those already included, the range I can expect its effect size to be within. By loose analogy: if I sample 100 people and their average height is roughly 5Ⲡ7â (95% CI 5â˛6â to 5â˛8â), the 95% range of the individual heights will range much more widely (say 5Ⲡ0â to 6Ⲡ2âł)
Unsurprisingly (especially given the marked heterogeneity), the prediction interval is much wider than the confidence interval around the average effect size. Crucially, if our ânext studyâ reports an effect size of (say) 0.1, our interpretation typically should not be: âThis study canât be right, the real effect of the intervention it studies must be much closer to 0.6â. Rather, as findings are heterogeneous, it is much more likely a study which (genuinely) reports a below average effect.[2] Back to the loose analogy, we would (typically) assume we got it right if we measured some more people at (e.g.) 6â˛0â and 5â˛4â, even though these are significantly above or below the 95% confidence interval of our average, and only start to doubt measurements much outside our prediction interval (e.g. 3â˛10â, 7â˛7âł).
Now the problem with the informed prior becomes clear: it is (essentially) being constructed with confidence intervals of the average, not prediction intervals for its data from its underlying models. As such, it is a prior not of âWhat is the expected impact of a given PT interventionâ, but rather âWhat is the expected average impact of PT interventions as a wholeâ.[3]
With this understanding, the previously bizarre behaviour is made sensible. For the informed prior should assign very little credence to the average impact of PT overall being ~0.4 per the stipulated Strongminds data, even though it should not be that surprised that a particular intervention (e.g. Strongminds!) has an impact much below average, as many other PT interventions studied also do (cf. Although I shouldnât be surprised if I measure someone as 5â˛2â, I should be very surprised if the true average height is actually 5â˛2â, given my large sample averages 5â˛7â). Similarly, if we are given a much smaller additional sample reporting a much different effect size, the updated average effect should remain pretty close to the prior (e.g. if a handful of new people have heights < 5â˛4â, my overall average goes down a little, but not by much).
Needless to say, the results of such an analysis, if indeed for âaverage effect size of psychotherapy as a wholeâ are completely inappropriate for âexpected effect size of a given psychotherapy interventionâ, which is the use it is put to in the report.[4] If the measured effect size of Strongminds was indeed ~0.4, the fact psychotherapy interventions ~average substantially greater effects of ~1.4 gives very little reason to conclude the effect of Strongminds is in fact much higher (e.g. ~1.3). In the same way, if I measure your height is 5â˛0â, the fact the average heights of other people Iâve measured is 5â˛7â does not mean I should conclude youâre probably about 5â˛6âł.[5]
Minor: it does lie pretty far along the right tail of the prior (<top 1st percentile?), so maybe one could be a little concerned. Not much, though: given HLI was searching for particularly effective PT interventions in the literature, it doesnât seem that surprising that this effort could indeed find one at the far-ish right tail of apparent efficacy.
One problem for many types of the examined psychotherapy is that the level of heterogeneity was high, and many of the prediction intervals were broad and included zero. This means that it is difficult to predict the effect size of the next study that is done with this therapy, and that study may just as well find negative effects. The resulting effect sizes differ so much for one type of therapy, that it cannot be reliably predicted what the true effect size is.
Cf. your original point about a low result looking weird given the prior. Perhaps the easiest way to see this is to consider a case where the intervention is harmful. The informed prior says P (ES < 0) is very close to zero. Yet >1/â72 studies in the sample did have an effect size < 0. So obviously a prior of an intervention should not be that confident in predicting it will not have a -ve effect. But a prior of the average effect of PT interventions should be that confident this average is not in fact negative, given the great majority of sampled studies show substantially +ve effects.
In a sense, the posterior is not computing the expected effect of StrongMinds, but rather the expected effect of a future intervention like StrongMinds. Somewhat ironically, this (again, simulated) result would be best interpreted as an anti-recommendation: Strongminds performs much below the average we would expect of interventions similar to it.
It is slightly different for measured height as we usually have very little pure measurement error (versus studies with more significant sampling variance). So youâd update a little less towards the reported study effects vs. the expected value than you would for height measurements vs. the average. But the key points still stand.
What prior to formally pick is trickyâI agree the factors you note would be informative, but how to weigh them (vs. other sources of informative evidence) could be a matter of taste. However, sources of evidence like this could be handy to use as âbenchmarksâ to see whether the prior (/âresults of the meta-analysis) are consilient with them, and if not, explore why.
But I think I can now offer a clearer explanation of what is going wrong. The hints you saw point in this direction, although not quite as you describe.
One thing worth being clear on is HLI is not updating on the actual SM specific evidence. As they model it, the estimated effect on this evidence is an initial effect of g = 1.8, and a total effect of ~3.48 WELLBYs, so this would lie on the right tail, not the left, of the informed prior.[1] They discount the effect by a factor of 20 to generate the data they feed into their Bayesian method. Stipulating data which would be (according to their prior) very surprisingly bad would be in itself a strength, not a concern, of the conservative analysis they are attempting.
Next, we need to distinguish an average effect size from a prediction interval. The HLI does report both (Section 4) for a more basic model of PT in LMICs. The (average, random) effect size is 0.64 (95% CI 0.54 to 0.74), whilst the prediction interval is â0.27 to 1.55. The former is giving you the best guess of the average effect (with a confidence interval), the latter is telling youâif I do another study like those already included, the range I can expect its effect size to be within. By loose analogy: if I sample 100 people and their average height is roughly 5Ⲡ7â (95% CI 5â˛6â to 5â˛8â), the 95% range of the individual heights will range much more widely (say 5Ⲡ0â to 6Ⲡ2âł)
Unsurprisingly (especially given the marked heterogeneity), the prediction interval is much wider than the confidence interval around the average effect size. Crucially, if our ânext studyâ reports an effect size of (say) 0.1, our interpretation typically should not be: âThis study canât be right, the real effect of the intervention it studies must be much closer to 0.6â. Rather, as findings are heterogeneous, it is much more likely a study which (genuinely) reports a below average effect.[2] Back to the loose analogy, we would (typically) assume we got it right if we measured some more people at (e.g.) 6â˛0â and 5â˛4â, even though these are significantly above or below the 95% confidence interval of our average, and only start to doubt measurements much outside our prediction interval (e.g. 3â˛10â, 7â˛7âł).
Now the problem with the informed prior becomes clear: it is (essentially) being constructed with confidence intervals of the average, not prediction intervals for its data from its underlying models. As such, it is a prior not of âWhat is the expected impact of a given PT interventionâ, but rather âWhat is the expected average impact of PT interventions as a wholeâ.[3]
With this understanding, the previously bizarre behaviour is made sensible. For the informed prior should assign very little credence to the average impact of PT overall being ~0.4 per the stipulated Strongminds data, even though it should not be that surprised that a particular intervention (e.g. Strongminds!) has an impact much below average, as many other PT interventions studied also do (cf. Although I shouldnât be surprised if I measure someone as 5â˛2â, I should be very surprised if the true average height is actually 5â˛2â, given my large sample averages 5â˛7â). Similarly, if we are given a much smaller additional sample reporting a much different effect size, the updated average effect should remain pretty close to the prior (e.g. if a handful of new people have heights < 5â˛4â, my overall average goes down a little, but not by much).
Needless to say, the results of such an analysis, if indeed for âaverage effect size of psychotherapy as a wholeâ are completely inappropriate for âexpected effect size of a given psychotherapy interventionâ, which is the use it is put to in the report.[4] If the measured effect size of Strongminds was indeed ~0.4, the fact psychotherapy interventions ~average substantially greater effects of ~1.4 gives very little reason to conclude the effect of Strongminds is in fact much higher (e.g. ~1.3). In the same way, if I measure your height is 5â˛0â, the fact the average heights of other people Iâve measured is 5â˛7â does not mean I should conclude youâre probably about 5â˛6âł.[5]
Minor: it does lie pretty far along the right tail of the prior (<top 1st percentile?), so maybe one could be a little concerned. Not much, though: given HLI was searching for particularly effective PT interventions in the literature, it doesnât seem that surprising that this effort could indeed find one at the far-ish right tail of apparent efficacy.
Cf. Cuijpers et al. 2020
Cf. your original point about a low result looking weird given the prior. Perhaps the easiest way to see this is to consider a case where the intervention is harmful. The informed prior says P (ES < 0) is very close to zero. Yet >1/â72 studies in the sample did have an effect size < 0. So obviously a prior of an intervention should not be that confident in predicting it will not have a -ve effect. But a prior of the average effect of PT interventions should be that confident this average is not in fact negative, given the great majority of sampled studies show substantially +ve effects.
In a sense, the posterior is not computing the expected effect of StrongMinds, but rather the expected effect of a future intervention like StrongMinds. Somewhat ironically, this (again, simulated) result would be best interpreted as an anti-recommendation: Strongminds performs much below the average we would expect of interventions similar to it.
It is slightly different for measured height as we usually have very little pure measurement error (versus studies with more significant sampling variance). So youâd update a little less towards the reported study effects vs. the expected value than you would for height measurements vs. the average. But the key points still stand.