When running a meta-analysis, you can either use a fixed effect assumption (that all variation between studies is just due to sampling error) and a random effect assumption (that studies differ in terms of their “true effects”.) Therapy treatments differ greatly, so you have to use a random effects model in this case. Then the prior you use for strong minds impact should have a variance that is the sum of the variance in the estimate of average therapy treatments effects AND the variance among different types of treatments effects, both numbers should be available from a random effects meta-analysis. I’m not quite sure what HLI did exactly to get their prior for strong minds here, but for some reason the variance on it seems WAY too low, and I suspect that they neglected the second type of variance that they should have gotten from a random effects meta-analysis.
Section 2.2.2 of their report is titled “Choosing a fixed or random effects model”. They discuss the points you make and clearly say that they use a random effects model. In section 2.2.3 they discuss the standard measures of heterogeneity they use. Section 2.2.4 discusses the specific 4-level random effects model they use and how they did model selection.
I reviewed a small section of the report prior to publication but none of these sections, and it only took me 5 minutes now to check what they did. I’d like the EA Forum to have a higher bar (as Gregory’s parent comment exemplifies) before throwing around easily checkable suspicions about what (very basic) mistakes might have been made.
Yes, some of Greg’s examples point to the variance being underestimated, but the problem does not inherently come from the idea of using the distribution of effects as the prior, since that should include both the sampling uncertainty and true heterogeneity. That would be the appropriate approach even under a random effects model (I think; I’m more used to thinking in terms of Bayesian hierarchical models and the equivalence might not hold)
I think the fundamental point (i.e. “You cannot use the distribution for the expected value of an average therapy treatment as the prior distribution for a SPECIFIC therapy treatment, as there will be a large amount of variation between possible therapy treatments that is missed when doing this.”) is on the right lines, although subsequent discussion of fixed/random effect models might confuse the issue. (Cf. my reply to Jason).
The typical output of a meta-analysis is an (~) average effect size estimate (the diamond at the bottom of the forest plot, etc.) The confidence interval given for that is (very roughly)[1] the interval we predict the true average effect likely lies. So for the basic model given in Section 4 of the report, the average effect size is 0.64, 95% CI (0.54 − 0.74). So (again, roughly) our best guess of the ‘true’ average effect size of psychotherapy in LMICs from our data is 0.64, and we’re 95% sure(*) this average is somewhere between (0.54, 0.74).
Clearly, it is not the case that if we draw another study from the same population, we should be 95% confident(*) the effect size of this new data point will lie between 0.54 to 0.74. This would not be true even in the unicorn case there’s no between study heterogeneity (e.g. all the studies are measuring the same effect modulo sampling variance), and even less so when this is marked, as here. To answer that question, what you want is a prediction interval.[2] This interval is always wider, and almost always significantly so, than the confidence interval for the average effect: in the same analysis with the 0.54-0.74 confidence interval, the prediction interval was −0.27 to 1.55.
Although the full model HLI uses in constructing informed priors is different from that presented in S4 (e.g. it includes a bunch of moderators), they appear to be constructed with monte carlo on the confidenceintervals for the average, not the prediction interval for the data. So I believe the informed prior is actually one of the (adjusted) “Average effect of psychotherapy interventions as a whole”, not a prior for (e.g.) “the effect size reported in a given PT study.” The latter would need to use the prediction intervals, and have a much wider distribution.[3]
I think this ably explains exactly why the Bayesian method for (e.g.) Strongminds gives very bizarre results when deployed as the report does, but they do make much more sense if re-interpreted as (in essence) computing the expected effect size of ‘a future strongminds-like intervention’, but not the effect size we should believe StrongMinds actually has once in receipt of trial data upon it specifically. E.g.:
The histogram of effect sizes shows some comparisons had an effect size < 0, but the ‘informed prior’ suggests P(ES < 0) is extremely low. As a prior for the effect size of the next study, it is much too confident, given the data, a trial will report positive effects (you have >1/72 studies being negative, so surely it cannot be <1%, etc.). As a prior for the average effect size, this confidence is warranted: given the large number of studies in our sample, most of which report positive effects, we would be very surprised to discover the true average effect size is negative.
The prior doesn’t update very much on data provided. E.g. When we stipulate the trials upon strongminds report a near-zero effect of 0.05 WELLBYs, our estimate of 1.49 WELLBYS goes to 1.26: so we should (apparently) believe in such a circumstance the efficacy of SM is ~25 times greater than the trial data upon it indicates. This is, obviously, absurd. However, such a small update is appropriate if it were to ~the average of PT interventions as a whole: that we observe a new PT intervention has much below average results should cause our average to shift a little towards the new findings, but not much.
In essence, the update we are interested in is not “How effective should we expect future interventions like Strongminds are given the data on Strongminds efficacy”, but simply “How effective should we expect Strongminds is given the data on how effective Strongminds is”. Given the massive heterogeneity and wide prediction interval, the (correct) informed prior is pretty uninformative, as it isn’t that surprised by anything in a very wide range of values, and so on finding trial data on SM with a given estimate in this range, our estimate should update to match it pretty closely.[4]
(This also should mean, unlike the report suggests, the SM estimate is not that ‘robust’ to adverse data. Eyeballing it, I’d guess the posterior should be going down by a factor of 2-3 conditional on the stipulated data versus currently reported results).
I’m aware confidence intervals are not credible intervals, and that ‘the 95% CI tells you where the true value is with 95% likelihood’ strictly misinterprets what a confidence interval is, etc. (see) But perhaps ‘close enough’, so I’m going to pretend these are credible intervals, and asterisk each time I assume the strictly incorrect interpretation.
The summary estimate and confidence interval from a random-effects meta-analysis refer to the centre of the distribution of intervention effects, but do not describe the width of the distribution. Often the summary estimate and its confidence interval are quoted in isolation and portrayed as a sufficient summary of the meta-analysis. This is inappropriate. The confidence interval from a random-effects meta-analysis describes uncertainty in the location of the mean of systematically different effects in the different studies. It does not describe the degree of heterogeneity among studies, as may be commonly believed. For example, when there are many studies in a meta-analysis, we may obtain a very tight confidence interval around the random-effects estimate of the mean effect even when there is a large amount of heterogeneity. A solution to this problem is to consider a prediction interval (see Section 10.10.4.3).
Obviously, modulo all the other issues I suggest with both the meta-analysis as a whole, that we in fact would incorporate other sources of information into our actual prior, etc. etc.
When running a meta-analysis, you can either use a fixed effect assumption (that all variation between studies is just due to sampling error) and a random effect assumption (that studies differ in terms of their “true effects”.) Therapy treatments differ greatly, so you have to use a random effects model in this case. Then the prior you use for strong minds impact should have a variance that is the sum of the variance in the estimate of average therapy treatments effects AND the variance among different types of treatments effects, both numbers should be available from a random effects meta-analysis. I’m not quite sure what HLI did exactly to get their prior for strong minds here, but for some reason the variance on it seems WAY too low, and I suspect that they neglected the second type of variance that they should have gotten from a random effects meta-analysis.
Section 2.2.2 of their report is titled “Choosing a fixed or random effects model”. They discuss the points you make and clearly say that they use a random effects model. In section 2.2.3 they discuss the standard measures of heterogeneity they use. Section 2.2.4 discusses the specific 4-level random effects model they use and how they did model selection.
I reviewed a small section of the report prior to publication but none of these sections, and it only took me 5 minutes now to check what they did. I’d like the EA Forum to have a higher bar (as Gregory’s parent comment exemplifies) before throwing around easily checkable suspicions about what (very basic) mistakes might have been made.
Yes, some of Greg’s examples point to the variance being underestimated, but the problem does not inherently come from the idea of using the distribution of effects as the prior, since that should include both the sampling uncertainty and true heterogeneity. That would be the appropriate approach even under a random effects model (I think; I’m more used to thinking in terms of Bayesian hierarchical models and the equivalence might not hold)
(@Burner1989 @David Rhys Bernard @Karthik Tadepalli)
I think the fundamental point (i.e. “You cannot use the distribution for the expected value of an average therapy treatment as the prior distribution for a SPECIFIC therapy treatment, as there will be a large amount of variation between possible therapy treatments that is missed when doing this.”) is on the right lines, although subsequent discussion of fixed/random effect models might confuse the issue. (Cf. my reply to Jason).
The typical output of a meta-analysis is an (~) average effect size estimate (the diamond at the bottom of the forest plot, etc.) The confidence interval given for that is (very roughly)[1] the interval we predict the true average effect likely lies. So for the basic model given in Section 4 of the report, the average effect size is 0.64, 95% CI (0.54 − 0.74). So (again, roughly) our best guess of the ‘true’ average effect size of psychotherapy in LMICs from our data is 0.64, and we’re 95% sure(*) this average is somewhere between (0.54, 0.74).
Clearly, it is not the case that if we draw another study from the same population, we should be 95% confident(*) the effect size of this new data point will lie between 0.54 to 0.74. This would not be true even in the unicorn case there’s no between study heterogeneity (e.g. all the studies are measuring the same effect modulo sampling variance), and even less so when this is marked, as here. To answer that question, what you want is a prediction interval.[2] This interval is always wider, and almost always significantly so, than the confidence interval for the average effect: in the same analysis with the 0.54-0.74 confidence interval, the prediction interval was −0.27 to 1.55.
Although the full model HLI uses in constructing informed priors is different from that presented in S4 (e.g. it includes a bunch of moderators), they appear to be constructed with monte carlo on the confidence intervals for the average, not the prediction interval for the data. So I believe the informed prior is actually one of the (adjusted) “Average effect of psychotherapy interventions as a whole”, not a prior for (e.g.) “the effect size reported in a given PT study.” The latter would need to use the prediction intervals, and have a much wider distribution.[3]
I think this ably explains exactly why the Bayesian method for (e.g.) Strongminds gives very bizarre results when deployed as the report does, but they do make much more sense if re-interpreted as (in essence) computing the expected effect size of ‘a future strongminds-like intervention’, but not the effect size we should believe StrongMinds actually has once in receipt of trial data upon it specifically. E.g.:
The histogram of effect sizes shows some comparisons had an effect size < 0, but the ‘informed prior’ suggests P(ES < 0) is extremely low. As a prior for the effect size of the next study, it is much too confident, given the data, a trial will report positive effects (you have >1/72 studies being negative, so surely it cannot be <1%, etc.). As a prior for the average effect size, this confidence is warranted: given the large number of studies in our sample, most of which report positive effects, we would be very surprised to discover the true average effect size is negative.
The prior doesn’t update very much on data provided. E.g. When we stipulate the trials upon strongminds report a near-zero effect of 0.05 WELLBYs, our estimate of 1.49 WELLBYS goes to 1.26: so we should (apparently) believe in such a circumstance the efficacy of SM is ~25 times greater than the trial data upon it indicates. This is, obviously, absurd. However, such a small update is appropriate if it were to ~the average of PT interventions as a whole: that we observe a new PT intervention has much below average results should cause our average to shift a little towards the new findings, but not much.
In essence, the update we are interested in is not “How effective should we expect future interventions like Strongminds are given the data on Strongminds efficacy”, but simply “How effective should we expect Strongminds is given the data on how effective Strongminds is”. Given the massive heterogeneity and wide prediction interval, the (correct) informed prior is pretty uninformative, as it isn’t that surprised by anything in a very wide range of values, and so on finding trial data on SM with a given estimate in this range, our estimate should update to match it pretty closely.[4]
(This also should mean, unlike the report suggests, the SM estimate is not that ‘robust’ to adverse data. Eyeballing it, I’d guess the posterior should be going down by a factor of 2-3 conditional on the stipulated data versus currently reported results).
I’m aware confidence intervals are not credible intervals, and that ‘the 95% CI tells you where the true value is with 95% likelihood’ strictly misinterprets what a confidence interval is, etc. (see) But perhaps ‘close enough’, so I’m going to pretend these are credible intervals, and asterisk each time I assume the strictly incorrect interpretation.
Cf. Cochrane:
Although I think the same mean, so it will give the right ‘best guess’ initial estimates.
Obviously, modulo all the other issues I suggest with both the meta-analysis as a whole, that we in fact would incorporate other sources of information into our actual prior, etc. etc.