Gregory Lewis🔸 comments on Evaluating StrongMinds: how strong is the evidence?

Gregory Lewis🔸Feb 5, 2023, 10:12 AM
73 points
9 ∶ 1
I have now had a look at the analysis code. Once again, I find significant errors and—once again—correcting these errors is adverse to HLI’s bottom line.
I noted before the results originally reported do not make much sense (e.g. they generally report increases in effect size when ‘controlling’ for small study effects, despite it being visually obvious small studies tend to report larger effects on the funnel plot). When you use appropriate comparators (i.e. comparing everything to the original model as the baseline case), the cloud of statistics looks more reasonable: in general, they point towards discounts, not enhancements, to effect size: in general, the red lines are less than 1, whilst the blue ones are all over the place.
However, some findings still look bizarre even after doing this. E.g. Model 13 (PET) and model 19 (PEESE) not doing anything re. outliers, fixed effects, follow-ups etc, still report higher effects than the original analysis. These are both closely related to the eggers test noted before: why would it give a substantial discount, yet these a mild enhancement?
Happily, the code availability means I can have a look directly. All the basic data seems fine, as the various ‘basic’ plots and meta-analyses give the right results. Of interest, the Egger test is still pointing the right way—and even suggests a lower intercept effect size than last time (0.13 versus 0.26):
PET gives highly discordant findings:
You not only get a higher intercept (0.59 versus 0.5 in the basic random effects model), but the coefficient for standard error is negative: i.e. the regression line it draws slopes the opposite way to Eggers, so it predicts smaller studies give smaller, not greater, effects than larger ones. What’s going on?
The moderator (i.e. ~independent variable) is ‘corrected’ SE. Unfortunately, this correction is incorrect (line 17 divides (n/2)^2 by itself, where the first bracket should be +, not *), so it ‘corrects’ a lot of studies to SE = 1 exactly:
When you use this in a funnel plot, you get this:
Thus these aberrant results (which happened be below the mean effect size) explain why the best fit line now points in the opposite direction. All the PET analyses are contaminated by this error, and (given PEESE squares these values) so are all the PEESE analyses. When debugged, PET shows an intercept lower than 0.5, and the coefficient for SE pointing in the right direction:
Here’s the table of corrected estimates applied to models 13 − 24: as you can see, correction reduces the intercept in all models, often to substantial degrees (I only reported to 2 dp, but model 23 was marginally lower). Unlike the original analysis, here the regression slopes generally point in the right direction.
The same error appears to be in the CT analyses. I haven’t done the same correction, but I would guess the bizarre readings (e.g. the outliers of 70x or 200x etc. when comparing PT to CT when using these models) would vanish once it is corrected.
So, when correcting the PET and PEESE results, and use the appropriate comparator (Model 1, I forgot to do this for models 2-6 last time), we now get this:
Now interpretation is much clearer. Rather than ‘all over the place, but most of the models basically keep the estimate the same’, it is instead ‘across most reasonable ways to correct or reduce the impact of small study effects, you see substantial reductions in effect (the avg across the models is ~60% of the original—not a million miles away from my ’50%?’ eyeball guess.) Moreover, the results permit better qualitative explanation.
- On the first level, we can make our model fixed or random effects, fixed effects are more resilient to publication bias (more later), and we indeed find changing from random effects to fixed effect (i.e. model 1 to model 4) reduces effect size by a bit more than 2.
- On the second level, we can elect for different inclusion criteria: we could remove outliers, or exclude follow-ups. The former would be expected to partially reduce small study effects (as outliers will tend to be smaller studies reporting surprisingly high effects), whilst the later does not have an obvious directional effect—although one should account for nested outcomes, this would be expected to distort the weights rather than introduce a bias in effect size. Neatly enough, we see outlier exclusion does reduce effect size (Model 2 versus Model 1) but not followups or not (model 3 versus model 1). Another neat example of things lining up is you would expect FE to give a greater correction than outlier removal (as FE is strongly discounting smaller studies across the board, rather than removing a few of the most remarkable ones), and this is what we see (Model 2 vs. Model 4)
- Finally, one can deploy a statistical technique to adjust for publication bias. There are a bunch of methods to do this: PET, PEESE, Rucker’s limit, P curve, and selection models. All of these besides the P curve give a discount to the original effect size (model 7, 13,19,25,37 versus model 31).
- We can also apply these choices in combination, but essentially all combinations point to a significant downgrade in effect size. Furthermore, the combinations allow us to better explain discrepant findings. Only models 3, 31, 33, 35, 36 give numerically higher effect sizes. As mentioned before, model 3 only excludes follow-ups, so would not be expected to be less vulnerable to small study effects. The others are all P curve analyses, and P curves are especially sensitive to heterogeneity: the two P curves which report discounts are those with outliers removed (Model 32, 35), supporting this interpretation.
With that said, onto Joel’s points.
1. Discarding (better—investigating) bizarre results
I think if we discussed this beforehand and I said “Okay, you’ve made some good points, I’m going to run all the typical tests and publish their results.” would you have said have advised me to not even try, and instead, make ad hoc adjustments. If so, I’d be surprised given that’s the direction I’ve taken you to be arguing I should move away from.
You are correct I would have wholly endorsed permuting all the reasonable adjustments and seeing what picture emerges. Indeed, I would be (and am) happy with ‘throwing everything in’ even if some combinations can’t really work, or doesn’t really make much sense (e.g. outlier rejection + trim and fill).
But I would have also have urged you to actually understand the results you are getting, and querying results which plainly do not make sense. That we’re still seeing the pattern of “Initial results reported don’t make sense, and I have to repeat a lot of the analysis myself to understand why (and, along the way, finding the real story is much more adverse than HLI presents)” is getting depressing.
The error itself for PET and PEESE is no big deal—“I pressed the wrong button once when coding and it messed up a lot of my downstream analysis” can happen to anyone. But these results plainly contradicted both the naked eye (they not only give weird PT findings but weird CT findings: by inspection the CT is basically a negative control for pub bias, yet PET-PEESE typically finds statistically significant discounts), the closely-related Egger’s test (disagreeing with respect to sign), and the negative coefficients for the models (meaning they are sloping in the opposite direction) are printed in the analysis code.
I also find myself inclined to less sympathy here because I didn’t meticulously inspect every line of analysis code looking for trouble (my file drawer is empty). I knew the results being reported for these analysis could not be right, so I zeroed in on it expecting there was an error. I was right.
2. Comparators
When I do this, and again remove anything that doesn’t produce a discount for psychotherapy, the average correction leads to a 6x cost-effectiveness ratio of PT to CT. This is a smaller shift than you seem to imply.
9.4x → ~6x is a drop of about one third, I guess we could argue about what increment is large or small. But more concerning is the direction of travel: taking the ‘CT (all)’ comparator.
If we do not do my initial reflex and discard the PT favouring results, then we see adding the appropriate comparator and fixing the statistical error ~ halves the original multiple. If we continue excluding the “surely not” +ve adjustments, we’re still seeing a 20% drop with the comparator, and a further 10% increment with the right results for PT PET/PEESE.
How many more increments are there? There’s at least one more—the CT PET/PEESE results are wrong, and they’re giving bizarre results in the spreadsheet. Although I would expect diminishing returns to further checking (i.e. if I did scour the other bits of the analysis, I expect the cumulative error is smaller or neutral), but the ‘limit value’ of what this analysis would show if there were no errors doesn’t look great so far.
Maybe it would roughly settle towards the average of ~ 60%, so 9.4*0.6 = 5.6. Of course, this would still be fine by the lights of HLI’s assessment.
3. Cost effectiveness analysis
My complete guess is that if StrongMinds went below 7x GiveDirectly we’d qualitatively soften our recommendation of StrongMinds and maybe recommend bednets to more donors. If it was below 4x we’d probably also recommend GiveDirectly. If it was below 1x we’d drop StrongMinds. This would change if / when we find something much more (idk: 1.5-2x?) cost-effective and better evidenced than StrongMinds.
However, I suspect this is beating around the bush—as I think the point Gregory is alluding to is “look at how much their effects appear to wilt with the slightest scrutiny. Imagine what I’d find with just a few more hours.”
If that’s the case, I understand why—but that’s not enough for me to reshuffle our research agenda. I need to think there’s a big, clear issue now to ask the team to change our plans for the year. Again, I’ll be doing a full re-analysis in a few months.
Thank you for the benchmarks. However, I mean to beat both the bush and the area behind it.
The first things first, I have harped on about the CEA because it is is bizarre to be sanguine about significant corrections because ‘the CEA still gives a good multiple’ when the CEA itself gives bizarre outputs (as noted before). With these benchmarks, it seems this analysis, on its own terms, is already approaching action relevance: unless you want to stand behind cycling comparators (which the spreadsheet only does for PT and not CT, as I noted last time), then this + the correction gets you below 7x. Further, if you want to take SM effects as relative to the meta-analytic results (rather take their massively outlying values), you get towards 4x (e.g. drop the effect size of both meta-analyses by 40%, then put the SM effect sizes at the upper 95% CI). So there’s already a clear motive to investigate urgently in terms of what you already trying to do.
The other reason is the general point of “Well, this important input wilts when you look at it closely—maybe this behaviour generalises”. Sadly, we don’t really need to ‘imagine’ what I would find with a few more hours: I just did (and on work presumably prepared expecting I would scrutinise it), and I think the results speak for themselves.
The other parts of the CEA are non-linear in numerous ways, so it is plausible that drops of 50% in intercept value lead to greater than 50% drops in the MRA integrated effect sizes if correctly ramified across the analysis. More importantly, the thicket of the guestimate gives a lot of forking paths available—given it seems HLI clearly has had a finger on the scale, you may not need many more relatively gentle (i.e. 10%-50%) pushes upwards to get very inflated ‘bottom line multipliers’.
4. Use a fixed effects model instead?
As Ryan notes, fixed effects are unconventional in general, but reasonable in particular when confronted with considerable small study effects. I think—even if one had seen publication bias prior to embarking on the analysis—sticking with random effects would have been reasonable.
What links here?
- Gregory Lewis🔸's comment on The Happier Lives Institute is funding constrained and needs you! by MichaelPlant (Jul 8, 2023, 7:33 AM; 111 points)
- Gregory Lewis🔸's comment on The Happier Lives Institute is funding constrained and needs you! by MichaelPlant (Jul 16, 2023, 7:03 PM; 74 points)
- JoelMcGuire Feb 5, 2023, 8:32 PM
  7 points
  0 ∶ 0
  Parent
  Gregory,
  Thank you for pointing out two errors.
  - First, the coding mistake with the standard error correction calculation.
  - Second, and I didn’t pick this up in the last comment, that the CT effect size change calculation were all referencing the same model, while the PT effect size changes were referencing their non-publication bias analog.
  ______________________________________________________
  After correcting these errors, the picture does shift a bit, but the quantitative changes are relatively small.
  Here’s the results where only the change due to the publication bias adjusts the cost-effectiveness comparison. More of the tests indicate a downwards correction, and the average / median test now indicates an adjustment from 9.4x to 8x. However, when we remove all adjustments that favor PT in the comparison (models 19, 25, 23, 21, 17, 27, 15) the (average / median) is ratio of PT to CT is now (7x / 8x). This is the same as it was before the corrections.
  Note: I added vertical reference lines to mark the 3x, 7x and 9.44x multiples.
  Next, I present the changes where we include the model choices as publication bias adjustments (e.g., any reduction in effect size that comes from using a fixed effect model or outlier removal is counted against PT—Gregory and Ryan support this approach. I’m still unsure, but it seems plausible and I’ll read more about it). The mean / median adjustment leads to a 6x/7x comparison ratio. Excluding all PT favorable results leads to an average / median correction of 5.6x / 5.8x slightly below the 6x I previously reported.
  
  Note: I added vertical reference lines to mark the 3x, 7x and 9.44x multiples.
  Since the second approach bites into the cost-effectiveness comparison more and to a degree that’s worth mentioning if true, I’ll read more / raise this with my colleagues about whether using fixed effect models / discarding outliers are appropriate responses to suspicion of publication bias.
  If it turns out this is a more appropriate approach, then I should eat my hat re:
  My complete guess is that if StrongMinds went below 7x GiveDirectly we’d qualitatively soften our recommendation of StrongMinds and maybe recommend bednets to more donors.
  - Gregory Lewis🔸Feb 7, 2023, 7:04 PM
    21 points
    2 ∶ 0
    Parent
    The issue re comparators is less how good dropping outliers or fixed effects are as remedies to publication bias (or how appropriate either would be as an analytic choice here all things considered), but the similarity of these models to the original analysis.
    We are not, after all, adjusting or correcting the original metaregression analysis directly, but rather indirectly inferring the likely impact of small study effects on the original analysis by reference to the impact it has in simpler models.
    The original analysis, of course, did not exclude outliers, nor follow-ups, and used random effects, not fixed effects. So of Models 1-6, model 1 bears the closest similarity to the analysis being indirectly assessed, so seems the most appropriate baseline.
    The point about outlier removal and fixed effects reducing the impact of small study effects is meant to illustrate cycling comparators introduces a bias in assessment instead of just adding noise. Of models 2-6, we would expect 2, 4,5 and 6 to be more resilient to small study effects than model 1, because they either remove outliers, use fixed effects, or both (Model 3 should be ~ a wash). The second figure provides some (further) evidence of this, as (e.g.) the random effects models (thatched) strongly tend to report greater effect sizes than the fixed effect ones, regardless of additional statistical method.
    So noting the discount for a statistical small study effect correction is not so large versus comparators which are already less biased (due to analysis choices contrary to those made in the original analysis) misses the mark.
    If the original analysis had (somehow) used fixed effects, these worries would (largely) not apply. Of course, if the original analysis had used fixed effects, the effect size would have been a lot smaller in the first place.
    --
    Perhaps also worth noting is—with a discounted effect size—the overall impact of the intervention now becomes very sensitive to linear versus exponential decay of effect, given the definite integral of the linear method scales with the square of the intercept, whilst for exponential decay the integral is ~linear with the intercept. Although these values line up fairly well with the original intercept value of ~ 0.5, they diverge at lower values. If (e.g.) the intercept is 0.3, over a 5 year period the exponential method (with correction) returns ~1 SD years (vs.1.56 originally), whilst the linear method gives ~0.4 SD years (vs. 1.59 originally).
    (And, for what it is worth, if you plug in corrected SE or squared values in to the original multilevel meta-regressions, PET/PEESE style, you do drop the intercept by around these amounts either vs. follow-up alone or the later models which add other covariates.)
    What links here?
    EA Forum: content and moderator positions by Lizka (May 18, 2023, 11:15 PM; 72 points)