Our preferred model uses a meta-regression with the follow-up time as a moderator, not the typical “average everything” meta-analysis. Because of my experience presenting the cash transfers meta-analysis, I wanted to avoid people fixating on the forest plot and getting confused about the results since it’s not the takeaway result. But In hindsight I think it probably would have been helpful to include the forest plot somewhere.
I don’t have a good excuse for the publication bias analysis. Instead of making a funnel plot I embarked on a quest to try and find a more general system for adjusting for biases between intervention literatures. This was, perhaps unsurprisingly, an incomplete work that failed to achieve many of its aims (see Appendix C) -- but it did lead to a discount of psychotherapy’s effects relative to cash transfers. In hindsight, I see the time spent on that mini project as a distraction. In the future I think we will spend more time focusing on using extant ways to adjust for publication bias quantitatively.
Part of the reasoning was because we weren’t trying to do a systematic meta-analysis, but trying to do a quicker version on a convenience sample of studies. As we said on page 8 “These studies are not exhaustive (footnote: There are at least 24 studies, with an estimated total sample size of 2,310, we did not extract. Additionally, there appear to be several protocols registered to run trials studying the effectiveness and cost of non-specialist-delivered mental health interventions.). We stopped collecting new studies due to time constraints and the perception of diminishing returns.”
I wasn’t sure if a funnel plot was appropriate when applied to a non-systematically selected sample of studies. As I’ve said elsewhere, I think we could have made the depth (or shallowness) of our analysis more clear.
so I do think there was enough time to check a funnel plot for publication bias or odd heterogeneity
While that’s technically true that there was enough time, It certainly doesn’t feel like it! -- HLI is a very small research organization (from 2020 through 2021 I was pretty much the lone HLI empirical researcher), and we have to constantly balance between exploring new cause areas / searching for interventions, and updating / improving previous analyses. It feels like I hit publish on this yesterday. I concede that I could have done better, and I plan on doing so in the future, but this balancing act is an art. It sometimes takes conversations like this to put items on our agenda.
FWIW, here some quick plots I cooked up with the cleaner data. Some obvious remarks:
The StrongMinds relevant studies (Bolton et al., 2003; Bass et al., 2006) appear to be unusually effective (outliers?).
There appears more evidence of publication bias than was the case with our cash transfers meta-analysis (see last plot).
I also added a p-curve. What you don’t want to see is a larger number of studies at the 0.05 mark than the 0.04 significance level, but that’s what you see here.
Thank you for sharing these Joel. You’ve got a lot going on in the comments here, so I’m going only make a few brief specific comments and one larger one. The larger one relates to something you’ve noted elsewhere in the thread, which is:
“That the quality of this analysis was an attempt to be more rigorous than most shallow EA analyses, but definitely less rigorous than an quality peer reviewed academic paper. I think this [...] is not something we clearly communicated.”
This work forms part of the evidence base behind some strong claims from HLI about where to give money, so I did expect it to be more rigorous. I wondered if I was alone in being surprised here, so I did a very informal (n = 23!) Twitter poll in the EA group asking about what people expected re: the rigor of evidence for charity recommendations. (I fixed my stupid Our World in Data autocorrect glitch in a follow up tweet).
I don’t want to lean on this too much, but I do think it suggests that I’m not alone in expecting a higher degree of rigor when it comes to where to put charity dollars. This is perhaps mostly a communication issue, but I also think that as quality of analysis and evidence becomes less rigorous then claims should be toned down or at least the uncertainty (in the broad sense) needs to be more strongly expressed.
On the specifics, first, I appreciate you noting the apparent publication bias. That’s both important and not great.
Second, I think comparing the cash transfer funnel plot to the other one is informative. The cash transfer one looks “right”. It has the correct shape and it’s comforting to see the Egger regression line is basically zero. This is definitely not the case with the StrongMinds MA. The funnel plot looks incredibly weird, which could be heterogeneity that we can model but should regardless make everyone skeptical because doing that kind of modelling well is very hard. It’s also rough to see that if we project the Egger regression line back to the origin then the predicted effect when the SE is zero is basically zero. In other words, unwinding publication bias in this way would lead us to guess at a true effect of around nothing. Do I believe that? I’m not sure. There are good reasons to be skeptical of Egger-type regressions, but all of this definitely increases my skepticism of the results. While I’m glad it’s public now, I don’t feel great that this wasn’t part of the very public first cut of the results.
Again, I appreciate you responding. I do think going forward it would be worth taking seriously community expectations about what underlies charity recommendations, and if something is tentative or rough then I hope that it gets clearly communicated as such, both originally and in downstream uses.
Interesting poll Ryan! I’m not sure how much to take away because I think epistemic / evidentiary standards is pretty fuzzy in the minds of most readers. But still, point taken that people probably expect high standards.
It’s also rough to see that if we project the Egger regression line back to the origin then the predicted effect when the SE is zero is basically zero.
I’m not sure about that. Here’s the output of the Egger test. If I’m interpreting it correctly then that’s smaller, but not zero. I’ll try to figure out how what the p-curve suggested correction says.
Edit: I’m also not sure how much to trust the Egger test to tell me what the corrected effect size should be, so this wasn’t an endorsement that I think the real effect size should be halfed. It seems different ways of making this correction give very different answers. I’ll add a further comment with more details.
I do think going forward it would be worth taking seriously community expectations about what underlies charity recommendations, and if something is tentative or rough then I hope that it gets clearly communicated as such, both originally and in downstream uses.
Hi Ryan,
Our preferred model uses a meta-regression with the follow-up time as a moderator, not the typical “average everything” meta-analysis. Because of my experience presenting the cash transfers meta-analysis, I wanted to avoid people fixating on the forest plot and getting confused about the results since it’s not the takeaway result. But In hindsight I think it probably would have been helpful to include the forest plot somewhere.
I don’t have a good excuse for the publication bias analysis. Instead of making a funnel plot I embarked on a quest to try and find a more general system for adjusting for biases between intervention literatures. This was, perhaps unsurprisingly, an incomplete work that failed to achieve many of its aims (see Appendix C) -- but it did lead to a discount of psychotherapy’s effects relative to cash transfers. In hindsight, I see the time spent on that mini project as a distraction. In the future I think we will spend more time focusing on using extant ways to adjust for publication bias quantitatively.
Part of the reasoning was because we weren’t trying to do a systematic meta-analysis, but trying to do a quicker version on a convenience sample of studies. As we said on page 8 “These studies are not exhaustive (footnote: There are at least 24 studies, with an estimated total sample size of 2,310, we did not extract. Additionally, there appear to be several protocols registered to run trials studying the effectiveness and cost of non-specialist-delivered mental health interventions.). We stopped collecting new studies due to time constraints and the perception of diminishing returns.”
I wasn’t sure if a funnel plot was appropriate when applied to a non-systematically selected sample of studies. As I’ve said elsewhere, I think we could have made the depth (or shallowness) of our analysis more clear.
While that’s technically true that there was enough time, It certainly doesn’t feel like it! -- HLI is a very small research organization (from 2020 through 2021 I was pretty much the lone HLI empirical researcher), and we have to constantly balance between exploring new cause areas / searching for interventions, and updating / improving previous analyses. It feels like I hit publish on this yesterday. I concede that I could have done better, and I plan on doing so in the future, but this balancing act is an art. It sometimes takes conversations like this to put items on our agenda.
FWIW, here some quick plots I cooked up with the cleaner data. Some obvious remarks:
The StrongMinds relevant studies (Bolton et al., 2003; Bass et al., 2006) appear to be unusually effective (outliers?).
There appears more evidence of publication bias than was the case with our cash transfers meta-analysis (see last plot).
I also added a p-curve. What you don’t want to see is a larger number of studies at the 0.05 mark than the 0.04 significance level, but that’s what you see here.
Here are the cash transfer plots for reference:
Thank you for sharing these Joel. You’ve got a lot going on in the comments here, so I’m going only make a few brief specific comments and one larger one. The larger one relates to something you’ve noted elsewhere in the thread, which is:
This work forms part of the evidence base behind some strong claims from HLI about where to give money, so I did expect it to be more rigorous. I wondered if I was alone in being surprised here, so I did a very informal (n = 23!) Twitter poll in the EA group asking about what people expected re: the rigor of evidence for charity recommendations. (I fixed my stupid Our World in Data autocorrect glitch in a follow up tweet).
I don’t want to lean on this too much, but I do think it suggests that I’m not alone in expecting a higher degree of rigor when it comes to where to put charity dollars. This is perhaps mostly a communication issue, but I also think that as quality of analysis and evidence becomes less rigorous then claims should be toned down or at least the uncertainty (in the broad sense) needs to be more strongly expressed.
On the specifics, first, I appreciate you noting the apparent publication bias. That’s both important and not great.
Second, I think comparing the cash transfer funnel plot to the other one is informative. The cash transfer one looks “right”. It has the correct shape and it’s comforting to see the Egger regression line is basically zero. This is definitely not the case with the StrongMinds MA. The funnel plot looks incredibly weird, which could be heterogeneity that we can model but should regardless make everyone skeptical because doing that kind of modelling well is very hard. It’s also rough to see that if we project the Egger regression line back to the origin then the predicted effect when the SE is zero is basically zero. In other words, unwinding publication bias in this way would lead us to guess at a true effect of around nothing. Do I believe that? I’m not sure. There are good reasons to be skeptical of Egger-type regressions, but all of this definitely increases my skepticism of the results. While I’m glad it’s public now, I don’t feel great that this wasn’t part of the very public first cut of the results.
Again, I appreciate you responding. I do think going forward it would be worth taking seriously community expectations about what underlies charity recommendations, and if something is tentative or rough then I hope that it gets clearly communicated as such, both originally and in downstream uses.
Interesting poll Ryan! I’m not sure how much to take away because I think epistemic / evidentiary standards is pretty fuzzy in the minds of most readers. But still, point taken that people probably expect high standards.
I’m not sure about that. Here’s the output of the Egger test. If I’m interpreting it correctly then that’s smaller, but not zero. I’ll try to figure out how what the p-curve suggested correction says.
Edit: I’m also not sure how much to trust the Egger test to tell me what the corrected effect size should be, so this wasn’t an endorsement that I think the real effect size should be halfed. It seems different ways of making this correction give very different answers. I’ll add a further comment with more details.
Seems reasonable.
Fair re: Egger. I just eyeballed the figure.