I’ll donate 5k USD if the Ozler RCT reports an effect size greater than d = 0.4 − 2x smaller than HLI’s estimate of ~ 0.8, and below the bottom 0.1% of their monte carlo runs.
This RCT (which should have been the Baird RCT—my apologies for mistakenly substituting Sarah Baird with her colleague Berk Ozler as first author previously) is now out.
I was not specific on which effect size would count, but all relevant[1] effect sizes reported by this study are much lower than d = 0.4 - around d = 0.1. I roughly[2] calculate the figures below.
In terms of “SD-years of depression averted” or similar, there are a few different ways you could slice it (e.g. which outcome you use, whether you linearly interpolate, do you extend the effects out to 5 years, etc). But when I play with the numbers I get results around 0.1-0.25 SD-years of depression averted per person (as a sense check, this lines up with an initial effect of ~0.1, which seems to last between 1-2 years).
These are indeed “dramatically worse results than HLI’s [2021] evaluation would predict”. They are also substantially worse than HLI’s (much lower) updated 2023 estimates of Strongminds. The immediate effects of 0.07-0.16 are ~>5x lower than HLI’s (2021) estimate of an immediate effect of 0.8; they are 2-4x lower than HLI’s (2023) informed prior for Strongminds having an immediate effect of 0.39. My calculations of the total effect over time from Baird et al. of 0.1-0.25 SD-years of depression averted are ~10x lower than HLI’s 2021 estimate of 1.92 SD-years averted, and ~3x lower than their most recent estimate of ~0.6.
Baird et al. also comment on the cost-effectiveness of the intervention in their discussion (p18):
Unfortunately, the IPT-G impacts on depression in this trial are too small to pass a cost-effectiveness test. We estimate the cost of the program to have been approximately USD 48 per individual offered the program (the cost per attendee was closer to USD 88). Given impact estimates of a reduction in the prevalence of mild depression of 0.054 pp for a period of one year, it implies that the cost of the program per case of depression averted was nearly USD 916, or 2,670 in 2019 PPP terms. An oft-cited reference point estimates that a health intervention can be considered cost-effective if it costs approximately one to three times the GDP per capita of the relevant country per Disability Adjusted Life Year (DALY) averted (Kazibwe et al., 2022; Robinson et al., 2017). We can then convert a case of mild depression averted into its DALY equivalent using the disability weights calculated for the Global Burden of Disease, which equates one year of mild depression to 0.145 DALYs (Salomon et al., 2012, 2015). This implies that ultimately the program cost USD PPP (2019) 18,413 per DALY averted. Since Uganda had a GDP per capita USD PPP (2019) of 2,345, the IPT-G intervention cannot be considered cost-effective using this benchmark.
I’m not sure anything more really needs to be said at this point. But much more could be, and I fear I’ll feel obliged to return to these topics before long regardless.
The primary mental health outcomes consist of two binary indicators: (i) having a Patient Health Questionnaire 8 (PHQ-8) score ≤ 4, which is indicative of showing no or minimal depression (Kroenke et al., 2009); and (ii) having a General Health Questionnaire 12 (GHQ-12) score < 3, which indicates one is not suffering from psychological distress (Goldberg and Williams, 1988). We supplement these two indicators with five secondary outcomes: (i) The PHQ-8 score (range: 0-24); (ii) the GHQ-12 score (0-12); (iii) the score on the Rosenberg self-esteem scale (0-30) (Rosenberg, 1965); (iv) the score on the Child and Youth Resilience Measure-Revised (0-34) (Jefferies et al., 2019); and (v) the locus of control score (1-10). The discrete PHQ-8 and GHQ-12 scores allow the assessment of impact on the severity of distress in the sample, while the remaining outcomes capture several distinct dimensions of mental health (Shah et al., 2024).
Measurements were taken following treatment completion (‘Rapid resurvey’), then at 12m and 24m thereafer (midline and endline respectively).
I use both primary indicators and the discrete values of the underlying scores they are derived from. I haven’t carefully looked at the other secondary outcomes nor the human capital variables, but besides being less relevant, I do not think these showed much greater effects.
I.e. I took the figures from Table 6 (comparing IPT-G vs. control) for these measures and plugged them into a webtool for Cohen’s h or d as appropriate. This is rough and ready, although my calculations agree with the effect sizes either mentioned or described in text. They also pass an ‘eye test’ of comparing them to the cmfs of the scores in figure 3 - these distributions are very close to one another, consistent with small-to-no effect (one surprising result of this study is IPT-G + cash lead to worse outcomes than either control or IPT-G alone):
One of the virtues of this study is it includes a reproducibility package, so I’d be happy to produce a more rigorous calculation directly from the provided data if folks remain uncertain.
My view is that HLI[1], GWWC[2], Founders Pledge[3], and other EA / effective giving orgs that recommend or provide StrongMinds as an donation option should ideally at least update their page on StrongMinds to include relevant considerations from this RCT, and do so well before Thanksgiving / Giving Tuesday in Nov/Dec this year, so donors looking to decide where to spend their dollars most cost effectively can make an informed choice.[4]
Thanks Bruce, would you still think this if Strongminds ditched their adolescent programs as a result of this study and continued with their core groups with older women?
1) I think this RCT is an important proxy for StrongMinds (SM)‘s performance ‘in situ’, and worth updating on—in part because it is currently the only completed RCT of SM. Uninformed readers who read what is currently on e.g. GWWC[1]/FP[2]/HLI website might reasonably get the wrong impression of the evidence base behind the recommendation around SM (i.e. there are no concerns sufficiently noteworthy to merit inclusion as a caveat). I think the effective giving community should have a higher bar for being proactively transparent here—it is much better to include (at minimum) a relevant disclaimer like this, than to be asked questions by donors and make a claim that there wasn’t capacity to include.[3]
2) If a SM recommendation is justified as a result of SM’s programme changes, this should still be communicated for trust building purposes (e.g. “We are recommending SM despite [Baird et al RCT results], because …), both for those who are on the fence about deferring, and for those who now have a reason to re-affirm their existing trust on EA org recommendations.[4]
3) Help potential donors make more informed decisions—for example, informed readers who may be unsure about HLI’s methodology and wanted to wait for the RCT results should not have to go search this up themselves or look for a fairly buried comment thread on a post from >1 year ago in order to make this decision when looking at EA recommendations / links to donate—I don’t think it’s an unreasonable amount of effort compared to how it may help. This line of reasoning may also apply to other evaluators (e.g. GWWC evaluator investigations).[5]
GWWC website currently says it only includes recommendations after they review it through their Evaluating Evaluators work, and their evaluation of HLI did not include any quality checks of HLI’s work itself nor finalise a conclusion. Similarly, they say: “we don’t currently include StrongMinds as one of our recommended programs but you can still donate to it via our donation platform”.
We recommend StrongMinds because IPT-G has shown significant promise as an evidence-backed intervention that can durably reduce depression symptoms. Crucial to our analysis are previous RCTs
I’m not suggesting at all that they should have done this by now, only ~2 weeks after the Baird RCT results were made public. But I do think three months is a reasonable timeframe for this.
If there was an RCT that showed malaria chemoprevention cost more than $6000 per DALY averted in Nigeria (GDP/capita * 3), rather than per life saved (ballpark), I would want to know about it. And I would want to know about it even if Malaria Consortium decided to drop their work in Nigeria, and EA evaluators continued to recommend Malaria Consortium as a result. And how organisations go about communicating updates like this do impact my personal view on how much I should defer to them wrt charity recommendations.
Of course, based on HLI’s current analysis/approach, the ?disappointing/?unsurprising result of this RCT (even if it was on the adult population) would not have meaningfully changed the outcome of the recommendation, even if SM did not make this pivot (pg 66):
Therefore, even if the StrongMinds-specific evidence finds a small total recipient effect (as we present here as a placeholder), and we relied solely on this evidence, then it would still result in a cost-effectiveness that is similar or greater than that of GiveDirectly because StrongMinds programme is very cheap to deliver.
And while I think this is a conversation that has already been hashed out enough on the forum, I do think the point stands—potential donors who disagree with or are uncertain about HLI’s methodology here would benefit from knowing the results of the RCT, and it’s not an unreasonable ask for organisations doing charity evaluations / recommendations to include this information.
Acknowledging that this is DALYs not WELLBYs! OTOH, this conclusion is not the GiveWell or GiveDirectly bar, but a ~mainstream global health cost-effectiveness standard of ~3x GDP per capita per DALY averted (in this case, the ~$18k USD PPP/DALY averted of SM is below the ~$7k USD PPP/DALY bar for Uganda)
Nice one Bruce. I think I agree that it should be communicated like you say for reasons 2 and 3
I don’t think this is a good proxy for their main programs though, as this RCT looks a very different thing than their regular programming. I think other RCTs on group therapy in adult women from the region are better proxies than this study on adolescents.
Why do you think it’s a particularly good proxy? In my mind the same org doing a different treatment, (that seems to work but only a little for a short ish time) with many similarities to their regular treatment of course.
Like I said a year ago, I would have much rather this has been an RCT on Strongminds regular programs rather than this one on a very different program for adolescents. I understand though that “does similar group psychotherapy also work for adolescents” is a more interesting question from a researcher’s perspective, although less useful for all of us deciding just how good regular StrongMinds group psychotherapy is.
It sounds like you’re interpreting my claim to be “the Baird RCT is a particularly good proxy (or possibly even better than other RCTs on group therapy in adult women) for the SM adult programme effectiveness”, but this isn’t actually my claim here; and while I think one could reasonably make some different, stronger (donor-relevant) claims based on the discussions on the forum and the Baird RCT results, mine are largely just: “it’s an important proxy”, “it’s worth updating on”, and “the relevant considerations/updates should be easily accessible on various recommendation pages”. I definitely agree that an RCT on the adult programme would have been better for understanding the adult programme.
(I’ll probably check out of the thread here for now, but good chatting as always Nick! hope you’re well)
Thanks for this Gregory, I think it’s an important result and have updated my views. I’m not sure why HLI were so optimistic about this. I have a few comments here.
This study was performed on adolescents, which is not the core group of women that Strong Minds and other group IPT programs treat. This study might update me slightly negatively against the effectof their core programming with groups of older women but not by much.
As The study said, “this marked the first time SMU (i) delivered therapy to out-of-school
adolescent females, (ii) used youth mentors, and (iii) delivered therapy through a partner
organization.”
This result then doesn’t surprise me as (high uncertainty) I think it’s generally harder to move the needle with adolescent mental health than with adults.
The therapy still worked, even though the effect sizes were much smaller than other studies and were not cost effective.
As you’ve said before, f this kind of truly independent research was done on a lot of interventions, the results might not look nearly as good as the original studies.
I think Strongminds should probably stop their adolescent programs based on this study. Why keep doing it, when your work with adult women currently seems far more cost effective?
Even with the Covid caveat, I’m stunned at the null/negative effect of the cash transfer arm. Interesting stuff and not sure what to make of it.
I would still love a similar independent study on the regular group IPT programs with older women, and these RCTs should be pretty cheap on the scale of things, I doubt we’ll get that though as it will probably seen as being too similar and not interesting enough for researchers which is fair enough.
An update:
This RCT (which should have been the Baird RCT—my apologies for mistakenly substituting Sarah Baird with her colleague Berk Ozler as first author previously) is now out.
I was not specific on which effect size would count, but all relevant[1] effect sizes reported by this study are much lower than d = 0.4 - around d = 0.1. I roughly[2] calculate the figures below.
In terms of “SD-years of depression averted” or similar, there are a few different ways you could slice it (e.g. which outcome you use, whether you linearly interpolate, do you extend the effects out to 5 years, etc). But when I play with the numbers I get results around 0.1-0.25 SD-years of depression averted per person (as a sense check, this lines up with an initial effect of ~0.1, which seems to last between 1-2 years).
These are indeed “dramatically worse results than HLI’s [2021] evaluation would predict”. They are also substantially worse than HLI’s (much lower) updated 2023 estimates of Strongminds. The immediate effects of 0.07-0.16 are ~>5x lower than HLI’s (2021) estimate of an immediate effect of 0.8; they are 2-4x lower than HLI’s (2023) informed prior for Strongminds having an immediate effect of 0.39. My calculations of the total effect over time from Baird et al. of 0.1-0.25 SD-years of depression averted are ~10x lower than HLI’s 2021 estimate of 1.92 SD-years averted, and ~3x lower than their most recent estimate of ~0.6.
Baird et al. also comment on the cost-effectiveness of the intervention in their discussion (p18):
I’m not sure anything more really needs to be said at this point. But much more could be, and I fear I’ll feel obliged to return to these topics before long regardless.
The report describes the outcomes on p.10:
Measurements were taken following treatment completion (‘Rapid resurvey’), then at 12m and 24m thereafer (midline and endline respectively).
I use both primary indicators and the discrete values of the underlying scores they are derived from. I haven’t carefully looked at the other secondary outcomes nor the human capital variables, but besides being less relevant, I do not think these showed much greater effects.
I.e. I took the figures from Table 6 (comparing IPT-G vs. control) for these measures and plugged them into a webtool for Cohen’s h or d as appropriate. This is rough and ready, although my calculations agree with the effect sizes either mentioned or described in text. They also pass an ‘eye test’ of comparing them to the cmfs of the scores in figure 3 - these distributions are very close to one another, consistent with small-to-no effect (one surprising result of this study is IPT-G + cash lead to worse outcomes than either control or IPT-G alone):
One of the virtues of this study is it includes a reproducibility package, so I’d be happy to produce a more rigorous calculation directly from the provided data if folks remain uncertain.
My view is that HLI[1], GWWC[2], Founders Pledge[3], and other EA / effective giving orgs that recommend or provide StrongMinds as an donation option should ideally at least update their page on StrongMinds to include relevant considerations from this RCT, and do so well before Thanksgiving / Giving Tuesday in Nov/Dec this year, so donors looking to decide where to spend their dollars most cost effectively can make an informed choice.[4]
Listed as a top recommendation
Not currently a recommendation, (but to included as an option to donate)
Currently tagged as an “active recommendation”
Acknowledging that HLI’s current schedule is “By Dec 2024”, though this may only give donors 3 days before Giving Tuesday.
Thanks Bruce, would you still think this if Strongminds ditched their adolescent programs as a result of this study and continued with their core groups with older women?
Yes, because:
1) I think this RCT is an important proxy for StrongMinds (SM)‘s performance ‘in situ’, and worth updating on—in part because it is currently the only completed RCT of SM. Uninformed readers who read what is currently on e.g. GWWC[1]/FP[2]/HLI website might reasonably get the wrong impression of the evidence base behind the recommendation around SM (i.e. there are no concerns sufficiently noteworthy to merit inclusion as a caveat). I think the effective giving community should have a higher bar for being proactively transparent here—it is much better to include (at minimum) a relevant disclaimer like this, than to be asked questions by donors and make a claim that there wasn’t capacity to include.[3]
2) If a SM recommendation is justified as a result of SM’s programme changes, this should still be communicated for trust building purposes (e.g. “We are recommending SM despite [Baird et al RCT results], because …), both for those who are on the fence about deferring, and for those who now have a reason to re-affirm their existing trust on EA org recommendations.[4]
3) Help potential donors make more informed decisions—for example, informed readers who may be unsure about HLI’s methodology and wanted to wait for the RCT results should not have to go search this up themselves or look for a fairly buried comment thread on a post from >1 year ago in order to make this decision when looking at EA recommendations / links to donate—I don’t think it’s an unreasonable amount of effort compared to how it may help. This line of reasoning may also apply to other evaluators (e.g. GWWC evaluator investigations).[5]
GWWC website currently says it only includes recommendations after they review it through their Evaluating Evaluators work, and their evaluation of HLI did not include any quality checks of HLI’s work itself nor finalise a conclusion. Similarly, they say: “we don’t currently include StrongMinds as one of our recommended programs but you can still donate to it via our donation platform”.
Founders Pledge’s current website says:
I’m not suggesting at all that they should have done this by now, only ~2 weeks after the Baird RCT results were made public. But I do think three months is a reasonable timeframe for this.
If there was an RCT that showed malaria chemoprevention cost more than $6000 per DALY averted in Nigeria (GDP/capita * 3), rather than per life saved (ballpark), I would want to know about it. And I would want to know about it even if Malaria Consortium decided to drop their work in Nigeria, and EA evaluators continued to recommend Malaria Consortium as a result. And how organisations go about communicating updates like this do impact my personal view on how much I should defer to them wrt charity recommendations.
Of course, based on HLI’s current analysis/approach, the ?disappointing/?unsurprising result of this RCT (even if it was on the adult population) would not have meaningfully changed the outcome of the recommendation, even if SM did not make this pivot (pg 66):
And while I think this is a conversation that has already been hashed out enough on the forum, I do think the point stands—potential donors who disagree with or are uncertain about HLI’s methodology here would benefit from knowing the results of the RCT, and it’s not an unreasonable ask for organisations doing charity evaluations / recommendations to include this information.
Based on Nigeria’s GDP/capita * 3
Acknowledging that this is DALYs not WELLBYs! OTOH, this conclusion is not the GiveWell or GiveDirectly bar, but a ~mainstream global health cost-effectiveness standard of ~3x GDP per capita per DALY averted (in this case, the ~$18k USD PPP/DALY averted of SM is below the ~$7k USD PPP/DALY bar for Uganda)
Nice one Bruce. I think I agree that it should be communicated like you say for reasons 2 and 3
I don’t think this is a good proxy for their main programs though, as this RCT looks a very different thing than their regular programming. I think other RCTs on group therapy in adult women from the region are better proxies than this study on adolescents.
Why do you think it’s a particularly good proxy? In my mind the same org doing a different treatment, (that seems to work but only a little for a short ish time) with many similarities to their regular treatment of course.
Like I said a year ago, I would have much rather this has been an RCT on Strongminds regular programs rather than this one on a very different program for adolescents. I understand though that “does similar group psychotherapy also work for adolescents” is a more interesting question from a researcher’s perspective, although less useful for all of us deciding just how good regular StrongMinds group psychotherapy is.
It sounds like you’re interpreting my claim to be “the Baird RCT is a particularly good proxy (or possibly even better than other RCTs on group therapy in adult women) for the SM adult programme effectiveness”, but this isn’t actually my claim here; and while I think one could reasonably make some different, stronger (donor-relevant) claims based on the discussions on the forum and the Baird RCT results, mine are largely just: “it’s an important proxy”, “it’s worth updating on”, and “the relevant considerations/updates should be easily accessible on various recommendation pages”. I definitely agree that an RCT on the adult programme would have been better for understanding the adult programme.
(I’ll probably check out of the thread here for now, but good chatting as always Nick! hope you’re well)
Nice one 100% agree no need to check in again!
Thanks for this Gregory, I think it’s an important result and have updated my views. I’m not sure why HLI were so optimistic about this. I have a few comments here.
This study was performed on adolescents, which is not the core group of women that Strong Minds and other group IPT programs treat. This study might update me slightly negatively against the effectof their core programming with groups of older women but not by much.
As The study said, “this marked the first time SMU (i) delivered therapy to out-of-school adolescent females, (ii) used youth mentors, and (iii) delivered therapy through a partner organization.”
This result then doesn’t surprise me as (high uncertainty) I think it’s generally harder to move the needle with adolescent mental health than with adults.
The therapy still worked, even though the effect sizes were much smaller than other studies and were not cost effective.
As you’ve said before, f this kind of truly independent research was done on a lot of interventions, the results might not look nearly as good as the original studies.
I think Strongminds should probably stop their adolescent programs based on this study. Why keep doing it, when your work with adult women currently seems far more cost effective?
Even with the Covid caveat, I’m stunned at the null/negative effect of the cash transfer arm. Interesting stuff and not sure what to make of it.
I would still love a similar independent study on the regular group IPT programs with older women, and these RCTs should be pretty cheap on the scale of things, I doubt we’ll get that though as it will probably seen as being too similar and not interesting enough for researchers which is fair enough.