Author: Alex Cohen, GiveWell Senior Researcher
In a nutshell
The Happier Lives Institute (HLI) has argued in a series of posts that GiveWell should use subjective well-being measures in our moral weights and that if we did, we would find StrongMinds, which provides group-based interpersonal psychotherapy (IPT-G), is as or more cost-effective than marginal funding to our top charities.
This report provides our initial thoughts on HLI’s assessment, based on a shallow review of the relevant literature and considerations.
Our best guess is that StrongMinds is approximately 25% as cost-effective as our marginal funding opportunities. This assessment is based on several subjective adjustments, and we’ve identified limited evidence to discipline these adjustments. As a result, we think a wide range of cost-effectiveness is possible (approximately 5% to 80% as cost-effective as our marginal funding opportunities), and additional research could lead to different conclusions.
The main factors that cause us to believe StrongMinds is less cost-effective than HLI believes it is—and that we are unsure about—are:
Spillovers to household members: It’s possible that therapy for women served by StrongMinds benefits other members of their households. Our best guess is that the evidence HLI uses to estimate spillover effects overestimates spillovers, but we think it’s possible that direct evidence on the effect of StrongMinds on non-recipients could lead to a different conclusion.
Adjustments for internal validity of therapy studies: In our analysis, we apply downward adjustments for social desirability bias and publication bias in studies of psychotherapy. Further desk research and conversation with experts could help inform adjustments for social desirability and publication bias.
Lower effects outside of trial contexts: Our general expectation is that programs implemented as part of randomized trials are higher quality than similar programs implemented at scale outside of trial settings. We expect forthcoming results from two randomized controlled trials (RCTs) of StrongMinds could provide an update on this question.
Duration of effects of StrongMinds: There is evidence for long-term effects for some lay-person-delivered psychotherapy programs but not IPT-G, and we’re skeptical that a 4- to 8-week program like StrongMinds would have benefits that persist far beyond a year. We expect one of the forthcoming RCTs of StrongMinds, which has a 2-year follow-up, will provide an update on this question.
Translating improvements in depression to life satisfaction scores: HLI’s subjective well-being approach requires translating effects of StrongMinds into effects on life satisfaction scores, but studies of psychotherapy generally do not report effects on life satisfaction. We think HLI overestimates the extent to which improvements in depression lead to increases in life satisfaction. However, we think direct evidence on the effect of StrongMinds on life satisfaction scores could update that view.
Comparing StrongMinds to our top charities relies heavily on one’s moral views about the trade-off between StrongMinds’ program and averting a death or improving a life. As a result, we also think it’s important to understand what HLI’s estimates imply about these trade-offs at a high level, in addition to considering the specific factors above. HLI’s estimates imply, for example, that a donor would pick offering StrongMinds’ intervention to 20 individuals over averting the death of a child, and that receiving StrongMinds’ program is 80% as good for the recipient as an additional year of healthy life. We’re uncertain about how much weight to put on these considerations, since these trade-offs are challenging to assess, but they seem unintuitive to us and further influence our belief that StrongMinds is less cost-effective than HLI’s estimates.
We may conduct further work on this topic, but we’re uncertain about the timeline because even under optimistic scenarios, StrongMinds is less cost-effective than our marginal funding opportunities. If we did additional work, we would prioritize reviewing two forthcoming RCTs of StrongMinds and conducting additional desk research and conversations with subject matter experts to try to narrow our uncertainty on some of the key questions above.
Background
Group interpersonal psychotherapy (IPT-G) is a time-limited course of group therapy sessions which aims to tackle depression. StrongMinds implements an IPT-G program that specifically targets women with depression and consists of participants meeting with a facilitator in groups of five to ten, on average, for 90 minutes one to two times per week for four to eight weeks. [1]
The Happier Lives Institute (HLI) has argued in a series of reports and posts that if we were to assess charities we recommend funding based on their impact on subjective well-being measures (like life satisfaction scores), StrongMinds’ interpersonal group psychotherapy program would be competitive with Against Malaria Foundation (AMF), one of our top charities. [2]
This report is intended to share our view of HLI’s assessment and StrongMinds’ cost-effectiveness, based on a shallow review of the relevant literature and considerations. It incorporates feedback we’ve received from HLI on a previous (unpublished) draft of our work and follow-up research HLI has done since then.
HLI’s estimate of the cost-effectiveness of StrongMinds
HLI estimates the effect of StrongMinds compared to AMF, one of our top charities, by measuring the impact of both on life satisfaction scores.[3] Life satisfaction scores measure how people respond, on a scale from 0-10, to the question, “All things considered, how satisfied are you with your life as a whole these days?”[4]
HLI estimates that psychotherapy from StrongMinds creates 77 life satisfaction point-years (or WELLBYs) per $1,000 spent.[5] Summary of its calculations:
Main effect: It estimates that StrongMinds’ program increases mental health scores among recipients of psychotherapy by 1.69 standard deviation (SD)-years.[6] This is based on combining estimates from studies of programs similar to StrongMinds in low- and middle-income countries (Thurman et al. 2017, Bolton et al. 2007, Bolton et al. 2003), studies of StrongMinds, and a meta-regression of indirect evidence.[7]
Internal and external validity: In its meta-regression of indirect evidence, HLI includes adjustments for internal validity (including publication bias) and external validity (proxied by geographic overlap). These are reported relative to cash transfers studies.[8] It notes that social desirability bias and concerns about effects being smaller at higher scale are excluded.[9] These internal and external validity adjustments lead to a 90% discount for therapy relative to cash transfers.[10]
Spillovers: HLI estimates that non-recipients of the program in the recipient’s household see 53% of the benefits of psychotherapy from StrongMinds and that each recipient lives in a household with 5.85 individuals.[11] This is based on three studies (Kemp et al. 2009, Mutamba et al. 2018a, and Swartz et al. 2008) of therapy programs where recipients were selected based on negative shocks to children (e.g., automobile accident, children with nodding syndrome, children with psychiatric illness).[12] This leads to a 3.6x multiplier for other household members.[13] The total benefit to the household is 6.06 SD-years.[14]
Translating depression scores into life satisfaction: To convert improvements in depression scores, measured in SDs, into life satisfaction scores, HLI assumes (i) 1 SD improvement in depression scores is equivalent to 1 SD improvement in life satisfaction scores and (ii) a 1 SD improvement in life satisfaction is 2.17 points. [15] This 2.17 estimate is based on data from low-income countries, which finds an average SD in life satisfaction scores of 2.37, and data from high-income countries, which finds an average SD in life satisfaction scores of 1.86.[16]
Cost: HLI estimates that StrongMinds costs $170 per recipient.[17]
Cost-effectiveness: This implies 77 life satisfaction point-years per $1,000 spent. [18]
HLI estimates that AMF creates 81 life satisfaction point-years (or WELLBYs) per $1,000 spent. Note: This is under the deprivationist framework and assuming a “neutral point” of 0.5 life satisfaction points. HLI’s report presents estimates using a range of views of the badness of death and range of neutral points. It does not share an explicit view about the value of saving a life. We are benchmarking against this approach, since we think this is what we would use and it seems closest to our current moral weights. Summary of HLI’s calculations:
-
Using a deprivationist framework, HLI estimates the value of averting a death using the following formula: ([Average life satisfaction score out of 10 during remaining years of life] minus [Score out of 10 on life satisfaction scale that is equivalent to death]) times [years of life gained due to death being averted].[19]
-
Average life satisfaction is 4.95/10. Average age of death from malaria is 20 and they live to be 70, giving them 50 extra years of life. As a result, WELLBYs gained equal (4.95-neutral point)*(70-20).[20]
-
HLI cites a cost per death averted of $3,000 for AMF.[21]
-
With a neutral point of 0.5, this would be approximately 74 WELLBYs (life satisfaction point-years) per $1,000 spent.[22] (Note: HLI presents a range of neutral points. We chose 0.5 since that’s the neutral point we’re using.)
-
HLI also adds grief effects of 2.4 WELLBYs per $1,000 spent and income-increasing effects of 4 WELLBYs per $1,000 spent.[23]
-
This yields a bottom line of approximately 80 WELLBYs per $1,000 spent for a neutral point of 0.5.[24]
HLI estimates cash transfers from GiveDirectly create 8 life satisfaction point-years (or WELLBYs) per $1,000 spent.[25]
Our assessment
Overall, we estimate StrongMinds is approximately 25% as cost-effective as marginal funding to AMF and our other top charities. However, this is based on several subjective adjustments, and we think a wide range of cost-effectiveness is possible (approximately 5% to 80% as cost-effective as our marginal dollar).
Best guess on cost-effectiveness
We put together an initial analysis of StrongMinds’ cost-effectiveness under a subjective well-being approach, based on HLI’s analysis.
We estimate StrongMinds is roughly 25% as cost-effective as our marginal funding to AMF and other top charities. Compared to HLI, we estimate lower spillover effects and stricter downward adjustments for social desirability bias, publication bias, and lower effects at scale. This is partially counterbalanced by estimating lower costs and lower cost-effectiveness of AMF under a subjective well-being approach, compared to HLI.
A summary of HLI’s estimates vs. our view:
Our best guess is that StrongMinds leads to 17 life satisfaction point-years (or WELLBYs) per $1,000 spent. Summary of our calculations:
-
Main effect: HLI’s estimate of the effect of the main effect of IPT-G on SDs in depression scores is roughly similar to our estimate, which was based on a shallower review of the literature.[26] We’re uncertain about these estimates and think it’s possible our estimates could change if we prioritized a more in-depth review. (We describe this further below.)
-
Internal and external validity adjustments: We include downward adjustments for three additional factors that are not incorporated into HLI’s estimates. These are subjective guesses, but we believe they’re worth including to make StrongMinds comparable to top charities and other funding opportunities. These include:
-
Duration of benefits: There is evidence for long-term effects for some lay-person-delivered psychotherapy programs but not IPT-G, and we’re skeptical that a 4- to 8-week program like StrongMinds would have benefits that occur far beyond a year. We also expect some of the internal and external validity adjustments we apply would also lead to shorter duration of effects. We apply an 80% adjustment factor for this. (We describe this further below.)
-
Spillovers: We roughly double the effects to account for spillovers to other household members. This is lower than HLI’s adjustment for spillovers. This reflects our assumption that the evidence HLI uses to estimate spillovers may overestimate spillover effects and household size is lower than HLI estimates. (We describe these further below.)
-
Translating depression scores into life satisfaction: We make a slight discount (90% adjustment) to account for improvements in depression scores translating less than 1:1 to improvements in life satisfaction scores and a discount (80% adjustment) to account for individuals participating in StrongMinds having depression at baseline and therefore having a more concentrated distribution of life satisfaction scores. (We describe this further below.)
-
Cost: StrongMinds cited a figure of $105 per person receiving StrongMinds for 2022.[27] We use that more recent figure instead of the $170 per person figure used by HLI.
-
Cost-effectiveness: This implies 17 life satisfaction point-years per $1,000 spent.
We estimate marginal funding to AMF creates 70 life satisfaction point-years per $1,000 spent. We define marginal funding to AMF as funding that is roughly 10 times as cost-effective as unconditional cash transfers through GiveDirectly (our current bar for funding opportunities). This is similar to marginal funding to our other top charities.[28] Our assumptions are similar to HLI’s on life satisfaction point-years per death averted from AMF. When we input this into our current cost-effectiveness analysis, we find an effect of 70 life satisfaction point-years per $1,000 spent.
We also estimate that GiveDirectly creates 8 life satisfaction point-years per $1,000 spent, which is similar to HLI’s estimate. This is largely because we rely on HLI’s meta-analysis of the effect of cash transfers on life satisfaction scores.
Key uncertainties and judgment calls that we would like to explore further
The cost-effectiveness of StrongMinds relies on several judgment calls that we’re uncertain about. We would like to explore these further.
Spillover effects to other household members
It’s possible that improvements in well-being of StrongMinds participants lead to improvements in well-being of other individuals in their household. We had excluded these benefits in our initial analysis, and HLI’s work updated us toward believing these could be a substantial part of the benefits of therapy.
To estimate spillover effects of StrongMinds, HLI relies on three studies that measure spillovers from therapy given to caregivers or children with severe health issues:[29]
Mutamba et al. 2018a: This trial measured the effect of therapy to caregivers of children with nodding syndrome. Spillovers were assessed by comparing the effect of therapy on caregivers and the child with nodding syndrome. The study was non-randomized.
Kemp et al. 2009: This trial measured the effect of eye movement desensitization and reprocessing for post-traumatic stress disorder from a motor vehicle accident among children 6-12 years old. Spillovers were assessed by comparing mental health among children vs. parents.
Swartz et al. 2008: This trial measured the effect of interpersonal therapy for mothers of children with psychiatric illness. Spillovers were assessed by comparing effects on mothers to their children.
These three studies find a household member effect of 0.35 SDs (95% confidence interval, −0.04, 074), compared to 0.66 SDs for the recipient (95% confidence interval, 0.35, 0.97),[30] or a benefit that’s 53% as large as the recipient’s benefits.
We think it’s possible this evidence leads to greater spillovers than we would expect from StrongMinds, though these are speculative:
Mutamba et al. 2008 and Swartz et al. 2008 were oriented specifically toward caregivers having better relationships with children with severe health conditions.[31] As a result, we may expect the effects on those children to be larger than children of StrongMinds recipients, since StrongMinds participants may not focus as intensively on relationships with household members.
Mutamba et al. 2008 and Swartz et al. 2008 look at therapy provided to caregivers of children with severe health issues and measure spillovers to children. It seems plausible that this would have a larger spillover effect than StrongMinds, since those children may rely on that caregiver more intensively for care, and therefore be more affected by changes in that caregiver’s depression scores, than household members of typical StrongMinds participants.
In Mutamba et al. 2008 and Swartz et al. 2008, it seems possible children may also experience higher rates of treatment for their psychiatric conditions as a result of caregivers receiving therapy, which would confer direct benefits to non-recipients (in this case, children with nodding syndrome or psychiatric illness).
In addition, a recent blog post points out that the results of Kemp et al. 2009 show that parents’ depression scores increased, rather than decreased, which should lower HLI’s estimates. HLI notes in a comment on this post that updating for this error lowers the spillover effect to 38%.[32]
We also did a shallow review of correlational estimates and found a range of 5% to 60% across studies. We haven’t reviewed these studies in depth but view them as illustrative of a range of possible effect sizes.
-
Das et al. 2008 estimates that a one standard deviation change in mental health of household members is associated with a 22% to 59% of a standard deviation change in own mental health, across a sample of low- and middle-income countries.[33]
-
Powdthavee and Vignoles 2008 find a one standard deviation increase in parents’ mental distress in the previous year lowers life satisfaction in the current year by 25% of a standard deviation for girls, using a sample from the UK. Effects are smaller for boys.[34]
-
Mendolia et al. 2018 find a standard deviation increase in partner’s life satisfaction leads to a 5% of a standard deviation increase in individual life satisfaction, using data from Australia.[35]
These correlations may overestimate the extent of spillovers if there are shocks that are common to the household driving this correlation, assortative matching based on life satisfaction, or genetic transmission of life satisfaction scores within households. The authors control for some of these (consumption, physical health indicators),[36] but it’s possible there are unobserved differences driving the correlation. These also do not account for assortative matching or genetic transmission of life satisfaction. On the other hand, measurement error in life satisfaction scores could bias the relationship downward.
Our best guess is that spillovers to other household members is 15%, but we don’t feel very confident and think new research could update us a lot.
We think additional research on StrongMinds’ costs could lead us to higher or lower estimates of program cost.
We would be interested in exploring ways to fund further research to understand the extent to which improvements in depression scores or life satisfaction measures of StrongMinds participants lead to improvements in these outcomes for others in the household.
Household size
The extent of spillovers also depends on the number of individuals in StrongMinds participants’ households.
HLI estimates household size using data from the Global Data Lab and UN Population Division.[37] They estimate a household size of 5.9[38] in Uganda based on these data, which appears to be driven by high estimates for rural household size in the Global Data Lab data, which estimate a household size of 6.3 in rural areas in 2019.[39] A recent Uganda National Household Survey, on the other hand, estimates household size of 4.8 in rural areas.[40]
We’re not sure what’s driving differences in estimates across these surveys, but our best guess is that household size is smaller than the 5.9 estimate HLI is using.
We would also be interested in understanding what may be driving differences in estimates and whether it’s possible to collect data directly on household size for women participating in StrongMinds program, since these data could potentially be collected cheaply and provide a meaningful update on the extent of spillovers.
Effect of depression on life satisfaction
HLI’s estimates of the effect of StrongMinds are in terms of SD-years of improvement in mental health measures like depression scores.[41]
To translate these measures into life satisfaction, HLI assumes that (i) 1 SD improvement in depression scores is equivalent to 1 SD improvement in life satisfaction scores in trials of the effect of psychotherapy programs similar to StrongMinds and (ii) the SD in life satisfaction among StrongMinds recipients is equal to the SD in life satisfaction among a pooled average of individuals in low-, middle-, and high-income countries.[42]
We apply a 90% adjustment to assumption (i) to account for the possibility that improvements in depression scores do not translate 1:1 to improvements in life satisfaction scores. This is based on:
-
HLI’s review of five therapeutic interventions that report effects on subjective well-being measures and depression. This finds a ratio in effects of 0.89 SD.[43]
-
Our prior that there may be conceptual reasons why depression scores and life satisfaction might not map 1:1 onto each other. The highest/worst values on depression scales correspond to the most severe cases of depression. It seems likely that having the highest score on a depression scale would correspond to a life satisfaction score of 0. However, it’s less clear that this applies on the other end of the scale. Those who have low scores on depression scales have an absence of depressive symptoms.[44] These individuals likely have a higher life satisfaction, but it’s not obvious that the absence of depressive symptoms corresponds to a life satisfaction of 10. If absence of depression means people are completely satisfied with their lives, then it makes sense to scale in this way. But if high life satisfaction requires not just the absence of depressive symptoms but something more, then this approach seems less plausible. We’re uncertain about this line of reasoning, though, and would be interested in direct evidence on the effect of IPT-G or similar programs on life satisfaction measures.
We apply an 80% adjustment to assumption (ii). This is because it seems likely that SD in life satisfaction score is lower among StrongMinds recipients, who are screened for depression at baseline[45] and therefore may be more concentrated at the lower end of the life satisfaction score distribution than the average individual.
For both of these adjustments, we’re unsure about any non-linearities in the relationship between improvements in depression and improvements in life satisfaction. For example, it’s possible that going from severely depressed to moderately depressed leads to a larger than 1:1 increase in life satisfaction measures.
Because adjustments here are highly subjective, we would be open to considering collecting more evidence on this question. A potential approach would be to include surveys on life satisfaction, in addition to depression measures and other mental health scores, in subsequent studies of IPT-G. It may also be possible to explore whether existing datasets allow estimating SD in life satisfaction separately for individuals classified as depressed at baseline vs. not.
Effects at scale and outside of trial contexts
HLI does not include discounts for StrongMinds having a smaller effect when implemented at a larger scale and outside of trial contexts.
Our general expectation is that programs implemented as part of randomized trials are higher quality than similar programs implemented at scale. We would guess that the dominant factor here is that it is difficult to maintain quality of services as an organization scales. This seems particularly relevant for a program that relies on trained practitioners, such as interpersonal group therapy. It’s plausible that as StrongMinds scales up its program in the real world, the quality of implementation will decrease relative to the academic trials.[46]
For example, HLI notes that StrongMinds uses a reduced number of sessions and slightly reduced training, compared to Bolton (2003), which its program is based on.[47] We think this type of modification could reduce program effectiveness relative to what is found in trials. However, we have not done a side-by-side of training, session duration, or other program characteristics of StrongMinds compared to other programs in HLI’s full meta-analysis.
We can also see some evidence for lower effects in larger trials:
Thurman et al. 2017 finds no effect of IPT-G and suggests that one reason for this might be that the study was conducted following rapid program scale-up (unlike the Bolton et al. 2017 trials).[48]
In the studies included in HLI’s meta-analysis, larger trials tend to find smaller effects.[49] This could be consistent with either smaller effects at scale or publication bias.
We would be eager to see studies that measure the effect of StrongMinds, as it is currently implemented, directly and think the ongoing trial from Baird and Ozler will provide a useful data point here.[50] We’re also interested in learning more about the findings from StrongMinds’ RCT and how the program studied in that RCT compares to its typical program.[51]
In addition, we would be interested in understanding how StrongMinds’ costs might change if it were to expand (see below).
Social desirability bias
One major concern we have with these studies is that participants might report a lower level of depression after the intervention because they believe that is what the experimenter wants to see (more detail in footnote).[52] This is a concern because the depression outcomes in therapy programs are self-reported (i.e. participants answer questions regarding their own mental health before and after the intervention).
HLI responded to this criticism and noted that studies that try to assess experimenter-demand effects typically find small effects.[53] These studies test how much responses to questions about depression scores (or other mental health outcomes) change when the surveyor says they expect a particular response (either higher or lower score).
We’re not sure these tests would resolve this bias so we still include a downward adjustment (80% adjustment factor).
Our guess is that individuals who have gone through IPT-G programs would still be inclined to report having lower depression scores and better mental health on a survey that is related to the IPT-G program they received. If the surveyor told them they expected the program to worsen their mental health or improve their mental health, it seems unlikely to overturn whatever belief they had about the program’s expected effect that was formed during their group therapy sessions. This is speculative, however, and we don’t feel confident in this adjustment.
We also guess this would not detect bias arising from individuals changing their responses in order to receive treatment subsequently (or allowing others to do so),[54] though we’re unsure how important this concern is or how typical this is of other therapy interventions included in HLI’s meta-analysis.
We would be interested in speaking more to subject matter experts about ways to detect self-reporting bias and understand possible magnitude.
Publication bias
HLI’s analysis includes a roughly 10% downward adjustment for publication bias in the therapy literature relative to cash transfers literature.[55] We have not explored this in depth but guess we would apply a steeper adjustment factor for publication bias in therapy relative to our top charities.
After publishing its cost-effectiveness analysis, HLI published a funnel plot showing a high level of publication bias, with well-powered studies finding smaller effects than less-well-powered studies.[56] This is qualitatively consistent with a recent meta-analysis of therapy finding a publication bias of 25%.[57]
We roughly estimate an additional 15% downward adjustment (85% adjustment) to account for this bias. We may look into this further if we prioritize more work on StrongMinds by speaking to researchers, HLI, and other experts and by explicitly estimating publication bias in this literature.
Main effect of StrongMinds
We undertook a shallow review of the evidence for IPT-G prior to reviewing HLI’s analysis. Because we ended up with a similar best guess on effect size,[58] we did not prioritize further review. There are several key questions we have not investigated in depth:
How much weight should we put on different studies? In our shallow review, we rely on three RCTs that tested the impact of IPT-G on depression in a low- and middle-income country context (Bolton et al. 2003, Bolton et al. 2007, Thurman et al. 2017) and broader meta-analysis of psychotherapy programs across low-income and non-low-income countries. HLI uses Thurman et al. 2017, Bolton et al. 2007, and Bolton et al. 2003, studies of StrongMinds, and a meta-regression of indirect evidence.[59] We’re not sure how much to weight these pieces of evidence to generate a best guess for the effect of StrongMinds in countries where it would operate with additional funding.
How do Bolton et al. 2003, Bolton et al. 2006, Thurman et al. 2017, and the programs included in HLI’s meta-analysis vary from StrongMinds program, in terms of the program being delivered and target population, and how should that affect how much we generalize results? A key piece of the assessment of how much weight to put on different trials is the similarity in program type and target population. We have not done a thorough review of study populations (e.g., the extent to which different trials targeted women with depression at baseline, like StrongMinds) and programs (e.g., number of sessions, level of training, group size, etc.).[60]
What interventions did control groups receive? It’s possible that counterfactual treatment varied across studies. If control groups received some type of effective treatment, this could bias effects downward.
Would we come to the same conclusions if we replicated HLI’s meta-analysis? We have not vetted its analysis of the studies it uses in its meta-analysis,[61] and it’s possible further work could uncover changes. We think it’s possible that further review of these questions could lead to changes in our best guess on the main effects of StrongMinds and similar programs on depression scores.
Durability of benefits
HLI estimates durability of benefits by fitting a decay model of the effects of therapy programs over time, based on studies of the effect of therapy at different follow-up periods.[62] HLI estimates that effects persist up to five years, based on programs it deems similar to StrongMinds.[63] To the best of our knowledge, there are no long-term follow-up studies (beyond 15 months) of IPT-G in low-income countries specifically.[64]
We do think it’s plausible that lay-person-delivered therapy programs can have persistent long-term effects, based on recent trials by Bhat et al. 2022 and Baranov et al. 2020.
However, we’re somewhat skeptical of HLI’s estimate, given that it seems unlikely to us that a time-limited course of group therapy (4-8 weeks) would have such persistent effects. We also guess that some of the factors that cause StrongMinds’ program to be less effective than programs studied in trials (see above) could also limit how long the benefits of the program endure. As a result, we apply an 80% adjustment factor to HLI’s estimates.
We view this adjustment as highly speculative, though, and think it’s possible we could update our view with more work. We also expect the forthcoming large-scale RCT of StrongMinds in Uganda by Sarah Baird and Berk Ozler, which will measure follow-up at 2 years, could provide an update to these estimates.[65]
Costs
HLI’s most recent analysis includes a cost of $170 per person treated by StrongMinds, but StrongMinds cited a 2022 figure of $105 in a recent blog post and said it expects costs to decline to $85 per person treated by the end of 2024.[66]
We would be interested in learning more about StrongMinds’ costs and understanding what may be driving fluctuations over time and whether these are related to program impact (e.g., in-person or teletherapy, amount of training).
We are also interested in understanding whether to value volunteers’ time, since StrongMinds relies on volunteers for many of its models.[67]
Additional considerations
-
Potential unmodeled upsides: There may be additional benefits of StrongMinds’ programs that our model is excluding. An example is advocacy for improved government mental health policies by StrongMinds recipients.[68]
-
Implications for cost-effectiveness of other programs: HLI’s analysis has updated us toward putting more weight on household spillover effects of therapy and to consider putting at least some weight on subjective well-being measures in assessing the benefits of therapy. It’s possible that incorporating both of these could cause us to increase our estimates of cost-effectiveness of other morbidity-averting programs (e.g., cataract surgery, fistula surgery, clubfoot), since they may also benefit from within-household spillovers or look better under a subjective well-being approach.
-
Measures of grief: We have not prioritized an in-depth review of the effects of grief on life satisfaction, since this seems relatively unlikely to change the bottom line in our current analysis. We are taking HLI’s estimates at face value for now.
Plausibility
Stepping back, we also think HLI’s estimates of the cost-effectiveness of StrongMinds seem surprisingly large compared to other benchmarks, which gives us some additional reservations about this approach.
Because we’re uncertain about the right way to trade off improvements in subjective well-being from therapy vs. other interventions like cash transfers and averting deaths, we think it’s useful to compare against other perspectives. This includes:
-
Examining the trade-offs between offering StrongMinds, averting a death, and providing unconditional cash transfers: HLI’s estimates imply that offering 17 recipients StrongMinds is as valuable as averting a death from malaria and that offering 19 recipients StrongMinds is as valuable as averting an under-5 death from malaria.[69] HLI’s estimates also imply that providing someone a $1,000 cash transfer would be _less _valuable to them than offering StrongMinds.[70] These feel like unintuitive trade-offs that beneficiaries and donors would be unlikely to want to make. We acknowledge that this type of reasoning has limitations: It’s not obvious we should defer to whatever trade-offs we’d expect individuals to make (even if we knew individuals’ preferences) or that individuals are aware of what would make them the best off (e.g., individuals might not prefer bed nets at the same rate as our cost-effectiveness analysis implies).
-
Comparing the benefits of IPT-G to an additional year of life. HLI’s estimates imply that receiving IPT-G is roughly 40% as valuable as an additional year of life per year of benefit or 80% of the value of an additional year of life total.[71] This feels intuitively high.
-
Comparing HLI’s estimates of grief from deaths averted to impact of StrongMinds: HLI estimates that each death prevented by AMF causes a gain of 7 WELLBYs that would have been lost due to grief.[72] It estimates that StrongMinds causes a gain of 13 WELLBYs. This is nearly twice the benefits of averting the grief over someone’s death, which seems intuitively high. We have not dug deeply into evidence on grief effects on life satisfaction but think the current comparison seems implausible.
We’re uncertain about how much weight to put on these considerations, since these trade-offs are challenging to assess, but they seem unintuitive to us and influence our belief that StrongMinds is less cost-effective than HLI estimates.
Cost-effectiveness using our moral weights, rather than a subjective well-being approach
The above analysis focuses on estimating cost-effectiveness under a subjective well-being approach. Our bottom line is similar if we use our current moral weights.
HLI argues for using subjective well-being measures like life satisfaction scores to compare outcomes like averting a death, increasing consumption, and improving mental health through psychotherapy.
We think there are important advantages to this approach:
-
Subjective well-being measures provide an independent approach to assessing the effect of charities. The subjective well-being approach relies on measures that we have not previously used (i.e., life satisfaction scores) and provides a distinct and coherent methodology for establishing the goodness of different interventions. How we value charities that improve different outcomes, such as increasing consumption or averting death, are one of the most uncertain parts of our process of selecting cost-effective charities, and we believe we will reach more robust decisions if we consider multiple independent lines of reasoning.[73]
-
These measures capture a relevant outcome. We think most would agree that subjective well-being is an important component of overall well-being. Measuring effects that interventions have on changes in life satisfaction may be a more accurate way to assess impacts on subjective well-being than using indirect proxies, such as increases in income.
-
The subjective well-being approach is empirical. It seems desirable for a moral weights approach to be able to change based on new research or facts about a particular context. The subjective well-being approach can change based on, for example, new studies about the effect of a particular intervention on life satisfaction.
We’re unsure what approach we should adopt for morbidity-averting interventions. We’d like to think further about the pros and cons of the subjective well-being approach and also the extent to which disability-adjusted life years may fail to capture some of the benefits of therapy (or other morbidity-averting interventions).
In this case, however, using a subjective well-being approach vs. our current moral weights does not make a meaningful difference in cost-effectiveness. If we use disability-adjusted life years instead of subjective well-being (i.e., effect on life satisfaction scores), we estimate StrongMinds is roughly 20% as cost-effective as a grant to AMF that is 10 times as cost-effective as unconditional cash transfers.[74]
Our next steps
We may prioritize further work on StrongMinds in the future to try to narrow some of our major uncertainties on this program. This may include funding additional studies, conducting additional desk-based research, and speaking with experts.
We should first prioritize the following:
Reviewing the forthcoming large-scale RCT of StrongMinds in Uganda by Sarah Baird and Berk Ozler, as well as a recent (unpublished) RCT by StrongMinds
Speaking to StrongMinds, Happier Lives Institute, researchers, and other subject matter experts about the key questions we’ve raised
Beyond that, we would consider:
Exploring ways to fund research on spillover effects and on effects on life satisfaction directly, potentially as part of ongoing trials or new trials
Exploring data collection on household size among StrongMinds participants
Learning more about program costs from StrongMinds
Conducting additional desk research that may inform adjustments for social desirability bias, publication bias, effects on grief, and duration of benefits of StrongMinds
Considering whether to incorporate subjective well-being measures directly into our moral weights and how much weight we would put on a subjective well-being approach vs. our typical approach in our assessment of StrongMinds
Considering how much weight to put on plausibility checks, potentially including additional donor surveys to understand where they fall on questions about moral intuitions
Considering research to measure effect of StrongMinds or other therapy programs on life satisfaction directly
Exploring how this might change our assessment of other programs addressing morbidity (because, e.g., they also have spillover effects or they also look more cost-effective under a subjective well-being approach) and consider collecting data on the effect on life satisfaction of those programs as well
Sources
Notes
- ↩︎
“Over 8-10 sessions, counselors guide structured discussions to help participants identify the underlying triggers of their depression and examine how their current relationships and their depression are linked.” StrongMinds, “Our Model at Work”
“StrongMinds treats African women with depression through talk therapy groups led by community workers. Groups consist of 5-10 women with depression or anxiety, meeting for a 90-minute session 1-2x per week for 4-8 weeks. Groups can meet in person or by phone.” StrongMinds, “StrongMinds FAQs”
- ↩︎
- ↩︎
1.“In order to do as much good as possible, we need to compare how much good different things do in a single ‘currency’. At the Happier Lives Institute (HLI), we believe the best approach is to measure the effects of different interventions in terms of ‘units’ of subjective well-being (e.g. self-reports of happiness and life satisfaction).” Plant and McGuire, “Donating money, buying happiness: new meta-analyses comparing the cost-effectiveness of cash transfers and psychotherapy in terms of subjective well-being,” 2021
2.“We will say that 1 WELLBY (wellbeing-adjusted life year) is equivalent to a 1-point improvement on a 0-10 life satisfaction scale for 1 year.” Happier Lives Institute, “The elephant in the bednet: the importance of philosophy when choosing between extending and improving lives,” 2022, p. 14.
- ↩︎
“Example evaluative measures: … An overall life satisfaction question, as adopted in the World Values Survey (Bjørnskov, 2010): All things considered, how satisfied are you with your life as a whole these days? Using this card on which 1 means you are “completely dissatisfied” and 10 means you are “completely satisfied” where would you put your satisfaction with life as a whole?” OECD Guidelines on Measuring Subjective Well-Being: Annex A, p. 1
- ↩︎
“We found that GiveDirectly’s cash transfers produce 8 WELLBYs/$1,000 and StrongMinds’ psychotherapy produces 77 WELLBYs/$1,000, making the latter about 10 times more cost-effective than GiveDirectly.” Happier Lives Institute, “The elephant in the bednet: the importance of philosophy when choosing between extending and improving lives,” 2022, p. 16.
- ↩︎
Happier Lives Institute, “Happiness for the whole family: Accounting for household spillovers when comparing the cost-effectiveness of psychotherapy to cash transfers,” February 2022, p. 22, Table 2: summary of estimated spillover effects and change in comparison.
- ↩︎
-
Happier Lives Institute, “Cost-effectiveness analysis: StrongMinds,” October 2021, p. 15, Table 2: Evidence of direct and indirect evidence of StrongMinds’ effectiveness.
-
Studies are described in “Section 4. Effectiveness of StrongMinds’ core programme,” Pp. 9-18.
-
- ↩︎
These are described in this spreadsheet and Table A.3 of this page.
- ↩︎
Happier Lives Institute, “Cost-effectiveness analysis: StrongMinds,” October 2021, Section “6.5 Considerations and limitations,” Pp. 26-27.
- ↩︎
See this cell.
- ↩︎
Happier Lives Institute, “Happiness for the whole family: Accounting for household spillovers when comparing the cost-effectiveness of psychotherapy to cash transfers,” February 2022, p. 22, Table 2: summary of estimated spillover effects and change in comparison.
- ↩︎
2.“A limitation to the external validity of this evidence is that all of the samples were selected based on negative shocks happening to the children in the sample. In Kemp et al. (2009), children received EMDR for PTSD symptoms following an automobile accident. In Mutamba et al. (2018), caregivers of children with nodding syndrome received group interpersonal psychotherapy. In Swartz et al. (2008), depressed mothers of children with psychiatric illness received interpersonal psychotherapy. We are not sure if this would lead to an over or underestimate of the treatment effects, but it is potentially a further deviation from the type of household we are trying to predict the effects of psychotherapy for. Whilst recipients of programmes like StrongMinds might have children who have experienced negative shocks, we expect this is not the case for all of them.” Happier Lives Institute, “Happiness for the whole family: Accounting for household spillovers when comparing the cost-effectiveness of psychotherapy to cash transfers,” February 2022, p. 12.
- ↩︎
1 + (5.85-1)*0.53)= 3.5705
- ↩︎
-
6.06 = 1.69 + 4.85 * 1.69 * 0.53
-
- ↩︎
-
Life satisfaction point-years are also known as WELLBYs.
-
“To convert from SD-years to WELLBYs we multiply the SD-years by the average SD of life satisfaction (2.17, see row 8, “Inputs” tab), which results in 0.6 x 2.17 = 1.3 WELLBYs.” Happier Lives Institute, “The elephant in the bednet: the importance of philosophy when choosing between extending and improving lives,” 2022, p. 25.
-
See this cell.
-
“Our previous results (McGuire et al., 2022b) are in standard deviation changes over time (SD-years) of subjective wellbeing gained. Since these effects are standardised by dividing the raw effect by its SD, we convert it into life satisfaction points by unstandardising it with the global SD (2.2, see row 8) for life satisfaction (Our World in Data). Crucially, we assume a one-to-one exchange rate between a 1 SD change in affective mental health and subjective wellbeing measures. We’re concerned this may not be justified, but our investigations so far have not supported a different exchange rate.” Happier Lives Institute, “The elephant in the bednet: the importance of philosophy when choosing between extending and improving lives,” 2022, footnote 24, p. 16.
-
- ↩︎
See these cells.
- ↩︎
Happier Lives Institute, “Happiness for the whole family: Accounting for household spillovers when comparing the cost-effectiveness of psychotherapy to cash transfers,” February 2022, p. 22, Table 2: summary of estimated spillover effects and change in comparison.
- ↩︎
-
6.06 * 2.17 / $170 * $1,000
-
This is 36 SD-years improvements in depression scores per $1,000 spent. That matches the value here.
-
- ↩︎
“We start with the simplest account, deprivationism. On this view: badness of death = net wellbeing level x years of life lost
“We assume that the average age of the individual who dies from malaria is 20 years old, they would expect to live to 70, and so preventing their death leads to 50 extra years. We estimate their average expected life satisfaction to be 4.95/10. Hence, the WELLBYs gained by the person whose death is prevented is (4.95 – neutral point) * (70 – 20).” Happier Lives Institute, “The elephant in the bednet: the importance of philosophy when choosing between extending and improving lives,” 2022, p. 17.
- ↩︎
“We assume that the average age of the individual who dies from malaria is 20 years old, they would expect to live to 70, and so preventing their death leads to 50 extra years. We estimate their average expected life satisfaction to be 4.95/10. Hence, the WELLBYs gained by the person whose death is prevented is (4.95 – neutral point) * (70 – 20).” Happier Lives Institute, “The elephant in the bednet: the importance of philosophy when choosing between extending and improving lives,” 2022, p. 17.
- ↩︎
“According to GiveWell, it costs $3,000 for AMF to prevent a death (on average).” Happier Lives Institute, “The elephant in the bednet: the importance of philosophy when choosing between extending and improving lives,” 2022, p. 18.
- ↩︎
Equals $3,000 per death averted divided by (4.95-0.05)*50 WELLBYs per death averted x $1,000.
- ↩︎
“We estimate the grief-averting effect of preventing a death is 7 WELLBYs for each death prevented (see Appendix A.2), so 2.4 WELLBYs/$1,000. We estimate the income-increasing effects to be 4 WELLBYs/$1,000 (see Appendix A.1).” Happier Lives Institute, “The elephant in the bednet: the importance of philosophy when choosing between extending and improving lives,” 2022, p. 18
- ↩︎
We’re ignoring HLI’s estimates of the cost-effectiveness of cash transfers because the most important point of comparison is to our top charities, and because we don’t come to a meaningfully different bottom line on cost-effectiveness of GiveDirectly than HLI does. An overview of its calculations are here.
- ↩︎
Calculations are summarized in this spreadsheet.
- ↩︎
We identified three RCTs that tested the impact of IPT-G on depression in a low- and middle-income country context (Bolton et al. 2003, Bolton et al. 2007, Thurman et al. 2017). Two of these trials (Bolton et al. 2003 and Bolton et al. 2007) find IPT-G decreases symptoms of depression, while one (Thurman et al. 2017) does not find evidence of an effect. Averaging across trials, we estimate an effect on depression scores of 1.1 standard deviations. We describe our weighting in this document.
We also did a quick review of three meta-analyses of the effect of various forms of therapy on depression score in low-, middle- and high-income countries. We have low confidence that these papers provide a comprehensive look at the effect of therapy on depression, and we view it as an intuitive check on the findings from the three RCTs of IPT-G in Sub-Saharan Africa.
These trials have effects that range from 0.2 to 0.9 standard deviations, which is lower than what we found in IPT-G trials in Sub-Saharan Africa reported above.
Cuijpers et al. 2016 is a meta-analysis of RCTs of the effect interpersonal psychotherapy (IPT) on mental health. It finds IPT for depression had an effect of 0.6 standard deviations (95% CI 0.45-0.75) across 31 studies.
Morina et al. 2017 is a meta-analysis of RCTs of psychotherapy for adult post-traumatic stress disorder and depression in low- and middle-income countries.# It finds an effect of 0.86 standard deviations (95% CI 0.536-1.18) for 11 studies measuring an effect on depression.
Cuijpers et al. 2010 is a meta-analysis of RCTs of psychotherapy for adult depression that examines whether effect size varies with study quality. It finds an effect of 0.22 standard deviations for high-quality studies, compared to 0.74 for low-quality studies.
Based on this, we apply a 25% downward adjustment to these trials. We put slightly higher weight on Cuijpers et al. 2010, which finds a 0.2 standard deviations effect, since it explicitly takes into account study quality, and our best guess is the typical therapy program reduces depression scores by 0.4 standard deviations. We put a 40% weight on these meta-analyses and 60% weight on the trials from Sub-Saharan Africa reported above (Bolton et al. 2003, Bolton et al. 2007, and Thurman et al. 2017), which implies an average effect of 0.82, or a 25% downward adjustment. Weight effect is 0.82 = 1.160% + 0.440%. This is 75% of the effect reported in IPT-G trials from Sub-Saharan Africa, or a 25% discount.
- ↩︎
“In 2022 we expect the cost to treat one patient will be $105 USD.” Mayberry, “AMA: Sean Mayberry, Founder & CEO of StrongMinds,” November 2022
- ↩︎
For example, we estimate Malaria Consortium in Oyo, Nigeria is 11 times as cost-effective as GiveDirectly under our current moral weights and 69 life satisfaction point-years per $1,000 spent under a subjective well-being approach, and HKI in Guinea is 11 times as cost-effective as GiveDirectly under our current approach and 71 life satisfaction point-years per $1,000 spent under a subjective well-being approach. See this spreadsheet for calculations.
- ↩︎
Happier Lives Institute, “Happiness for the whole family: Accounting for household spillovers when comparing the cost-effectiveness of psychotherapy to cash transfers,” February 2022, Appendix A, Section A2. Psychotherapy studies, p. 32.
- ↩︎
- ↩︎
-
“The therapist also related interpersonal therapy’s emphasis on addressing interpersonal stressors, which led Ms. A to say that she needed guidance in dealing with Ann’s problems. The therapist explained that IPT-MOMS would specifically help Ms. A find ways to interact with Ann that would be more helpful to both mother and daughter.” Swartz et al. 2008, p. 7.
-
“ITP-MOMS has been described elsewhere (18). Briefly, it consists of an initial engagement session based on principles of motivational interviewing and ethnographic interviewing (28), which is designed to explore and resolve potential barriers to treatment seeking (17, 29), followed by eight sessions of brief in- terpersonal psychotherapy (30). IPT-MOMS differs from standard interpersonal psychotherapy (16) in that 1) it follows the brief in- terpersonal psychotherapy model that is both shorter than standard interpersonal psychotherapy and uses some “soft” behavioral strategies to rapidly activate depressed patients (30), 2) it incorporates a motivational interviewing- and ethnographic in- terviewing-based engagement session and continues to draw on these engagement strategies as needed during the treatment, and 3) it uses specific strategies to assist mothers in managing prob- lematic interpersonal relationships with their dependent, psychi- atrically ill offspring. [...] Subjects assigned to treatment as usual were informed of their diagnoses, given psychoeducational materials, and told to seek treatment.” Swartz et al 2008, p. 3.
-
A video shared in Mutamba et al. 2018b shows sample sessions with women discussing care of children with nodding syndrome.
-
Mutamba et al. 2018b provides more description of the intervention in Mutambda et al. 2008. “Quantitative results of the trial have been published and demonstrate the effectiveness of IPT-G in treating depression in both caregivers and their children [34].” Mutamba et al. 2018b, p. 10.
-
- ↩︎
-
“In the forest plot above, HLI reports that Kemp et al. (2009) finds a non-significant 0.35 [-0.43,1.13] standard deviation improvement in mental health for parents of treated children. But table 2 reports that parents in the treatment groups’ score on the GHQ-12 increased relative to the wait-list group (higher scores on the GHQ-12 indicate more self-reported mental health problems).” Snowden, “Why I don’t agree with HLI’s estimate of household spillovers from therapy,” February 24, 2023
-
From HLI’s comment: “This correction would reduce the spillover effect from 53% to 38% and reduce the cost-effectiveness comparison from 9.5 to 7.5x, a clear downwards correction.” Snowden, “Why I don’t agree with HLI’s estimate of household spillovers from therapy,” February 24, 2023
-
- ↩︎
Das et al. 2008, Table 2 for results.
- ↩︎
“In summary, Table 4 suggests that there is variation by gender in the estimated transmission coefficient from parental distress to child’s LS. Specifically, mothers’ distress levels do not appear to be an important determinant of boys’ LS. … Transmission correlations are quantitatively important as well as statistically significant. For example, the mean of _FDhi(t-1) _and _MDhi(t-1) _are 1.759 and 2.186, and their standard deviations are 2.914 and 3.208, respectively. An increase of one standard deviation from the means of _FDhi(t-1) _and _MDhi(t-1) _imply a change in the mental distress level to 4.674 for fathers and 5.394 for mothers. Taking conservative estimates of _FDhi(t-1) _and _MDhi(t-1) _for girls to be -.029 and -.022, the implied changes in the girl’s LS are approximately -.051 and - .048. Given that the mean of LS for girls is 5.738 and its standard deviation is 1.348, a _ceteris paribus _increase of one standard deviation in either parent’s mental distress level explains around a 25% drop in the standard deviation in the girl’s LS.” Powdthavee and Vignoles 2008, p. 18.
- ↩︎
“In Table 7, we begin with the analysis of the impact of the partner’s standardised SF36 mental health score (0-100, where higher values represent higher level of well-being). Increasing this score by one standard deviation increases individual’s life satisfaction by 0.07 points (on a 1-10 scale), which is equivalent to 5% of a standard deviation in life satisfaction. To put this in context, this is similar to the (reversed) effect of becoming unemployed or being victim of a property crime (see Table 10).” Mendolia et al. 2018, p. 12.
- ↩︎
-
For Das et al. 2008, Table 2 includes controls for age, female, married, widowed, education indicators, household consumption, household size, physical health, elderly dependents, and young dependents. See Pp. 39-40.
-
For Powdthavee and Vignoles 2008: “We include a set of youth attributes, as well as both parents’ characteristics and some household characteristics ((taken from the main BHPS dataset) as control variables in the child’s LS regressions. Youth attributes include child’s age and the number of close friends the child has. Age and the number of close friends are measured as continuous variables, and are time-varying across the observation waves. Parental characteristics include education, employment status, and health status of both parents if present in the household. Education is captured by two dummy variables, which represent (i) whether the parent achieved A levels or not and (ii) whether they achieved a degree. More disaggregated measures of parental education are not feasible with these data. Parental employment status is measured as a categorical variable identifying selfemployment and full-time employment. Health status is also measured as a categorical variable, ranging from “1.very poor health” to “5.excellent health”. Household characteristics include household income in natural log form and the number of children in the household. Household income is calculated by taking the summation of all household members’ annual incomes and is converted into real income in 1995 prices by dividing it by the annual consumer prices index (CPI). The number of children is a continuous variable and time varying across the panel. We include these variables because they are known to be correlated with measures of LS, and they may also be correlated with the mental distress of the parents (for a review, see Oswald, 1997). Following prior studies on how to model psychological well-being (Clark, 2003; Gardner & Oswald, 2007), a similar set of controls were included in each parent’s mental distress equations, with the addition of each parent’s age. The spouse’s observed characteristics are not included in the parent’s own mental distress equation as the model already allows for the correlations between the residuals. We also include the gender of the child in later analyses of moderating gender effects. Details of mean scores and standard deviations in the final sample for each of the dependent and control variables are given in Appendix B. In order to avoid non-response bias, we create dummy variables representing missing values for all control variables in the final sample.” p. 12.
-
For Mendolia et al. 2018: “Our main model (Specification 1) includes an extensive set of independent variables, to consider other factors that may influence life satisfaction, such as individual’s and partner’s self-assessed health, education, gender, employment and marital status, number and age of children, geographic remoteness, time binary variables1 , and life events that took place in the last 12 months (personal injury or illness, serious illness of a family member, victim of physical violence, death of a close relative or family member, victim of a property crime). We also estimate two additional specifications (Specifications 2 and 3) of each model, including other variables, such as partners’ long term conditions, and possible strategies to help the individual to deal with partners’ mental health, such as presence of social networks, and engagement in social activities. The complete list of variables included in the model is reported in Table 4.” p. 9.
-
- ↩︎
“We calculate average household sizes by averaging the latest available data from the United Nations Population Division (2019a), with average rural household sizes in these countries. Rural household size data comes from the Global Data Lab. We do this because StrongMinds and GiveDirectly operate mainly in rural or suburban areas.” Happier Lives Institute, “Happiness for the whole family: Accounting for household spillovers when comparing the cost-effectiveness of psychotherapy to cash transfers,” February 2022, p. 33.
- ↩︎
- ↩︎
-
The Global Data Lab household size estimate is approximately 6.3 for rural areas in Kenya and Uganda 2019.
-
The UN household size estimate for Uganda is approximately 4.9 based on 2019 DHS data (from “dataset” spreadsheet, sheet “HH Size and Composition 2022,” downloaded here).
-
- ↩︎
Uganda Bureau of Statistics, Uganda National Household Survey 2019/2020, Figure 2.4, p. 36.
- ↩︎
-
See, for example, Table 1B. Effect on depression at t=0 in Cohen’s d.
-
“We assume that treatment improves the ‘subjective well-being’ factors to the same extent as the ‘functioning’ factors, and therefore we could unproblematically compare depression measures to ‘pure’ SWB measures using changes in standard deviations (Cohen’s d).” Happier Lives Institute, “Cost-effectiveness analysis: StrongMinds,” October 2021, p. 14.
-
- ↩︎
-
“To convert from SD-years to WELLBYs we multiply the SD-years by the average SD of life satisfaction (2.17, see row 8, “Inputs” tab), which results in 0.6 x 2.17 = 1.3 WELLBYs.” Happier Lives Institute, “The elephant in the bednet: the importance of philosophy when choosing between extending and improving lives,” 2022, p. 25.
-
See this cell.
-
“Our previous results (McGuire et al., 2022b) are in standard deviation changes over time (SD-years) of subjective wellbeing gained. Since these effects are standardised by dividing the raw effect by its SD, we convert it into life satisfaction points by unstandardising it with the global SD (2.2, see row 8) for life satisfaction (Our World in Data). Crucially, we assume a one-to-one exchange rate between a 1 SD change in affective mental health and subjective wellbeing measures. We’re concerned this may not be justified, but our investigations so far have not supported a different exchange rate.” Happier Lives Institute, “The elephant in the bednet: the importance of philosophy when choosing between extending and improving lives,” 2022, footnote 24, p. 16.
-
- ↩︎
“We summarized the relative effectiveness of five different therapeutic interventions on SWB (not just LS specifically) and depression. The results are summarized in this spreadsheet. The average ratio of SWB to depression changes in the five meta-analyses is 0.89 SD; this barely changes if we remove the SWB measures that are specifically affect-based.” Happier Lives Institute, “Cost-effectiveness analysis: Group or task-shifted psychotherapy to treat depression,” 2021, p. 29.
- ↩︎
- ↩︎
StrongMinds reports in its 2022 Q4 report that across all programs 100% of participants had depression (3% mild, 41% moderate, 41% moderate-severe, 15% severe). StrongMinds, Q4 2022 Report, p. 2
- ↩︎
Some possibilities are as follows. As StrongMinds scales up its program:
It has to train more facilitators. It seems possible that the first individuals to come forward to get trained as facilitators are the highest quality, and that as more individuals come forward as the program scales up in an area the marginal facilitator is of decreasing quality.
It may not have the resources to oversee the quality of implementation at scale to the same extent as academic researchers in a small trial.
It may start to operate in new contexts in which it has less experience or understanding of locally relevant concepts/causes of depression. It may take time to tailor its program accordingly, and in the meantime the program may be less effective.
- ↩︎
“Bolton et al. (2003) and its six-month follow-up (Bass et al., 2006) were studies of an RCT deployed in Uganda (where StrongMinds primarily operates). StrongMinds based its core programme on the form, format, and facilitator training[footnote 5] of this RCT, which makes it highly relevant as a piece of evidence. StrongMinds initially used the same number of sessions (16) but later reduced its number of sessions to 12. They did this because the extra sessions did not appear to confer much additional benefit (StrongMinds, 2015, p.18), so it did not seem worth the cost to maintain it….Footnote 5: In personal communication StrongMinds says that their mental health facilitators receive slightly less training than those in the Bolton et al., (2003) RCT.” Happier Lives Institute, “Cost-effectiveness analysis: StrongMinds,” October 2021, p. 10.
- ↩︎
-
“Finally, the study was conducted following rapid program scale-up. Mechanisms to ensure implementation fidelity were not overseen by the researchers, in contrast to the Uganda trial (Bolton et al., 2007). This study captured World Vision South Africa’s first experience delivering IPTG in these communities; time and experience may contribute to increased effectiveness. A variety of new local implementation contexts, many of them resource-constrained, and the pace and scope of scale-up likely contributed to variation in the level of program quality and fidelity to the original model”, Thurman et al. 2017, p. 229.
-
“While 23% of adolescents in the intervention group did not attend any IPTG sessions, average attendance was 12 out of 16 possible sessions among participants. The intervention was not associated with changes in depression symptomology.” Thurman et al. 2017, Abstract, “Results.”
-
Whilst we might not want to additionally adjust the estimates in Thurman et al. 2017 much for this concern, we do want to adjust the estimates in the two Bolton et al. studies downwards. Without much detail in the Thurman et al. 2017 study as to the problems that rapid scale up caused with implementation, we have not adjusted the relative weights in the meta-analysis to favor the Thurman et al. 2017 study (because it is hard to know to what extent those same factors would apply to StrongMinds’ program).
-
- ↩︎
See “Effect vs. sample size” chart in this spreadsheet. These charts are based on the meta-analysis described in section 4.1.2 of this report and the spreadsheet linked below:
“We include evidence from psychotherapy that isn’t directly related to StrongMinds (i.e., not based on IPT or delivered to groups of women). We draw upon a wider evidence base to increase our confidence in the robustness of our results. We recently reviewed any form of face-to-face modes of psychotherapy delivered to groups or by non-specialists, deployed in LMICs (HLI, 2020b).9 At the time of writing, we have extracted data from 39 studies that appeared to be delivered by non-specialists and or to groups from five meta-analytic sources10 and any additional studies we found in our search for the costs of psychotherapy.
“These studies are not exhaustive. We stopped collecting new studies due to time constraints (after 10 hours), and the perception that we had found most of the large and easily accessible studies from the extant literature.11 The studies we include and their features can be viewed in this spreadsheet.”
- ↩︎
“Actual enrollment: 1914 participants.” Ozler and Baird, “Using Group Interpersonal Psychotherapy to Improve the Well-Being of Adolescent Girls,” (ongoing), “Study Design.”
- ↩︎
“StrongMinds recently conducted a geographically-clustered RCT (n = 394 at 12 months) but we were only given the results and some supporting details of the RCT. The weight we currently assign to it assumes that it improves on StrongMinds’ impact evaluation and is more relevant than Bolton et al., (2003). We will update our evaluation once we have read the full study.” Happier Lives Institute, “Cost-effectiveness analysis: StrongMinds,” October 2021, p. 11.
- ↩︎
A couple examples of why social desirability bias might exist in this setting:
If a motivated and pleasant IPT facilitator comes to your village and is trying to help you to improve your mental health, you may feel some pressure to report that the program has worked to reward the effort that facilitator has put into helping you.
In Bolton et al. 2003, the experimenters told participants at the start of the study that the control group will receive the treatment at a later date if it proved effective. Participants might then feel some pressure to report that the treatment worked so as not to deprive individuals in control villages from receiving the treatment too. “Prior to randomization, all potential participants were informed that if the intervention proved effective, it would later be offered to controls (currently being implemented by World Vision International)”, Bolton et al. 2003, p. 3118.
Individuals may report worse findings if they think doing so would lead to them receiving the intervention. “On this occasion informed consent included advising each youth of the study group to which he or she had been allocated. Our NGO partners had previously agreed to provide/continue on a permanent basis whichever intervention proved effective. Individuals assigned to the wait-control group were told they would be first to receive whichever intervention (if any) proved effective.” Bolton et al. 2007, p. 522.
- ↩︎
“As far as we can tell, this is not a problem. Haushofer et al., (2020), a trial of both psychotherapy and cash transfers in a LMIC, perform a test ‘experimenter demand effect’, where they explicitly state to the participants whether they expect the research to have a positive or negative effect on the outcome in question. We take it this would generate the maximum effect, as participants would know (rather than have to guess) what the experimenter would like to hear. Haushofer et al., (2020), found no impact of explicitly stating that they expected the intervention to increase (or decrease) self-reports of depression. The results were non-significant and close to zero (n = 1,545). [...]
Other less relevant evidence of experimenter demand effects finds that it results in effects that are small or close to zero. Bandiera et al., (n =5966; 2020) studied a trial that attempted to improve the human capital of women in Uganda. They found that experimenter demand effects were close to zero. In an online experiment Mummolo & Peterson, (2019) found that “Even financial incentives to respond in line with researcher expectations fail to consistently induce demand effects.” Finally, in de Quidt et al., (2018) while they find experimenter demand effects they conclude by saying “Across eleven canonical experimental tasks we … find modest responses to demand manipulations that explicitly signal the researcher’s hypothesis… We argue that these treatments reasonably bound the magnitude of demand in typical experiments, so our … findings give cause for optimism.”” Happier Lives Institute, “Cost-effectiveness analysis: StrongMinds,” October 2021, p. 26.
- ↩︎
-
Individuals may choose to report more favorable findings due to thinking that if they say it was helpful, others will receive the program.
-
“Prior to randomization, all potential participants were informed that if the intervention proved effective, it would later be offered to controls (currently being implemented by World Vision International).” Bolton et al. 2003, p. 3118.
-
Individuals may report worse findings if they think doing so would lead to them receiving the intervention.
-
“On this occasion informed consent included advising each youth of the study group to which he or she had been allocated. Our NGO partners had previously agreed to provide/continue on a permanent basis whichever intervention proved effective. Individuals assigned to the wait-control group were told they would be first to receive whichever intervention (if any) proved effective.” Bolton et al. 2007, p. 522.
-
- ↩︎
Calculations are in this spreadsheet. HLI applies an overall discount factor of 89%. Removing the publication bias adjustment (setting the weight in cell I7 to 0) changes this adjustment factor to 96%.
- ↩︎
See funnel plots here.
- ↩︎
“Among comparisons to control conditions, adding unpublished studies (Hedges’ g = 0.20; CI95% −0.11~0.51; k = 6) to published studies (g = 0.52; 0.37~0.68; k = 20) reduced the psychotherapy effect size point estimate (g = 0.39; 0.08~0.70) by 25%.” Driessen et al. 2015, p. 1.
- ↩︎
- ↩︎
-
“We include evidence from psychotherapy that isn’t directly related to StrongMinds (i.e., not based on IPT or delivered to groups of women).” Happier Lives Institute, “Cost-effectiveness analysis: StrongMinds,” October 2021, p. 12.
-
Happier Lives Institute, “Cost-effectiveness analysis: StrongMinds,” October 2021, p. 15, Table 2: Evidence of direct and indirect evidence of StrongMinds’ effectiveness.
-
Studies are described in “Section 4. Effectiveness of StrongMinds’ core programme,” Pp. 9-18.
-
- ↩︎
For example:
Bolton et al. 2003 and Bolton et al. 2007 also only include individuals who have been screened for depression in the study. By contrast, Thurman et al. 2017 does not directly screen participants for depression (but rather targets an “at risk” group—children who have been orphaned as a result of HIV/AIDS or are otherwise vulnerable).
Bolton et al. 2007 treats individuals in camps for internally displaced people, 40% of whom had been abducted as children. Thurman et al. 2017 treats children who have been orphaned as a result of HIV/AIDS or are otherwise vulnerable.
- ↩︎
These are in this spreadsheet and described in Appendix B (p. 30) of this page.
- ↩︎
Happier Lives Institute, “Cost-effectiveness analysis: StrongMinds,” October 2021, Section 4.2 Trajectory of efficacy through time.
- ↩︎
Happier Lives Institute, “Cost-effectiveness analysis: StrongMinds,” October 2021, Figure 3, p. 18. “We assume the effects have entirely dissipated in five years19 (95% CI: 2, 10).” https://www.happierlivesinstitute.org/report/strongminds-cost-effectiveness-analysis/
- ↩︎
-
The longest follow-up in the RCT literature on IPT-G in developing countries is 1y3m in Thurman et al. 2017 (which demonstrates the persistence of a null effect measured 3 months after follow up).
-
The next longest follow-up is the 6 month follow-up in Bass et al. 2006, which demonstrates the persistence of the treatment effect on the treated (although there is no direct test of the persistence of the intent-to-treat effect).
-
It is possible that the effects persist for longer than the first 1-2 years, but that longer duration just hasn’t been tested.
-
However, I am skeptical that the effects persist beyond the first 1-2 years for two reasons: a) My prior is that a time-limited and fairly light-touch intervention (12 90-minute group sessions) is unlikely to have a persistent effect, b) These RCTs were done sufficiently long ago that the authors have had time to be able to conduct longer-term follow-ups. There are many possible reasons why they haven’t done so which are not related to the effect size. However, I believe that a long-term follow-up demonstrating a persistent effect of IPT-G would make for a good academic publication and so the absence of a published long-term follow-up updates me slightly in the direction that the effect does not persist.
-
- ↩︎
See “Outcome Measures” section, “Primary Outcome Measures” subsection here.
- ↩︎
“In 2022 we expect the cost to treat one patient will be $105 USD. By the end of 2024, we anticipate the cost per patient will have decreased to just $85. We will continue to reduce the cost of treating one woman even while our numbers increase. This is through effective scaling and continuing to evaluate where we can gain more cost savings. A donation to StrongMinds will be used as effectively and efficiently as possible. And when you think about what it costs for therapy in the United States, to spend just $105 and treat a woman for depression is a pretty incredible feat.” Sean Mayberry, Founder and CEO, StrongMinds, responses to questions on the Effective Altruism Forum, November 2022.
- ↩︎
“Group tele-therapy (38.27% of 2021 budget) is delivered over the phone by trained mental health facilitators and volunteers (peers) to groups of 5 (mostly women) for 8 weeks. We expect the share of the budget this programme receives to decline as the threat of COVID diminishes.” [...]
“In addition to the core programme, StrongMinds also implements face-to-face g-IPT directed to young women and has begun a volunteer-run model. The youth programme (14.07%) is delivered by trained mental health facilitators to groups of adolescent girls.” [...] “StrongMinds’ peer programme (5.70%) is described as “self-replicating, volunteer-led talk therapy groups of eighteen people led by individuals trained in IPT-G. For this programme component, mental health facilitators recruit [core programme] graduates eager to give back to their communities and train them to be volunteer peer mental health facilitators. They train by co-facilitating courses of the core programme for half a year” (StrongMinds.org, 2021). The peer groups are smaller than the core programme groups (6-8) instead of 12-14.” Happier Lives Institute, “Cost-effectiveness analysis: StrongMinds,” October 2021, Pp. 22-23.
- ↩︎
“Building on this success, StrongMinds Uganda is advocating for new country-level adolescent mental healthcare policies. Between April and August of 2022, we identified and trained learners and female teachers from five districts around the country to serve as mental health advocates. All had experienced StrongMinds therapy as group members or leaders.” StrongMinds, “Big Win For Mental Health in Uganda’s schools,” 2022.
- ↩︎
See calculations here.
Note that the comparisons to the value of deaths averted are assuming deprivationism and a neutral point of 0.5. In their report, HLI presents a range of views of the badness of death and does not share an explicit view about the value of saving a life. Assuming deprivationism and a neutral point of 0.5 leads to among the highest values for averting a death. As a result, nearly all of the alternative assumptions HLI presents would suggest lower values for averting a death and lead to us what we think are more unintuitive conclusions.
- ↩︎
See calculations here.
- ↩︎
HLI estimates 1.70 SD-years of impact over 2.13 years and 2.17 life satisfaction points per SD of impact, compared to 4.45 life satisfaction points from each additional year of life (under a deprivationist approach to valuing death and a neutral point of 0.5). See calculations here.
- ↩︎
“We do a shallow calculation for grief in the same way we did in Donaldson et al. (2020). The best estimate we found is from Oswald and Powdthavee (2008): a panel study in the UK which finds the effect on life satisfaction due to the death of a child in the last year as being −0.72 (adjusted for a 0-10 scale). According to Clark et al. (2018), the duration of grief is ~5 years. Based on data from the UNDP, we calculate that the average household size across the beneficiary countries (excluding the recipient of the nets) is 4.03 people (row 16). Hence, an overall effect of grief per death prevented is (0.72 x 5 x 0.5) x 4.03 = 7.26 WELLBYs. However, we think this is an upper bound because it doesn’t account for the counterfactual grief averted.” Happier Lives Institute, “The elephant in the bednet: the importance of philosophy when choosing between extending and improving lives,” 2022, p. 26.
- ↩︎
We describe this as “cluster thinking” in this post.
- ↩︎
See calculations here.
Joel’s response
[Michael’s response below provides a shorter, less-technical explanation.]
Summary
Alex’s post has two parts. First, what is the estimated impact of StrongMinds in terms of WELLBYs? Second, how cost-effective is StrongMinds compared to the Against Malaria Foundation (AMF)? I briefly present my conclusions to both in turn. More detail about each point is presented in Sections 1 and 2 of this comment.
The cost-effectiveness of StrongMinds
GiveWell estimates that StrongMinds generates 1.8 WELLBYs per treatment (17 WELLBYs per $1000, or 2.3x GiveDirectly[1]). Our most recent estimate[2] is 10.5 WELLBYs per treatment (62 WELLBYs per $1000, or 7.5x GiveDirectly) . This represents a 83% discount (an 8.7 WELLBYs gap)[3] to StrongMinds effectiveness[4]. These discounts, while sometimes informed by empirical evidence, are primarily subjective in nature. Below I present the discounts, and our response to them, in more detail.
Figure 1: Description of GiveWell’s discounts on StrongMinds’ effect, and their source
Notes: The graph shows the factors that make up the 8.7 WELLBY discount.
Table 1: Disagreements on StrongMinds per treatment effect (10.5 vs. 1.8 WELLBYs) and cost
Note: HLI estimates StrongMinds has an effect of 1.8 WELLBYs per household of recipient. HLI estimates that this figure is 10.5. This represents a 8.7 WELLBY gap.
How do we assess GiveWell’s discounts? We summarise our position below.
Figure 2: HLI’s views on GiveWell’s total discount of 83% to StrongMind’s effects
We think there’s sufficient evidence and reason to justify the size and magnitude of 5% of GiveWell’s total discount
For ~45% of their total discount, we are sympathetic to including a discount, but we are unsure about the magnitude (generally, we think the discount would be lower). The adjustments that I think are the most plausible are:
A discount of up to 15% for conversion between depression and life-satisfaction SD.
A discount of up to 20% for loss of effectiveness at scale.
A discount of up to 5% for response biases.
Reducing the household size down to 4.8 people.
We are unsympathetic to ~35% of their total discount, because our intuitions differ, but there doesn’t appear to be sufficient existing evidence to settle the matter (i.e., household spillovers).
We think that for 15% of their total discount, the evidence that exists doesn’t seem to substantiate a discount (i.e., their discounts on StrongMind’s durability).
However, as Michael mentions in his comment, a general source of uncertainty we have is about how and when to make use of subjective discounts. We will make more precise claims about the cost-effectiveness of StrongMinds when we finalise our revision and expansion.
The cost-effectiveness of AMF
The second part of Alex’s post is asking how cost-effective is StrongMinds compared to the Against Malaria Foundation (AMF)? AMF, which prevents malaria with insecticide treated bednets, is in contrast to StrongMinds, a primarily life-saving intervention. Hence, as @Jason rightly pointed out elsewhere in the comments, its cost-effectiveness strongly depends on philosophical choices about the badness of death and the neutral point (see Plant et al., 2022). GiveWell takes a particular set of views (deprivationism with a neutral point of 0.5) that are very favourable to life saving interventions. But there are other plausible views that can change the results, and even make GiveWell’s estimate of StrongMinds seem more cost-effective than AMF. Whether you accept our original estimate of StrongMinds, or GiveWell’s lower estimate, the comparison is still incredibly sensitive to these philosophical choices. I think GiveWell is full of incredible social scientists, and I admire many of them, but I’m not sure that should privilege their philosophical intuitions.
Further research and collaboration opportunities
We are truly grateful to GiveWell for engaging with our research on StrongMinds. I think we largely agree with GiveWell regarding promising steps for future research. We’d be keen to help make many of these come true, if possible. Particularly regarding: other interventions that may benefit from a SWB analysis, household spillovers, publication bias, the SWB effects of psychotherapy (i.e. not just depression), and surveys about views on the neutral point and the badness of death. I would be delighted if we could make progress on these issues, and doubly so if we could do so together.
1. Disagreements on the cost-effectiveness of StrongMinds
HLI estimates that psychotherapy produces 10.5 WELLBYs (or 62 per $1000, 7.5x GiveDirectly) for the household of the recipient, while GiveWell estimates that psychotherapy has about a sixth of the effect, 1.8 WELLBYs (17 per $1000 or 2.3x GiveDirectly[5]). In this section, I discuss the sources of our disagreement regarding StrongMinds in the order I presented in Table 1.
1.1 Household spillover differences
Household spillovers are our most important disagreement. When we discuss the household spillover effect or ratio we’re referring to the additional benefit each non-recipient member of the household gets, as a percentage of what the main recipient receives. We first analysed household spillovers in McGuire et al. (2022), which was recently discussed here. Notably, James Snowden pointed out a mistake we made in extracting some data, which reduces the spillover ratio from 53% to 38%.
GiveWell’s method relies on:
Discounting the 38% figure citing several general reasons. (A) Specific concerns that the studies we use might overestimate the benefits because they focused on families with children that had high-burden medical conditions. (B) A shallow review of correlational estimates of household spillovers and found spillover ratios ranging from 5% to 60%.
And finally concluding that their best guess is that the spillover percentage is 15 or 20%[6], rather than 53% (what we used in December 2022) or 38% (what we would use now in light of Snowden’s analysis). Since their resulting figure is a subjective estimate, we aren’t exactly sure why they give that figure, or how much they weigh each piece of evidence.
Table 2: HLI and GiveWell’s views on household spillovers of psychotherapy
Variable
HLI
GiveWell
Explains how much difference in SM’s effect (%)
Household spillover ratio for psychotherapy
38%
15%
3 WELLBYs (34% of total gap)
Note: The household spillover for cash transfers we estimated is 86%.
I reassessed the evidence very recently—as part of the aforementioned discussion with James Snowden—and Alex’s comments don’t lead me to update my view further. In my recent analysis, I explained that I think I should weigh the studies we previously used less because they do seem less relevant to StrongMinds, but I’m unsure what to use instead. And I also hold a more favourable intuition about household spillovers for psychotherapy, because parental mental health seems important for children (e.g., Goodman, 2020).
But I think we can agree that collecting and analysing new evidence could be very important here. The data from Barker et al. (2022), a high quality RCT of the effect of CBT on the general population in Ghana (n = ~7,000) contains information on both partners’ psychological distress when one of them received cognitive behavioural therapy, so this data can be used to estimate any spousal spillover effects from psychotherapy. I am in the early stage of analysing this data[7]. There also seems to be a lot of promising primary work that could be done to estimate household spillovers alongside the effects of psychotherapy.
1.2 Conversion between measures, data sources, and units
The conversion between depression and life-satisfaction (LS) scores ties with household spillovers in terms of importance for explaining our disagreements about the effectiveness of psychotherapy. We’ve previously assumed that a one standard deviation (SD) decrease in depression symptoms (or affective mental health; MHa) is equivalent to a one SD improvement in life-satisfaction or happiness (i.e., a 1:1 conversion), see here for our previous discussion and rationale.
Givewell has two concerns with this:
Depression and life-satisfaction measures might not be sufficiently empirically or conceptually related to justify a 1:1 conversion. Because of this, they apply an empirically based 10% discount.
They are concerned that recipients of psychotherapy have a smaller variance in subjective wellbeing (SWB) than general populations (e.g., cash transfers), which leads to inflated effect sizes. They apply a 20% subjective discount to account for this.
Hence, GiveWell applied a 30% discount (see Table 4 below).
Table 3: HLI and GiveWell’s views on converting between SDs of depression and life satisfaction
HLI
GiveWell
Explains what difference in SM’s effect (%)
1 to 1
1 to 0.7
3 WELLBYs (34% of total)
Overall, I agree that there are empirical reasons for including a discount in this domain, but I’m unsure of its magnitude. I think it will likely be smaller than GiveWell’s 30% discount.
1.2.1 Differences between the two measures
First, GiveWell mentions a previous estimate of ours suggesting that mental health (MH) treatments[8] impact depression 11% more than SWB. Our original calculation used a naive average, but on reflection, it seems more appropriate to use a sample-size-weighted average (because of the large differences in samples between studies), which results in depression measures overestimating SWB measures by 4%, instead of 11%.
Results between depression and happiness measures are also very close in Bhat et al. (2022; n = 589), the only study I’ve found so far that looks at effects of psychotherapy on both types of measures. We can standardise the effects in two ways. Depending on the method, the SWB effects are larger by 18% or smaller by 1% than MHa effects[9]. Thus, effects of psychotherapy on depression appear to be of similar size as effects on SWB. Given these results, I think the discount due to empirical differences could be smaller than 10%, I would guess 3%.
Another part of this is that depression and life satisfaction are not the same concept. So if the scores are different, there is a further moral question about which deserves more weight. The HLI ‘house view’, as our name indicates, favours happiness (how good/bad we feel) as what matters. Further, we suspect that measures of depression are conceptually closer to happiness than measures of life satisfaction are. Hence, if push came to shove, and there is a difference, we’d care more about the depression scores, so no discount would be justified. From our conversation with Alex, we understand that the GiveWell ‘house view’ is to care more about life satisfaction than happiness. In this case, GiveWell would be correct, by their lights, to apply some reduction here.
1.2.2 Differences in variance
In addition to their 11% conversion discount, GiveWell adds another 20% discount because they think a sample of people with depression have a smaller variance in life satisfaction scores.[10] Setting aside the technical topic of why variance in variances matters, I investigated whether there are lower SDs in life satisfaction when you screen for baseline depression using a few datasets. I found that, if anything, the SDs are larger by 4% (see Table 4 below). Although I see the rationale behind GiveWell’s speculation, the evidence I’ve looked at suggests a different conclusion.
Table 4: Life-satisfaction SD depending on clinical mental health cutoff
LS SD for general pop
LS SD for dep pop
SWB SD change (gen → dep)
SWB measure
1.23
1.30
106%
LS 1-10
1.65
1.88
114%
LS 0-10
2.43
2.38
98%
LS 1-10
1.02
1.04
102%
LS (z-score)
Average change
1.58
1.65
104%
Note: BHPS = The British Household Panel Survey, HILDA = The Household Income and Labour Dynamics Survey, NIDS = National Income Dynamics Study. LS = life satisfaction, dep = depression.
However, I’m separately concerned that SD changes in trials where recipients are selected based on depression (i.e., psychotherapy) are inflated compared to trials without such selection (i.e., cash transfers)[11].
Overall, I think I agree with GiveWell that there should be a discount here that HLI doesn’t implement, but I’m unsure of its magnitude, and I think that it’d be smaller than GiveWell’s. More data could likely be collected on these topics, particularly how much effect sizes in practice differ between life-satisfaction and depression, to reduce our uncertainty.
1.3 Loss of effectiveness outside trials and at scale
GiveWell explains their concern, summarised in the table below:
“Our general expectation is that programs implemented as part of randomized trials are higher quality than similar programs implemented at scale. [...] For example, HLI notes that StrongMinds uses a reduced number of sessions and slightly reduced training, compared to Bolton (2003), which its program is based on.48 We think this typeof modification could reduce program effectiveness relative to what is found in trials. [...] We can also see some evidence for lower effects in larger trials…”
Table 5: HLI and GiveWell’s views on an adjustment for StongMind’s losing effectiveness at scale
Explains what difference in SM’s effect (%)
0.9 WELLBYs (10.1% of total gap)
While GiveWell provides several compelling reasons for why StongMinds efficacy will decrease as it scales, I can’t find the reason GiveWell provides for why these reasons result in a 25% discount. It seems like a subjective judgement informed by some empirical factors and perhaps from previous experience studying this issue (e.g., cases like No Lean Season). Is there any quantitative evidence that suggests that when RCT interventions scale they drop 25% in effectiveness? While GiveWell also mentions that larger psychotherapy trials have smaller effects, I assume this is driven by publication bias (discussed in Section 1.6). I’m also less sure that scaling has no offsetting benefits. I would be surprised if when RCTs are run, the intervention has all of its kinks ironed out. In fact, there’s many cases of the RCT version of an intervention being the “minimum viable product” (Karlan et al., 2016). While I think a discount here is plausible, I’m very unsure of its magnitude.
In our updated meta-analysis we plan on doing a deeper analysis of the effect of expertise and time spent in therapy, and to use this to better predict the effect of StrongMinds. We’re awaiting the results from Baird et al. which should better reflect their new strategy as StrongMinds trained but did not directly deliver the programme.
1.4 Disagreements on the durability of psychotherapy
GiveWell explains their concern summarised in the table below, “We do think it’s plausible that lay-person-delivered therapy programs can have persistent long-term effects, based on recent trials by Bhat et al. 2022 and Baranov et al. 2020. However, we’re somewhat skeptical of HLI’s estimate, given that it seems unlikely to us that a time-limited course of group therapy (4-8 weeks) would have such persistent effects. We also guess that some of the factors that cause StrongMinds’ program to be less effective than programs studied in trials (see above) could also limit how long the benefits of the program endure. As a result, we apply an 80% adjustment factor to HLI’s estimates. We view this adjustment as highly speculative, though, and think it’s possible we could update our view with more work.”
Table 6: HLI and GiveWell’s views on a discount to account for a decrease in durability
Explains what difference in SM’s effect (%)
Since this disagreement appears mainly based on reasoning, I’ll explain why my intuitions—and my interpretation of the data—differ from GiveWell here. I already assume that StrongMinds decays 4% more each year than does psychotherapy in general (see table 3). Baranov et al. (2020) and Bhat et al. (2022) both find long-term effects that are greater than what our general model predicts. This means that we already assume a higher decay rate in general, and especially for StrongMinds than the two best long-term studies of psychotherapy suggest. I show how these studies compare to our model in Figure 3 below.
Figure 3: Effects of our model over time, and the only long-term psychotherapy studies in LMICs
Edit: I updated the figure to add the StrongMinds model, which starts with a higher effect but has a faster decay.
Baranov et al. (2020, 16 intended sessions) and Bhat et al. (2022, 6-14 intended sessions, with 70% completion rate) were both time limited. StrongMinds historically used 12 sessions (it may be 8 now) of 90 minutes[12]. Therefore, our model is more conservative than the Baranov et al. result, and closer to the Bhat et al. which has a similar range of sessions. Another reason, in favour of the duration of StrongMinds, which I mentioned in McGuire et al. (2021), is that 78% of groups continued meeting on their own at least six months after the programme formally ended.
Bhat et al. (2022) is also notable in another regard: They asked ~200 experts to predict the impact of the intervention after 4.5 years. The median prediction underestimated the effectiveness by nearly 1/3rd, which makes me inclined to weigh expert priors less here[13].
Additionally, there seems to be something double-county in GiveWell’s adjustments. The initial effect is adjusted by 0.75 for “Lower effectiveness at scale and outside of trial contexts” and the duration effect is adjusted by 0.80, also for “lower effectiveness at scale and outside of trial contexts”. Combined this is a 0.55 adjustment instead of one 0.8 adjustment. I feel like one concern should show up as one discount.
1.5 Disagreements on social desirability bias[14]
GiveWell explains their concern, which is summarised in the table below: “One major concern we have with these studies is that participants might report a lower level of depression after the intervention because they believe that is what the experimenter wants to see [...] HLI responded to this criticism [section 4.4] and noted that studies that try to assess experimenter-demand effects typically find small effects.[...] We’re not sure these tests would resolve this bias so we still include a downward adjustment (80% adjustment factor).”
Table 7: HLI and GiveWell’s views on converting between SDs of depression and life satisfaction
Explains what diff in SM’s effect (%)
Participants might report bigger effects to be agreeable with the researchers (socially driven bias) or in the hopes of future rewards (cognitively driven bias; Bandiera et al., 2018), especially if they recognise the people delivering the survey to be the same people delivering the intervention[15].
But while I also worry about this issue, I am less concerned than GiveWell that response bias poses a unique threat to psychotherapy. Because if this bias exists, it seems likely to apply to all RCTs of interventions with self-reported outcomes (and without active controls). So I think the relevant question is why the propensity to response bias might differ between cash transfers and psychotherapy? Here are some possibilities:
It seems potentially more obvious that psychotherapy should alleviate depression than cash transfers should increase happiness. If so, questions about self-reported wellbeing may be more subject to bias in psychotherapy trials[16].
We could expect that the later the follow-up, the less salient the intervention is, the less likely respondents are to be biased in this way (Park & Kumar, 2022). This is one possibility that could favour cash transfers because they have relatively longer follow-ups than psychotherapy.
However, it is obvious to cash transfer participants whether they are in the treatment (they receive cash) or control conditions (they get nothing). This seems less true in psychotherapy trials where there are often active controls.
GiveWell responded to the previous evidence I cited (McGuire & Plant, 2021, Section 4.4)[17] by arguing that the tests run in the literature, by investigating the effect of the general propensity towards socially desirable responding or the expectations of surveyor, are not relevant because: “If the surveyor told them they expected the program to worsen their mental health or improve their mental health, it seems unlikely to overturn whatever belief they had about the program’s expected effect that was formed during their group therapy sessions.” But, if participants’ views about an intervention seem unlikely to be overturned by what the surveyor seems to want—when what the surveyor wants and the participant’s experience differs—then that’s a reason to be less concerned about socially motivated response bias in general.
However, I am more concerned with socially desirable responses driven by cognitive factors. Bandiera et al. (2018, p. 25) is the only study I found to discuss the issue, but they do not seem to think this was an issue with their trial: “Cognitive drivers could be present if adolescent girls believe providing desirable responses will improve their chances to access other BRAC programs (e.g. credit). If so, we might expect such effects to be greater for participants from lower socioeconomic backgrounds or those in rural areas. However, this implication runs counter to the evidence in Table A5, where we documented relatively homogenous impacts across indices and time periods, between rich/poor and rural/urban households.”
I agree with GiveWell that more research would be very useful, and could potentially update my views considerably, particularly with respect to the possibility of cognitively driven response bias in RCTs deployed in low-income contexts.
1.6 Publication bias
GiveWell explains their concern, which we summarise in the table below: “HLI’s analysis includes a roughly 10% downward adjustment for publication bias in the therapy literature relative to cash transfers literature. We have not explored this in depth but guess we would apply a steeper adjustment factor for publication bias in therapy relative to our top charities. After publishing its cost-effectiveness analysis, HLI published a funnel plot showing a high level of publication bias, with well-powered studies finding smaller effects than less-well-powered studies.57 This is qualitatively consistent with a recent meta-analysis of therapy finding a publication bias of 25%.”
Table 8: HLI and GiveWell’s views on a publication bias discount
Explains what diff in SM’s effect (%)
After some recent criticism, we have revisited this issue and are working on estimating the bias empirically. Publication bias seems like a real issue, where a 10-25% correction like what GiveWell suggests seems plausible, but we’re unsure about the magnitude as our research is ongoing. In our update of our psychotherapy meta-analysis we plan to employ a more sophisticated quantitative approach to adjust for publication bias.
1.7 Household size
GiveWell explains their concern, which we summarise in the table below: “HLI estimates household size using data from the Global Data Lab and UN Population Division. They estimate a household size of 5.9 in Uganda based on these data, which appears to be driven by high estimates for rural household size in the Global Data Lab data, which estimate a household size of 6.3 in rural areas in 2019. A recent Uganda National Household Survey, on the other hand, estimates household size of 4.8 in rural areas. We’re not sure what’s driving differences in estimates across these surveys, but our best guess is that household size is smaller than the 5.9 estimate HLI is using.”
Table 9: HLI and GiveWell’s views on household size of StrongMind’s recipients
Explains what diff in SM’s effect (%)
I think the figures GiveWell cites are reasonable. I favour using international datasets because I assume it means greater comparability between countries, but I don’t feel strongly about this. I agree it could be easy and useful to try and understand StrongMinds recipient’s household sizes more directly. We will revisit this in our StrongMinds update.
1.8 Cost per person of StrongMinds treated
The one element where we differ that makes StrongMinds look more favourable is cost. As GiveWell explains “HLI’s most recent analysis includes a cost of $170 per person treated by StrongMinds, but StrongMinds cited a 2022 figure of $105 in a recent blog post”
Table 10: HLI and GiveWell’s views on cost per person for StrongMind’s treatment
According to their most recent quarterly report, a cost per person of $105 was the goal, but they claim $74 per person for 2022[18]. We agree this is a more accurate/current figure, and the cost might well be lower now. A concern is that the reduction in costs comes at the expense of treatment fidelity – an issue we will review in our updated analysis.
2. GiveWell’s cost-effectiveness estimate of AMF is dependent on philosophical views
GiveWell estimates that AMF produces 70 WELLBYs per $1000[19], which would be 4 times better than StrongMinds. GiveWell described the philosophical assumptions of their life saving analysis as: “...Under the deprivationist framework and assuming a “neutral point” of 0.5 life satisfaction points. [...] we think this is what we would use and it seems closest to our current moral weights, which use a combination of deprivationism and time-relative interest account.”
Hence, they conclude that AMF produces 70 WELLBYs per $1000, which makes StrongMinds 0.24 times as cost-effective as AMF. However, the position they take is nearly the most favourable one can take towards interventions that save lives[20]. But there are other plausible views about the neutral point and the badness of death (we discuss this in Plant et al., 2022). Indeed, assigning credences to higher neutral points[21] or alternative philosophical views of death’s badness will reduce the cost-effectiveness of AMF relative to StrongMinds (see Figure 3). In some cases, AMF is less cost-effective than GiveWell’s estimate of StrongMinds[22].
Figure 4: Cost-effectiveness of charities under different philosophical assumptions (with updated StrongMinds value, and GiveWell’s estimate for StrongMinds)
To be clear, HLI does not (yet) take a stance on these different philosophical views. While I present some of my views here, these do not represent HLI as a whole.
Personally, I’d use a neutral point closer to 2 out of 10[23]. Regarding the philosophy, I think my credences would be close to uniformly distributed across the Epicurean, TRIA, and deprivationist views. If I plug this view into our model introduced in Plant et al. (2022) then this would result in a cost-effectiveness for AMF of 29 WELLBYs per $1000 (rather than 81 WELLBYs per $1000)[24], which is about half as good as the 62 WELLBYs per $1000 for StrongMinds. If GiveWell held these views, then AMF would fall within GiveWell’s pessimistic and optimistic estimates of 3-57 WELLBYs per $1000 for StrongMinds’ cost-effectiveness. For AMF to fall above this range, you need to (A) put almost all your credence in deprivationism and (B) have a neutral point lower than 2[25].
Coincidently, this is (barely) within our most recent confidence interval for comparing the cost-effectiveness of StrongMinds to GiveDirectly (95% CI: 2, 100).
This calculation is based on a correction for a mistake in our spillover ratio discussed here (a spillover ratio of 38% instead of 53%). Our previous estimate was 77 WELLBYs per $1000 (Plant et al., 2022; McGuire et al., 2022).
The discount on the effect per $1000 is smaller because GiveWell used a 38% smaller cost figure.
Note that the reduction in cost-effectiveness is only 27% because they also think that the costs are 62% smaller.
Coincidently, this is (barely) within our most recent confidence interval for comparing the cost-effectiveness of StrongMinds to GiveDirectly (95% CI: 2, 100).
The text and the table give different values.
But if you want to accept that the results could be very off, see here for a document with tables with my very preliminary results.
These are positive psychology interventions (like mindfulness and forgiveness therapy) which might not completely generalise to psychotherapy in LMICs.
Psychotherapy improved happiness by 0.38 on a 1-10 score and reduced depression by 0.97 (on the PHQ-9’s 0-27 scale). If we convert the depression score to a 1-10 scale, using stretch transformation, then the effect is a reduction in depression of 0.32. Hence, the SWB changes are 18% larger than MHa changes. If we convert both results to Cohen’s d, we find a Cohen’s d of 0.167 for depression and a Cohen’s d of 0.165 for happiness. Hence changes in MHa are 1% greater than SWB.
“it seems likely that SD in life satisfaction score is lower among StrongMinds recipients, who are screened for depression at baseline46 and therefore may be more concentrated at the lower end of the life satisfaction score distribution than the average individual.”
Sample selection based on depression (i.e., selection based on the outcome used) could shrink the variance of depression scores in the sample, which would inflate standardised effects sizes of depression compared to trials without depression selection, because standardisation occurs by dividing the raw effect by its standard deviation (i.e., standardised mean differences, such as Cohen’s d). To explore this, I used the datasets mentioned in Table 4, all of which also included measures of depression or distress and the data from Barker et al. (2022, n = 11,835). I found that the SD of depression for those with clinically significant depression was 18 to 21% larger than it was for the general sample (both the mentally ill and healthy). This seems to indicate that SD changes from psychotherapy provide inflated SD changes in depression compared to cash transfers, due to smaller SDs of depression. However, I think this may be offset by another technical adjustment. Our estimate of the life-satisfaction SD we use to convert SD changes (in MHa or SWB) to WELLBYs might be larger, which means the effects of psychotherapy and cash transfers are underestimated by 14% compared to AMF. When we convert from SD-years to WELLBYs we’ve used a mix of LMIC and HIC sources to estimate the general SD of LS. But I realised that there’s a version of the World Happiness Report that published data that included the SDs of LS for many countries in LMICs. If we use this more direct data for Sub-Saharan Countries then it suggests a higher SD of LS than what I previously estimated (2.5 instead of 2.2, according to a crude estimate), a 14% increase.
In one of the Bhat et al. trials, each session was 30 to 45 minutes (it’s unclear what the session length was for the other trials).
Note, I was one of the predictors, and my guess was in line with the crowd (~0.05 SDs), and you can’t see others’ predictions beforehand on the Social Science Prediction Platform.
Note, this is more about ‘experimenter demand effects’ (i.e., being influenced by the experimenters in a certain direction, because that’s what they want to find) than ‘socially desirability bias’ (i.e., answering that one is happier than they are because it looks better). The latter is controlled for in an RCT. We keep the wording used by GW here.
GiveWell puts it in the form of this scenario “If a motivated and pleasant IPT facilitator comes to your village and is trying to help you to improve your mental health, you may feel some pressure to report that the program has worked to reward the effort that facilitator has put into helping you.” But these situations are why most implementers in RCTs aren’t the surveyors. I’d be concerned if there were more instances of implementers acting as surveyors in psychotherapy than cash transfer studies.
On the other hand, who in poverty expects cash transfers to bring them misery? That seems about as rare (or rarer) as those who think psychotherapy will deepen their suffering. However, I think the point is about what participants think that implementers most desire.
Since then, I did some more digging. I found Dhar et al. (2018) and Islam et al. (2022) which use a questionnaire to test for propensity to answer questions in a socially desirable manner, but find similarly small results of socially motivated response bias. Park et al. (2022) takes an alternative approach where they randomise a subset of participants to self-survey, and argue that this does not change the results.
This is mostly consistent with 2022 expenses / people treated = 8,353,149 / 107,471 = $78.
81 WELLBYs per $1000 in our calculations, but they add some adjustments.
The most favourable position would be assuming deprivationism and a neutral point of zero.
People might hold that the neutral point is higher than 0.5 (on a 0-10 scale), and thereby reduce the cost-effectiveness of AMF. The IDinsight survey GiveWell uses surveys people from Kenya and Ghana but has a small sample (n = 70) for its neutrality question. In our pilot report (n = 79; UK sample; Samuelsson et al., 2023) we find a neutral point of 1.3. See Samuelsson et al. (2023; Sections 1.3 and 6) for a review of the different findings in the literature and more detail on our findings. Recent unpublished work by Julian Jamison finds a neutral point of 2.5 on a sample size of ~1,800 drawn from the USA, Brazil and China. Note that, in all these cases, we recommend caution in concluding that any of these values is the neutral point. There is still more work to be done.
Under GiveWell’s analysis, there are still some combinations of philosophical factors where AMF produces 17 WELLBYs or less (i.e., is as or less good than SM in GiveWell’s analysis): (1) An Epicurean view, (2) Deprivationism with neutral points above 4, and (3) TRIA with high ages of connectivity and neutral points above 3 or 4 (depending on the combination). This does not include the possibility of distributing credences across different views.
I would put the most weight on the work by HLI and Jamison and colleagues, mentioned in above, which finds a neutral point of 1.3/10 and 2.5/10, respectively.
I average the results across each view.
We acknowledge that many people may hold these views. We also want to highlight that many people may hold other views. We encourage more work investigating the neutral point and investigating the extent to which these philosophical views are held.
Zooming out a little: is it your view that group therapy increases happiness by more than the death of your child decreases it? (GiveWell is saying that this is what your analysis implies.)
To be a little more precise:
I.e., is it your view that 4-8 weeks of group therapy (~12 hours) for 20 people is preferable to averting the death of a child?
To be clear on what the numbers are: we estimate that group psychotherapy has an effect of 10.5 WELLBYs on the recipient’s household, and that the death of a child in a LIC has a −7.3 WELLBY effect on the bereaved household. But the estimate for grief was very shallow. The report this estimate came from was not focused on making a cost-effectiveness estimate of saving a life (with AMF). Again, I know this sounds weasel-y, but we haven’t yet formed a view on the goodness of saving a life so I can’t say how much group therapy HLI thinks is preferable averting the death of a child.
That being said, I’ll explain why this comparison, as it stands, doesn’t immediately strike me as absurd. Grief has an odd counterfactual. We can only extend lives. People who’re saved will still die and the people who love them will still grieve. The question is how much worse the total grief is for a very young child (the typical beneficiary of e.g., AMF) than the grief for the adolescent, or a young adult, or an adult, or elder they’d become [1]-- all multiplied by mortality risk at those ages.
So is psychotherapy better than the counterfactual grief averted? Again, I’m not sure because the grief estimates are quite shallow, but the comparison seems less absurd to me when I hold the counterfactual in mind.
I assume people, who are not very young children, also have larger social networks and that this could also play into the counterfactual (e.g., non-children may be grieved for by more people who forged deeper bonds). But I’m not sure how much to make of this point.
Thanks Joel.
My intuition, which is shared by many, is that the badness of a child’s death is not merely due to the grief of those around them. So presumably the question should not be comparing just the counterfactual grief of losing a very young child VS an [older adult], but also “lost wellbeing” from living a net-positive-wellbeing life in expectation?
I also just saw that Alex claims HLI “estimates that StrongMinds causes a gain of 13 WELLBYs”. Is this for 1 person going through StrongMinds (i.e. ~12 hours of group therapy), or something else? Where does the 13 WELLBYs come from?
I ask because if we are using HLI’s estimates of WELLBYs per death averted, and use your preferred estimate for the neutral point, then 13 / (4.95-2) is >4 years of life. Even if we put the neutral point at zero, this suggests 13 WELLBYs is worth >2.5 years of life.[1]
I think I’m misunderstanding something here, because GiveWell claims “HLI’s estimates imply that receiving IPT-G is roughly 40% as valuable as an additional year of life per year of benefit or 80% of the value of an additional year of life total.”
Can you help me disambiguate this? Apologies for the confusion.
13 / 4.95
I didn’t mean to imply that the badness of a child’s death is just due to grief. As I said in my main comment, I place substantial credence (2/3rds) in the view that death’s badness is the wellbeing lost. Again, this my view not HLIs.
The 13 WELLBY figure is the household effect of a single person being treated by StrongMinds. But that uses the uncorrected household spillover (53% spillover rate). With the correction (38% spillover) it’d be 10.5 WELLBYs (3.7 WELLBYs for recipient + 6.8 for household).
GiveWell arrives at the figure of 80% because they take a year of life as valued at 4.55 WELLBYs = 4.95 − 0.5 according to their preferred neutral point, and StrongMinds benefit ,according to HLI, to the direct recipient is 3.77 WELLBYs --> 3.77 / 4.55 = ~80%. I’m not sure where the 40% figure comes from.
That makes sense, thanks for clarifying!
If I understand correctly, the updated figures should then be:
For 1 person being treated by StrongMinds (excluding all household spillover effects) to be worth the WELLBYs gained for a year of life[1] with HLI’s methodology, the neutral point needs to be at least 4.95-3.77 = 1.18.
If we include spillover effects of StrongMinds (and use the updated / lower figures), then the benefit of 1 person going through StrongMinds is 10.7 WELLBYs.[2] Under HLI’s estimates, this is equivalent to more than two years of wellbeing benefits from the average life, even if we set the neutral point at zero. Using your personal neutral point of 2 would suggest the intervention for 1 person including spillovers is equivalent to >3.5 years of wellbeing benefits. Is this correct or am I missing something here?
1.18 as the neutral point seems pretty reasonable, though the idea that 12 hours of therapy for an individual is worth the wellbeing benefits of 1 year of an average life when only considering impacts to them, and anywhere between 2~3.5 years of life when including spillovers does seem rather unintuitive to me, despite my view that we should probably do more work on subjective wellbeing measures on the margin. I’m not sure if this means:
WELLBYs as a measure can’t capturing what I care about in a year of healthy life, so we should not use solely WELLBYs when measuring wellbeing;
HLI isn’t applying WELLBYs in a way that captures the benefits of a healthy life;
The existing way of estimating 1 year of life via WELLBYs is wrong in some other way (e.g. the 4.95 assumption is wrong, the 0-10 scale is wrong, the ~1.18 neutral point is wrong);
HLI have overestimated the benefits of StrongMinds;
I have a very poorly calibrated view of how good / bad 12 hours of therapy / a year of life is worth, though this seems less likely.
Would be interested in your thoughts on this / let me know if I’ve misinterpreted anything!
More precisely, the average wellbeing benefits from 1 year of life from an adult in 6 African countries
3.77*(1+0.38*4.85)
I appreciate your candid response. To clarify further: suppose you give a mother a choice between “your child dies now (age 5), but you get group therapy” and “your child dies in 60 years (age 65), but no group therapy”. Which do you think she will choose?
Also, if you don’t mind answering: do you have children? (I have a hypothesis that EA values are distorted by the lack of parents in the community; I don’t know how to test this hypothesis. I hope my question does not come off as rude.)
I don’t think that’s the right question for three reasons.
The hypothetical mother will almost certainly consider the well-being of her child (under a deprivationist framework) in making that decision—no one is suggesting that saving a life is less valuable than therapy under such an approach. Whatever the merits of an epicurean view that doesn’t weigh lost years of life, we wouldn’t have made it long as a species if parents applied that logic to their own young children.
Second, the hypothetical mother would have to live with the guilt of knowing she could have saved her child but chose something for herself.
Finally, GiveWell-type recommendations often would fail the same sort of test. Many beneficiaries would choose receiving $8X (where X = bednet cost) over receiving a bednet, even where GiveWell thinks they would be better off with the latter.
Thanks for your response.
If the mother would rather have her child alive, then under what definition of happiness/utility do you conclude she would be happier with her child dead (but getting therapy)? I understand you’re trying to factor out the utility loss of the child; so am I. But just from the mother’s perspective alone: she prefers scenario X to scenario Y, and you’re saying it doesn’t count for some reason? I don’t get it.
I think you’re double-subtracting the utility of the child: you’re saying, let’s factor it out by not asking the child his preference, and ALSO let’s ADDITIONALLY factor it out by not letting the mother be sad about the child not getting his preference. But the latter is a fact about the mother’s happiness, not the child’s.
Let’s add memory loss to the scenario, so she doesn’t remember making the decision.
Yes, and GiveWell is very clear about this and most donors bite the bullet (people make irrational decisions with regards to small risks of death, and also, betnets have positive externalities to the rest of the community). Do you bite the bullet that says “the mother doesn’t know enough about her own happiness; she’d be happier with therapy than with a living child”?
Finally, I do hope you’ll answer regarding whether you have children. Thanks again.
I’m not Joel (nor do I work for HLI, GiveWell, SM, or any similar organization). I do have a child, though. And I do have concerns with overemphasis on whether one is a parent, especially when one’s views are based (in at least significant part) on review of the relevant academic literature. Otherwise, does one need both to be a parent and to have experienced a severe depressive episode (particularly in a low-resource context where there is likely no safety net) in order to judge the tradeoffs between supporting AMF and supporting SM?
Personally—I am skeptical that the positive effect of therapy exceeds the negative effect of losing one’s young child on a parent’s own well-being. I just don’t think the thought experiment you proposed is a good way to cross-check the plausibility of such a view. The consideration of the welfare of one’s child (independent of one’s own welfare) in making decisions is just too deeply rooted for me to think we can effectively excise it in a thought experiment.
In any event—given that SM can deliver many courses of therapy with the resources AMF needs to save one child, the two figures don’t need to be close if one believes the only benefit from AMF is the prevention of parental grief. SM’s effect size would only need to be greater 1/X of the WELLBYs lost from parental grief from one child death, where X is the number of courses SM can deliver with the resources AMF needs to prevent one child death. That is the bullet that epicurean donors have to bite to choose SM over AMF.
Sorry for confusing you for Joel!
It’s good to hear you say this.
Definitely true. But if a source (like a specific person or survey) gives me absurd numbers, it is a reason to dismiss it entirely. For example, if my thermometer tells me it’s 1000 degrees in my house, I’m going to throw it out. I’m not going to say “even if you merely believe it’s 90 degrees we should turn on the AC”. The exaggerated claim is disqualifying; it decreases the evidentiary value of the thermometer’s reading to zero.
When someone tells me that group therapy is more beneficial to the mother’s happiness than saving her child from death, I don’t need to listen to that person anymore. And if it’s a survey that tells me this, throw out the survey. If it’s some fancy academic methods and RCTs, the interesting question is where they went wrong, and someone should definitely investigate that, but at no point should people take it seriously.
By all means, let’s investigate how the thermometer possibly gave a reading of 1000 degrees. But until we diagnose the issue, it is NOT a good idea to use “1000 degrees in the house” in any decision-making process. Anyone who uses “it’s 1000 degrees in this room” as a placeholder value for making EA decisions is, in my view, someone who should never be trusted with any levers of power, as they cannot spot obvious errors that are staring them in the face.
We both think the ratio of parental grief WELLBYs to therapy WELLBYs is likely off, although that doesn’t tell us which number is wrong. Given that your argument is that an implausible ratio should tip HLI off that there’s a problem, the analysis below takes the view more favorable to HLI—that the parental grief number (for which much less work has been done) is at least the major cause of the ratio being off.
As I see it, the number of WELLBYs preserved by averting an episode of parental grief is very unlikely to be material to any decision under HLI’s cost-effectiveness model. Under philosophical assumptions where it is a major contributor to the cost-effectiveness estimate, that estimate is almost always going to be low enough that life-saving interventions won’t be considered cost-effective on the whole. Under philosophical assumptions where life-saving programs may be cost-effective, the bulk of the effectiveness will come directly from the effect on the saved life itself. Thus, it would not be unreasonable for HLI—which faces significant resource constraints—to have deprioritized attempts to improve the accuracy of its estimate for WELLBYs preserved by averting an episode of parental grief.
Given that, I can see three ways of dealing with parental grief in the cost-effectiveness model for AMF. Ignoring it seems rather problematic. And I would argue that reporting the value one’s relatively shallow research provided (with a disclaimer that one has low certainty in the value) is often more epistemically virtuous than
making upadjusting to some value one thinks is more likely to be correct for intuitive reasons, bereft of actual evidence to support that number. I guess the other way is to just not publish anything until one can turn in more precise models . . . but that norm would make it much more difficult to bring new and innovative ideas to the table.I don’t think the thermometer analogy really holds here. Assuming HLI got a significantly wrong value for WELLBYs preserved by averting an episode of parental grief, there are a number of plausible explanations, the bulk of which would not justify not “listen[ing] to [them] anymore.” The relevant literature on grief could be poor quality or underdeveloped; HLI could have missed important data or modeled inadequately due to the resources it could afford to spend on the question; it could have made a technical error; its methodology could be ill-suited for studying parental grief; its methodology could be globally unsound; and doubtless other reasons. In other words, I wouldn’t pay attention to the specific thermometer that said it was much hotter than it was . . . but in most cases I would only update weakly against using other thermometers by the same manufacturer (charity evaluator), or distrusting thermometer technology in general (the WELLBY analysis).
Moreover, I suspect there have been, and will continue to be, malfunctioning thermometers at most of the major charity evaluators and major grantmakers. The grief figure is a non-critical value relating to an intervention that HLI isn’t recommending. For the most part, if an evaluator or grantmaker isn’t recommending or funding an organization, it isn’t going to release its cost-effectiveness model for that organization at all. Even where funding is recommended, there often isn’t the level of reasoning transparency that HLI provides. If we are going to derecognize people who have used malfunctioning thermometer values in any cost-effectiveness analysis, there may not be many people left to perform them.
I’ve criticized HLI on several occasions before, and I’m likely to find reasons to criticize it again at some point. But I think we want to encourage its willingness to release less-refined models for public scrutiny (as long as the limitations are appropriately acknowledged) and its commitment to reasoning transparency more generally. I am skeptical of any argument that would significantly incentivize organizations to keep their analyses close to the chest.
I disagree with you on several points.
The most important thing to note here is that, if you dig through the various long reports, the tradeoff is:
With $7800 you can save the life of a child, or
If you grant HLI’s assumptions regarding costs (and I’m a bit skeptical even there), you can give a multi-week group therapy to 60 people for that same cost (I think 12 sessions of 90 min).
Which is better? Well, right off the bat, if you think mothers would value their children at 60x what they value the therapy sessions, you’ve already lost.
Of course, the child’s life also matters, not just the mother’s happiness. But HLI has a range of “assumptions” regarding how good a life is, and in many of these assumptions the life of the child is indeed fairly value-less compared to benefits in the welfare of the mother (because life is suffering and death is OK, basically).
All this is obfuscated under various levels of analysis. Moreover, in HLI’s median assumption, not only is the therapy more effective, it is 5x more effective. They are saying: the number of group therapies that equal the averted death of a child is not 60, but rather, 12.
To me that’s broken-thermometer level.
I know the EA community is full of broken thermometers, and it’s actually one of the reasons I do not like the community. One of my main criticisms of EA is, indeed, “you’re taking absurd numbers (generated by authors motivated to push their own charities/goals) at face value”. This also happens with animal welfare: there’s this long report and 10-part forum series evaluating animals’ welfare ranges, and it concludes that 1 human has the welfare range of (checks notes) 14 bees. Then others take that at face value and act as if a couple of beehives or shrimp farms are as important as a human city.
This is not the first time I’ve had this argument made to me when I criticize an EA charity. It seems almost like the default fallback. I think EA has the opposite problem, however: nobody ever dares to say the emperor has no clothes, and everyone goes around pretending 1 human is worth 14 bees and a group therapy session increases welfare by more than the death of your child decreases it.
I think that it is possible to buy that humans only have 14 times as painful maximum pains/pleasurable maximal pleasure than bees, and still think 14 bees=1 human is silly. You just have to reject hedonism about well-being. I have strong feelings about saving humans over animals, but I have no intuition whatsoever that if my parents’ dog burns her paw it hurts less than when I burn my hand. The whole idea that animals have less intense sensations than us seems to me less like a commonsense claim, and more like something people committed to both hedonism and antispeciesism made up to reconcile their intuitive repugnant at results like 10 pigs or whatever=1 human. (Bees are kind of a special case because lots of people are confident they aren’t conscious at all.)
Where’s the evidence that, e.g., everyone “act[s] as if a couple of beehives or shrimp farms are as important as a human city”?So someone wrote a speculative report about bee welfare ranges . . . if “everyone” accepted that “1 human is worth 14 bees”—or even anything close to that—the funding and staffing pictures in EA would look very, very different. How many EAs are working in bee welfare, and how much is being spent in that area?
As I understand the data, EA resources in GH&D are pretty overwhelmingly in life-saving interventions like AMF, suggesting that the bulk of EA does not agree with HLI at present. I’m not as well versed in farmed animal welfare, but I’m pretty sure no one in that field is fundraising for interventions costing anywhere remotely near hundreds of dollars to save a bee and claiming they are effective.
In the end, reasoning transparency by charity evaluators helps the donor better make an informed moral choice. Carefully reading analyses from various sources helps me (and other donors) make choices that are consistent with our own values. EA is well ahead of most charitable movements by explicitly acknowledging that trade-offs exist and at least attempting to reason about them. One can (and should) decline to donate where the charity’s treatment of tradeoffs isnt convincing. As I’ve stated elsewhere on this post, I’m sticking with GiveWell-style interventions at least for now.
Oh, I should definitely clarify: I find effective altruism the philosophy, as well as most effective altruists and their actions, to be very good and admirable. My gripe is with what I view as the “EA community”—primarily places like this forum, organizations such as the CEA, and participants in EA global. The more central to EA-the-community, the worse I like the the ideas.
In my view, what happens is that there are a lot of EA-ish people donating to GiveWell charities, and that’s amazing. And then the EA movement comes and goes “but actually, you should really give the money to [something ineffective that’s also sometimes in the personal interest of the person speaking]” and some people get duped. So forums like this one serve to take money that would go to malaria nets, and try as hard as they can to redirect it to less effective charities.
So, to your questions: how many people are working towards bee welfare? Not many. But on this forum, it’s a common topic of discussion (often with things like nematodes instead of bees). I haven’t been to EA global, but I know where I’d place my bets for what receives attention there. Though honestly, both HLI and the animal welfare stuff is probably small potatoes compared to AI risk and meta-EA, two areas in which these dynamics play an even bigger role (and in which there are even more broken thermometers and conflicts of interest).
Do you think there’s a number you would accept for how many people treated with psychotherapy would be “worth” the death of one child?
Yes. There is a large range of such numbers. I am not sure of the right tradeoff. I would intuitively expect a billion therapy sessions to be an overestimate (i.e. clearly more valuable than the life of a child), but I didn’t do any calculations. A thousand seems like an underestimate, but again who knows (I didn’t do any calculations). HLI is claiming (checks notes) ~12.
To flip the question: Do you think there’s a number you would reject for how many people treated with psychotherapy would be worth the death of one child, even if some seemingly-fancy analysis based on survey data backed it up? Do you ever look at the results of an analysis and go “this must be wrong,” or is that just something the community refuses to do on principle?
Thank you for this detailed and transparent response!
I applaud HLI for creating a chart (and now an R Shiny App) to show how philosophical views can affect the tradeoff between predominately life-saving and predominately life-enhancing interventions. However, one challenge with that approach is that almost any changes to your CEA model will be outcome-changing for donors in some areas of that chart. [1]
For example, the 53-> 38% correction alone switched the recommendation for donors with a deprivationist framework who think the neutral point is over ~ 0.65 but under 1.58. Given that GiveWell’s moral weights were significantly derived from donor preferences, and (0.5, deprivationism) is fairly implied by those donor weights, I think that correction shifted the recommendation from SM to AMF for a significant number of donors even though it was only material to one of three philosophical approaches and about 1 point of neutral-point assumptions.
GiveWell reduced the WELLBY estimate from about 62 (based on the 38% figure) to about 17, a difference of about 45. If I’m simplifying your position correctly, for about half of those WELLBYs you disagree with GiveWell that an adjustment is appropriate. For about half of them, you believe a discount is likely appropriate, but think it is likely less than GiveWell modelled.
If we used GiveWell’s numbers for that half but HLI’s numbers otherwise, that split suggests that we’d end up with about 39.5 WELLBYs. So one way to turn your response into a donor-actionable statement would be to say that there is a zone of uncertainty between 39.5 and 62 WELLBYs. One might also guess that the heartland of that zone is between about 45 and 56.5 WELLBYs, reasoning that it is less likely that your discounts will be less than 25% or more than 75% of GiveWell’s.
The bottom end of that zone of uncertainty (39.5) would pull the neutral point at which a deprivationist approach would conclude AMF = SM up to about 2.9. I suspect few people employing a deprivationist approach have the neutral point that high. AMF is also superior to SM on a decent number of TRIA-based approaches at 39.5 WELLBYs.
So it seems there are two reasonable approaches to donor advice under these kinds of circumstances:
One approach would encourage donors within a specified zone of uncertainty to hold their donations until HLI sufficiently updates its CEA for SM to identify a more appropriate WELLBY figure ; or
The other approach would encourage donors to make their decision based on HLI’s best estimate of what the WELLBY figure on the next update of the CEA will be. Even if the other approach is correct, there will be some donors who need to use this approach for various reasons (e.g., tax reasons).
I don’t think reaffirming advice on the current model in the interim without any adjustments is warranted, unless you believe the adjustments will be minor enough such that a reasonable donor would likely not find them of substantive importance no matter where they are on the philosophical chart.[2]
In the GiveWell model, the top recommendation is to give to a regranting fund, and there isn’t any explicit ranking of the four top charities. So the recommendation is actually to defer the choice of specific charity to someone who has the most up-to-date information when the monies are actually donated to the effective charity. Moreover, all four top charities are effective in very similar ways. Thus, GiveWell’s bottom-line messaging to donors is much less sensitive to changes in the CEA for any given charity.
I am not sure how to define “minor.” I think whether the change flips the recommendation to the donor is certainly relevant, but wouldn’t go so far as to say that any change that flips the recommendation for a given donor’s philosophical assumptions would be automatically non-minor. On the other hand, I think a large enough change can be non-minor even if it doesn’t flip the recommendation on paper. Some donors apply discounts and bonuses not reflected in HLI’s model. For instance, one could reasonably apply a discount to SM when compared to better-studied interventions, on the basis that CEAs usually decrease as they become more complete. Or one could reasonably apply a bonus to SM because funding a smaller organization is more likely to have a positive effect on its future cost-effectiveness. Thus, just because the change is not outcome-determinative on HLI’s base model doesn’t mean it isn’t so on the donor’s application of the model. The time-to-update and amount of funds involved are also relevant. All that being said, my gut thinks that the starting point for determining minor vs. non-minor is somewhere in the neighborhood of 10%.
Jason,
You raise a fair point. One we’ve been discussing internally. Given the recent and expected adjustments to StrongMinds, it seems reasonable to update and clarify our position on AMF to say something like, “Under more views, AMF is better than or on par with StrongMinds. Note that currently, under our model, when AMF is better than StrongMinds, it isn’t wildly better.” Of course, while predicting how future research will pan out is tricky, we’d aim to be more specific.
(EDITED)
Is this (other than 53% being corrected to 38%) from the post accurate?
If so, a substantial discount seems reasonable to me. It’s plausible these studies also say almost nothing about the spillover, because of how unrepresentative they seem. Presumably much of the content of the therapy will be about the child, so we shouldn’t be surprised if it has much more impact on the child than general therapy for depression.
It’s not clear any specific number away from 0 could be justified.
I find nothing objectionable in that characterization. And if we only had these three studies to guide us then I’d concede that a discount of some size seems warranted. But we also have A. our priors. And B. some new evidence from Barker et al. Both of point me away from very small spillovers, but again I’m still very unsure. I think I’ll have clearer views once I’m done analyzing the Barker et al. results and have had someone, ideally Nathanial Barker, check my work.
[Edit: Michael edited to add: “It’s not clear any specific number away from 0 could be justified.”] Well not-zero certainly seems more justifiable than zero. Zero spillovers implies that emotional empathy doesn’t exist, which is an odd claim.
To clarify what I edited in, I mean that, without better evidence/argument, the effect could be arbitrarily small but still nonzero. What reason do we have to believe it’s at least 1%, say, other than very subjective priors?
I agree that analysis of new evidence should help.
I’d point to the literature on time lagged correlations between household members emotional states that I quickly summarised in the last installment of the household spillover discussion. I think it implies a household spillover of 20%. But I don’t know if this type of data should over- or -underestimate the spillover ratio relative to what we’d find in RCTs. I know I’m being really slippery about this, but the Barker et al. analysis stuff so far makes me think it’s larger than that.
Regarding the question of what philosophical view should be used, I wonder if it would also matter if someone were something like prioritarian rather than a total utilitarian. StrongMinds looks to focus on people who suffer more than typical members of these countries’ populations, whilst the lives saved by AMF would presumably cover more of the whole distribution of wellbeing. So a prioritarian may favour StrongMinds more, assuming the people helped are not substantially better off economically or in other ways. (Though, it could perhaps also be argued that the people who would die without AMF’s intervention are extremely badly off pre-intervention.)
As my colleagues have mentioned in their responses (Michael’s general response, Joel’s technical response), the WELLBYs per $1000 that GiveWell put forward for AMF are dependent on philosophical choices about the badness of death and the neutral point. There are a range of plausible possible choices and these can affect the results. HLI does not hold a view.
We’ve whipped up an R Shiny app so that you, the reader, can play around with these choices and see how your views affect the comparison between StrongMinds and AMF.
Please note that this is a work in progress and was done very quickly. Also, I’m using the free plan for hosting the app so it might be a bit slow/limited in monthly bandwidth.
This is really helpful! One suggestion for future improvement would be to allow the user to specify a mix among the philosophical views (or at least to be able to select predefined mixes of those views).
Thank you for the feedback! We are keen on that feature too, distributing credences between the views is the next step I am working on.
Thanks to GiveWell for sharing this!
It’s worth emphasizing that this analysis estimates StrongMinds at about 2.3X times effective as GiveDirectly-type programs, which is itself a pretty high bar, and as plausibly up to ~ 8X as effective (or as low as ~ 0.5X). If we take GD as the bar for a program being one of the most effective in the Global Health space, this conclusion suggests that StrongMinds is very likely to be a strong program (no pun intended), even if it isn’t the single best use of marginal funding. I know that’s obvious from reading the full post, but I think it bears some emphasis that we’re talking about donor choice among a variety of programs that we have reason to believe are rather effective.
FWIW I don’t think GiveDirectly should be “the bar” for being considered one of the most effective organizations in the global health and development space.
I think both 5x and 10x differences are big and meaningful in this domain, and I think there are likely billions of dollars in funding gaps between GiveWell’s bar (~10x) and GiveDirectly. I think donors motivated by EA principles would be making a mistake, and leaving a lot of value on the table by donating to GiveDirectly or StrongMinds over GiveWell’s recommendations (I say this as someone who’s donated to both StrongMinds and GiveDirectly in the past, and hugely respects the work they both do).
Recognize this might be a difference in what we mean by “one of” the most effective, but wanted to comment because this sentiment feeds into a general worry I have that a desire for pluralism and positivity within GH&D (both good and important things!) is eroding intensity about prioritization (more important IMO).
Fair points. I’m not planning to move my giving to GiveWell All Grants to either SM or GD, and don’t mean to suggest anyone else does so either. Nor do I want to suggest we should promote all organizations over an arbitrary bar without giving potential donors any idea about how we would rank within the class of organizations that clear that bar despite meaningful differences.
I mainly wrote the comment because I think the temperature in other threads about SM has occasionally gotten a few degrees warmer than I think optimally conducive to what we’re trying to do here. So it was an attempt at a small preventive ice cube.
I think you’re right that we probably mean different things by “one of.” 5-10X differences are big and meaningful, but I don’t think that insight is inconsistent with the idea that a point estimate something around “above GiveDirectly” is around the point at which an organization should be on our radar as potentially worth recommending given the right circumstances.
One potential definition for the top class would be whether a person could reasonably conclude on the evidence that it was the most effective based on moral weights or assumptions that seem plausible. Here, it’s totally plausible to me that a donor’s own moral weights might value reducing suffering from depression relatively more than GiveWell’s analysis implies, and saving lives relatively less. GiveWell’s model here makes some untestable philosophical assumptions that seem relatively favorable to AMF: “deprivationist framework and assuming a ‘neutral point’ of 0.5 life satisfaction points.” As HLI’s analysis suggests at Section 3.4 of this study, the effectiveness of AMF under a WELLBY/subjective well-being model is significantly dependent on these assumptions.
For a donor with significantly different assumptions and/or moral weights, adjusting for those could put SM over AMF even accepting the rest of GiveWell’s analysis. More moderate philosophical differences could put one in a place where more optimistic empirical assumptions + a expectation that SM will continue reducing cost-per-participant and/or effectively refine its approach as it scales up could lead to the same conclusion.
Another potential definition for the top class would be whether one would feel more-than-comfortable recommending it to a potential donor for whom there are specific reasons to choose an approach similar to the organization’s. I think GiveWell’s analysis suggests the answer is yes for reasons similar to the above. If you’ve got a potential donor who just isn’t that enthuiastic about saving lives (perhaps due to emphasizing a more epicurean moral weighting) but is motivated to give to reducing human suffering, SM is a valuable organization to have in one’s talking points (and may well be a better pitch than any of the GiveWell top charities under those circumstances).
Thanks Jason, makes sense.
I think I’m more skeptical than you that reasonable alternative assumptions make StrongMinds look more cost effective than AMF. But I agree that StrongMinds seems like it could be a good fit for some donors.
Interested if you could elaborate here. I’m not sure which intuitions you consider ‘reasonable’ and why. As Joel’s figure 4 above indicates, for either HLI’s or GiveWell’s estimates of StrongMinds, talk therapy can be more cost-effective than bednets, and vice versa, but which is more effective depends on the philosophical assumptions—so that ought to be the debate we’re having, but aren’t. Perhaps we have a much more capacious concept of which assumptions are reasonable and you’d want to rule some of those out? If so, which ones?
I’m not sure if this is what you meant, but if we’re talking about a raw difference in philosophical/ethical intuitions, I am very reluctant to say that some of those are unreasonable—in case any reader isn’t aware of this, philosophy seminars often end up with the discussants realising they just have different intuitions. To say, at this point, your intuitions are reasonable and the other person’s are unreasonable is the rhetorical equivalent of banging your fist on the table—you’re not making a new argument, you’re just hoping the other person will agree with you anyway!
Sure, happy to elaborate.
Here’s figure 4 for reference:
I think each part of this chart has some assumptions I don’t think are defensible.
1. I don’t think a neutral point higher than 2 is defensible.
You cite three studies in this report.[1] My read on what to conclude about the neutral point from those is:
i) IDinsight 2019 (n=70; representative of GW recipients): you highlight the average answer of 0.56, but this is excluding the 1⁄3 of people who say it’s not possible to have a life worse than death.[2] I think including those as 0 more accurately reflects their preferences, so 2/3*0.56=0.37/10
ii) Peasgood et al, unpublished (n=75; UK): you say 2⁄10 and I can’t find the study so I’m taking that at face value.
iii) Jamison and Shukla, unpublished (n=600, US, Brazil, China): you highlight the average answer of 25⁄100. In private communication with the author, I got the impression that 1.8/10 was probably more appropriate because the scale used in this study isn’t comparable to typical life satisfaction scales.[3]
So what to make of this? I think you could reasonably put weight in the largest study (1.8/10). Or you could put weight on the most representative study (0.37). I lean towards the latter, because I intuitively find it quite likely that less well off people will report lower neutral points (I don’t feel certain about this, and hoping Jamison & Shukla will have enough sample to test it). But either way, I don’t see any way of combining these studies to get an answer higher than 2.
In addition, a neutral point of 5 implies the average person in over 40 countries would be better off dead. A neutral point of 2.5 implies the average person in Afghanistan would be better off dead. I find these both jarring implications.
HLI’s belief that a neutral point of 5 is within a reasonable range seems to come from Diener et al. 2018.[4] But that article’s not explicit about what it means by “neutral point”. As far as I can tell from a quick skim, it seems to be defining “neutral” as halfway between 0 and 10.
2. I don’t think 38% is a defensible estimate for spillovers, which puts me closer to GiveWell’s estimate of StrongMinds than HLI’s estimate of StrongMinds.
I wrote this critique of your estimate that household spillovers was 52%. That critique had three parts. The third part was an error, which you corrected and brought the answer down to 38%. But I think the first two are actually more important: you’re deriving a general household spillover effect from studies specifically designed to help household members, which would lead to an overestimate.
I thought you agreed with that from your response here, so I’m confused as to why you’re still defending 38%. Flagging that I’m not saying the studies themselves are weak (though it’s true that they’re not very highly powered). I’m saying they’re estimating a different thing from what you’re trying to estimate, and there are good reasons to think the thing they’re trying to estimate is higher. So I think your estimate should be lower.
3. I don’t think strong epicureanism is a defensible position
Strong epicureanism (the red line) is the view that death isn’t bad for the person who dies. I think it’s logically possible to hold this position as a thought experiment in a philosophy seminar, but I’ve never met anyone who actually believes it and I’d be deeply troubled if decisionmakers took action on the basis of it. You seem to agree to some extent,[5] but by elevating it to this chart, and putting it alongside the claim that “Against Malaria Foundation is less cost-effective than StrongMinds under almost all assumptions” I think you’re implying this is a reasonable position to take action on, and I don’t think it is.
So I think my version of this chart looks quite different: the x-axis is between 0.4 and 2, the StrongMinds estimate’s quite a bit closer to GiveWell than HLI, and there’s no “epicureanism” line.
What does HLI actually believe?
More broadly, I’m quite confused about how strongly HLI is recommending StrongMinds. In this post, you say (emphasis mine)
And
But you’ve said elsewhere:
That strikes me as inconsistent. You’ve defined a range of assumptions you believe are reasonable, then claimed that StrongMinds > AMF on almost all of those assumptions. And then said you don’t take a stance on these assumptions. But you have to actually defend the range of assumptions you’ve defined as reasonable. And in my view, they’re not.
“Empirical work on how individuals interpret the scale could be helpful but is extremely limited. A small (n = 75) survey in the UK found that respondents would choose death over life at a life satisfaction level of about 2⁄10 (Peasgood et al., unpublished, as referenced in Krekel & Frijters, 2021). A survey of people living in poverty in Ghana and Kenya estimated the neutral point as 0.56 (IDinsight, 2019, p. 92; n = 70). There are also preliminary results from a sample of 600 in the USA, Brazil, and China that finds a neutral point of 25⁄100 (Jamison & Shukla, private communication). At the Happier Lives Institute, we are currently working on our own survey to explore this topic further and hope to share our results soon.” Elephant in the bednet
“Approximately one third of respondents stated that it’s not possible to have a life that’s worse than death. These respondents cited deontological frameworks such as the inherent and immeasurable value of life regardless of other factors. The remaining respondents (close to two thirds) indicate that there are points on the ladder where life is worse than death. For these respondents, this point is substantially lower than their current life satisfaction scores –the average point identified was 0.56 on a ladder from 0 to 10, compared to their current average life satisfaction score of 2.21” IDInsight 2019, pg 94
I’m not sharing the full reasoning because it was private correspondence and I haven’t asked the authors if they’d be comfortable with me sharing.
“Other wellbeing researchers, such as Diener et al. (2018), appear to treat the midway point on the scale as the neutral point (i.e., 5 on a 0-10 scale).” Elephant in the bednet
“Although what we might call strong Epicureanism, the view that death is not bad at all, has few takers, there may be more sympathy for weak Epicureanism, where death can be bad, but relatively more weight is given to living well than living long” Elephant in the bednet
On 3. Epicureanism being a defensible position
Epicureanism is discussed in almost every philosophy course on the badness of death. It’s taken seriously, rather than treated as an absurd position, a non-starter, and whilst not that many philosophers end up as Epicureans, I’ve met some that are very sympathetic. I find critics dismiss the view too quickly and I’ve not seen anything that’s convinced me the view has no merit. I don’t think we should have zero credence in it, and it seems reasonable to point out that it is one of the options. Again, I’m inclined to let donors make up their own minds.
On what HLI actually believes
HLI is currently trying not to have a view on these issues, but point out to donors how having different views would change the priorities so they can form their own view. We may have to develop a ‘house view’ but none of the options for doing this seem particularly appealing (they include: we use my view, we use a staff aggregate, we poll donors, we poll the public, some combo of the previous options).
You bring up this quote
I regret this sentence, which is insufficiently nuanced and I wouldn’t use it again (you and I have discussed this privately). That said, I think we’re quite well-caveated elsewhere. You quote this bullet point:
But you didn’t quote the bullet point directly before it (emphasis added):
The backstory to the “we confidently recommend StrongMinds” bit is that, when we did the analysis, StrongMinds looked better under almost all assumptions and, even where AMF was better, it was only slightly better (1.3x). We thought donors would want an overall recommendation, and hence StrongMinds seemed like the safe choice (given some intuitions about donors’ intuitions and moral uncertainty). You’re right that we’ll have to rethink what our overall recommendations are, and how to frame them, once the dust has settled on this debate.
Finally, whilst you say
This feels uneasily like a double standard. As I’ve pointed out before, neither GiveWell nor Open Philanthropy really defends their views in general (asserting a view isn’t the same as defending it). In this report, GiveWell doesn’t defend its assumptions, point out what other assumptions one might (reasonably) take, or say how this would change the result. Part of what we have tried to highlight in our work is that these issues have been mostly ignored and can really matter.
Our aim was more to cover the range of views we think some reasonable people would believe, not to restrict it to what we think they should believe. We motivated our choices in the original report and will restate that briefly here. For the badness of death, we give the three standard views in the literature. At one end, deprivationism gives ‘full value’ to saving lives. On the other, Epicurianism gives no weight to saving lives. TRIA offers something in between. For the neutral point, we used a range that included what we saw as the minimum and maximum possible values. Including a range of values is not equivalent to saying they are all equally probable. We encourage donors and decision-makers to use values they think are most plausible (for example, by using this interactive chart).
In suggesting James quote these together, it sounds like you’re saying something like “this is a clear caveat to the strength of recommendation behind StrongMinds, HLI doesn’t recommend StrongMinds as strongly as the individual bullet implies, it’s misleading for you to not include this”.
But in other places HLI’s communication around this takes on a framing of something closer to “The cost effectiveness of AMF, (but not StrongMinds) varies greatly under these assumptions. But the vast majority of this large range falls below the cost effectiveness of StrongMinds”. (extracted quotes in footnote)[1]
As a result of this framing, despite the caveat that HLI “[does] not advocate for any particular view”, I think it’s reasonable to interpret this as being strongly supportive of StrongMinds, which can be true even if HLI does not have a formed view on the exact philosophical view to take.[2]
If you did mean the former (that the bullet about philosophical assumptions is primarily included as a caveat to the strength of recommendation behind StrongMinds), then there is probably some tension here between (emphasis added):
-”the relative value of life-extending and life-improving interventions depends very heavily on the philosophical assumptions you make...there is no simple answer”, and
-”We conclude StrongMinds > AMF under almost all assumptions”
Additionally I think some weak evidence to suggest that HLI is not as well-caveated as it could be is that many people (mistakenly) viewed HLI as an advocacy organisation for mental health interventions. I do think this is a reasonable outside interpretation based on HLI’s communications, even though this is not HLI’s stated intent. For example, I don’t think it would be unreasonable for an outsider to read your current pinned thread and come away with conclusions like:
“StrongMinds is the best place to donate”,
“StrongMinds is better than AMF”,
“Mental health is a very good place to donate if you want to do the most good”,
“Happiness is what ultimately matters for wellbeing and what should be measured”.
If these are not what you want people to take away, then I think pointing to this bullet point caveat doesn’t really meaningfully address this concern—the response kind of feels something like “you should have read the fine print”. While I don’t think it’s not necessary for HLI to take a stance on specific philosophical views, I do think it becomes an issue if people are (mis)interpreting HLI’s stance based on its published statements.
(commenting in personal capacity etc)
As you’ve acknowledged, comments like “We’re now in a position to confidently recommend StrongMinds as the most effective way we know of to help other people with your money.” perhaps add to the confusion.
Do you think the neutral point and basic philosophical perspective (e.g., deprivationism vs. epicureanism) are empirical questions, or are they matters on which the donor has to exercise their own moral and philosophical judgment (after considering what the somewhat limited survey data have to say on the topic)?
I would graph the neutral point from 0 to 3. I think very few donors would set the neutral point above 3, and I’d start with the presumption that the most balanced way to present the chart is probably to center it fairly near the best guess from the survey data. On the other hand, if you have most of the surveys reporting “about 2,” then it’s hard to characterize 3 as an outlier view—presumably, a good fraction of the respondents picked a value near, at, or even over 3.
Although I don’t think HLI puts it this way, it doesn’t strike me as implausible to view human suffering as a more severe problem than lost human happiness. As I noted in a different comment, I think of that chart as a starting point from which a donor can apply various discounts and bonuses on a number of potentially relevant factors. But another way to account for this would be to give partial weight to strong epicureanism as a means of discounting the value of lost human happiness vis-a-vis suffering.
Given that your critique was published after HLI’s 2022 charity recommendation, I think it’s fair to ask HLI whether it would reaffirm those characterizations today. I would agree that the appropriate conclusion, on HLI’s current state of analysis, is that the recommendation is either SM or GiveWell’s top charities depending on the donor’s philosophical assumptions. I don’t think it’s inappropriate to make a recommendation based on the charity evaluator’s own philosophical judgment, but unless HLI has changed its stance it has taken no position. I don’t think it is appropriate to merely assume equal credence for each of the philosophical views and neutral points under consideration.
One could also defensibly make a summary recommendation on stated assumptions about donor values or on receipient values. But the best information I’ve seen on those points—the donor and beneficiary surveys as reflected in GiveWell’s moral weights—seemingly points to a predominately deprivationist approach with a pretty low neutral point (otherwise the extremely high value on saving the lives of young children wouldn’t compute).
Thanks Jason, mostly agree with paras 4-5, and think para 2 is a good point as well.
I think the basic philosophical perspective is a moral/philosophical judgement. But the neutral point combines that moral judgement with empirical models of what peoples’ lives are actually like, and empirical beliefs about how people respond to surveys.
I wonder if, insofar as we do have different perspectives on this (and I don’t think we’re particularly far apart, particularly on the object level question), the crux is around how much weight to put in individual donor judgement? Or even how much individual donors have those judgements?
My experience of even EA-minded (or at least GiveWell) donors is that ~none of them have a position on these kinds of questions, and they actively want to defer. My (less confident but based on quite a few conversations) model of EA-minded StrongMinds donors is they want to give to mental health and see an EA-approved charity so give there, rather than because of a quantitative belief on foundational questions like the neutral point. As an aside, I believe that was how StrongMinds first got on EA’s radar—as a recommendation for Founders Pledge donors who specifically wanted to give to mental health in an evidence-based way.
It does seem plausible to me that donors who follow HLI recommendations (who I expect are particularly philosophically minded) would be more willing to change their decisions based on these kinds of questions than donors I’ve talked to.
I’d be interested if someone wanted to stick up for a neutral point of 3 as something they actually believe and a crux for where they give, rather than something someone could believe, or is plausible. I could be wrong, but I’m starting out skeptical that belief would survive contact with “But that implies the world would be better if everyone in Afghanistan died” and “a representative survey of people whose deaths you’d be preventing think their lives are more valuable than that”
What do you think?
From HLI’s perspective, it makes sense to describe how the moral/philosophical views one assumes affect the relative effectiveness of charities. They are, after all, a charity recommender, and donors are their “clients” in a sense. GiveWell doesn’t really do this, which makes sense—GiveWell’s moral weights are so weighted toward saving lives that it doesn’t really make sense for them to investigate charities with other modes of action. I think it’s fine to provide a bottom-line recommendation on whatever moral/philosophical view a recommender feels is best-supported, but it’s hardly obligatory.
We recognize donor preferences in that we don’t create a grand theory of effectiveness and push everyone to donate to longtermist organizations, or animal-welfare organizations, or global health organizations depending on the grand theory’s output. Donors choose among these for their own idiosyncratic reasons, but moral/philosophical views are certainly among the critical criteria for many donors. I don’t see why that shouldn’t be the case for interventions within a cause area that produce different kinds of outputs as well.
Here, I doubt most global-health donors—either those who take advice from GiveWell or from HLI—have finely-tuned views on deprivationism, neutral points, and so on. However, I think many donors do have preferences that indirectly track on some of those issues. For instance, you describe a class of donors who “want to give to mental health.” While there could be various reasons for that, it’s plausible to me that these donors place more of an emphasis on improving experience for those who are alive (e.g., they give partial credence to epicureanism) and/or on alleviating suffering. If they did assess and chart their views on neutral point and philosophical view, I would expect them to end up more often at points where SM is ranked relatively higher than the average global-health donor would. But that is just conjecture on my part.
One interesting aspect of thinking from the donor perspective is the possibility that survey results could be significantly affected by religious beliefs. If many respondents chose a 0 neutral point because their religious tradition led them to that conclusion, and you are quite convinced that the religious tradition is just wrong in general, do you adjust for that? Does not adjusting allow the religious tradition to indirectly influence where you spend your charitable dollar?
To me, the most important thing a charity evaluator/recommender does is clearly communicate what the donation accomplishes (on average) if given to various organizations they identify—X lives saved (and smaller benefits), or Y number of people’s well-being improved by Z amount. That’s the part the donor can’t do themselves (without investing a ton of time and resources).
I don’t think the neutral point is as high as 3. But I think it’s fine for HLI to offer recommendations for people who do.
Hi James, thanks for elaborating, that’s really useful! We’ll reply to your points in separate comments.
Your statement, 1. I don’t think a neutral point higher than 2 is defensible
Reply: I don’t think we have enough evidence or theory to be confident about where to put the neutral point.
Your response about where to put the neutral point involves taking answers to survey questions where people are asked something like “where on a 0-10 scale would you choose not to keep living?” and assuming we should take those answers at face value for where to locate the neutral point. However, this conclusion strikes me as too fast; I don’t think we have enough theory or evidence on this issue. Are we definitely asking the right questions? Do we understand people’s responses? Should we agree with them even if we understand them?
I’m not sure if I told you about this, but we’re working on a pilot survey for this and other wellbeing measuring issues. The relevant sections for neutrality are 1.3 and 6. I’ll try to put the main bits here, to make life easier (link to initial part of report on EA forum; link to full report):
As we say in footnote 5, this aligned is straightforwardly entailed by the standard formulation of utilitarianism.
That’s why,
Unfortunately, we find that, on a life satisfaction scale, participants put the zero point at 5⁄10, and the neutral point at 1.3/10. We’re not really sure what to make of this. Here’s what we say in section 6.2 in full.
We plan to think about this more and test our hypotheses in the full version of the survey. If you have ideas for what we should test, now would be a great time to share them!
Given the methodological challenges in measuring the neutral point, I would have some hesitation to credit any conclusions that diverged too much from what revealed preferences imply. A high neutral point implies that many people in developing countries believe their lives are not worth living. So I’d look for evidence of behavior (either in respondents or in the population more generally) that corroborated whether people acted in a way that was consistent with the candidate neutral point.
For instance, although money, family, and other considerations doubtless affect it, studying individuals who are faced with serious and permanent (or terminal) medical conditions might be helpful. At what expected life satisfaction score do they decline treatment? If the neutral point is relatively close to the median point in a country, one would expect to see a lot of people decide to not obtain curative treatment if the results would leave them 1-2 points less satisfied than their baseline.
You might be able to approximate that by asking hypothetical questions about specific situations that you believe respondents would assess as reducing life satisfaction by a specified amount (disability, imprisonment, social stigma, etc.), and then ask whether the respondent believes they would find life still worth living if that happened. I don’t think that approach works to establish a neutral point, but I think having something more concrete would be an important cross-check on what may otherwise come across as an academic, conjectural exercise to many respondents.
This isn’t necessarily the case. I assume that if people described their lives as having negative wellbeing, this wouldn’t imply they thought their life was not worth continuing.
People can have negative wellbeing and still want to live for the sake of others or causes greater than themselves.
Life satisfaction appears to be increasing over time in low income countries. I think this progress is such that many people who may have negative wellbeing at present, will not have negative wellbeing their whole lives.
Edit: To expand a little, for these reasons, as well as the very reasonable drive to survive (regardless of wellbeing), I find it difficult to interpret revealed preferences and it’s unclear they’re a bastion of clarity in this confusing debate.
Anectdotally, I’ve clearly had periods of negative wellbeing before (sometimes starkly), but never wanted to die during those periods. If I knew that such periods were permanent, I’d probably think it was good for me to not-exist, but I’d still hesitate to say I’d prefer to not-exist, because I don’t just care about my wellbeing. As Tyrion said “Death is so final, and life is so full of possibilities.”
I think these difficulties should highlight that the difficulties here aren’t just localized to this area of the topic.
Thanks for these points! The idea that people care about more than their wellbeing may be critical here. I’m thinking of a simplified model with the following assumptions: a mean lifetime wellbeing of 5, SD 2, normal distribution, wellbeing is constant through the lifespan, with a neutral point of 4 (which is shared by everyone).
Under these assumptions, AMF gets no “credit” (except for grief avoidance) for saving the life of a hypothetical person with wellbeing of 4. I’m really hesitant to say that saving that person’s life doesn’t morally “count” as a good because they are at the neutral point. On the one hand, the model tells me that saving this person’s life doesn’t improve total wellbeing. On the other hand, suppose I (figuratively) asked the person whose life was saved, and he said that he preferred his existence to non-existence and appreciated AMF saving his life.
At that point, I think the WELLBY-based model might not be incorporating some important data—the person telling us that he prefers his existence to non-existence would strongly suggest that saving his life had moral value that should indeed “count” as a moral good in the AMF column. His answers may not be fully consistent, but it’s not obvious to me why I should fully credit his self-reported wellbeing but give zero credence to his view on the desirability of his continued existence. I guess he could be wrong to prefer his continued existence, but he is uniquely qualified to answer that question and so I think I should be really hesitant to completely discount what he says. And a full 30% of the population would have wellbeing of 4 or less under the assumptions.
Even more concerning, AMF gets significantly “penalized” for saving the life of a hypothetical person with wellbeing of 3 who also prefers existence to non-existence. And almost 16% of the population would score at least that low.
Of course, the real world is messier than a quick model. But if you have a population where the neutral point is close enough to the population average, but almost everyone prefers continued existence, it seems that you are going to have a meaningful number of cases where AMF gets very little / no / negative moral “credit” for saving the lives of people who want (or would want) their lives saved. That seems like a weakness, not a feature, of the WELLBY-based model to me.
I could have been clearer, the 38% is a placeholder while I do the Barker et al. 2022 analysis. You did update me about the previous studies’ relevance. My arguments are less supporting the 38% figure—which I expect to update with more data and more about explaining why I think that I have a higher prior for household spillovers from psychotherapy than you and Alex seem to. But really, the hope is that we can soon be discussing more and better evidence.
Not going into the wider discussion, I specifically disagree with this idea: there’s a trade-off here between estimated impact and things like risk, paternalism, scalability. If I’m risk-averse enough, or give some partial weight to bring less paternalistic, I might prefer donating to GiveDirectly—which I indeed am, despite choosing to donate to AMF in the past.
(In practice, I expect I’ll try to do a sort of Softmax based on my subjective estimates of a few different charities and give different amounts to all of them.)
We are really pleased to see that GiveWell has engaged with the subjective wellbeing approach and has assessed our work at the Happier Lives Institute. There are a lot of complicated topics to cover, so we’ve split our response into two. I’m going to try to give a shorter, non-technical reply for those that want to know, in broad terms, what HLI’s response is. My colleague Joel will dive into all the details and provide more substance. It’s not quite a ‘good cop, bad cop’ routine, so much as a ‘simple cop, more-than-you-wanted-to-know cop’ routine. You have been warned…
Here’s my reply, in a large nutshell
We’re very grateful to GiveWell for writing this and sending it to us a week in advance.
We were pleasantly surprised to to see GiveWell evaluating charities using happiness data, and in terms of “Well-being Life-Years” aka WELLBYs. We are also encouraged that StrongMinds comes out as more cost-effective than cash transfers on their analysis.
GiveWell’s analysis should be seen as a game of two halves. The first half is GiveWell reevaluating our cost-effectiveness of StrongMinds. The second half is comparing StrongMinds against the Against Malaria Foundation, a GiveWell top-charity.
On the first half: GiveWell concludes the effect of StrongMinds is 83% smaller, but this figure is the result of the various researcher-made subjective discounts. We find that only 5% of the 83% discount is clearly supported by the evidence. This raises questions about the role and limits of subjective assessments.
On the second half: GiveWell claims AMF, one of their top charities, is 4x more cost-effective than StrongMinds, but glosses over how comparing life-improving against life-saving interventions is very complex and heavily depends on your philosophical assumptions. GiveWell puts forward its analysis using only its own ’house view, a view which is one of the most favourable to saving lives. On different, reasonable assumptions the life-improving option is better. We think these issues merited greater attention than GiveWell’s report provided—we hope GiveWell returns to them another time.
Here’s my reply, in more depth
1. I’m extremely grateful to Alex Cohen and GiveWell for writing this report, and generously sending it to us a week in advance so we could prepare a reply.
Readers may or may not know that I floated the ideas of (1) in general, using subjective wellbeing, or happiness, scores as a measure of impact and (2) more specifically, mental health interventions being unduly overlooked, now about 5 years ago (eg here and here). I’ve also directly raised these issues in meetings with GiveWell staff several times over that period and urged them to engage with (1) and (2) on the grounds they could substantially change our views on what the top giving opportunities are. This is GiveWell’s first substantial public response, and it’s really incredibly useful to be able to have the debate, see where we disagree, and try to move things forward. I’ve often been asked “but what do GiveWell think?” and not known what to say. But now I can point to this! So, thank you.
2. We were pleasantly surprised to to see GiveWell are evaluating charities using happiness data, and in terms of “Well-being Life-Years” aka WELLBYs. We are also encouraged that StrongMinds comes out as more cost-effective than cash transfers on their analysis.
We are delighted to see GiveWell using the subjective wellbeing approach. We’ve long advocated for it: we think we should ‘take happiness seriously’, use self-reports surveys, and measure impact in wellbeing life-years (‘WELLBYs’, see this write up or this talk for more detail). We see it much as Churchill saw democracy—it’s the worst option, apart from all the others. Ultimately, it’s the wellbeing approach we’re really excited about; despite what some have thought, we are not axiomatically committed to improving mental health specifically. If there are better ways to increase happiness (e.g. improving wealth or physical health, stopping wars, etc.), we would support those instead.
That said, we are surprised by the use of wellbeing data. In discussions over the years, GiveWell staff have been very sceptical about the subjective wellbeing approach. Alex doesn’t express that scepticism here and instead comments positively on the method. So we’re not sure why, or what extent, the organisation’s thinking has changed.
We also think it’s worth flagging that, even on GiveWell’s (more sceptical) evaluation of StrongMinds, it is still at least 2x better then cash transfers. Opinions will differ on whether StrongMinds should, simply because of that, count as a ‘top recommendation’, and we don’t want to get stuck into those debates. We do think it shows that mental health interventions merit more attention (especially for people who are most concerned with improving the quality of lives). We’re unsure how GiveWell thinks StrongMinds compares to deworming interventions: this isn’t mentioned in the report, even though GiveWell have previous argued that deworming is many times better than cash transfers.
3. GiveWell’s analysis should be seen as a game of two halves. The first half is GiveWell reevaluating our cost-effectiveness of StrongMinds. The second half is comparing StrongMinds against GiveWell’s top (life-saving) charities, such as the Against Malaria Foundation.
Almost all of GiveWell’s report is focused on the first half. Let me comment on these halves in turn.
4. On the first half: GiveWell concludes the effect of StrongMinds is 83% smaller, but this figure is the result of the various researcher-made subjective discounts. We find that only 5% of the 83% discount is clearly supported by the evidence. This raises questions about the role and limits of subjective assessments.
How does GiveWell reach a different conclusion from HLI about the cost-effectiveness of StrongMinds? As mentioned, I’ll deal in broad strokes here, whereas Joel gets into the details. What GiveWell does is look at the various parts of our CEA, reassess them, then apply a subjective discount based on the researcher’s judgement. For the most part, GiveWell concludes a reduction is appropriate, but they do recommend one increase related to the costs (we used a figure of $170 per treatment, whereas GiveWell uses $105; this seems reasonable to us and is based on StrongMinds’ data). At the end of this process, the good-done-per-treatment-provided figure for StrongMinds has gone down by 83% to 1.08 WELLBYs , compared to 10.5 WELLBYs, a pretty hefty haircut.
Should we be convinced by these adjustments? GiveWell makes 7 discounts but, for only 1 of these do we agree there is clear evidence indicating (1) that there should be a discount and (2) how big the discount should be. For instance, GiveWell discounts the effect of StrongMinds by 25% on the grounds that programmes are less effective when applied at scale. The basic idea seems fine, but it is not clear where the 25% figure comes from, or if it’s justified. In an additional case—and readers need not worry about the technicalities here—GiveWell applies a 20% discount because they reason that those with depression will have a smaller variance in life satisfaction scores; however, when we do a quick check of the evidence, we find those with depression have a larger variation in life satisfaction scores, so no discount is warranted. The rest of the analysis is similar. Ultimately, we conclude that of the 83% reduction, only 5% of that 83% is clearly supported by the evidence. We are unsympathetic to 35% because of differing intuitions, and 15% we think is not warranted by the evidence. And for the remaining 45%, we are sympathetic to their being a discount, but there’s no evidence provided to demonstrate the size of the adjustment is justified.
All this raises the question: to what extent should researchers make subjective adjustments to CEAs, and other empirical analyses? We detect something of a difference between how we and GiveWell think about this. In HLI, we seem more uncomfortable with deviating from the data than GiveWell does. We don’t know what the right balance is. Possibly we’re too stringent. But this is the sort of case that worries us about researcher-based discounts: although each of Alex’s adjustments are small, taken individually, they end up reducing the numbers by a factor of about 10, which seems large, and the analysis is driven (more?) by intuition than empirical evidence.
Overall, the GiveWell’s analysis provides a minor, immediate update to our CEA and additional motivation to look into various areas when we update our analysis this year.
5. On the second half: GiveWell claims AMF, one of their top charities, is 4x more cost-effective than StrongMinds, but glosses over how comparing life-improving against life-saving interventions is very complex and heavily depends on your philosophical assumptions. GiveWell puts forward its analysis using only its own ‘house view’, one of the most favourable to saving lives. On different, reasonable assumptions the life-improving option is better. We think these issues merited greater attention than GiveWell’s report provided—we hope GiveWell returns to them another time.
How do GiveWell compare the cost-effectiveness of StrongMinds against their top charities? The top charity they mention in the post in the Against Malaria Foundation. Hence, GiveWell needs to also put WELLBY numbers on AMF. How do they do that? Importantly, AMF is a life-saving intervention, whereas StrongMinds is a life-improving intervention. This is more of an apples-to-oranges comparison. As we’ve recently argued, there isn’t “one best way” of doing this: the ‘output’ you get for this depends really heavily on the philosophical assumptions, or ‘inputs’, you make. Here’s part of the summary of our previous report:
We show how much cost-effectiveness changes by shifting from one extreme of (reasonable) opinion to the other. At one end, AMF is 1.3x better than StrongMinds. At the other, StrongMinds is 12x better than AMF. We do not advocate for any particular view. Our aim is simply to show that these philosophical choices are decision-relevant and merit further discussion.
What GiveWell does is use the framework and the figures we set out in our previous report, then plug in their preferred assumptions on the two key issues (the ‘account of the badness of death’ and the ‘neutral point’). This leads them to reach the conclusion that, on their reduced numbers for StrongMinds, AMF is 4x more cost-effective than StrongMinds. What GiveWell doesn’t point out is that their preferred assumptions are amongst the most favourable to the life-saving side of the comparison, and there are other positions you could reasonably hold that would lead you to the conclusion that the life-improving intervention, StrongMinds, is more cost-effective. Regardless of whether you accept our original estimate of StrongMinds, or GiveWell’s new, lower estimate, your conclusion about which of StrongMinds or AMF is more cost-effective is still dependent on these philosophical choices, i.e. going from one extreme to the other still flips the results. Again, I’ll leave it to Joel to get into the specifics.
In some sense, the disagreement in the second half of the analysis is similar to how it was in the first: it’s not the result of indisputable facts, but based on moral judgments and subjective intuitions.
For one part of the life-saving-vs-life-improving comparison, the location of the ‘neutral point’ is currently an understudied, open question. We think that further empirical research can help, and we are undertaking some now—see this other recent report. For the other part, which view of badness of death we should take—should we prioritise the youngest? Or should we prioritise infants over adults? Or prioritise living well over living long? - this a well-worn moral philosophy question (and not amenable to data) but decision-makers could certainly think about it more to better form their views. In general, because these issues can make such a difference, we think we should pay close attention to them, which is why we consider GiveWell’s treatment to have been too brief.
Overall, we are really pleased that GiveWell has engaged with this work and produced this report. While we disagree with some aspects of the analysis and agree with others, there is plenty to be done to improve our collective understanding here, and we plan to incorporate insights from this discussion into our subsequent analyses of StrongMinds, and similar programmes. As we continue to search for the best ways to worldwide wellbeing, we would be very happy to collaborate with GiveWell, or anyone else, to find out what these are.
Joel from HLI here,
Alex kindly shared a draft of this report and discussed feedback from Michael and I more than a year ago. He also recently shared this version before publication. We’re very pleased to finally see that this is published!
We will be responding in more (maybe too much) detail tomorrow. I’m excited to see more critical discussion of this topic.
Edit: the response (Joel’s, Michael’s, Sam’s) has arrived.
It seems hard to believe that the life satisfaction of bad lives can only range 0.5 points on a 0-10 life satisfaction scale, assuming a neutral point of 0.5. Or, if it is the case, then a marginal increase in measured life satisfaction should have greater returns to actual welfare (whatever that is) near the low end than elsewhere.
Thanks for the update!
Personally, I am more concerned about GiveWell neglecting effects on animals, i.e. the meat-eater problem. These may well imply GiveWell’s top charities are harmful, even in the nearterm:
From here, the negative utility of farmed chickens is 2.64 times the positive utility of humans.
From here, the effects of GiveWell’s top charities on wild arthropods are 1.50 k times their effects on humans (based on deforestation rates, and Rethink Priorities’ median moral weight for silkworm). I do not know whether arthropods have good or bad lives
However, the conclusion for me is that the overall effect accounting for humans and animals is pretty unclear, since it is really hard to know whether wild arthropods have good or bad lives. One may argue we should ignore the impacts on wild animals due to their uncertainty, but I do not think that is fair, because it is a case of complex cluelessness (not one of simple clulessness where very uncertain effects can be ignored based on evidential symmetry). I agree that, mathematically, E(“overall effect”) > 0 if:
“Overall effect” = “nearterm effect on humans” + “nearterm effect on animals” + “longterm effect”.
E(“nearterm effect on humans”) > 0.
E(“nearterm effect on animals” + “longterm effect”) = k E(“nearterm effect on humans”).
k = 0.
However, under complex cluelessness, setting k to 0 is unfair. One could just as well set it to −1, in which case E(“overall effect”) = 0. Since I am not confident |k| << 1, I am not confident either about the sign of E(“overall effect”).
I would also say the impact on farmed animals has relatively low uncertainty. Rethink Priorities’ median moral weight for chickens is 0.332 (see here), and the Welfare Footprint Project has done great research on measuring the pain farmed chickens experience.
A comment here is not a good place for this: it’s barely related to the content of the post. A new top level post, your shortform, or a comment on relevant post would be a much better fit.
(It’s also very similar to a comment you wrote a week ago in another unrelated thread.)
Just thought I’d say I’m actually interested by Vasco’s comment. I don’t see why it’s not related—the post is meant to be assessing overall cost-effectiveness (according to the title), so effects on animals are potentially relevant (edit: OK the title refers to HLI’s analysis and the comment is about GiveWell’s, but it applies to both, so I’d accept it). If the point were only written about elsewhere, then it could easily be missed by readers interested in this topic. That said, a fuller write up of how the meat eater problem may affect views on which charities are most cost-effective would also be helpful I think.
It’s a completely different conversation in my book. The post, per the title, is an assessment of HLI’s model of SM’s effectiveness. I dont really see Vasco’s comment as about GW’s assessment of HLI’s model, HLI’s model itself, or SM’s effectiveness with any particularity. It’s more about the broad idea that GH&D effects for almost any GH&D program may be swamped by animal-welfare and longtermist effects.
I do actually think there is a related point to be made that is appropriate to the post: (1) it is good that we have a new published analysis that SM is very likely an effective charity; because (2) even under GW’s version of the analysis, some donors may feel SM is an attractive choice in the global health & development space because they are concerned about the meat-eater problem [link to Vasco’s analysis here] and/or environmental concerns that potentially affect life-saving and economic-development modes of action.
The reasons I’d find that kind of comment helpful—but didn’t find the comment by @Vasco as written well-suited for this post include:
(1) the perspective above is an attempt at a practical application of GW’s findings that is much more hooked into the main subject of the post (which is about SM and HLI’s CEA thereof), and
(2) By noting the meat-eater problem but linking to a discussion in one’s own post, rather than attempting to explain/discuss it in a post trying to nail down the GH&D effects of SM, the risk of derailing the discussion on someone else’s post is significantly reduced.
Ya, the analyses explicitly include spillover effects on some individuals who aren’t directly affected by the interventions (i.e. household family members), but ignore potentially important predictable nearterm indirect effects (those on nonhuman animals) and all of the far future effects. And it doesn’t explain why.
However, ignoring effects on nonhuman animals and the far future is typical for analyses of global health and poverty interventions. And this is discussed in other places where cause prioritization is the main topic. I’d guess, based on comments elsewhere on the EA Forum and other EA-related spaces, nonhuman animal effects are ignored because the authors don’t agree with giving nonhuman animals so much moral weight relative to humans, or are doing worldview diversification and they aren’t confident in such high moral weights. I don’t think we’d want a comment like Vasco’s on many global health and poverty intervention posts, because we don’t want to have the same discussion scattered and repeated this way, especially when there are better places to have it. Instead, Vasco’s own posts, posts about moral weight and posts about cause prioritization would be better places.
When people bring up effects on wild fish, I often point out that they’re thinking about it the wrong way (getting the supply responses wrong) and ignoring population effects. But I’m pretty sure this is something they would care about if informed, and there aren’t that many posts about wild fish. I also suspect we should be more worried about animal product reduction backfiring in the near term because of wild animal effects, but I think this is more controversial and animal product reduction is covered much more on the EA Forum than fishing in particular, so passing comments on posts about diet change and substitutes doesn’t seem like a good way to have this discussion.
I guess there’s a question of whether a comment like Vasco’s would be welcome every now and then on global health and poverty posts, but it could be a slippery slope.
I agree with your assessment that Vasco’s comment is not really on topic.
I also feel like there is a lack of substantive discussion and just overall engagement on the forum (this post and comment section being an exception).
I’m not exactly sure why this is (maybe there just aren’t enough EA’s) but it seems related to users being worried that their comments might not add value combined with the lack of anonymity and in-group dynamics. In general I find hacker news and subreddits like r/neoliberal to be significantly more thought provoking and engaging even though I think the commenters of those subs are often engaging more hedonistically and less to add value. On the margin the EA forum should be more serious and have stricter norms than those communities, but I’m worried that forum users optimizing individual posts and comments for usefulness is lowering the overall usefulness of the forum.
Hi Jeff,
Thanks for taking the time to comment that!
Just to make a point on this comment related to how the forum works, it looks like people don’t like it on net, but there may be a substantial minority interested in animal welfare considerations who find it helpful (I count myself here), and therefore it would be valuable for these people. But currently it’s automatically hidden as if it’s spam-like and not worth reading for anyone. This seems suboptimal, and perhaps a more strict bar for hiding comments should be set. Comments with low scores are sent to the bottom of the page anyway, so it’s unlikely to be that bothersome.
It may also be valuable for people to be able to see the numbers of upvotes and downvotes separately, so they can see if there’s a minority of readers who appreciate their comments vs getting pure downvotes, which give different messages in terms of feedback.
Given the current forum workings, it seems like people should be cautious about downvoting comments where a substantial minority of others may disagree and think it’s a useful point and wouldn’t want it hidden (and use disagree voting to indicate difference of judgement).