Alex’s post has two parts. First, what is the estimated impact of StrongMinds in terms of WELLBYs? Second, how cost-effective is StrongMinds compared to the Against Malaria Foundation (AMF)? I briefly present my conclusions to both in turn. More detail about each point is presented in Sections 1 and 2 of this comment.
The cost-effectiveness of StrongMinds
GiveWell estimates that StrongMinds generates 1.8 WELLBYs per treatment (17 WELLBYs per $1000, or 2.3x GiveDirectly[1]). Our most recent estimate[2] is 10.5 WELLBYs per treatment (62 WELLBYs per $1000, or 7.5x GiveDirectly) . This represents a 83% discount (an 8.7 WELLBYs gap)[3] to StrongMinds effectiveness[4]. These discounts, while sometimes informed by empirical evidence, are primarily subjective in nature. Below I present the discounts, and our response to them, in more detail.
Figure 1: Description of GiveWell’s discounts on StrongMinds’ effect, and their source
Notes: The graph shows the factors that make up the 8.7 WELLBY discount.
Table 1: Disagreements on StrongMinds per treatmenteffect(10.5 vs. 1.8 WELLBYs) and cost
Note: HLI estimates StrongMinds has an effect of 1.8 WELLBYs per household of recipient. HLI estimates that this figure is 10.5. This represents a 8.7 WELLBY gap.
How do we assess GiveWell’s discounts? We summarise our position below.
Figure 2: HLI’s views on GiveWell’s total discount of 83% to StrongMind’s effects
We think there’s sufficient evidence and reason to justify the size and magnitude of 5% of GiveWell’s total discount
For ~45% of their total discount, we are sympathetic to including a discount, but we are unsure about the magnitude (generally, we think the discount would be lower). The adjustments that I think are the most plausible are:
A discount of up to 15% for conversion between depression and life-satisfaction SD.
A discount of up to 20% for loss of effectiveness at scale.
A discount of up to 5% for response biases.
Reducing the household size down to 4.8 people.
We are unsympathetic to ~35% of their total discount, because our intuitions differ, but there doesn’t appear to be sufficient existing evidence to settle the matter (i.e., household spillovers).
We think that for 15% of their total discount, the evidence that exists doesn’t seem to substantiate a discount (i.e., their discounts on StrongMind’s durability).
However, as Michael mentions in his comment, a general source of uncertainty we have is about how and when to make use of subjective discounts. We will make more precise claims about the cost-effectiveness of StrongMinds when we finalise our revision and expansion.
The cost-effectiveness of AMF
The second part of Alex’s post is asking how cost-effective is StrongMinds compared to the Against Malaria Foundation (AMF)? AMF, which prevents malaria with insecticide treated bednets, is in contrast to StrongMinds, a primarily life-saving intervention. Hence, as @Jason rightly pointed out elsewhere in the comments, its cost-effectiveness strongly depends on philosophical choices about the badness of death and the neutral point (see Plant et al., 2022). GiveWell takes a particular set of views (deprivationism with a neutral point of 0.5) that are very favourable to life saving interventions. But there are other plausible views that can change the results, and even make GiveWell’s estimate of StrongMinds seem more cost-effective than AMF. Whether you accept our original estimate of StrongMinds, or GiveWell’s lower estimate, the comparison is still incredibly sensitive to these philosophical choices. I think GiveWell is full of incredible social scientists, and I admire many of them, but I’m not sure that should privilege their philosophical intuitions.
Further research and collaboration opportunities
We are truly grateful to GiveWell for engaging with our research on StrongMinds. I think we largely agree with GiveWell regarding promising steps for future research. We’d be keen to help make many of these come true, if possible. Particularly regarding: other interventions that may benefit from a SWB analysis, household spillovers, publication bias, the SWB effects of psychotherapy (i.e. not just depression), and surveys about views on the neutral point and the badness of death. I would be delighted if we could make progress on these issues, and doubly so if we could do so together.
1. Disagreements on the cost-effectiveness of StrongMinds
HLI estimates that psychotherapy produces 10.5 WELLBYs (or 62 per $1000, 7.5x GiveDirectly) for the household of the recipient, while GiveWell estimates that psychotherapy has about a sixth of the effect, 1.8 WELLBYs (17 per $1000 or 2.3x GiveDirectly[5]). In this section, I discuss the sources of our disagreement regarding StrongMinds in the order I presented in Table 1.
1.1 Household spillover differences
Household spillovers are our most important disagreement. When we discuss the household spillover effect or ratio we’re referring to the additional benefit each non-recipient member of the household gets, as a percentage of what the main recipient receives. We first analysed household spillovers in McGuire et al. (2022), which was recently discussed here. Notably, James Snowden pointed out a mistake we made in extracting some data, which reduces the spillover ratio from 53% to 38%.
GiveWell’s method relies on:
Discounting the 38% figure citing several general reasons. (A) Specific concerns that the studies we use might overestimate the benefits because they focused on families with children that had high-burden medical conditions. (B) A shallow review of correlational estimates of household spillovers and found spillover ratios ranging from 5% to 60%.
And finally concluding that their best guess is that the spillover percentage is 15 or 20%[6], rather than 53% (what we used in December 2022) or 38% (what we would use now in light of Snowden’s analysis). Since their resulting figure is a subjective estimate, we aren’t exactly sure why they give that figure, or how much they weigh each piece of evidence.
Table 2: HLI and GiveWell’s views on household spillovers of psychotherapy
Variable
HLI
GiveWell
Explains how much difference in SM’s effect (%)
Household spillover ratio for psychotherapy
38%
15%
3 WELLBYs (34% of total gap)
Note: The household spillover for cash transfers we estimated is 86%.
I reassessed the evidence very recently—as part of the aforementioned discussion with James Snowden—and Alex’s comments don’t lead me to update my view further. In my recent analysis, I explained that I think I should weigh the studies we previously used less because they do seem less relevant to StrongMinds, but I’m unsure what to use instead. And I also hold a more favourable intuition about household spillovers for psychotherapy, because parental mental health seems important for children (e.g., Goodman, 2020).
But I think we can agree that collecting and analysing new evidence could be very important here. The data from Barker et al. (2022), a high quality RCT of the effect of CBT on the general population in Ghana (n = ~7,000) contains information on both partners’ psychological distress when one of them received cognitive behavioural therapy, so this data can be used to estimate any spousal spillover effects from psychotherapy. I am in the early stage of analysing this data[7]. There also seems to be a lot of promising primary work that could be done to estimate household spillovers alongside the effects of psychotherapy.
1.2 Conversion between measures, data sources, and units
The conversion between depression and life-satisfaction (LS) scores ties with household spillovers in terms of importance for explaining our disagreements about the effectiveness of psychotherapy. We’ve previously assumed that a one standard deviation (SD) decrease in depression symptoms (or affective mental health; MHa) is equivalent to a one SD improvement in life-satisfaction or happiness (i.e., a 1:1 conversion), see here for our previous discussion and rationale.
Givewell has two concerns with this:
Depression and life-satisfaction measures might not be sufficiently empirically or conceptually related to justify a 1:1 conversion. Because of this, they apply an empirically based 10% discount.
They are concerned that recipients of psychotherapy have a smaller variance in subjective wellbeing (SWB) than general populations (e.g., cash transfers), which leads to inflated effect sizes. They apply a 20% subjective discount to account for this.
Hence, GiveWell applied a 30% discount (see Table 4 below).
Table 3: HLI and GiveWell’s views on converting between SDs of depression and life satisfaction
Variable
HLI
GiveWell
Explains what difference in SM’s effect (%)
Conversion from depression to LS
1 to 1
1 to 0.7
3 WELLBYs (34% of total)
Overall, I agree that there are empirical reasons for including a discount in this domain, but I’m unsure of its magnitude. I think it will likely be smaller than GiveWell’s 30% discount.
1.2.1 Differences between the two measures
First, GiveWell mentions a previous estimate of ours suggesting that mental health (MH) treatments[8] impact depression 11% more than SWB. Our original calculation used a naive average, but on reflection, it seems more appropriate to use a sample-size-weighted average (because of the large differences in samples between studies), which results in depression measures overestimating SWB measures by 4%, instead of 11%.
Results between depression and happiness measures are also very close in Bhat et al. (2022; n = 589), the only study I’ve found so far that looks at effects of psychotherapy on both types of measures. We can standardise the effects in two ways. Depending on the method, the SWB effects are larger by 18% or smaller by 1% than MHa effects[9]. Thus, effects of psychotherapy on depression appear to be of similar size as effects on SWB. Given these results, I think the discount due to empirical differences could be smaller than 10%, I would guess 3%.
Another part of this is that depression and life satisfaction are not the same concept. So if the scores are different, there is a further moral question about which deserves more weight. The HLI ‘house view’, as our name indicates, favours happiness (how good/bad we feel) as what matters. Further, we suspect that measures of depression are conceptually closer to happiness than measures of life satisfaction are. Hence, if push came to shove, and there is a difference, we’d care more about the depression scores, so no discount would be justified. From our conversation with Alex, we understand that the GiveWell ‘house view’ is to care more about life satisfaction than happiness. In this case, GiveWell would be correct, by their lights, to apply some reduction here.
1.2.2 Differences in variance
In addition to their 11% conversion discount, GiveWell adds another 20% discount because they think a sample of people with depression have a smaller variance in life satisfaction scores.[10] Setting aside the technical topic of why variance in variances matters, I investigated whether there are lower SDs in life satisfaction when you screen for baseline depression using a few datasets. I found that, if anything, the SDs are larger by 4% (see Table 4 below). Although I see the rationale behind GiveWell’s speculation, the evidence I’ve looked at suggests a different conclusion.
Table 4: Life-satisfaction SD depending on clinical mental health cutoff
Dataset
LS SD for general pop
LS SD for dep pop
SWB SD change (gen → dep)
SWB measure
BHPS (UK, n = 7,310)
1.23
1.30
106%
LS 1-10
HILDA (AUS, n = 4,984)
1.65
1.88
114%
LS 0-10
NIDS (SA, n = 18,039)
2.43
2.38
98%
LS 1-10
Haushofer et al. 2016 (KE, n = 1,336)
1.02
1.04
102%
LS (z-score)
Average change
1.58
1.65
104%
Note: BHPS = The British Household Panel Survey, HILDA = The Household Income and Labour Dynamics Survey, NIDS = National Income Dynamics Study. LS = life satisfaction, dep = depression.
However, I’m separately concerned that SD changes in trials where recipients are selected based on depression (i.e., psychotherapy) are inflated compared to trials without such selection (i.e., cash transfers)[11].
Overall, I think I agree with GiveWell that there should be a discount here that HLI doesn’t implement, but I’m unsure of its magnitude, and I think that it’d be smaller than GiveWell’s. More data could likely be collected on these topics, particularly how much effect sizes in practice differ between life-satisfaction and depression, to reduce our uncertainty.
1.3 Loss of effectiveness outside trials and at scale
GiveWell explains their concern, summarised in the table below:
“Our general expectation is that programs implemented as part of randomized trials are higher quality than similar programs implemented at scale. [...] For example, HLI notes that StrongMinds uses a reduced number of sessions and slightly reduced training, compared to Bolton (2003), which its program is based on.48 We think this typeof modification could reduce program effectiveness relative to what is found in trials. [...] We can also see some evidence for lower effects in larger trials…”
Table 5: HLI and GiveWell’s views on an adjustment for StongMind’s losing effectiveness at scale
Variable
HLI
GiveWell
Explains what difference in SM’s effect (%)
Loss of effect at scale discount
0%
25%
0.9 WELLBYs (10.1% of total gap)
While GiveWell provides several compelling reasons for why StongMinds efficacy will decrease as it scales, I can’t find the reason GiveWell provides for why these reasons result in a 25% discount. It seems like a subjective judgement informed by some empirical factors and perhaps from previous experience studying this issue (e.g., cases like No Lean Season). Is there any quantitative evidence that suggests that when RCT interventions scale they drop 25% in effectiveness? While GiveWell also mentions that larger psychotherapy trials have smaller effects, I assume this is driven by publication bias (discussed in Section 1.6). I’m also less sure that scaling has no offsetting benefits. I would be surprised if when RCTs are run, the intervention has all of its kinks ironed out. In fact, there’s many cases of the RCT version of an intervention being the “minimum viable product” (Karlan et al., 2016). While I think a discount here is plausible, I’m very unsure of its magnitude.
In our updated meta-analysis we plan on doing a deeper analysis of the effect of expertise and time spent in therapy, and to use this to better predict the effect of StrongMinds. We’re awaiting the results from Baird et al. which should better reflect their new strategy as StrongMinds trained but did not directly deliver the programme.
1.4 Disagreements on the durability of psychotherapy
GiveWell explains their concern summarised in the table below, “We do think it’s plausible that lay-person-delivered therapy programs can have persistent long-term effects, based on recent trials by Bhatet al. 2022 and Baranov et al. 2020. However, we’re somewhat skeptical of HLI’s estimate, given that it seems unlikely to us that a time-limited course of group therapy (4-8 weeks) would have such persistent effects. We also guess that some of the factors that cause StrongMinds’ program to be less effective than programs studied in trials (see above) could also limit how long the benefits of the program endure. As a result, we apply an 80% adjustment factor to HLI’s estimates. We view this adjustment as highly speculative, though, and think it’s possible we could update our view with more work.”
Table 6: HLI and GiveWell’s views on a discount to account for a decrease in durability
Variable
HLI
GiveWell
Explains what difference in SM’s effect (%)
Decrease in durability
0%
20%
0.9 WELLBYs (10.1% of total gap)
Since this disagreement appears mainly based on reasoning, I’ll explain why my intuitions—and my interpretation of the data—differ from GiveWell here. I already assume that StrongMinds decays 4% more each year than does psychotherapy in general (see table 3). Baranov et al. (2020) and Bhat et al. (2022) both find long-term effects that are greater than what our general model predicts. This means that we already assume a higher decay rate in general, and especially for StrongMinds than the two best long-term studies of psychotherapy suggest. I show how these studies compare to our model in Figure 3 below.
Figure 3: Effects of our model over time, and the only long-term psychotherapy studies in LMICs
Edit: I updated the figure to add the StrongMinds model, which starts with a higher effect but has a faster decay.
Baranov et al. (2020, 16 intended sessions) and Bhat et al. (2022, 6-14 intended sessions, with 70% completion rate) were both time limited. StrongMinds historically used 12 sessions (it may be 8 now) of 90 minutes[12]. Therefore, our model is more conservative than the Baranov et al. result, and closer to the Bhat et al. which has a similar range of sessions. Another reason, in favour of the duration of StrongMinds, which I mentioned in McGuire et al. (2021), is that 78% of groups continued meeting on their own at least six months after the programme formally ended.
Bhat et al. (2022) is also notable in another regard: They asked ~200 experts to predict the impact of the intervention after 4.5 years. The median prediction underestimated the effectiveness by nearly 1/3rd, which makes me inclined to weigh expert priors less here[13].
Additionally, there seems to be something double-county in GiveWell’s adjustments. The initial effect is adjusted by 0.75 for “Lower effectiveness at scale and outside of trial contexts” and the duration effect is adjusted by 0.80, also for “lower effectiveness at scale and outside of trial contexts”. Combined this is a 0.55 adjustment instead of one 0.8 adjustment. I feel like one concern should show up as one discount.
GiveWell explains their concern, which is summarised in the table below: “One major concern we have with these studies is that participants might report a lower level of depression after the intervention because they believe that is what the experimenter wants to see [...] HLI responded to this criticism [section 4.4] and noted that studies that try to assess experimenter-demand effects typically find small effects.[...] We’re not sure these tests would resolve this bias so we still include a downward adjustment (80% adjustment factor).”
Table 7: HLI and GiveWell’s views on converting between SDs of depression and life satisfaction
Variable
HLI
GiveWell
Explains what diff in SM’s effect (%)
Social desirability bias discount
0%
20%
0.5 WELLBYs (5.1% of total gap)
Participants might report bigger effects to be agreeable with the researchers (socially driven bias) or in the hopes of future rewards (cognitively driven bias; Bandiera et al., 2018), especially if they recognise the people delivering the survey to be the same people delivering the intervention[15].
But while I also worry about this issue, I am less concerned than GiveWell that response bias poses a unique threat to psychotherapy. Because if this bias exists, it seems likely to apply to all RCTs of interventions with self-reported outcomes (and without active controls). So I think the relevant question is why the propensity to response bias might differ between cash transfers and psychotherapy? Here are some possibilities:
It seems potentially more obvious that psychotherapy should alleviate depression than cash transfers should increase happiness. If so, questions about self-reported wellbeing may be more subject to bias in psychotherapy trials[16].
We could expect that the later the follow-up, the less salient the intervention is, the less likely respondents are to be biased in this way (Park & Kumar, 2022). This is one possibility that could favour cash transfers because they have relatively longer follow-ups than psychotherapy.
However, it is obvious to cash transfer participants whether they are in the treatment (they receive cash) or control conditions (they get nothing). This seems less true in psychotherapy trials where there are often active controls.
GiveWell responded to the previous evidence I cited (McGuire & Plant, 2021, Section 4.4)[17] by arguing that the tests run in the literature, by investigating the effect of the general propensity towards socially desirable responding or the expectations of surveyor, are not relevant because: “If the surveyor told them they expected the program to worsen their mental health or improve their mental health, it seems unlikely to overturn whatever belief they had about the program’s expected effect that was formed during their group therapy sessions.” But, if participants’ views about an intervention seem unlikely to be overturned by what the surveyor seems to want—when what the surveyor wants and the participant’s experience differs—then that’s a reason to be less concerned about socially motivated response bias in general.
However, I am more concerned with socially desirable responses driven by cognitive factors. Bandiera et al. (2018, p. 25) is the only study I found to discuss the issue, but they do not seem to think this was an issue with their trial: “Cognitive drivers could be present if adolescent girls believe providing desirable responses will improve their chances to access other BRAC programs (e.g. credit). If so, we might expect such effects to be greater for participants from lower socioeconomic backgrounds or those in rural areas. However, this implication runs counter to the evidence in Table A5, where we documented relatively homogenous impacts across indices and time periods, between rich/poor and rural/urban households.”
I agree with GiveWell that more research would be very useful, and could potentially update my views considerably, particularly with respect to the possibility of cognitively driven response bias in RCTs deployed in low-income contexts.
1.6 Publication bias
GiveWell explains their concern, which we summarise in the table below: “HLI’s analysis includes a roughly 10% downward adjustment for publication bias in the therapy literature relative to cash transfers literature.We have not explored this in depth but guess we would apply a steeper adjustment factor for publication bias in therapy relative to our top charities. After publishing its cost-effectiveness analysis, HLI published a funnel plot showing a high level of publication bias, with well-powered studies finding smaller effects than less-well-powered studies.57 This is qualitatively consistent with a recent meta-analysis of therapy finding a publication bias of 25%.”
Table 8: HLI and GiveWell’s views on a publication bias discount
Variable
HLI
GiveWell
Explains what diff in SM’s effect (%)
Publication bias discount
11%
15%
0.39 WELLBYs (4.5% of total gap)
After some recent criticism, we have revisited this issue and are working on estimating the bias empirically. Publication bias seems like a real issue, where a 10-25% correction like what GiveWell suggests seems plausible, but we’re unsure about the magnitude as our research is ongoing. In our update of our psychotherapy meta-analysis we plan to employ a more sophisticated quantitative approach to adjust for publication bias.
1.7 Household size
GiveWell explains their concern, which we summarise in the table below: “HLI estimates household size using data from the Global Data Lab and UN Population Division.They estimate a household size of 5.9in Uganda based on these data, which appears to be driven by high estimates for rural household size in the Global Data Lab data, which estimate a household size of 6.3 in rural areas in 2019.A recent Uganda National Household Survey, on the other hand, estimates household size of 4.8 in rural areas. We’re not sure what’s driving differences in estimates across these surveys, but our best guess is that household size is smaller than the 5.9 estimate HLI is using.”
Table 9: HLI and GiveWell’s views on household size of StrongMind’s recipients
Variable
HLI
GiveWell
Explains what diff in SM’s effect (%)
Household size for StrongMinds
5.9
4.8
0.39 WELLBYs (4.5% of total gap)
I think the figures GiveWell cites are reasonable. I favour using international datasets because I assume it means greater comparability between countries, but I don’t feel strongly about this. I agree it could be easy and useful to try and understand StrongMinds recipient’s household sizes more directly. We will revisit this in our StrongMinds update.
1.8 Cost per person of StrongMinds treated
The one element where we differ that makes StrongMinds look more favourable is cost. As GiveWell explains “HLI’s most recent analysis includes a cost of $170 per person treated by StrongMinds, but StrongMinds cited a 2022 figure of $105 in a recent blog post”
Table 10: HLI and GiveWell’s views on cost per person for StrongMind’s treatment
Variable
HLI
GiveWell
% Total Gap Explained
cost per person of StrongMinds
$170
$105
-75%
According to their most recent quarterly report, a cost per person of $105 was the goal, but they claim $74 per person for 2022[18]. We agree this is a more accurate/current figure, and the cost might well be lower now. A concern is that the reduction in costs comes at the expense of treatment fidelity – an issue we will review in our updated analysis.
2. GiveWell’s cost-effectiveness estimate of AMF is dependent on philosophical views
GiveWell estimates that AMF produces 70 WELLBYs per $1000[19], which would be 4 times better than StrongMinds. GiveWell described the philosophical assumptions of their life saving analysis as: “...Under the deprivationist framework and assuming a “neutral point” of 0.5 life satisfaction points. [...] we think this is what we would use and it seems closest to our current moral weights, which use a combination of deprivationism and time-relative interest account.”
Hence, they conclude that AMF produces 70 WELLBYs per $1000, which makes StrongMinds 0.24 times as cost-effective as AMF. However, the position they take is nearly the most favourable one can take towards interventions that save lives[20]. But there are other plausible views about the neutral point and the badness of death (we discuss this in Plant et al., 2022). Indeed, assigning credences to higher neutral points[21] or alternative philosophical views of death’s badness will reduce the cost-effectiveness of AMF relative to StrongMinds (see Figure 3). In some cases, AMF is less cost-effective than GiveWell’s estimate of StrongMinds[22].
Figure 4: Cost-effectiveness of charities under different philosophical assumptions (with updated StrongMinds value, and GiveWell’s estimate for StrongMinds)
To be clear, HLI does not (yet) take a stance on these different philosophical views. While I present some of my views here, these do not represent HLI as a whole.
Personally, I’d use a neutral point closer to 2 out of 10[23]. Regarding the philosophy, I think my credences would be close to uniformly distributed across the Epicurean, TRIA, and deprivationist views. If I plug this view into our model introduced in Plant et al. (2022) then this would result in a cost-effectiveness for AMF of 29 WELLBYs per $1000 (rather than 81 WELLBYs per $1000)[24], which is about half as good as the 62 WELLBYs per $1000 for StrongMinds. If GiveWell held these views, then AMF would fall within GiveWell’s pessimistic and optimistic estimates of 3-57 WELLBYs per $1000 for StrongMinds’ cost-effectiveness. For AMF to fall above this range, you need to (A) put almost all your credence in deprivationism and (B) have a neutral point lower than 2[25].
Coincidently, this is (barely) within our most recent confidence interval for comparing the cost-effectiveness of StrongMinds to GiveDirectly (95% CI: 2, 100).
This calculation is based on a correction for a mistake in our spillover ratio discussed here (a spillover ratio of 38% instead of 53%). Our previous estimate was 77 WELLBYs per $1000 (Plant et al., 2022; McGuire et al., 2022).
Coincidently, this is (barely) within our most recent confidence interval for comparing the cost-effectiveness of StrongMinds to GiveDirectly (95% CI: 2, 100).
These are positive psychology interventions (like mindfulness and forgiveness therapy) which might not completely generalise to psychotherapy in LMICs.
Psychotherapy improved happiness by 0.38 on a 1-10 score and reduced depression by 0.97 (on the PHQ-9’s 0-27 scale). If we convert the depression score to a 1-10 scale, using stretch transformation, then the effect is a reduction in depression of 0.32. Hence, the SWB changes are 18% larger than MHa changes. If we convert both results to Cohen’s d, we find a Cohen’s d of 0.167 for depression and a Cohen’s d of 0.165 for happiness. Hence changes in MHa are 1% greater than SWB.
“it seems likely that SD in life satisfaction score is lower among StrongMinds recipients, who are screened for depression at baseline46 and therefore may be more concentrated at the lower end of the life satisfaction score distribution than the average individual.”
Sample selection based on depression (i.e., selection based on the outcome used) could shrink the variance of depression scores in the sample, which would inflate standardised effects sizes of depression compared to trials without depression selection, because standardisation occurs by dividing the raw effect by its standard deviation (i.e., standardised mean differences, such as Cohen’s d). To explore this, I used the datasets mentioned in Table 4, all of which also included measures of depression or distress and the data from Barker et al. (2022, n = 11,835). I found that the SD of depression for those with clinically significant depression was 18 to 21% larger than it was for the general sample (both the mentally ill and healthy). This seems to indicate that SD changes from psychotherapy provide inflatedSD changes in depression compared to cash transfers, due to smaller SDs of depression. However, I think this may be offset by another technical adjustment. Our estimate of the life-satisfaction SD we use to convert SD changes (in MHa or SWB) to WELLBYs might be larger, which means the effects of psychotherapy and cash transfers are underestimated by 14% compared to AMF. When we convert from SD-years to WELLBYs we’ve used a mix of LMIC and HIC sources to estimate the general SD of LS. But I realised that there’s a version of the World Happiness Report that published data that included the SDs of LS for many countries in LMICs. If we use this more direct data for Sub-Saharan Countries then it suggests a higher SD of LS than what I previously estimated (2.5 instead of 2.2, according to a crude estimate), a 14% increase.
Note, I was one of the predictors, and my guess was in line with the crowd (~0.05 SDs), and you can’t see others’ predictions beforehand on the Social Science Prediction Platform.
Note, this is more about ‘experimenter demand effects’ (i.e., being influenced by the experimenters in a certain direction, because that’s what they want to find) than ‘socially desirability bias’ (i.e., answering that one is happier than they are because it looks better). The latter is controlled for in an RCT. We keep the wording used by GW here.
GiveWell puts it in the form of this scenario “If a motivated and pleasant IPT facilitator comes to your village and is trying to help you to improve your mental health, you may feel some pressure to report that the program has worked to reward the effort that facilitator has put into helping you.” But these situations are why most implementers in RCTs aren’t the surveyors. I’d be concerned if there were more instances of implementers acting as surveyors in psychotherapy than cash transfer studies.
On the other hand, who in poverty expects cash transfers to bring them misery? That seems about as rare (or rarer) as those who think psychotherapy will deepen their suffering. However, I think the point is about what participants think that implementers most desire.
Since then, I did some more digging. I found Dhar et al. (2018) and Islam et al. (2022) which use a questionnaire to test for propensity to answer questions in a socially desirable manner, but find similarly small results of socially motivated response bias. Park et al. (2022) takes an alternative approach where they randomise a subset of participants to self-survey, and argue that this does not change the results.
People might hold that the neutral point is higher than 0.5 (on a 0-10 scale), and thereby reduce the cost-effectiveness of AMF. The IDinsight survey GiveWell uses surveys people from Kenya and Ghana but has a small sample (n = 70) for its neutrality question. In our pilot report (n = 79; UK sample; Samuelsson et al., 2023) we find a neutral point of 1.3. See Samuelsson et al. (2023; Sections 1.3 and 6) for a review of the different findings in the literature and more detail on our findings. Recent unpublished work by Julian Jamison finds a neutral point of 2.5 on a sample size of ~1,800 drawn from the USA, Brazil and China. Note that, in all these cases, we recommend caution in concluding that any of these values is the neutral point. There is still more work to be done.
Under GiveWell’s analysis, there are still some combinations of philosophical factors where AMF produces 17 WELLBYs or less (i.e., is as or less good than SM in GiveWell’s analysis): (1) An Epicurean view, (2) Deprivationism with neutral points above 4, and (3) TRIA with high ages of connectivity and neutral points above 3 or 4 (depending on the combination). This does not include the possibility of distributing credences across different views.
I would put the most weight on the work by HLI and Jamison and colleagues, mentioned in above, which finds a neutral point of 1.3/10 and 2.5/10, respectively.
We acknowledge that many people may hold these views. We also want to highlight that many people may hold other views. We encourage more work investigating the neutral point and investigating the extent to which these philosophical views are held.
Zooming out a little: is it your view that group therapy increases happiness by more than the death of your child decreases it? (GiveWell is saying that this is what your analysis implies.)
HLI’s estimates imply, for example, that a donor would pick offering StrongMinds’ intervention to 20 individuals over averting the death of a child, and that receiving StrongMinds’ program is 80% as good for the recipient as an additional year of healthy life.
I.e., is it your view that 4-8 weeks of group therapy (~12 hours) for 20 people is preferable to averting the death of a child?
To be clear on what the numbers are: we estimate that group psychotherapy has an effect of 10.5 WELLBYs on the recipient’s household, and that the death of a child in a LIC has a −7.3 WELLBY effect on the bereaved household. But the estimate for grief was very shallow. The report this estimate came from was not focused on making a cost-effectiveness estimate of saving a life (with AMF). Again, I know this sounds weasel-y, but we haven’t yet formed a view on the goodness of saving a life so I can’t say how much group therapy HLI thinks is preferable averting the death of a child.
That being said, I’ll explain why this comparison, as it stands, doesn’t immediately strike me as absurd. Grief has an odd counterfactual. We can only extend lives. People who’re saved will still die and the people who love them will still grieve. The question is how much worse the total grief is for a very young child (the typical beneficiary of e.g., AMF) than the grief for the adolescent, or a young adult, or an adult, or elder they’d become [1]-- all multiplied by mortality risk at those ages.
So is psychotherapy better than the counterfactual grief averted? Again, I’m not sure because the grief estimates are quite shallow, but the comparison seems less absurd to me when I hold the counterfactual in mind.
I assume people, who are not very young children, also have larger social networks and that this could also play into the counterfactual (e.g., non-children may be grieved for by more people who forged deeper bonds). But I’m not sure how much to make of this point.
this comparison, as it stands, doesn’t immediately strike me as absurd. Grief has an odd counterfactual. We can only extend lives. People who’re saved will still die and the people who love them will still grieve. The question is how much worse the total grief is for a very young child (the typical beneficiary of e.g., AMF) than the grief for the adolescent, or a young adult, or an adult, or elder they’d become
My intuition, which is shared by many, is that the badness of a child’s death is not merely due to the grief of those around them. So presumably the question should not be comparing just the counterfactual grief of losing a very young child VS an [older adult], but also “lost wellbeing” from living a net-positive-wellbeing life in expectation?
I also just saw that Alex claims HLI “estimates that StrongMinds causes a gain of 13 WELLBYs”. Is this for 1 person going through StrongMinds (i.e. ~12 hours of group therapy), or something else? Where does the 13 WELLBYs come from?
I ask because if we are using HLI’s estimates of WELLBYs per death averted, and use your preferred estimate for the neutral point, then 13 / (4.95-2) is >4 years of life. Even if we put the neutral point at zero, this suggests 13 WELLBYs is worth >2.5 years of life.[1]
I think I’m misunderstanding something here, because GiveWell claims “HLI’s estimates imply that receiving IPT-G is roughly 40% as valuable as an additional year of life per year of benefit or 80% of the value of an additional year of life total.”
Can you help me disambiguate this? Apologies for the confusion.
My intuition, which is shared by many, is that the badness of a child’s death is not merely due to the grief of those around them. Thus the question should not be comparing just the counterfactual grief of losing a very young child VS an [older adult], but also “lost wellbeing” from living a net-positive-wellbeing life in expectation.
I didn’t mean to imply that the badness of a child’s death is just due to grief. As I said in my main comment, I place substantial credence (2/3rds) in the view that death’s badness is the wellbeing lost. Again, this my view not HLIs.
The 13 WELLBY figure is the household effect of a single person being treated by StrongMinds. But that uses the uncorrected household spillover (53% spillover rate). With the correction (38% spillover) it’d be 10.5 WELLBYs (3.7 WELLBYs for recipient + 6.8 for household).
GiveWell arrives at the figure of 80% because they take a year of life as valued at 4.55 WELLBYs = 4.95 − 0.5 according to their preferred neutral point, and StrongMinds benefit ,according to HLI, to the direct recipient is 3.77 WELLBYs --> 3.77 / 4.55 = ~80%. I’m not sure where the 40% figure comes from.
If I understand correctly, the updated figures should then be:
For 1 person being treated by StrongMinds (excluding all household spillover effects) to be worth the WELLBYs gained for a year of life[1] with HLI’s methodology, the neutral point needs to be at least 4.95-3.77 = 1.18.
If we include spillover effects of StrongMinds (and use the updated / lower figures), then the benefit of 1 person going through StrongMinds is 10.7 WELLBYs.[2] Under HLI’s estimates, this is equivalent to more than two years of wellbeing benefits from the average life, even if we set the neutral point at zero. Using your personal neutral point of 2 would suggest the intervention for 1 person including spillovers is equivalent to >3.5 years of wellbeing benefits. Is this correct or am I missing something here?
1.18 as the neutral point seems pretty reasonable, though the idea that 12 hours of therapy for an individual is worth the wellbeing benefits of 1 year of an average life when only considering impacts to them, and anywhere between 2~3.5 years of life when including spillovers does seem rather unintuitive to me, despite my view that we should probably do more work on subjective wellbeing measures on the margin. I’m not sure if this means:
WELLBYs as a measure can’t capturing what I care about in a year of healthy life, so we should not use solely WELLBYs when measuring wellbeing;
HLI isn’t applying WELLBYs in a way that captures the benefits of a healthy life;
The existing way of estimating 1 year of life via WELLBYs is wrong in some other way (e.g. the 4.95 assumption is wrong, the 0-10 scale is wrong, the ~1.18 neutral point is wrong);
HLI have overestimated the benefits of StrongMinds;
I have a very poorly calibrated view of how good / bad 12 hours of therapy / a year of life is worth, though this seems less likely.
Would be interested in your thoughts on this / let me know if I’ve misinterpreted anything!
I appreciate your candid response. To clarify further: suppose you give a mother a choice between “your child dies now (age 5), but you get group therapy” and “your child dies in 60 years (age 65), but no group therapy”. Which do you think she will choose?
Also, if you don’t mind answering: do you have children? (I have a hypothesis that EA values are distorted by the lack of parents in the community; I don’t know how to test this hypothesis. I hope my question does not come off as rude.)
I don’t think that’s the right question for three reasons.
The hypothetical mother will almost certainly consider the well-being of her child (under a deprivationist framework) in making that decision—no one is suggesting that saving a life is less valuable than therapy under such an approach. Whatever the merits of an epicurean view that doesn’t weigh lost years of life, we wouldn’t have made it long as a species if parents applied that logic to their own young children.
Second, the hypothetical mother would have to live with the guilt of knowing she could have saved her child but chose something for herself.
Finally, GiveWell-type recommendations often would fail the same sort of test. Many beneficiaries would choose receiving $8X (where X = bednet cost) over receiving a bednet, even where GiveWell thinks they would be better off with the latter.
If the mother would rather have her child alive, then under what definition of happiness/utility do you conclude she would be happier with her child dead (but getting therapy)? I understand you’re trying to factor out the utility loss of the child; so am I. But just from the mother’s perspective alone: she prefers scenario X to scenario Y, and you’re saying it doesn’t count for some reason? I don’t get it.
I think you’re double-subtracting the utility of the child: you’re saying, let’s factor it out by not asking the child his preference, and ALSO let’s ADDITIONALLY factor it out by not letting the mother be sad about the child not getting his preference. But the latter is a fact about the mother’s happiness, not the child’s.
Second, the hypothetical mother would have to live with the guilt of knowing she could have saved her child but chose something for herself.
Let’s add memory loss to the scenario, so she doesn’t remember making the decision.
Finally, GiveWell-type recommendations often would fail the same sort of test. Many beneficiaries would choose receiving $8X (where X = bednet cost) over receiving a bednet, even where GiveWell thinks they would be better off with the latter.
Yes, and GiveWell is very clear about this and most donors bite the bullet (people make irrational decisions with regards to small risks of death, and also, betnets have positive externalities to the rest of the community). Do you bite the bullet that says “the mother doesn’t know enough about her own happiness; she’d be happier with therapy than with a living child”?
Finally, I do hope you’ll answer regarding whether you have children. Thanks again.
I’m not Joel (nor do I work for HLI, GiveWell, SM, or any similar organization). I do have a child, though. And I do have concerns with overemphasis on whether one is a parent, especially when one’s views are based (in at least significant part) on review of the relevant academic literature. Otherwise, does one need both to be a parent and to have experienced a severe depressive episode (particularly in a low-resource context where there is likely no safety net) in order to judge the tradeoffs between supporting AMF and supporting SM?
Personally—I am skeptical that the positive effect of therapy exceeds the negative effect of losing one’s young child on a parent’s own well-being. I just don’t think the thought experiment you proposed is a good way to cross-check the plausibility of such a view. The consideration of the welfare of one’s child (independent of one’s own welfare) in making decisions is just too deeply rooted for me to think we can effectively excise it in a thought experiment.
In any event—given that SM can deliver many courses of therapy with the resources AMF needs to save one child, the two figures don’t need to be close if one believes the only benefit from AMF is the prevention of parental grief. SM’s effect size would only need to be greater 1/X of the WELLBYs lost from parental grief from one child death, where X is the number of courses SM can deliver with the resources AMF needs to prevent one child death. That is the bullet that epicurean donors have to bite to choose SM over AMF.
Personally—I am skeptical that the positive effect of therapy exceeds the negative effect of losing one’s young child on a parent’s own well-being.
It’s good to hear you say this.
In any event—given that SM can deliver many courses of therapy with the resources AMF needs to save one child, the two figures don’t need to be close
Definitely true. But if a source (like a specific person or survey) gives me absurd numbers, it is a reason to dismiss it entirely. For example, if my thermometer tells me it’s 1000 degrees in my house, I’m going to throw it out. I’m not going to say “even if you merely believe it’s 90 degrees we should turn on the AC”. The exaggerated claim is disqualifying; it decreases the evidentiary value of the thermometer’s reading to zero.
When someone tells me that group therapy is more beneficial to the mother’s happiness than saving her child from death, I don’t need to listen to that person anymore. And if it’s a survey that tells me this, throw out the survey. If it’s some fancy academic methods and RCTs, the interesting question is where they went wrong, and someone should definitely investigate that, but at no point should people take it seriously.
By all means, let’s investigate how the thermometer possibly gave a reading of 1000 degrees. But until we diagnose the issue, it is NOT a good idea to use “1000 degrees in the house” in any decision-making process. Anyone who uses “it’s 1000 degrees in this room” as a placeholder value for making EA decisions is, in my view, someone who should never be trusted with any levers of power, as they cannot spot obvious errors that are staring them in the face.
We both think the ratio of parental grief WELLBYs to therapy WELLBYs is likely off, although that doesn’t tell us which number is wrong. Given that your argument is that an implausible ratio should tip HLI off that there’s a problem, the analysis below takes the view more favorable to HLI—that the parental grief number (for which much less work has been done) is at least the major cause of the ratio being off.
As I see it, the number of WELLBYs preserved by averting an episode of parental grief is very unlikely to be material to any decision under HLI’s cost-effectiveness model. Under philosophical assumptions where it is a major contributor to the cost-effectiveness estimate, that estimate is almost always going to be low enough that life-saving interventions won’t be considered cost-effective on the whole. Under philosophical assumptions where life-saving programs may be cost-effective, the bulk of the effectiveness will come directly from the effect on the saved life itself. Thus, it would not be unreasonable for HLI—which faces significant resource constraints—to have deprioritized attempts to improve the accuracy of its estimate for WELLBYs preserved by averting an episode of parental grief.
Given that, I can see three ways of dealing with parental grief in the cost-effectiveness model for AMF. Ignoring it seems rather problematic. And I would argue that reporting the value one’s relatively shallow research provided (with a disclaimer that one has low certainty in the value) is often more epistemically virtuous than making up adjusting to some value one thinks is more likely to be correct for intuitive reasons, bereft of actual evidence to support that number. I guess the other way is to just not publish anything until one can turn in more precise models . . . but that norm would make it much more difficult to bring new and innovative ideas to the table.
I don’t think the thermometer analogy really holds here. Assuming HLI got a significantly wrong value for WELLBYs preserved by averting an episode of parental grief, there are a number of plausible explanations, the bulk of which would not justify not “listen[ing] to [them] anymore.” The relevant literature on grief could be poor quality or underdeveloped; HLI could have missed important data or modeled inadequately due to the resources it could afford to spend on the question; it could have made a technical error; its methodology could be ill-suited for studying parental grief; its methodology could be globally unsound; and doubtless other reasons. In other words, I wouldn’t pay attention to the specific thermometer that said it was much hotter than it was . . . but in most cases I would only update weakly against using other thermometers by the same manufacturer (charity evaluator), or distrusting thermometer technology in general (the WELLBY analysis).
Moreover, I suspect there have been, and will continue to be, malfunctioning thermometers at most of the major charity evaluators and major grantmakers. The grief figure is a non-critical value relating to an intervention that HLI isn’t recommending. For the most part, if an evaluator or grantmaker isn’t recommending or funding an organization, it isn’t going to release its cost-effectiveness model for that organization at all. Even where funding is recommended, there often isn’t the level of reasoning transparency that HLI provides. If we are going to derecognize people who have used malfunctioning thermometer values in any cost-effectiveness analysis, there may not be many people left to perform them.
I’ve criticized HLI on several occasions before, and I’m likely to find reasons to criticize it again at some point. But I think we want to encourage its willingness to release less-refined models for public scrutiny (as long as the limitations are appropriately acknowledged) and its commitment to reasoning transparency more generally. I am skeptical of any argument that would significantly incentivize organizations to keep their analyses close to the chest.
The most important thing to note here is that, if you dig through the various long reports, the tradeoff is:
With $7800 you can save the life of a child, or
If you grant HLI’s assumptions regarding costs (and I’m a bit skeptical even there), you can give a multi-week group therapy to 60 people for that same cost (I think 12 sessions of 90 min).
Which is better? Well, right off the bat, if you think mothers would value their children at 60x what they value the therapy sessions, you’ve already lost.
Of course, the child’s life also matters, not just the mother’s happiness. But HLI has a range of “assumptions” regarding how good a life is, and in many of these assumptions the life of the child is indeed fairly value-less compared to benefits in the welfare of the mother (because life is suffering and death is OK, basically).
All this is obfuscated under various levels of analysis. Moreover, in HLI’s median assumption, not only is the therapy more effective, it is 5x more effective. They are saying: the number of group therapies that equal the averted death of a child is not 60, but rather, 12.
To me that’s broken-thermometer level.
I know the EA community is full of broken thermometers, and it’s actually one of the reasons I do not like the community. One of my main criticisms of EA is, indeed, “you’re taking absurd numbers (generated by authors motivated to push their own charities/goals) at face value”. This also happens with animal welfare: there’s this long report and 10-part forum series evaluating animals’ welfare ranges, and it concludes that 1 human has the welfare range of (checks notes) 14 bees. Then others take that at face value and act as if a couple of beehives or shrimp farms are as important as a human city.
I am skeptical of any argument that would significantly incentivize organizations to keep their analyses close to the chest.
This is not the first time I’ve had this argument made to me when I criticize an EA charity. It seems almost like the default fallback. I think EA has the opposite problem, however: nobody ever dares to say the emperor has no clothes, and everyone goes around pretending 1 human is worth 14 bees and a group therapy session increases welfare by more than the death of your child decreases it.
I think that it is possible to buy that humans only have 14 times as painful maximum pains/pleasurable maximal pleasure than bees, and still think 14 bees=1 human is silly. You just have to reject hedonism about well-being. I have strong feelings about saving humans over animals, but I have no intuition whatsoever that if my parents’ dog burns her paw it hurts less than when I burn my hand. The whole idea that animals have less intense sensations than us seems to me less like a commonsense claim, and more like something people committed to both hedonism and antispeciesism made up to reconcile their intuitive repugnant at results like 10 pigs or whatever=1 human. (Bees are kind of a special case because lots of people are confident they aren’t conscious at all.)
Where’s the evidence that, e.g., everyone “act[s] as if a couple of beehives or shrimp farms are as important as a human city”?So someone wrote a speculative report about bee welfare ranges . . . if “everyone” accepted that “1 human is worth 14 bees”—or even anything close to that—the funding and staffing pictures in EA would look very, very different. How many EAs are working in bee welfare, and how much is being spent in that area?
As I understand the data, EA resources in GH&D are pretty overwhelmingly in life-saving interventions like AMF, suggesting that the bulk of EA does not agree with HLI at present. I’m not as well versed in farmed animal welfare, but I’m pretty sure no one in that field is fundraising for interventions costing anywhere remotely near hundreds of dollars to save a bee and claiming they are effective.
In the end, reasoning transparency by charity evaluators helps the donor better make an informed moral choice. Carefully reading analyses from various sources helps me (and other donors) make choices that are consistent with our own values. EA is well ahead of most charitable movements by explicitly acknowledging that trade-offs exist and at least attempting to reason about them. One can (and should) decline to donate where the charity’s treatment of tradeoffs isnt convincing. As I’ve stated elsewhere on this post, I’m sticking with GiveWell-style interventions at least for now.
Oh, I should definitely clarify: I find effective altruism the philosophy, as well as most effective altruists and their actions, to be very good and admirable. My gripe is with what I view as the “EA community”—primarily places like this forum, organizations such as the CEA, and participants in EA global. The more central to EA-the-community, the worse I like the the ideas.
In my view, what happens is that there are a lot of EA-ish people donating to GiveWell charities, and that’s amazing. And then the EA movement comes and goes “but actually, you should really give the money to [something ineffective that’s also sometimes in the personal interest of the person speaking]” and some people get duped. So forums like this one serve to take money that would go to malaria nets, and try as hard as they can to redirect it to less effective charities.
So, to your questions: how many people are working towards bee welfare? Not many. But on this forum, it’s a common topic of discussion (often with things like nematodes instead of bees). I haven’t been to EA global, but I know where I’d place my bets for what receives attention there. Though honestly, both HLI and the animal welfare stuff is probably small potatoes compared to AI risk and meta-EA, two areas in which these dynamics play an even bigger role (and in which there are even more broken thermometers and conflicts of interest).
Yes. There is a large range of such numbers. I am not sure of the right tradeoff. I would intuitively expect a billion therapy sessions to be an overestimate (i.e. clearly more valuable than the life of a child), but I didn’t do any calculations. A thousand seems like an underestimate, but again who knows (I didn’t do any calculations). HLI is claiming (checks notes) ~12.
To flip the question: Do you think there’s a number you would reject for how many people treated with psychotherapy would be worth the death of one child, even if some seemingly-fancy analysis based on survey data backed it up? Do you ever look at the results of an analysis and go “this must be wrong,” or is that just something the community refuses to do on principle?
Thank you for this detailed and transparent response!
I applaud HLI for creating a chart (and now an R Shiny App) to show how philosophical views can affect the tradeoff between predominately life-saving and predominately life-enhancing interventions. However, one challenge with that approach is that almost any changes to your CEA model will be outcome-changing for donors in some areas of that chart. [1]
For example, the 53-> 38% correction alone switched the recommendation for donors with a deprivationist framework who think the neutral point is over ~ 0.65 but under 1.58. Given that GiveWell’s moral weights were significantly derived from donor preferences, and (0.5, deprivationism) is fairly implied by those donor weights, I think that correction shifted the recommendation from SM to AMF for a significant number of donors even though it was only material to one of three philosophical approaches and about 1 point of neutral-point assumptions.
GiveWell reduced the WELLBY estimate from about 62 (based on the 38% figure) to about 17, a difference of about 45. If I’m simplifying your position correctly, for about half of those WELLBYs you disagree with GiveWell that an adjustment is appropriate. For about half of them, you believe a discount is likely appropriate, but think it is likely less than GiveWell modelled.
If we used GiveWell’s numbers for that half but HLI’s numbers otherwise, that split suggests that we’d end up with about 39.5 WELLBYs. So one way to turn your response into a donor-actionable statement would be to say that there is a zone of uncertainty between 39.5 and 62 WELLBYs. One might also guess that the heartland of that zone is between about 45 and 56.5 WELLBYs, reasoning that it is less likely that your discounts will be less than 25% or more than 75% of GiveWell’s.
The bottom end of that zone of uncertainty (39.5) would pull the neutral point at which a deprivationist approach would conclude AMF = SM up to about 2.9. I suspect few people employing a deprivationist approach have the neutral point that high. AMF is also superior to SM on a decent number of TRIA-based approaches at 39.5 WELLBYs.
So it seems there are two reasonable approaches to donor advice under these kinds of circumstances:
One approach would encourage donors within a specified zone of uncertainty to hold their donations until HLI sufficiently updates its CEA for SM to identify a more appropriate WELLBY figure ; or
The other approach would encourage donors to make their decision based on HLI’s best estimate of what the WELLBY figure on the next update of the CEA will be. Even if the other approach is correct, there will be some donors who need to use this approach for various reasons (e.g., tax reasons).
I don’t think reaffirming advice on the current model in the interim without any adjustments is warranted, unless you believe the adjustments will be minor enough such that a reasonable donor would likely not find them of substantive importance no matter where they are on the philosophical chart.[2]
In the GiveWell model, the top recommendation is to give to a regranting fund, and there isn’t any explicit ranking of the four top charities. So the recommendation is actually to defer the choice of specific charity to someone who has the most up-to-date information when the monies are actually donated to the effective charity. Moreover, all four top charities are effective in very similar ways. Thus, GiveWell’s bottom-line messaging to donors is much less sensitive to changes in the CEA for any given charity.
I am not sure how to define “minor.” I think whether the change flips the recommendation to the donor is certainly relevant, but wouldn’t go so far as to say that any change that flips the recommendation for a given donor’s philosophical assumptions would be automatically non-minor. On the other hand, I think a large enough change can be non-minor even if it doesn’t flip the recommendation on paper. Some donors apply discounts and bonuses not reflected in HLI’s model. For instance, one could reasonably apply a discount to SM when compared to better-studied interventions, on the basis that CEAs usually decrease as they become more complete. Or one could reasonably apply a bonus to SM because funding a smaller organization is more likely to have a positive effect on its future cost-effectiveness. Thus, just because the change is not outcome-determinative on HLI’s base model doesn’t mean it isn’t so on the donor’s application of the model. The time-to-update and amount of funds involved are also relevant. All that being said, my gut thinks that the starting point for determining minor vs. non-minor is somewhere in the neighborhood of 10%.
You raise a fair point. One we’ve been discussing internally. Given the recent and expected adjustments to StrongMinds, it seems reasonable to update and clarify our position on AMF to say something like, “Under more views, AMF is better than or on par with StrongMinds. Note that currently, under our model, when AMF is better than StrongMinds, it isn’t wildly better.” Of course, while predicting how future research will pan out is tricky, we’d aim to be more specific.
Is this (other than 53% being corrected to 38%) from the post accurate?
Spillovers: HLI estimates that non-recipients of the program in the recipient’s household see 53% of the benefits of psychotherapy from StrongMinds and that each recipient lives in a household with 5.85 individuals.[11] This is based on three studies (Kemp et al. 2009, Mutamba et al. 2018a, and Swartz et al. 2008) of therapy programs where recipients were selected based on negative shocks to children (e.g., automobile accident, children with nodding syndrome, children with psychiatric illness).[12]
If so, a substantial discount seems reasonable to me. It’s plausible these studies also say almost nothing about the spillover, because of how unrepresentative they seem. Presumably much of the content of the therapy will be about the child, so we shouldn’t be surprised if it has much more impact on the child than general therapy for depression.
It’s not clear any specific number away from 0 could be justified.
I find nothing objectionable in that characterization. And if we only had these three studies to guide us then I’d concede that a discount of some size seems warranted. But we also have A. our priors. And B. some new evidence from Barker et al. Both of point me away from very small spillovers, but again I’m still very unsure. I think I’ll have clearer views once I’m done analyzing the Barker et al. results and have had someone, ideally Nathanial Barker, check my work.
[Edit: Michael edited to add: “It’s not clear any specific number away from 0 could be justified.”] Well not-zero certainly seems more justifiable than zero. Zero spillovers implies that emotional empathy doesn’t exist, which is an odd claim.
To clarify what I edited in, I mean that, without better evidence/argument, the effect could be arbitrarily small but still nonzero. What reason do we have to believe it’s at least 1%, say, other than very subjective priors?
I agree that analysis of new evidence should help.
I’d point to the literature on time lagged correlations between household members emotional states that I quickly summarised in the last installment of the household spillover discussion. I think it implies a household spillover of 20%. But I don’t know if this type of data should over- or -underestimate the spillover ratio relative to what we’d find in RCTs. I know I’m being really slippery about this, but the Barker et al. analysis stuff so far makes me think it’s larger than that.
Regarding the question of what philosophical view should be used, I wonder if it would also matter if someone were something like prioritarian rather than a total utilitarian. StrongMinds looks to focus on people who suffer more than typical members of these countries’ populations, whilst the lives saved by AMF would presumably cover more of the whole distribution of wellbeing. So a prioritarian may favour StrongMinds more, assuming the people helped are not substantially better off economically or in other ways. (Though, it could perhaps also be argued that the people who would die without AMF’s intervention are extremely badly off pre-intervention.)
Joel’s response
[Michael’s response below provides a shorter, less-technical explanation.]
Summary
Alex’s post has two parts. First, what is the estimated impact of StrongMinds in terms of WELLBYs? Second, how cost-effective is StrongMinds compared to the Against Malaria Foundation (AMF)? I briefly present my conclusions to both in turn. More detail about each point is presented in Sections 1 and 2 of this comment.
The cost-effectiveness of StrongMinds
GiveWell estimates that StrongMinds generates 1.8 WELLBYs per treatment (17 WELLBYs per $1000, or 2.3x GiveDirectly[1]). Our most recent estimate[2] is 10.5 WELLBYs per treatment (62 WELLBYs per $1000, or 7.5x GiveDirectly) . This represents a 83% discount (an 8.7 WELLBYs gap)[3] to StrongMinds effectiveness[4]. These discounts, while sometimes informed by empirical evidence, are primarily subjective in nature. Below I present the discounts, and our response to them, in more detail.
Figure 1: Description of GiveWell’s discounts on StrongMinds’ effect, and their source
Notes: The graph shows the factors that make up the 8.7 WELLBY discount.
Table 1: Disagreements on StrongMinds per treatment effect (10.5 vs. 1.8 WELLBYs) and cost
Note: HLI estimates StrongMinds has an effect of 1.8 WELLBYs per household of recipient. HLI estimates that this figure is 10.5. This represents a 8.7 WELLBY gap.
How do we assess GiveWell’s discounts? We summarise our position below.
Figure 2: HLI’s views on GiveWell’s total discount of 83% to StrongMind’s effects
We think there’s sufficient evidence and reason to justify the size and magnitude of 5% of GiveWell’s total discount
For ~45% of their total discount, we are sympathetic to including a discount, but we are unsure about the magnitude (generally, we think the discount would be lower). The adjustments that I think are the most plausible are:
A discount of up to 15% for conversion between depression and life-satisfaction SD.
A discount of up to 20% for loss of effectiveness at scale.
A discount of up to 5% for response biases.
Reducing the household size down to 4.8 people.
We are unsympathetic to ~35% of their total discount, because our intuitions differ, but there doesn’t appear to be sufficient existing evidence to settle the matter (i.e., household spillovers).
We think that for 15% of their total discount, the evidence that exists doesn’t seem to substantiate a discount (i.e., their discounts on StrongMind’s durability).
However, as Michael mentions in his comment, a general source of uncertainty we have is about how and when to make use of subjective discounts. We will make more precise claims about the cost-effectiveness of StrongMinds when we finalise our revision and expansion.
The cost-effectiveness of AMF
The second part of Alex’s post is asking how cost-effective is StrongMinds compared to the Against Malaria Foundation (AMF)? AMF, which prevents malaria with insecticide treated bednets, is in contrast to StrongMinds, a primarily life-saving intervention. Hence, as @Jason rightly pointed out elsewhere in the comments, its cost-effectiveness strongly depends on philosophical choices about the badness of death and the neutral point (see Plant et al., 2022). GiveWell takes a particular set of views (deprivationism with a neutral point of 0.5) that are very favourable to life saving interventions. But there are other plausible views that can change the results, and even make GiveWell’s estimate of StrongMinds seem more cost-effective than AMF. Whether you accept our original estimate of StrongMinds, or GiveWell’s lower estimate, the comparison is still incredibly sensitive to these philosophical choices. I think GiveWell is full of incredible social scientists, and I admire many of them, but I’m not sure that should privilege their philosophical intuitions.
Further research and collaboration opportunities
We are truly grateful to GiveWell for engaging with our research on StrongMinds. I think we largely agree with GiveWell regarding promising steps for future research. We’d be keen to help make many of these come true, if possible. Particularly regarding: other interventions that may benefit from a SWB analysis, household spillovers, publication bias, the SWB effects of psychotherapy (i.e. not just depression), and surveys about views on the neutral point and the badness of death. I would be delighted if we could make progress on these issues, and doubly so if we could do so together.
1. Disagreements on the cost-effectiveness of StrongMinds
HLI estimates that psychotherapy produces 10.5 WELLBYs (or 62 per $1000, 7.5x GiveDirectly) for the household of the recipient, while GiveWell estimates that psychotherapy has about a sixth of the effect, 1.8 WELLBYs (17 per $1000 or 2.3x GiveDirectly[5]). In this section, I discuss the sources of our disagreement regarding StrongMinds in the order I presented in Table 1.
1.1 Household spillover differences
Household spillovers are our most important disagreement. When we discuss the household spillover effect or ratio we’re referring to the additional benefit each non-recipient member of the household gets, as a percentage of what the main recipient receives. We first analysed household spillovers in McGuire et al. (2022), which was recently discussed here. Notably, James Snowden pointed out a mistake we made in extracting some data, which reduces the spillover ratio from 53% to 38%.
GiveWell’s method relies on:
Discounting the 38% figure citing several general reasons. (A) Specific concerns that the studies we use might overestimate the benefits because they focused on families with children that had high-burden medical conditions. (B) A shallow review of correlational estimates of household spillovers and found spillover ratios ranging from 5% to 60%.
And finally concluding that their best guess is that the spillover percentage is 15 or 20%[6], rather than 53% (what we used in December 2022) or 38% (what we would use now in light of Snowden’s analysis). Since their resulting figure is a subjective estimate, we aren’t exactly sure why they give that figure, or how much they weigh each piece of evidence.
Table 2: HLI and GiveWell’s views on household spillovers of psychotherapy
Variable
HLI
GiveWell
Explains how much difference in SM’s effect (%)
Household spillover ratio for psychotherapy
38%
15%
3 WELLBYs (34% of total gap)
Note: The household spillover for cash transfers we estimated is 86%.
I reassessed the evidence very recently—as part of the aforementioned discussion with James Snowden—and Alex’s comments don’t lead me to update my view further. In my recent analysis, I explained that I think I should weigh the studies we previously used less because they do seem less relevant to StrongMinds, but I’m unsure what to use instead. And I also hold a more favourable intuition about household spillovers for psychotherapy, because parental mental health seems important for children (e.g., Goodman, 2020).
But I think we can agree that collecting and analysing new evidence could be very important here. The data from Barker et al. (2022), a high quality RCT of the effect of CBT on the general population in Ghana (n = ~7,000) contains information on both partners’ psychological distress when one of them received cognitive behavioural therapy, so this data can be used to estimate any spousal spillover effects from psychotherapy. I am in the early stage of analysing this data[7]. There also seems to be a lot of promising primary work that could be done to estimate household spillovers alongside the effects of psychotherapy.
1.2 Conversion between measures, data sources, and units
The conversion between depression and life-satisfaction (LS) scores ties with household spillovers in terms of importance for explaining our disagreements about the effectiveness of psychotherapy. We’ve previously assumed that a one standard deviation (SD) decrease in depression symptoms (or affective mental health; MHa) is equivalent to a one SD improvement in life-satisfaction or happiness (i.e., a 1:1 conversion), see here for our previous discussion and rationale.
Givewell has two concerns with this:
Depression and life-satisfaction measures might not be sufficiently empirically or conceptually related to justify a 1:1 conversion. Because of this, they apply an empirically based 10% discount.
They are concerned that recipients of psychotherapy have a smaller variance in subjective wellbeing (SWB) than general populations (e.g., cash transfers), which leads to inflated effect sizes. They apply a 20% subjective discount to account for this.
Hence, GiveWell applied a 30% discount (see Table 4 below).
Table 3: HLI and GiveWell’s views on converting between SDs of depression and life satisfaction
HLI
GiveWell
Explains what difference in SM’s effect (%)
1 to 1
1 to 0.7
3 WELLBYs (34% of total)
Overall, I agree that there are empirical reasons for including a discount in this domain, but I’m unsure of its magnitude. I think it will likely be smaller than GiveWell’s 30% discount.
1.2.1 Differences between the two measures
First, GiveWell mentions a previous estimate of ours suggesting that mental health (MH) treatments[8] impact depression 11% more than SWB. Our original calculation used a naive average, but on reflection, it seems more appropriate to use a sample-size-weighted average (because of the large differences in samples between studies), which results in depression measures overestimating SWB measures by 4%, instead of 11%.
Results between depression and happiness measures are also very close in Bhat et al. (2022; n = 589), the only study I’ve found so far that looks at effects of psychotherapy on both types of measures. We can standardise the effects in two ways. Depending on the method, the SWB effects are larger by 18% or smaller by 1% than MHa effects[9]. Thus, effects of psychotherapy on depression appear to be of similar size as effects on SWB. Given these results, I think the discount due to empirical differences could be smaller than 10%, I would guess 3%.
Another part of this is that depression and life satisfaction are not the same concept. So if the scores are different, there is a further moral question about which deserves more weight. The HLI ‘house view’, as our name indicates, favours happiness (how good/bad we feel) as what matters. Further, we suspect that measures of depression are conceptually closer to happiness than measures of life satisfaction are. Hence, if push came to shove, and there is a difference, we’d care more about the depression scores, so no discount would be justified. From our conversation with Alex, we understand that the GiveWell ‘house view’ is to care more about life satisfaction than happiness. In this case, GiveWell would be correct, by their lights, to apply some reduction here.
1.2.2 Differences in variance
In addition to their 11% conversion discount, GiveWell adds another 20% discount because they think a sample of people with depression have a smaller variance in life satisfaction scores.[10] Setting aside the technical topic of why variance in variances matters, I investigated whether there are lower SDs in life satisfaction when you screen for baseline depression using a few datasets. I found that, if anything, the SDs are larger by 4% (see Table 4 below). Although I see the rationale behind GiveWell’s speculation, the evidence I’ve looked at suggests a different conclusion.
Table 4: Life-satisfaction SD depending on clinical mental health cutoff
LS SD for general pop
LS SD for dep pop
SWB SD change (gen → dep)
SWB measure
1.23
1.30
106%
LS 1-10
1.65
1.88
114%
LS 0-10
2.43
2.38
98%
LS 1-10
1.02
1.04
102%
LS (z-score)
Average change
1.58
1.65
104%
Note: BHPS = The British Household Panel Survey, HILDA = The Household Income and Labour Dynamics Survey, NIDS = National Income Dynamics Study. LS = life satisfaction, dep = depression.
However, I’m separately concerned that SD changes in trials where recipients are selected based on depression (i.e., psychotherapy) are inflated compared to trials without such selection (i.e., cash transfers)[11].
Overall, I think I agree with GiveWell that there should be a discount here that HLI doesn’t implement, but I’m unsure of its magnitude, and I think that it’d be smaller than GiveWell’s. More data could likely be collected on these topics, particularly how much effect sizes in practice differ between life-satisfaction and depression, to reduce our uncertainty.
1.3 Loss of effectiveness outside trials and at scale
GiveWell explains their concern, summarised in the table below:
“Our general expectation is that programs implemented as part of randomized trials are higher quality than similar programs implemented at scale. [...] For example, HLI notes that StrongMinds uses a reduced number of sessions and slightly reduced training, compared to Bolton (2003), which its program is based on.48 We think this typeof modification could reduce program effectiveness relative to what is found in trials. [...] We can also see some evidence for lower effects in larger trials…”
Table 5: HLI and GiveWell’s views on an adjustment for StongMind’s losing effectiveness at scale
Explains what difference in SM’s effect (%)
0.9 WELLBYs (10.1% of total gap)
While GiveWell provides several compelling reasons for why StongMinds efficacy will decrease as it scales, I can’t find the reason GiveWell provides for why these reasons result in a 25% discount. It seems like a subjective judgement informed by some empirical factors and perhaps from previous experience studying this issue (e.g., cases like No Lean Season). Is there any quantitative evidence that suggests that when RCT interventions scale they drop 25% in effectiveness? While GiveWell also mentions that larger psychotherapy trials have smaller effects, I assume this is driven by publication bias (discussed in Section 1.6). I’m also less sure that scaling has no offsetting benefits. I would be surprised if when RCTs are run, the intervention has all of its kinks ironed out. In fact, there’s many cases of the RCT version of an intervention being the “minimum viable product” (Karlan et al., 2016). While I think a discount here is plausible, I’m very unsure of its magnitude.
In our updated meta-analysis we plan on doing a deeper analysis of the effect of expertise and time spent in therapy, and to use this to better predict the effect of StrongMinds. We’re awaiting the results from Baird et al. which should better reflect their new strategy as StrongMinds trained but did not directly deliver the programme.
1.4 Disagreements on the durability of psychotherapy
GiveWell explains their concern summarised in the table below, “We do think it’s plausible that lay-person-delivered therapy programs can have persistent long-term effects, based on recent trials by Bhat et al. 2022 and Baranov et al. 2020. However, we’re somewhat skeptical of HLI’s estimate, given that it seems unlikely to us that a time-limited course of group therapy (4-8 weeks) would have such persistent effects. We also guess that some of the factors that cause StrongMinds’ program to be less effective than programs studied in trials (see above) could also limit how long the benefits of the program endure. As a result, we apply an 80% adjustment factor to HLI’s estimates. We view this adjustment as highly speculative, though, and think it’s possible we could update our view with more work.”
Table 6: HLI and GiveWell’s views on a discount to account for a decrease in durability
Explains what difference in SM’s effect (%)
Since this disagreement appears mainly based on reasoning, I’ll explain why my intuitions—and my interpretation of the data—differ from GiveWell here. I already assume that StrongMinds decays 4% more each year than does psychotherapy in general (see table 3). Baranov et al. (2020) and Bhat et al. (2022) both find long-term effects that are greater than what our general model predicts. This means that we already assume a higher decay rate in general, and especially for StrongMinds than the two best long-term studies of psychotherapy suggest. I show how these studies compare to our model in Figure 3 below.
Figure 3: Effects of our model over time, and the only long-term psychotherapy studies in LMICs
Edit: I updated the figure to add the StrongMinds model, which starts with a higher effect but has a faster decay.
Baranov et al. (2020, 16 intended sessions) and Bhat et al. (2022, 6-14 intended sessions, with 70% completion rate) were both time limited. StrongMinds historically used 12 sessions (it may be 8 now) of 90 minutes[12]. Therefore, our model is more conservative than the Baranov et al. result, and closer to the Bhat et al. which has a similar range of sessions. Another reason, in favour of the duration of StrongMinds, which I mentioned in McGuire et al. (2021), is that 78% of groups continued meeting on their own at least six months after the programme formally ended.
Bhat et al. (2022) is also notable in another regard: They asked ~200 experts to predict the impact of the intervention after 4.5 years. The median prediction underestimated the effectiveness by nearly 1/3rd, which makes me inclined to weigh expert priors less here[13].
Additionally, there seems to be something double-county in GiveWell’s adjustments. The initial effect is adjusted by 0.75 for “Lower effectiveness at scale and outside of trial contexts” and the duration effect is adjusted by 0.80, also for “lower effectiveness at scale and outside of trial contexts”. Combined this is a 0.55 adjustment instead of one 0.8 adjustment. I feel like one concern should show up as one discount.
1.5 Disagreements on social desirability bias[14]
GiveWell explains their concern, which is summarised in the table below: “One major concern we have with these studies is that participants might report a lower level of depression after the intervention because they believe that is what the experimenter wants to see [...] HLI responded to this criticism [section 4.4] and noted that studies that try to assess experimenter-demand effects typically find small effects.[...] We’re not sure these tests would resolve this bias so we still include a downward adjustment (80% adjustment factor).”
Table 7: HLI and GiveWell’s views on converting between SDs of depression and life satisfaction
Explains what diff in SM’s effect (%)
Participants might report bigger effects to be agreeable with the researchers (socially driven bias) or in the hopes of future rewards (cognitively driven bias; Bandiera et al., 2018), especially if they recognise the people delivering the survey to be the same people delivering the intervention[15].
But while I also worry about this issue, I am less concerned than GiveWell that response bias poses a unique threat to psychotherapy. Because if this bias exists, it seems likely to apply to all RCTs of interventions with self-reported outcomes (and without active controls). So I think the relevant question is why the propensity to response bias might differ between cash transfers and psychotherapy? Here are some possibilities:
It seems potentially more obvious that psychotherapy should alleviate depression than cash transfers should increase happiness. If so, questions about self-reported wellbeing may be more subject to bias in psychotherapy trials[16].
We could expect that the later the follow-up, the less salient the intervention is, the less likely respondents are to be biased in this way (Park & Kumar, 2022). This is one possibility that could favour cash transfers because they have relatively longer follow-ups than psychotherapy.
However, it is obvious to cash transfer participants whether they are in the treatment (they receive cash) or control conditions (they get nothing). This seems less true in psychotherapy trials where there are often active controls.
GiveWell responded to the previous evidence I cited (McGuire & Plant, 2021, Section 4.4)[17] by arguing that the tests run in the literature, by investigating the effect of the general propensity towards socially desirable responding or the expectations of surveyor, are not relevant because: “If the surveyor told them they expected the program to worsen their mental health or improve their mental health, it seems unlikely to overturn whatever belief they had about the program’s expected effect that was formed during their group therapy sessions.” But, if participants’ views about an intervention seem unlikely to be overturned by what the surveyor seems to want—when what the surveyor wants and the participant’s experience differs—then that’s a reason to be less concerned about socially motivated response bias in general.
However, I am more concerned with socially desirable responses driven by cognitive factors. Bandiera et al. (2018, p. 25) is the only study I found to discuss the issue, but they do not seem to think this was an issue with their trial: “Cognitive drivers could be present if adolescent girls believe providing desirable responses will improve their chances to access other BRAC programs (e.g. credit). If so, we might expect such effects to be greater for participants from lower socioeconomic backgrounds or those in rural areas. However, this implication runs counter to the evidence in Table A5, where we documented relatively homogenous impacts across indices and time periods, between rich/poor and rural/urban households.”
I agree with GiveWell that more research would be very useful, and could potentially update my views considerably, particularly with respect to the possibility of cognitively driven response bias in RCTs deployed in low-income contexts.
1.6 Publication bias
GiveWell explains their concern, which we summarise in the table below: “HLI’s analysis includes a roughly 10% downward adjustment for publication bias in the therapy literature relative to cash transfers literature. We have not explored this in depth but guess we would apply a steeper adjustment factor for publication bias in therapy relative to our top charities. After publishing its cost-effectiveness analysis, HLI published a funnel plot showing a high level of publication bias, with well-powered studies finding smaller effects than less-well-powered studies.57 This is qualitatively consistent with a recent meta-analysis of therapy finding a publication bias of 25%.”
Table 8: HLI and GiveWell’s views on a publication bias discount
Explains what diff in SM’s effect (%)
After some recent criticism, we have revisited this issue and are working on estimating the bias empirically. Publication bias seems like a real issue, where a 10-25% correction like what GiveWell suggests seems plausible, but we’re unsure about the magnitude as our research is ongoing. In our update of our psychotherapy meta-analysis we plan to employ a more sophisticated quantitative approach to adjust for publication bias.
1.7 Household size
GiveWell explains their concern, which we summarise in the table below: “HLI estimates household size using data from the Global Data Lab and UN Population Division. They estimate a household size of 5.9 in Uganda based on these data, which appears to be driven by high estimates for rural household size in the Global Data Lab data, which estimate a household size of 6.3 in rural areas in 2019. A recent Uganda National Household Survey, on the other hand, estimates household size of 4.8 in rural areas. We’re not sure what’s driving differences in estimates across these surveys, but our best guess is that household size is smaller than the 5.9 estimate HLI is using.”
Table 9: HLI and GiveWell’s views on household size of StrongMind’s recipients
Explains what diff in SM’s effect (%)
I think the figures GiveWell cites are reasonable. I favour using international datasets because I assume it means greater comparability between countries, but I don’t feel strongly about this. I agree it could be easy and useful to try and understand StrongMinds recipient’s household sizes more directly. We will revisit this in our StrongMinds update.
1.8 Cost per person of StrongMinds treated
The one element where we differ that makes StrongMinds look more favourable is cost. As GiveWell explains “HLI’s most recent analysis includes a cost of $170 per person treated by StrongMinds, but StrongMinds cited a 2022 figure of $105 in a recent blog post”
Table 10: HLI and GiveWell’s views on cost per person for StrongMind’s treatment
According to their most recent quarterly report, a cost per person of $105 was the goal, but they claim $74 per person for 2022[18]. We agree this is a more accurate/current figure, and the cost might well be lower now. A concern is that the reduction in costs comes at the expense of treatment fidelity – an issue we will review in our updated analysis.
2. GiveWell’s cost-effectiveness estimate of AMF is dependent on philosophical views
GiveWell estimates that AMF produces 70 WELLBYs per $1000[19], which would be 4 times better than StrongMinds. GiveWell described the philosophical assumptions of their life saving analysis as: “...Under the deprivationist framework and assuming a “neutral point” of 0.5 life satisfaction points. [...] we think this is what we would use and it seems closest to our current moral weights, which use a combination of deprivationism and time-relative interest account.”
Hence, they conclude that AMF produces 70 WELLBYs per $1000, which makes StrongMinds 0.24 times as cost-effective as AMF. However, the position they take is nearly the most favourable one can take towards interventions that save lives[20]. But there are other plausible views about the neutral point and the badness of death (we discuss this in Plant et al., 2022). Indeed, assigning credences to higher neutral points[21] or alternative philosophical views of death’s badness will reduce the cost-effectiveness of AMF relative to StrongMinds (see Figure 3). In some cases, AMF is less cost-effective than GiveWell’s estimate of StrongMinds[22].
Figure 4: Cost-effectiveness of charities under different philosophical assumptions (with updated StrongMinds value, and GiveWell’s estimate for StrongMinds)
To be clear, HLI does not (yet) take a stance on these different philosophical views. While I present some of my views here, these do not represent HLI as a whole.
Personally, I’d use a neutral point closer to 2 out of 10[23]. Regarding the philosophy, I think my credences would be close to uniformly distributed across the Epicurean, TRIA, and deprivationist views. If I plug this view into our model introduced in Plant et al. (2022) then this would result in a cost-effectiveness for AMF of 29 WELLBYs per $1000 (rather than 81 WELLBYs per $1000)[24], which is about half as good as the 62 WELLBYs per $1000 for StrongMinds. If GiveWell held these views, then AMF would fall within GiveWell’s pessimistic and optimistic estimates of 3-57 WELLBYs per $1000 for StrongMinds’ cost-effectiveness. For AMF to fall above this range, you need to (A) put almost all your credence in deprivationism and (B) have a neutral point lower than 2[25].
Coincidently, this is (barely) within our most recent confidence interval for comparing the cost-effectiveness of StrongMinds to GiveDirectly (95% CI: 2, 100).
This calculation is based on a correction for a mistake in our spillover ratio discussed here (a spillover ratio of 38% instead of 53%). Our previous estimate was 77 WELLBYs per $1000 (Plant et al., 2022; McGuire et al., 2022).
The discount on the effect per $1000 is smaller because GiveWell used a 38% smaller cost figure.
Note that the reduction in cost-effectiveness is only 27% because they also think that the costs are 62% smaller.
Coincidently, this is (barely) within our most recent confidence interval for comparing the cost-effectiveness of StrongMinds to GiveDirectly (95% CI: 2, 100).
The text and the table give different values.
But if you want to accept that the results could be very off, see here for a document with tables with my very preliminary results.
These are positive psychology interventions (like mindfulness and forgiveness therapy) which might not completely generalise to psychotherapy in LMICs.
Psychotherapy improved happiness by 0.38 on a 1-10 score and reduced depression by 0.97 (on the PHQ-9’s 0-27 scale). If we convert the depression score to a 1-10 scale, using stretch transformation, then the effect is a reduction in depression of 0.32. Hence, the SWB changes are 18% larger than MHa changes. If we convert both results to Cohen’s d, we find a Cohen’s d of 0.167 for depression and a Cohen’s d of 0.165 for happiness. Hence changes in MHa are 1% greater than SWB.
“it seems likely that SD in life satisfaction score is lower among StrongMinds recipients, who are screened for depression at baseline46 and therefore may be more concentrated at the lower end of the life satisfaction score distribution than the average individual.”
Sample selection based on depression (i.e., selection based on the outcome used) could shrink the variance of depression scores in the sample, which would inflate standardised effects sizes of depression compared to trials without depression selection, because standardisation occurs by dividing the raw effect by its standard deviation (i.e., standardised mean differences, such as Cohen’s d). To explore this, I used the datasets mentioned in Table 4, all of which also included measures of depression or distress and the data from Barker et al. (2022, n = 11,835). I found that the SD of depression for those with clinically significant depression was 18 to 21% larger than it was for the general sample (both the mentally ill and healthy). This seems to indicate that SD changes from psychotherapy provide inflated SD changes in depression compared to cash transfers, due to smaller SDs of depression. However, I think this may be offset by another technical adjustment. Our estimate of the life-satisfaction SD we use to convert SD changes (in MHa or SWB) to WELLBYs might be larger, which means the effects of psychotherapy and cash transfers are underestimated by 14% compared to AMF. When we convert from SD-years to WELLBYs we’ve used a mix of LMIC and HIC sources to estimate the general SD of LS. But I realised that there’s a version of the World Happiness Report that published data that included the SDs of LS for many countries in LMICs. If we use this more direct data for Sub-Saharan Countries then it suggests a higher SD of LS than what I previously estimated (2.5 instead of 2.2, according to a crude estimate), a 14% increase.
In one of the Bhat et al. trials, each session was 30 to 45 minutes (it’s unclear what the session length was for the other trials).
Note, I was one of the predictors, and my guess was in line with the crowd (~0.05 SDs), and you can’t see others’ predictions beforehand on the Social Science Prediction Platform.
Note, this is more about ‘experimenter demand effects’ (i.e., being influenced by the experimenters in a certain direction, because that’s what they want to find) than ‘socially desirability bias’ (i.e., answering that one is happier than they are because it looks better). The latter is controlled for in an RCT. We keep the wording used by GW here.
GiveWell puts it in the form of this scenario “If a motivated and pleasant IPT facilitator comes to your village and is trying to help you to improve your mental health, you may feel some pressure to report that the program has worked to reward the effort that facilitator has put into helping you.” But these situations are why most implementers in RCTs aren’t the surveyors. I’d be concerned if there were more instances of implementers acting as surveyors in psychotherapy than cash transfer studies.
On the other hand, who in poverty expects cash transfers to bring them misery? That seems about as rare (or rarer) as those who think psychotherapy will deepen their suffering. However, I think the point is about what participants think that implementers most desire.
Since then, I did some more digging. I found Dhar et al. (2018) and Islam et al. (2022) which use a questionnaire to test for propensity to answer questions in a socially desirable manner, but find similarly small results of socially motivated response bias. Park et al. (2022) takes an alternative approach where they randomise a subset of participants to self-survey, and argue that this does not change the results.
This is mostly consistent with 2022 expenses / people treated = 8,353,149 / 107,471 = $78.
81 WELLBYs per $1000 in our calculations, but they add some adjustments.
The most favourable position would be assuming deprivationism and a neutral point of zero.
People might hold that the neutral point is higher than 0.5 (on a 0-10 scale), and thereby reduce the cost-effectiveness of AMF. The IDinsight survey GiveWell uses surveys people from Kenya and Ghana but has a small sample (n = 70) for its neutrality question. In our pilot report (n = 79; UK sample; Samuelsson et al., 2023) we find a neutral point of 1.3. See Samuelsson et al. (2023; Sections 1.3 and 6) for a review of the different findings in the literature and more detail on our findings. Recent unpublished work by Julian Jamison finds a neutral point of 2.5 on a sample size of ~1,800 drawn from the USA, Brazil and China. Note that, in all these cases, we recommend caution in concluding that any of these values is the neutral point. There is still more work to be done.
Under GiveWell’s analysis, there are still some combinations of philosophical factors where AMF produces 17 WELLBYs or less (i.e., is as or less good than SM in GiveWell’s analysis): (1) An Epicurean view, (2) Deprivationism with neutral points above 4, and (3) TRIA with high ages of connectivity and neutral points above 3 or 4 (depending on the combination). This does not include the possibility of distributing credences across different views.
I would put the most weight on the work by HLI and Jamison and colleagues, mentioned in above, which finds a neutral point of 1.3/10 and 2.5/10, respectively.
I average the results across each view.
We acknowledge that many people may hold these views. We also want to highlight that many people may hold other views. We encourage more work investigating the neutral point and investigating the extent to which these philosophical views are held.
Zooming out a little: is it your view that group therapy increases happiness by more than the death of your child decreases it? (GiveWell is saying that this is what your analysis implies.)
To be a little more precise:
I.e., is it your view that 4-8 weeks of group therapy (~12 hours) for 20 people is preferable to averting the death of a child?
To be clear on what the numbers are: we estimate that group psychotherapy has an effect of 10.5 WELLBYs on the recipient’s household, and that the death of a child in a LIC has a −7.3 WELLBY effect on the bereaved household. But the estimate for grief was very shallow. The report this estimate came from was not focused on making a cost-effectiveness estimate of saving a life (with AMF). Again, I know this sounds weasel-y, but we haven’t yet formed a view on the goodness of saving a life so I can’t say how much group therapy HLI thinks is preferable averting the death of a child.
That being said, I’ll explain why this comparison, as it stands, doesn’t immediately strike me as absurd. Grief has an odd counterfactual. We can only extend lives. People who’re saved will still die and the people who love them will still grieve. The question is how much worse the total grief is for a very young child (the typical beneficiary of e.g., AMF) than the grief for the adolescent, or a young adult, or an adult, or elder they’d become [1]-- all multiplied by mortality risk at those ages.
So is psychotherapy better than the counterfactual grief averted? Again, I’m not sure because the grief estimates are quite shallow, but the comparison seems less absurd to me when I hold the counterfactual in mind.
I assume people, who are not very young children, also have larger social networks and that this could also play into the counterfactual (e.g., non-children may be grieved for by more people who forged deeper bonds). But I’m not sure how much to make of this point.
Thanks Joel.
My intuition, which is shared by many, is that the badness of a child’s death is not merely due to the grief of those around them. So presumably the question should not be comparing just the counterfactual grief of losing a very young child VS an [older adult], but also “lost wellbeing” from living a net-positive-wellbeing life in expectation?
I also just saw that Alex claims HLI “estimates that StrongMinds causes a gain of 13 WELLBYs”. Is this for 1 person going through StrongMinds (i.e. ~12 hours of group therapy), or something else? Where does the 13 WELLBYs come from?
I ask because if we are using HLI’s estimates of WELLBYs per death averted, and use your preferred estimate for the neutral point, then 13 / (4.95-2) is >4 years of life. Even if we put the neutral point at zero, this suggests 13 WELLBYs is worth >2.5 years of life.[1]
I think I’m misunderstanding something here, because GiveWell claims “HLI’s estimates imply that receiving IPT-G is roughly 40% as valuable as an additional year of life per year of benefit or 80% of the value of an additional year of life total.”
Can you help me disambiguate this? Apologies for the confusion.
13 / 4.95
I didn’t mean to imply that the badness of a child’s death is just due to grief. As I said in my main comment, I place substantial credence (2/3rds) in the view that death’s badness is the wellbeing lost. Again, this my view not HLIs.
The 13 WELLBY figure is the household effect of a single person being treated by StrongMinds. But that uses the uncorrected household spillover (53% spillover rate). With the correction (38% spillover) it’d be 10.5 WELLBYs (3.7 WELLBYs for recipient + 6.8 for household).
GiveWell arrives at the figure of 80% because they take a year of life as valued at 4.55 WELLBYs = 4.95 − 0.5 according to their preferred neutral point, and StrongMinds benefit ,according to HLI, to the direct recipient is 3.77 WELLBYs --> 3.77 / 4.55 = ~80%. I’m not sure where the 40% figure comes from.
That makes sense, thanks for clarifying!
If I understand correctly, the updated figures should then be:
For 1 person being treated by StrongMinds (excluding all household spillover effects) to be worth the WELLBYs gained for a year of life[1] with HLI’s methodology, the neutral point needs to be at least 4.95-3.77 = 1.18.
If we include spillover effects of StrongMinds (and use the updated / lower figures), then the benefit of 1 person going through StrongMinds is 10.7 WELLBYs.[2] Under HLI’s estimates, this is equivalent to more than two years of wellbeing benefits from the average life, even if we set the neutral point at zero. Using your personal neutral point of 2 would suggest the intervention for 1 person including spillovers is equivalent to >3.5 years of wellbeing benefits. Is this correct or am I missing something here?
1.18 as the neutral point seems pretty reasonable, though the idea that 12 hours of therapy for an individual is worth the wellbeing benefits of 1 year of an average life when only considering impacts to them, and anywhere between 2~3.5 years of life when including spillovers does seem rather unintuitive to me, despite my view that we should probably do more work on subjective wellbeing measures on the margin. I’m not sure if this means:
WELLBYs as a measure can’t capturing what I care about in a year of healthy life, so we should not use solely WELLBYs when measuring wellbeing;
HLI isn’t applying WELLBYs in a way that captures the benefits of a healthy life;
The existing way of estimating 1 year of life via WELLBYs is wrong in some other way (e.g. the 4.95 assumption is wrong, the 0-10 scale is wrong, the ~1.18 neutral point is wrong);
HLI have overestimated the benefits of StrongMinds;
I have a very poorly calibrated view of how good / bad 12 hours of therapy / a year of life is worth, though this seems less likely.
Would be interested in your thoughts on this / let me know if I’ve misinterpreted anything!
More precisely, the average wellbeing benefits from 1 year of life from an adult in 6 African countries
3.77*(1+0.38*4.85)
I appreciate your candid response. To clarify further: suppose you give a mother a choice between “your child dies now (age 5), but you get group therapy” and “your child dies in 60 years (age 65), but no group therapy”. Which do you think she will choose?
Also, if you don’t mind answering: do you have children? (I have a hypothesis that EA values are distorted by the lack of parents in the community; I don’t know how to test this hypothesis. I hope my question does not come off as rude.)
I don’t think that’s the right question for three reasons.
The hypothetical mother will almost certainly consider the well-being of her child (under a deprivationist framework) in making that decision—no one is suggesting that saving a life is less valuable than therapy under such an approach. Whatever the merits of an epicurean view that doesn’t weigh lost years of life, we wouldn’t have made it long as a species if parents applied that logic to their own young children.
Second, the hypothetical mother would have to live with the guilt of knowing she could have saved her child but chose something for herself.
Finally, GiveWell-type recommendations often would fail the same sort of test. Many beneficiaries would choose receiving $8X (where X = bednet cost) over receiving a bednet, even where GiveWell thinks they would be better off with the latter.
Thanks for your response.
If the mother would rather have her child alive, then under what definition of happiness/utility do you conclude she would be happier with her child dead (but getting therapy)? I understand you’re trying to factor out the utility loss of the child; so am I. But just from the mother’s perspective alone: she prefers scenario X to scenario Y, and you’re saying it doesn’t count for some reason? I don’t get it.
I think you’re double-subtracting the utility of the child: you’re saying, let’s factor it out by not asking the child his preference, and ALSO let’s ADDITIONALLY factor it out by not letting the mother be sad about the child not getting his preference. But the latter is a fact about the mother’s happiness, not the child’s.
Let’s add memory loss to the scenario, so she doesn’t remember making the decision.
Yes, and GiveWell is very clear about this and most donors bite the bullet (people make irrational decisions with regards to small risks of death, and also, betnets have positive externalities to the rest of the community). Do you bite the bullet that says “the mother doesn’t know enough about her own happiness; she’d be happier with therapy than with a living child”?
Finally, I do hope you’ll answer regarding whether you have children. Thanks again.
I’m not Joel (nor do I work for HLI, GiveWell, SM, or any similar organization). I do have a child, though. And I do have concerns with overemphasis on whether one is a parent, especially when one’s views are based (in at least significant part) on review of the relevant academic literature. Otherwise, does one need both to be a parent and to have experienced a severe depressive episode (particularly in a low-resource context where there is likely no safety net) in order to judge the tradeoffs between supporting AMF and supporting SM?
Personally—I am skeptical that the positive effect of therapy exceeds the negative effect of losing one’s young child on a parent’s own well-being. I just don’t think the thought experiment you proposed is a good way to cross-check the plausibility of such a view. The consideration of the welfare of one’s child (independent of one’s own welfare) in making decisions is just too deeply rooted for me to think we can effectively excise it in a thought experiment.
In any event—given that SM can deliver many courses of therapy with the resources AMF needs to save one child, the two figures don’t need to be close if one believes the only benefit from AMF is the prevention of parental grief. SM’s effect size would only need to be greater 1/X of the WELLBYs lost from parental grief from one child death, where X is the number of courses SM can deliver with the resources AMF needs to prevent one child death. That is the bullet that epicurean donors have to bite to choose SM over AMF.
Sorry for confusing you for Joel!
It’s good to hear you say this.
Definitely true. But if a source (like a specific person or survey) gives me absurd numbers, it is a reason to dismiss it entirely. For example, if my thermometer tells me it’s 1000 degrees in my house, I’m going to throw it out. I’m not going to say “even if you merely believe it’s 90 degrees we should turn on the AC”. The exaggerated claim is disqualifying; it decreases the evidentiary value of the thermometer’s reading to zero.
When someone tells me that group therapy is more beneficial to the mother’s happiness than saving her child from death, I don’t need to listen to that person anymore. And if it’s a survey that tells me this, throw out the survey. If it’s some fancy academic methods and RCTs, the interesting question is where they went wrong, and someone should definitely investigate that, but at no point should people take it seriously.
By all means, let’s investigate how the thermometer possibly gave a reading of 1000 degrees. But until we diagnose the issue, it is NOT a good idea to use “1000 degrees in the house” in any decision-making process. Anyone who uses “it’s 1000 degrees in this room” as a placeholder value for making EA decisions is, in my view, someone who should never be trusted with any levers of power, as they cannot spot obvious errors that are staring them in the face.
We both think the ratio of parental grief WELLBYs to therapy WELLBYs is likely off, although that doesn’t tell us which number is wrong. Given that your argument is that an implausible ratio should tip HLI off that there’s a problem, the analysis below takes the view more favorable to HLI—that the parental grief number (for which much less work has been done) is at least the major cause of the ratio being off.
As I see it, the number of WELLBYs preserved by averting an episode of parental grief is very unlikely to be material to any decision under HLI’s cost-effectiveness model. Under philosophical assumptions where it is a major contributor to the cost-effectiveness estimate, that estimate is almost always going to be low enough that life-saving interventions won’t be considered cost-effective on the whole. Under philosophical assumptions where life-saving programs may be cost-effective, the bulk of the effectiveness will come directly from the effect on the saved life itself. Thus, it would not be unreasonable for HLI—which faces significant resource constraints—to have deprioritized attempts to improve the accuracy of its estimate for WELLBYs preserved by averting an episode of parental grief.
Given that, I can see three ways of dealing with parental grief in the cost-effectiveness model for AMF. Ignoring it seems rather problematic. And I would argue that reporting the value one’s relatively shallow research provided (with a disclaimer that one has low certainty in the value) is often more epistemically virtuous than
making upadjusting to some value one thinks is more likely to be correct for intuitive reasons, bereft of actual evidence to support that number. I guess the other way is to just not publish anything until one can turn in more precise models . . . but that norm would make it much more difficult to bring new and innovative ideas to the table.I don’t think the thermometer analogy really holds here. Assuming HLI got a significantly wrong value for WELLBYs preserved by averting an episode of parental grief, there are a number of plausible explanations, the bulk of which would not justify not “listen[ing] to [them] anymore.” The relevant literature on grief could be poor quality or underdeveloped; HLI could have missed important data or modeled inadequately due to the resources it could afford to spend on the question; it could have made a technical error; its methodology could be ill-suited for studying parental grief; its methodology could be globally unsound; and doubtless other reasons. In other words, I wouldn’t pay attention to the specific thermometer that said it was much hotter than it was . . . but in most cases I would only update weakly against using other thermometers by the same manufacturer (charity evaluator), or distrusting thermometer technology in general (the WELLBY analysis).
Moreover, I suspect there have been, and will continue to be, malfunctioning thermometers at most of the major charity evaluators and major grantmakers. The grief figure is a non-critical value relating to an intervention that HLI isn’t recommending. For the most part, if an evaluator or grantmaker isn’t recommending or funding an organization, it isn’t going to release its cost-effectiveness model for that organization at all. Even where funding is recommended, there often isn’t the level of reasoning transparency that HLI provides. If we are going to derecognize people who have used malfunctioning thermometer values in any cost-effectiveness analysis, there may not be many people left to perform them.
I’ve criticized HLI on several occasions before, and I’m likely to find reasons to criticize it again at some point. But I think we want to encourage its willingness to release less-refined models for public scrutiny (as long as the limitations are appropriately acknowledged) and its commitment to reasoning transparency more generally. I am skeptical of any argument that would significantly incentivize organizations to keep their analyses close to the chest.
I disagree with you on several points.
The most important thing to note here is that, if you dig through the various long reports, the tradeoff is:
With $7800 you can save the life of a child, or
If you grant HLI’s assumptions regarding costs (and I’m a bit skeptical even there), you can give a multi-week group therapy to 60 people for that same cost (I think 12 sessions of 90 min).
Which is better? Well, right off the bat, if you think mothers would value their children at 60x what they value the therapy sessions, you’ve already lost.
Of course, the child’s life also matters, not just the mother’s happiness. But HLI has a range of “assumptions” regarding how good a life is, and in many of these assumptions the life of the child is indeed fairly value-less compared to benefits in the welfare of the mother (because life is suffering and death is OK, basically).
All this is obfuscated under various levels of analysis. Moreover, in HLI’s median assumption, not only is the therapy more effective, it is 5x more effective. They are saying: the number of group therapies that equal the averted death of a child is not 60, but rather, 12.
To me that’s broken-thermometer level.
I know the EA community is full of broken thermometers, and it’s actually one of the reasons I do not like the community. One of my main criticisms of EA is, indeed, “you’re taking absurd numbers (generated by authors motivated to push their own charities/goals) at face value”. This also happens with animal welfare: there’s this long report and 10-part forum series evaluating animals’ welfare ranges, and it concludes that 1 human has the welfare range of (checks notes) 14 bees. Then others take that at face value and act as if a couple of beehives or shrimp farms are as important as a human city.
This is not the first time I’ve had this argument made to me when I criticize an EA charity. It seems almost like the default fallback. I think EA has the opposite problem, however: nobody ever dares to say the emperor has no clothes, and everyone goes around pretending 1 human is worth 14 bees and a group therapy session increases welfare by more than the death of your child decreases it.
I think that it is possible to buy that humans only have 14 times as painful maximum pains/pleasurable maximal pleasure than bees, and still think 14 bees=1 human is silly. You just have to reject hedonism about well-being. I have strong feelings about saving humans over animals, but I have no intuition whatsoever that if my parents’ dog burns her paw it hurts less than when I burn my hand. The whole idea that animals have less intense sensations than us seems to me less like a commonsense claim, and more like something people committed to both hedonism and antispeciesism made up to reconcile their intuitive repugnant at results like 10 pigs or whatever=1 human. (Bees are kind of a special case because lots of people are confident they aren’t conscious at all.)
Where’s the evidence that, e.g., everyone “act[s] as if a couple of beehives or shrimp farms are as important as a human city”?So someone wrote a speculative report about bee welfare ranges . . . if “everyone” accepted that “1 human is worth 14 bees”—or even anything close to that—the funding and staffing pictures in EA would look very, very different. How many EAs are working in bee welfare, and how much is being spent in that area?
As I understand the data, EA resources in GH&D are pretty overwhelmingly in life-saving interventions like AMF, suggesting that the bulk of EA does not agree with HLI at present. I’m not as well versed in farmed animal welfare, but I’m pretty sure no one in that field is fundraising for interventions costing anywhere remotely near hundreds of dollars to save a bee and claiming they are effective.
In the end, reasoning transparency by charity evaluators helps the donor better make an informed moral choice. Carefully reading analyses from various sources helps me (and other donors) make choices that are consistent with our own values. EA is well ahead of most charitable movements by explicitly acknowledging that trade-offs exist and at least attempting to reason about them. One can (and should) decline to donate where the charity’s treatment of tradeoffs isnt convincing. As I’ve stated elsewhere on this post, I’m sticking with GiveWell-style interventions at least for now.
Oh, I should definitely clarify: I find effective altruism the philosophy, as well as most effective altruists and their actions, to be very good and admirable. My gripe is with what I view as the “EA community”—primarily places like this forum, organizations such as the CEA, and participants in EA global. The more central to EA-the-community, the worse I like the the ideas.
In my view, what happens is that there are a lot of EA-ish people donating to GiveWell charities, and that’s amazing. And then the EA movement comes and goes “but actually, you should really give the money to [something ineffective that’s also sometimes in the personal interest of the person speaking]” and some people get duped. So forums like this one serve to take money that would go to malaria nets, and try as hard as they can to redirect it to less effective charities.
So, to your questions: how many people are working towards bee welfare? Not many. But on this forum, it’s a common topic of discussion (often with things like nematodes instead of bees). I haven’t been to EA global, but I know where I’d place my bets for what receives attention there. Though honestly, both HLI and the animal welfare stuff is probably small potatoes compared to AI risk and meta-EA, two areas in which these dynamics play an even bigger role (and in which there are even more broken thermometers and conflicts of interest).
Do you think there’s a number you would accept for how many people treated with psychotherapy would be “worth” the death of one child?
Yes. There is a large range of such numbers. I am not sure of the right tradeoff. I would intuitively expect a billion therapy sessions to be an overestimate (i.e. clearly more valuable than the life of a child), but I didn’t do any calculations. A thousand seems like an underestimate, but again who knows (I didn’t do any calculations). HLI is claiming (checks notes) ~12.
To flip the question: Do you think there’s a number you would reject for how many people treated with psychotherapy would be worth the death of one child, even if some seemingly-fancy analysis based on survey data backed it up? Do you ever look at the results of an analysis and go “this must be wrong,” or is that just something the community refuses to do on principle?
Thank you for this detailed and transparent response!
I applaud HLI for creating a chart (and now an R Shiny App) to show how philosophical views can affect the tradeoff between predominately life-saving and predominately life-enhancing interventions. However, one challenge with that approach is that almost any changes to your CEA model will be outcome-changing for donors in some areas of that chart. [1]
For example, the 53-> 38% correction alone switched the recommendation for donors with a deprivationist framework who think the neutral point is over ~ 0.65 but under 1.58. Given that GiveWell’s moral weights were significantly derived from donor preferences, and (0.5, deprivationism) is fairly implied by those donor weights, I think that correction shifted the recommendation from SM to AMF for a significant number of donors even though it was only material to one of three philosophical approaches and about 1 point of neutral-point assumptions.
GiveWell reduced the WELLBY estimate from about 62 (based on the 38% figure) to about 17, a difference of about 45. If I’m simplifying your position correctly, for about half of those WELLBYs you disagree with GiveWell that an adjustment is appropriate. For about half of them, you believe a discount is likely appropriate, but think it is likely less than GiveWell modelled.
If we used GiveWell’s numbers for that half but HLI’s numbers otherwise, that split suggests that we’d end up with about 39.5 WELLBYs. So one way to turn your response into a donor-actionable statement would be to say that there is a zone of uncertainty between 39.5 and 62 WELLBYs. One might also guess that the heartland of that zone is between about 45 and 56.5 WELLBYs, reasoning that it is less likely that your discounts will be less than 25% or more than 75% of GiveWell’s.
The bottom end of that zone of uncertainty (39.5) would pull the neutral point at which a deprivationist approach would conclude AMF = SM up to about 2.9. I suspect few people employing a deprivationist approach have the neutral point that high. AMF is also superior to SM on a decent number of TRIA-based approaches at 39.5 WELLBYs.
So it seems there are two reasonable approaches to donor advice under these kinds of circumstances:
One approach would encourage donors within a specified zone of uncertainty to hold their donations until HLI sufficiently updates its CEA for SM to identify a more appropriate WELLBY figure ; or
The other approach would encourage donors to make their decision based on HLI’s best estimate of what the WELLBY figure on the next update of the CEA will be. Even if the other approach is correct, there will be some donors who need to use this approach for various reasons (e.g., tax reasons).
I don’t think reaffirming advice on the current model in the interim without any adjustments is warranted, unless you believe the adjustments will be minor enough such that a reasonable donor would likely not find them of substantive importance no matter where they are on the philosophical chart.[2]
In the GiveWell model, the top recommendation is to give to a regranting fund, and there isn’t any explicit ranking of the four top charities. So the recommendation is actually to defer the choice of specific charity to someone who has the most up-to-date information when the monies are actually donated to the effective charity. Moreover, all four top charities are effective in very similar ways. Thus, GiveWell’s bottom-line messaging to donors is much less sensitive to changes in the CEA for any given charity.
I am not sure how to define “minor.” I think whether the change flips the recommendation to the donor is certainly relevant, but wouldn’t go so far as to say that any change that flips the recommendation for a given donor’s philosophical assumptions would be automatically non-minor. On the other hand, I think a large enough change can be non-minor even if it doesn’t flip the recommendation on paper. Some donors apply discounts and bonuses not reflected in HLI’s model. For instance, one could reasonably apply a discount to SM when compared to better-studied interventions, on the basis that CEAs usually decrease as they become more complete. Or one could reasonably apply a bonus to SM because funding a smaller organization is more likely to have a positive effect on its future cost-effectiveness. Thus, just because the change is not outcome-determinative on HLI’s base model doesn’t mean it isn’t so on the donor’s application of the model. The time-to-update and amount of funds involved are also relevant. All that being said, my gut thinks that the starting point for determining minor vs. non-minor is somewhere in the neighborhood of 10%.
Jason,
You raise a fair point. One we’ve been discussing internally. Given the recent and expected adjustments to StrongMinds, it seems reasonable to update and clarify our position on AMF to say something like, “Under more views, AMF is better than or on par with StrongMinds. Note that currently, under our model, when AMF is better than StrongMinds, it isn’t wildly better.” Of course, while predicting how future research will pan out is tricky, we’d aim to be more specific.
(EDITED)
Is this (other than 53% being corrected to 38%) from the post accurate?
If so, a substantial discount seems reasonable to me. It’s plausible these studies also say almost nothing about the spillover, because of how unrepresentative they seem. Presumably much of the content of the therapy will be about the child, so we shouldn’t be surprised if it has much more impact on the child than general therapy for depression.
It’s not clear any specific number away from 0 could be justified.
I find nothing objectionable in that characterization. And if we only had these three studies to guide us then I’d concede that a discount of some size seems warranted. But we also have A. our priors. And B. some new evidence from Barker et al. Both of point me away from very small spillovers, but again I’m still very unsure. I think I’ll have clearer views once I’m done analyzing the Barker et al. results and have had someone, ideally Nathanial Barker, check my work.
[Edit: Michael edited to add: “It’s not clear any specific number away from 0 could be justified.”] Well not-zero certainly seems more justifiable than zero. Zero spillovers implies that emotional empathy doesn’t exist, which is an odd claim.
To clarify what I edited in, I mean that, without better evidence/argument, the effect could be arbitrarily small but still nonzero. What reason do we have to believe it’s at least 1%, say, other than very subjective priors?
I agree that analysis of new evidence should help.
I’d point to the literature on time lagged correlations between household members emotional states that I quickly summarised in the last installment of the household spillover discussion. I think it implies a household spillover of 20%. But I don’t know if this type of data should over- or -underestimate the spillover ratio relative to what we’d find in RCTs. I know I’m being really slippery about this, but the Barker et al. analysis stuff so far makes me think it’s larger than that.
Regarding the question of what philosophical view should be used, I wonder if it would also matter if someone were something like prioritarian rather than a total utilitarian. StrongMinds looks to focus on people who suffer more than typical members of these countries’ populations, whilst the lives saved by AMF would presumably cover more of the whole distribution of wellbeing. So a prioritarian may favour StrongMinds more, assuming the people helped are not substantially better off economically or in other ways. (Though, it could perhaps also be argued that the people who would die without AMF’s intervention are extremely badly off pre-intervention.)