This is Alex Cohen, GiveWell senior researcher, responding from GiveWell’s EA Forum account.
Joel, Samuel and Michael — Thank you for the deep engagement on our deworming cost-effectiveness analysis.
We really appreciate you prodding us to think more about how to deal with any decay in benefits in our model, since it has the potential to meaningfully impact our funding recommendations.
We agree with HLI that there is some evidence for benefits of deworming declining over time and that this is an issue we haven’t given enough weight to in our analysis.
We’re extremely grateful to HLI for bringing this to our attention and think it will allow us to make better decisions on recommending funding to deworming going forward.
We would like to encourage more of this type of engagement with our research. We’re planning to announce prizes for criticism of our work in the future. When we do, we plan to give a retroactive prize to HLI.
We’re planning to do additional work to incorporate this feedback into an updated deworming cost-effectiveness estimate. In the meantime, we wanted to share our initial thoughts. At a high level:
We agree with HLI that there is some evidence for benefits of deworming declining over time and that this is an issue we haven’t given enough weight to in our analysis. We don’t totally agree with HLI on how to incorporate decay in our cost-effectiveness model and think HLI is making a mistake that leads it to overstate the decline in cost-effectiveness from incorporating decay. However, we still guess incorporating decay more in our model could meaningfully change our estimated cost-effectiveness of deworming. We plan to conduct additional research and publish updated estimates soon.
Once we do this work, our best guess is that we will reduce our estimate of the cost-effectiveness of deworming by 10%-30%. Had we made this change in 2019 when KLPS-4 was released, we would have recommended $2-$8m less in grants to deworming (out of $55m total) since 2019.
We also agree that we should do more to improve the transparency of our cost-effectiveness estimates. We plan to make key assumptions and judgment calls underlying our deworming cost-effectiveness estimate clearer on our website.
Should we adjust the effect of deworming down to account for decay in benefits?
We agree with HLI that there is some evidence in the Kenya Life Panel Survey (KLPS) data for benefits declining over time. We haven’t explicitly made an adjustment in the cost-effectiveness analysis for the possibility that the effects decline over time.
Where I think we disagree is on how to incorporate that potential decay in benefits in our model. While incorporating decay into our model will most likely reduce cost-effectiveness overall, my guess is that HLI’s approach overstates the decline in cost-effectiveness from incorporating decay for three reasons. We plan to explore these further in our updated cost-effectiveness estimate for deworming, but I’ll summarize them quickly here.
First, if we were to model decay based on the KLPS data, we would likely use a higher starting point. We think HLI may have made an error in interpreting the data here.
To estimate decay, HLI’s model begins with GiveWell’s current estimate of the benefits of deworming — which is based on the average of results from the 10-, 15-, and 20-year follow-ups (KLPS 2, KLPS 3, and KLPS 4, respectively) — then assumes effects decline from that value. However, if we believe that there are declining effects from KLPS 2 to KLPS 3 to KLPS 4, this should imply above average effects in initial years (as found in KLPS 2 and KLPS 3) and then below average effects in later years (as found in KLPS 4).
Specifically, in its decay model, HLI uses a starting value of 0.006 units of ln(consumption). This is our current estimate of “Benefit of one year’s income (discounted back because of delay between deworming and working for income.” This is based on averaging across results from KLPS 2, KLPS 3, and KLPS 4. If we were to incorporate this decay model, we would most likely choose a value above 0.006 for initial years that declines below 0.006 in later years. For example, if we used the value effect size from KLPS 2 and applied the same adjustments we currently do, this value would be 0.012 in the first year of impacts. We expect this to substantially increase cost-effectiveness, compared to HLI’s model.
Second, we guess that we would not put full weight on the decay model, since it seems like there’s a decent chance the observed decline is due to chance or lack of robustness to different specifications of the effects over time.
The three rounds of KLPS data provide three estimates for the effect of deworming at 10-, 15- and 20-year follow-up, which seem to show a decline in effect (based on percentage increase in income and consumption) over time. We could interpret this as either a decline in effect over time, as recommended by HLI, or as three noisy estimates of a constant effect over time, which is the interpretation in our current cost-effectiveness analysis. We’re unsure how much credence to put on each of these interpretations going forward, but it’s unlikely that we will decide to put all of our credence on “decline in effect over time.”
When we look at evidence like this, we typically favor pooled results when there is no a priori reason to believe effects differ over time, across geography, etc. (e.g., a meta-analysis of RCTs for a malaria prevention program) because this increases the precision and robustness of the effect measurement. In cases where there’s more reason to believe the effects vary across time or geography, we’re more likely to focus on “sub-group” results, rather than pooled effects. We acknowledge this is often a subjective assessment.
In the deworming case, there are some reasons to put weight on the decay story. First, the point estimates we have from KLPS 2, KLPS 3, and KLPS 4, in terms of impact on ln income and consumption, tend toward a decline over time. Second, there are plausible stories for why effects would decline. For example, it’s possible individuals in the control group are catching up to individuals who were dewormed due to broader trends in the economy. This is speculative, however, and we haven’t looked into drivers of changes over time.
However, we also think there are reasons to put weight on the “noisy effects” story, which is why our current cost-effectiveness analysis uses a pooled estimate as our best guess of effects over time. First, the evidence for decline comes from three imprecise estimates of income and two imprecise estimates from consumption with overlapping confidence intervals. And comparing effect sizes across rounds and measures is not straightforward – for example, the small sample KLPS 3 consumption results implied at least a doubling of deworming’s cost-effectiveness relative to GiveWell’s historical model if taken literally (in part due to idiosyncrasies about effects being measured at the per-capita level in a household, rather than merely for a single individual who was dewormed), which is why we funded a larger consumption survey in KLPS-4 and expected to see a much smaller effect in a larger sample.[1] Factors like this give us reason to believe that some of the observed decline in particular measures is due to chance or measurement error. In this case, we would expect the average pooled effect factoring in multiple types of measures to be the best predictor of what we’ll find in future KLPS rounds. Second, it seems plausible that effects would be constant over time (or could even compound over time). For example, adults who were dewormed as children and see greater cognitive or educational gains may be less likely to enter sectors like agriculture, which may have flatter earnings trajectories, or be more likely to move to cities, where opportunities for wage growth are higher. However, these stories are also speculative.
As a result, even if we did incorporate the decay model, we would put less than 100% weight on it. We’d like to look further into studies of interventions where the mechanism (improving child development) is more plausibly similar to deworming to see if this provides additional information, as well as any evidence of mechanisms for deworming specifically that would point toward decline in effects. We’d also like to further explore how to interpret the round-by-round estimates, since many factors change between rounds (such as rates of labor force participation and methods of earnings measurement) and we would like to better understand how to predict future changes in control group standards of living when taking all of this into account.
Third, we would likely update our “replicability adjustment” for deworming, based on these results.
This is noted in the blog post. HLI notes, though, “The more puzzling concept is the idea that, if you realise you should change one bit of your analysis, you would be justified to arbitrarily alter another, unrelated and non-specified, part of it to ensure you retain a ‘plausible result’.”
Our approach to the replicability adjustment is to do an informal Bayesian update. We have a prior that is more skeptical of the impact of deworming than seen in the data from KLPS. We incorporate that prior into our estimate of the effect size through the replicability adjustment. The larger the effect size estimated by the KLPS data, the greater the gap between that result and our prior, and the larger the adjustment needed to incorporate our prior.[2] As a result of this adjustment, we input a smaller effect of ln income in our cost-effectiveness analysis than we would if we took the data from KLPS at face value.
In our current cost-effectiveness analysis, we adjust the pooled effect of 0.109 in ln income downward by 87%. This reflects our prior belief that we should expect a much lower effect of deworming on later-life income. If we thought the pooled effect was lower than 0.109, we would likely apply a less strict adjustment. We would plan to specify how we update our adjustment in our follow-up work on this.
Our current best guess is that incorporating decay into our cost-effectiveness estimates would reduce the cost-effectiveness of deworming charities by 10%-30%. This adjustment would have led to $2-$8 million less out of $55 million total to deworming since late 2019 (when the most recent deworming study results were released).
We plan to do some additional research to refine our estimates and share an updated cost-effectiveness analysis soon.
Where we’d like to improve on reasoning transparency
We also agree with HLI that we have room for improvement on explaining our cost-effectiveness models. The decision about how to model whether benefits decline is an example of that—the reasoning I outlined above isn’t on our website. We only wrote, “the KLPS 4 results are smaller in magnitude (on a percentage increase basis) and higher variance than earlier survey rounds.”
We plan to update our website to make it clearer what key judgment calls are driving our cost-effectiveness estimates, why we’ve chosen specific parameters or made key assumptions, and how we’ve prioritized research questions that could potentially change our bottom line.
Encouraging more feedback on our research
We’re extremely grateful to HLI for taking the time to dig into our work and provide feedback. We think this type of engagement improves the quality of our research and our grant recommendations, which helps us allocate resources more cost-effectively, and we’d like to encourage more of it.
In the near future, we plan to announce prizes to individuals and organizations who identify issues in our cost-effectiveness analyses that are likely to lead to meaningful changes in our decisions.
As part of that contest, we also plan to retroactively recommend a prize to HLI (details TBD). We believe HLI’s feedback is likely to change some of our funding recommendations, at least marginally, and perhaps more importantly improve our decision-making across multiple interventions.
See the “internal forecasts” we published, which roughly predict a 25% chance of large consumption results (similar to KLPS-3) that would have doubled our estimate of deworming’s cost-effectiveness. (Put differently, we predicted a 75% chance of not updating to the KLPS-3-like magnitude of results after seeing KLPS-4′s larger survey.)
To see how the math on this works, see this tool, which we used in generating our replicability adjustments. If you input a prior with mean = 0.5 and standard deviation = 0.25, and see evidence with mean = 10 and standard deviation = 1, the posterior effect estimate is ~1.1, for a “replicability adjustment” relative to the evidence of 1.1/10 = ~11%. However, if the evidence shows a smaller effect closer to the prior (mean = 5, sd = 1), the estimated posterior effect is ~0.8, with a replicability adjustment of 0.8/5 = ~16%. So, the overall estimated posterior effect falls when the evidence shows a lower effect estimate (from ~1.1 to ~0.8), but the skeptical Bayesian “replicability adjustment” is slightly less extreme in the second case (an 84% discount instead of an 89% discount). This is what we mean when we say that the replicability adjustment must be updated in conjunction with the estimated effect size, and this is what we have done historically.
We’d like to express our sincere thanks to GiveWell for providing such a detailed and generous response. We are delighted that our work may lead to substantive changes, and echoing GiveWell, we encourage others to critique HLI’s work with the same level of rigour.
In response to the substantive points raised by Alex:
Using a different starting value: Our post does not present a strong argument for how exactly to include the decay. Instead, we aimed to do the closest ‘apples-to-apples’ comparison possible using the same values that GiveWell uses in their original analysis. Our main point was that including decay makes a difference, and we are encouraged to see that GiveWell will consider incorporating this into their analysis.
We don’t have a strong view of the best way to incorporate decay in the CEA. However, we intend to develop and defend our views about how the benefits change over time as we finalise our analysis of deworming in terms of subjective wellbeing.
How to weigh the decay model: We agree with Alex’s proposal to put some weight on the effects being constant. Again, we haven’t formed a strong view on how to do this yet and recognise the challenges that GiveWell faces in doing so. We look forward to seeing more of GiveWell’s thinking on this.
Improving reasoning transparency: We strongly support the plans quoted below and look forward to reading future publications that clearly lay out the importance of key judgements and assumptions.
We plan to update our website to make it clearer what key judgment calls are driving our cost-effectiveness estimates, why we’ve chosen specific parameters or made key assumptions, and how we’ve prioritized research questions that could potentially change our bottom line.
In retrospect, I think my reply didn’t do enough to acknowledge that A. using a different starting value seems reasonable and B. this would lead to a much smaller change in cost-effectiveness foor deworming. While very belated, I’m updating the post to note this for posterity.
This is Alex Cohen, GiveWell senior researcher, responding from GiveWell’s EA Forum account.
Joel, Samuel and Michael — Thank you for the deep engagement on our deworming cost-effectiveness analysis.
We really appreciate you prodding us to think more about how to deal with any decay in benefits in our model, since it has the potential to meaningfully impact our funding recommendations.
We agree with HLI that there is some evidence for benefits of deworming declining over time and that this is an issue we haven’t given enough weight to in our analysis.
We’re extremely grateful to HLI for bringing this to our attention and think it will allow us to make better decisions on recommending funding to deworming going forward.
We would like to encourage more of this type of engagement with our research. We’re planning to announce prizes for criticism of our work in the future. When we do, we plan to give a retroactive prize to HLI.
We’re planning to do additional work to incorporate this feedback into an updated deworming cost-effectiveness estimate. In the meantime, we wanted to share our initial thoughts. At a high level:
We agree with HLI that there is some evidence for benefits of deworming declining over time and that this is an issue we haven’t given enough weight to in our analysis. We don’t totally agree with HLI on how to incorporate decay in our cost-effectiveness model and think HLI is making a mistake that leads it to overstate the decline in cost-effectiveness from incorporating decay. However, we still guess incorporating decay more in our model could meaningfully change our estimated cost-effectiveness of deworming. We plan to conduct additional research and publish updated estimates soon.
Once we do this work, our best guess is that we will reduce our estimate of the cost-effectiveness of deworming by 10%-30%. Had we made this change in 2019 when KLPS-4 was released, we would have recommended $2-$8m less in grants to deworming (out of $55m total) since 2019.
We also agree that we should do more to improve the transparency of our cost-effectiveness estimates. We plan to make key assumptions and judgment calls underlying our deworming cost-effectiveness estimate clearer on our website.
Should we adjust the effect of deworming down to account for decay in benefits?
We agree with HLI that there is some evidence in the Kenya Life Panel Survey (KLPS) data for benefits declining over time. We haven’t explicitly made an adjustment in the cost-effectiveness analysis for the possibility that the effects decline over time.
Where I think we disagree is on how to incorporate that potential decay in benefits in our model. While incorporating decay into our model will most likely reduce cost-effectiveness overall, my guess is that HLI’s approach overstates the decline in cost-effectiveness from incorporating decay for three reasons. We plan to explore these further in our updated cost-effectiveness estimate for deworming, but I’ll summarize them quickly here.
First, if we were to model decay based on the KLPS data, we would likely use a higher starting point. We think HLI may have made an error in interpreting the data here.
To estimate decay, HLI’s model begins with GiveWell’s current estimate of the benefits of deworming — which is based on the average of results from the 10-, 15-, and 20-year follow-ups (KLPS 2, KLPS 3, and KLPS 4, respectively) — then assumes effects decline from that value. However, if we believe that there are declining effects from KLPS 2 to KLPS 3 to KLPS 4, this should imply above average effects in initial years (as found in KLPS 2 and KLPS 3) and then below average effects in later years (as found in KLPS 4).
Specifically, in its decay model, HLI uses a starting value of 0.006 units of ln(consumption). This is our current estimate of “Benefit of one year’s income (discounted back because of delay between deworming and working for income.” This is based on averaging across results from KLPS 2, KLPS 3, and KLPS 4. If we were to incorporate this decay model, we would most likely choose a value above 0.006 for initial years that declines below 0.006 in later years. For example, if we used the value effect size from KLPS 2 and applied the same adjustments we currently do, this value would be 0.012 in the first year of impacts. We expect this to substantially increase cost-effectiveness, compared to HLI’s model.
Second, we guess that we would not put full weight on the decay model, since it seems like there’s a decent chance the observed decline is due to chance or lack of robustness to different specifications of the effects over time.
The three rounds of KLPS data provide three estimates for the effect of deworming at 10-, 15- and 20-year follow-up, which seem to show a decline in effect (based on percentage increase in income and consumption) over time. We could interpret this as either a decline in effect over time, as recommended by HLI, or as three noisy estimates of a constant effect over time, which is the interpretation in our current cost-effectiveness analysis. We’re unsure how much credence to put on each of these interpretations going forward, but it’s unlikely that we will decide to put all of our credence on “decline in effect over time.”
When we look at evidence like this, we typically favor pooled results when there is no a priori reason to believe effects differ over time, across geography, etc. (e.g., a meta-analysis of RCTs for a malaria prevention program) because this increases the precision and robustness of the effect measurement. In cases where there’s more reason to believe the effects vary across time or geography, we’re more likely to focus on “sub-group” results, rather than pooled effects. We acknowledge this is often a subjective assessment.
In the deworming case, there are some reasons to put weight on the decay story. First, the point estimates we have from KLPS 2, KLPS 3, and KLPS 4, in terms of impact on ln income and consumption, tend toward a decline over time. Second, there are plausible stories for why effects would decline. For example, it’s possible individuals in the control group are catching up to individuals who were dewormed due to broader trends in the economy. This is speculative, however, and we haven’t looked into drivers of changes over time.
However, we also think there are reasons to put weight on the “noisy effects” story, which is why our current cost-effectiveness analysis uses a pooled estimate as our best guess of effects over time. First, the evidence for decline comes from three imprecise estimates of income and two imprecise estimates from consumption with overlapping confidence intervals. And comparing effect sizes across rounds and measures is not straightforward – for example, the small sample KLPS 3 consumption results implied at least a doubling of deworming’s cost-effectiveness relative to GiveWell’s historical model if taken literally (in part due to idiosyncrasies about effects being measured at the per-capita level in a household, rather than merely for a single individual who was dewormed), which is why we funded a larger consumption survey in KLPS-4 and expected to see a much smaller effect in a larger sample.[1] Factors like this give us reason to believe that some of the observed decline in particular measures is due to chance or measurement error. In this case, we would expect the average pooled effect factoring in multiple types of measures to be the best predictor of what we’ll find in future KLPS rounds. Second, it seems plausible that effects would be constant over time (or could even compound over time). For example, adults who were dewormed as children and see greater cognitive or educational gains may be less likely to enter sectors like agriculture, which may have flatter earnings trajectories, or be more likely to move to cities, where opportunities for wage growth are higher. However, these stories are also speculative.
As a result, even if we did incorporate the decay model, we would put less than 100% weight on it. We’d like to look further into studies of interventions where the mechanism (improving child development) is more plausibly similar to deworming to see if this provides additional information, as well as any evidence of mechanisms for deworming specifically that would point toward decline in effects. We’d also like to further explore how to interpret the round-by-round estimates, since many factors change between rounds (such as rates of labor force participation and methods of earnings measurement) and we would like to better understand how to predict future changes in control group standards of living when taking all of this into account.
Third, we would likely update our “replicability adjustment” for deworming, based on these results.
This is noted in the blog post. HLI notes, though, “The more puzzling concept is the idea that, if you realise you should change one bit of your analysis, you would be justified to arbitrarily alter another, unrelated and non-specified, part of it to ensure you retain a ‘plausible result’.”
Our approach to the replicability adjustment is to do an informal Bayesian update. We have a prior that is more skeptical of the impact of deworming than seen in the data from KLPS. We incorporate that prior into our estimate of the effect size through the replicability adjustment. The larger the effect size estimated by the KLPS data, the greater the gap between that result and our prior, and the larger the adjustment needed to incorporate our prior.[2] As a result of this adjustment, we input a smaller effect of ln income in our cost-effectiveness analysis than we would if we took the data from KLPS at face value.
In our current cost-effectiveness analysis, we adjust the pooled effect of 0.109 in ln income downward by 87%. This reflects our prior belief that we should expect a much lower effect of deworming on later-life income. If we thought the pooled effect was lower than 0.109, we would likely apply a less strict adjustment. We would plan to specify how we update our adjustment in our follow-up work on this.
Our current best guess is that incorporating decay into our cost-effectiveness estimates would reduce the cost-effectiveness of deworming charities by 10%-30%. This adjustment would have led to $2-$8 million less out of $55 million total to deworming since late 2019 (when the most recent deworming study results were released).
We plan to do some additional research to refine our estimates and share an updated cost-effectiveness analysis soon.
Where we’d like to improve on reasoning transparency
We also agree with HLI that we have room for improvement on explaining our cost-effectiveness models. The decision about how to model whether benefits decline is an example of that—the reasoning I outlined above isn’t on our website. We only wrote, “the KLPS 4 results are smaller in magnitude (on a percentage increase basis) and higher variance than earlier survey rounds.”
We plan to update our website to make it clearer what key judgment calls are driving our cost-effectiveness estimates, why we’ve chosen specific parameters or made key assumptions, and how we’ve prioritized research questions that could potentially change our bottom line.
Encouraging more feedback on our research
We’re extremely grateful to HLI for taking the time to dig into our work and provide feedback. We think this type of engagement improves the quality of our research and our grant recommendations, which helps us allocate resources more cost-effectively, and we’d like to encourage more of it.
In the near future, we plan to announce prizes to individuals and organizations who identify issues in our cost-effectiveness analyses that are likely to lead to meaningful changes in our decisions.
As part of that contest, we also plan to retroactively recommend a prize to HLI (details TBD). We believe HLI’s feedback is likely to change some of our funding recommendations, at least marginally, and perhaps more importantly improve our decision-making across multiple interventions.
See the “internal forecasts” we published, which roughly predict a 25% chance of large consumption results (similar to KLPS-3) that would have doubled our estimate of deworming’s cost-effectiveness. (Put differently, we predicted a 75% chance of not updating to the KLPS-3-like magnitude of results after seeing KLPS-4′s larger survey.)
To see how the math on this works, see this tool, which we used in generating our replicability adjustments. If you input a prior with mean = 0.5 and standard deviation = 0.25, and see evidence with mean = 10 and standard deviation = 1, the posterior effect estimate is ~1.1, for a “replicability adjustment” relative to the evidence of 1.1/10 = ~11%. However, if the evidence shows a smaller effect closer to the prior (mean = 5, sd = 1), the estimated posterior effect is ~0.8, with a replicability adjustment of 0.8/5 = ~16%. So, the overall estimated posterior effect falls when the evidence shows a lower effect estimate (from ~1.1 to ~0.8), but the skeptical Bayesian “replicability adjustment” is slightly less extreme in the second case (an 84% discount instead of an 89% discount). This is what we mean when we say that the replicability adjustment must be updated in conjunction with the estimated effect size, and this is what we have done historically.
We’d like to express our sincere thanks to GiveWell for providing such a detailed and generous response. We are delighted that our work may lead to substantive changes, and echoing GiveWell, we encourage others to critique HLI’s work with the same level of rigour.
In response to the substantive points raised by Alex:
Using a different starting value: Our post does not present a strong argument for how exactly to include the decay. Instead, we aimed to do the closest ‘apples-to-apples’ comparison possible using the same values that GiveWell uses in their original analysis. Our main point was that including decay makes a difference, and we are encouraged to see that GiveWell will consider incorporating this into their analysis.
We don’t have a strong view of the best way to incorporate decay in the CEA. However, we intend to develop and defend our views about how the benefits change over time as we finalise our analysis of deworming in terms of subjective wellbeing.
How to weigh the decay model: We agree with Alex’s proposal to put some weight on the effects being constant. Again, we haven’t formed a strong view on how to do this yet and recognise the challenges that GiveWell faces in doing so. We look forward to seeing more of GiveWell’s thinking on this.
Improving reasoning transparency: We strongly support the plans quoted below and look forward to reading future publications that clearly lay out the importance of key judgements and assumptions.
In retrospect, I think my reply didn’t do enough to acknowledge that A. using a different starting value seems reasonable and B. this would lead to a much smaller change in cost-effectiveness foor deworming. While very belated, I’m updating the post to note this for posterity.