There are many limitations to cost-effectiveness estimates, and we do not assess charities only—or primarily—based on their estimated cost-effectiveness.
...
Because of the many limitations of cost-effectiveness estimates, we give estimated cost-effectiveness only limited weight in recommending charities. Confidence in an organization’s track record or the strength of the evidence for an intervention generally carries heavier weight when differences in estimated cost-effectiveness are not large.
(That page goes into more detail as to why.)
The philosophical underpinning behind this (if you want to call it that) is in Holden Karnofsky’s Sequence thinking vs cluster thinking essay—in short, GW is more cluster-style, while the Happier Lives Institute strikes me as more sequence-style (correct me if I’m wrong). Holden:
Our approach to making such comparisons strikes some as highly counterintuitive, and noticeably different from that of other “prioritization” projects such as Copenhagen Consensus. Rather than focusing on a single metric that all “good accomplished” can be converted into (an approach that has obvious advantages when one’s goal is to maximize), we tend to rate options based on a variety of criteria using something somewhat closer to (while distinct from) a “1=poor, 5=excellent” scale, and prioritize options that score well on multiple criteria. (For example, see our most recent top charities comparison.)
We often take approaches that effectively limit the weight carried by any one criterion, even though, in theory, strong enough performance on an important enough dimension ought to be able to offset any amount of weakness on other dimensions.
And then further down:
I think the cost-effectiveness analysis we’ve done of top charities has probably added more value in terms of “causing us to reflect on our views, clarify our views and debate our views, thereby highlighting new key questions” than in terms of “marking some top charities as more cost-effective than others.”
While some people feel that GiveWell puts too much emphasis on the measurable and quantifiable, there are others who go further than we do in quantification, and justify their giving (or other) decisions based on fully explicit expected-value formulas. The latter group tends to critique us – or at least disagree with us – based on our preference for strong evidence over high apparent “expected value,” and based on the heavy role of non-formalized intuition in our decisionmaking. This post is directed at the latter group.
We believe that people in this group are often making a fundamental mistake, one that we have long had intuitive objections to but have recently developed a more formal (though still fairly rough) critique of. The mistake (we believe) is estimating the “expected value” of a donation (or other action) based solely on a fully explicit, quantified formula, many of whose inputs are guesses or very rough estimates. We believe that any estimate along these lines needs to be adjusted using a “Bayesian prior”; that this adjustment can rarely be made (reasonably) using an explicit, formal calculation; and that most attempts to do the latter, even when they seem to be making very conservative downward adjustments to the expected value of an opportunity, are not making nearly large enough downward adjustments to be consistent with the proper Bayesian approach.
If I’m right that HLI leans more towards sequence than cluster-style thinking, then you can interpret this passage as Holden directly addressing HLI (in the future).
When we started GiveWell, we were very interested in cost-effectiveness estimates: calculations aiming to determine, for example, the “cost per life saved” or “cost per DALY saved” of a charity or program. Over time, we’ve found ourselves putting less weight on these calculations, because we’ve been finding that these estimates tend to be extremely rough (and in some cases badly flawed).
...we are arguing that focusing on directly estimating cost-effectiveness is not the best way to maximize cost-effectiveness. We believe there are alternative ways of maximizing cost-effectiveness – in particular, making limited use of cost-effectiveness estimates while focusing on finding high-quality evidence (an approach we have argued for previously and will likely flesh out further in a future post).
In a nutshell, we argue that the best currently available cost-effectiveness estimates – despite having extremely strong teams and funding behind them – have the problematic combination of being extremely simplified (ignoring important but difficult-to-quantify factors), extremely sensitive (small changes in assumptions can lead to huge changes in the figures), and not reality-checked (large flaws can persist unchecked – and unnoticed – for years). We believe it is conceptually difficult to improve on all three of these at once: improving on the first two is likely to require substantially greater complexity, which in turn will worsen the ability of outsiders to understand and reality-check estimates. Given the level of resources that have been invested in creating the problematic estimates we see now, we’re not sure that really reliable estimates can be created using reasonable resources – or, perhaps, at all.
Note that these are very old posts, last updated in 2016; it’s entirely possible that GiveWell has changed their stance on them, but that isn’t my impression (correct me if I’m wrong).
Side note on deworming in particular: in David Roodman’s writeups on the GW blog, which you linked to in the main post, GW’s “total discount” is actually something like 99%, because it’s a product of multiple adjustments, of which replicability is just one. I couldn’t however find any direct reference to this 99% discount in the actual deworming CEAs, so I don’t know if they’re actually used.
I think the main post’s recommendations are great—having spent hours poring over GW’s CEAs as inspiration for my own local charity assessment work, the remark above that “these [CEAs] are hard to follow unless you already know what’s going on” keenly resonated with me—but given GW’s stance above on CEAs and the fact that they only have a single person updating their model, I’m not sure it’ll be all that highly prioritized?
I am sympathetic to all of these comments but I think deworming is a uniquely bad intervention on which to apply strong priors/reasoning over evidence. It has been a highly controversial subject with strong arguments on both sides, and even the most positive studies of deworming are silent on the mechanisms for large long term effects from a brief intervention. Even consulting experts is not a guarantee of good judgments on a topic that is so polarizing in the expert community. I’m really curious how they arrived at this strong prior.
On cluster Vs sequence, I guess I don’t really understand what the important distinction here is supposed to be. Sometimes, you need to put various assumptions together to reach a conclusion—cost-effectiveness analysis is a salient example. However, for each specific premise, you could think about different pieces of information that would change your view on it. Aren’t these just the sequence and cluster bits, respectively? Okay, so you need to do both. Hence, if someone were to say ‘that’s wrong—you’re using sequence thinking’ and I think the correct response is to look at them blankly and say ‘um, okay… So what exactly are you saying the problem is’?
On cost-effectiveness, I’m going to assume that this is what GiveWell (and others) should be optimising for. And if they aren’t optimising for cost-effectiveness then, well, what are they trying to do? I can’t see any statement of what they are aiming for instead.
Also, I don’t understand why trying to maximize cost-effectiveness will fail to do so. Of course, you shouldn’t do naive cost-effectiveness, just like you probably shouldn’t be naive in general.
I appreciate that putting numbers on things can sometimes feel like false precision. But that’s a reason to use confidence intervals. (Also, as the saying goes, “if it’s worth doing, it’s worth doing with made up numbers”). Clearly, GiveWell do need to do cost-effectiveness assessments, even if just informally and in their heads, to decide what their recommendations are. But the part that’s just as crucial as sharing the numbers is explaining the reasons and evidence for your decision so people can check them and see if they agree. The point of this post is to highlight an important part of the analysis that was missing.
Thanks for collecting those quotes here. Because of some of what you quoted, I was confused for a while as to how much weight they actually put on their cost-effectiveness estimates. Elie’s appearance on Spencer Greenberg’s Clearer Thinking Podcast should be the most recent view on the issue.
In my experience, GiveWell is one of the few institutions that’s trying to make decisions based on cost-effectiveness analyses and trying to do that in a consistent and principled way. GiveWell’s cost-effectiveness estimates are not the only input into our decisions to fund programs, there are some other factors, but they’re certainly 80% plus of the case. I think we’re relatively unique in that way.
I think the quote is reasonably clear in it’s argument: maximizing cost-effectiveness through explicit EV calculation is not robust to uncertainty in our estimates. More formally, if our distribution of estimates is misspecified, then incorporating strength of evidence as a factor beyond explicit EV calculation helps limit how much weight we place on any (potentially misspecified) estimate. This is Knightian uncertainty, and the optimal decisions under Knightian uncertainty place more weight on factors with less risk of misspecification (ie stronger evidence).
You say that a “cluster bit” where you think about where evidence is coming from can account for this. I don’t think that’s true. Ultimately, your uncertainty will be irrelevant in determining the final Fermi estimate. Saying that you can “think about” sources of uncertainty doesn’t matter if that thinking doesn’t cash out into a decision criterion!
For example, if you estimate an important quantity as q = 1 with a confidence band of (-99, 101), that will give you the same cost-effectiveness estimate as if q had the confidence band of (0, 2). Even though the latter case is much more robust, you don’t have any way to minimize the effect of uncertainty in the former case. You do have the ability to place confidence bands around your cost-effectiveness estimate, but in every instance I’ve seen, confidence bands are pure lip service and the point estimate is the sole decision criterion. I do not see a confidence band in your estimate (sorry if I missed it) so that doesn’t seem like the most robust defense?
I think one consideration worth surfacing is that GiveWell explicitly notes that
(That page goes into more detail as to why.)
The philosophical underpinning behind this (if you want to call it that) is in Holden Karnofsky’s Sequence thinking vs cluster thinking essay—in short, GW is more cluster-style, while the Happier Lives Institute strikes me as more sequence-style (correct me if I’m wrong). Holden:
And then further down:
Elsewhere, Holden has also written that
If I’m right that HLI leans more towards sequence than cluster-style thinking, then you can interpret this passage as Holden directly addressing HLI (in the future).
This comment is getting too long already, so I’ll just add one more Holden quote, from Some considerations against more investment in cost-effectiveness estimates:
Note that these are very old posts, last updated in 2016; it’s entirely possible that GiveWell has changed their stance on them, but that isn’t my impression (correct me if I’m wrong).
Side note on deworming in particular: in David Roodman’s writeups on the GW blog, which you linked to in the main post, GW’s “total discount” is actually something like 99%, because it’s a product of multiple adjustments, of which replicability is just one. I couldn’t however find any direct reference to this 99% discount in the actual deworming CEAs, so I don’t know if they’re actually used.
I think the main post’s recommendations are great—having spent hours poring over GW’s CEAs as inspiration for my own local charity assessment work, the remark above that “these [CEAs] are hard to follow unless you already know what’s going on” keenly resonated with me—but given GW’s stance above on CEAs and the fact that they only have a single person updating their model, I’m not sure it’ll be all that highly prioritized?
I am sympathetic to all of these comments but I think deworming is a uniquely bad intervention on which to apply strong priors/reasoning over evidence. It has been a highly controversial subject with strong arguments on both sides, and even the most positive studies of deworming are silent on the mechanisms for large long term effects from a brief intervention. Even consulting experts is not a guarantee of good judgments on a topic that is so polarizing in the expert community. I’m really curious how they arrived at this strong prior.
Ah, I was waiting for someone to bring these up!
On cluster Vs sequence, I guess I don’t really understand what the important distinction here is supposed to be. Sometimes, you need to put various assumptions together to reach a conclusion—cost-effectiveness analysis is a salient example. However, for each specific premise, you could think about different pieces of information that would change your view on it. Aren’t these just the sequence and cluster bits, respectively? Okay, so you need to do both. Hence, if someone were to say ‘that’s wrong—you’re using sequence thinking’ and I think the correct response is to look at them blankly and say ‘um, okay… So what exactly are you saying the problem is’?
On cost-effectiveness, I’m going to assume that this is what GiveWell (and others) should be optimising for. And if they aren’t optimising for cost-effectiveness then, well, what are they trying to do? I can’t see any statement of what they are aiming for instead.
Also, I don’t understand why trying to maximize cost-effectiveness will fail to do so. Of course, you shouldn’t do naive cost-effectiveness, just like you probably shouldn’t be naive in general.
I appreciate that putting numbers on things can sometimes feel like false precision. But that’s a reason to use confidence intervals. (Also, as the saying goes, “if it’s worth doing, it’s worth doing with made up numbers”). Clearly, GiveWell do need to do cost-effectiveness assessments, even if just informally and in their heads, to decide what their recommendations are. But the part that’s just as crucial as sharing the numbers is explaining the reasons and evidence for your decision so people can check them and see if they agree. The point of this post is to highlight an important part of the analysis that was missing.
Thanks for collecting those quotes here. Because of some of what you quoted, I was confused for a while as to how much weight they actually put on their cost-effectiveness estimates. Elie’s appearance on Spencer Greenberg’s Clearer Thinking Podcast should be the most recent view on the issue.
(Time at the start of the quote: 29:14).
I think the quote is reasonably clear in it’s argument: maximizing cost-effectiveness through explicit EV calculation is not robust to uncertainty in our estimates. More formally, if our distribution of estimates is misspecified, then incorporating strength of evidence as a factor beyond explicit EV calculation helps limit how much weight we place on any (potentially misspecified) estimate. This is Knightian uncertainty, and the optimal decisions under Knightian uncertainty place more weight on factors with less risk of misspecification (ie stronger evidence).
You say that a “cluster bit” where you think about where evidence is coming from can account for this. I don’t think that’s true. Ultimately, your uncertainty will be irrelevant in determining the final Fermi estimate. Saying that you can “think about” sources of uncertainty doesn’t matter if that thinking doesn’t cash out into a decision criterion!
For example, if you estimate an important quantity as q = 1 with a confidence band of (-99, 101), that will give you the same cost-effectiveness estimate as if q had the confidence band of (0, 2). Even though the latter case is much more robust, you don’t have any way to minimize the effect of uncertainty in the former case. You do have the ability to place confidence bands around your cost-effectiveness estimate, but in every instance I’ve seen, confidence bands are pure lip service and the point estimate is the sole decision criterion. I do not see a confidence band in your estimate (sorry if I missed it) so that doesn’t seem like the most robust defense?