I was going to link to the 2011 GiveWell blog post by Holden Karnofsky arguing against taking EV estimates literally, but I see Alex Berger has already mentioned it above. I’d call out these passages in particular to save folks the effort of clicking through:
While some people feel that GiveWell puts too much emphasis on the measurable and quantifiable, there are others who go further than we do in quantification, and justify their giving (or other) decisions based on fully explicit expected-value formulas. The latter group tends to critique us – or at least disagree with us – based on our preference for strong evidence over high apparent “expected value,” and based on the heavy role of non-formalized intuition in our decisionmaking. This post is directed at the latter group.
We believe that people in this group are often making a fundamental mistake, one that we have long had intuitive objections to but have recently developed a more formal (though still fairly rough) critique of. The mistake (we believe) is estimating the “expected value” of a donation (or other action) based solely on a fully explicit, quantified formula, many of whose inputs are guesses or very rough estimates. We believe that any estimate along these lines needs to be adjusted using a “Bayesian prior”; that this adjustment can rarely be made (reasonably) using an explicit, formal calculation; and that most attempts to do the latter, even when they seem to be making very conservative downward adjustments to the expected value of an opportunity, are not making nearly large enough downward adjustments to be consistent with the proper Bayesian approach.
This view of ours illustrates why – while we seek to ground our recommendations in relevant facts, calculations and quantifications to the extent possible – every recommendation we make incorporates many different forms of evidence and involves a strong dose of intuition. And we generally prefer to give where we have strong evidence that donations can do a lot of good rather than where we have weak evidence that donations can do far more good – a preference that I believe is inconsistent with the approach of giving based on explicit expected-value formulas (at least those that (a) have significant room for error (b) do not incorporate Bayesian adjustments, which are very rare in these analyses and very difficult to do both formally and reasonably).
Sequence thinking involves making a decision based on a single model of the world: breaking down the decision into a set of key questions, taking one’s best guess on each question, and accepting the conclusion that is implied by the set of best guesses (an excellent example of this sort of thinking is Robin Hanson’s discussion of cryonics). It has the form: “A, and B, and C … and N; therefore X.” Sequence thinking has the advantage of making one’s assumptions and beliefs highly transparent, and as such it is often associated with finding ways to make counterintuitive comparisons.
Cluster thinking – generally the more common kind of thinking – involves approaching a decision from multiple perspectives (which might also be called “mental models”), observing which decision would be implied by each perspective, and weighing the perspectives in order to arrive at a final decision. Cluster thinking has the form: “Perspective 1 implies X; perspective 2 implies not-X; perspective 3 implies X; … therefore, weighing these different perspectives and taking into account how much uncertainty I have about each, X.” Each perspective might represent a relatively crude or limited pattern-match (e.g., “This plan seems similar to other plans that have had bad results”), or a highly complex model; the different perspectives are combined by weighing their conclusions against each other, rather than by constructing a single unified model that tries to account for all available information.
A key difference with “sequence thinking” is the handling of certainty/robustness (by which I mean the opposite of Knightian uncertainty) associated with each perspective. Perspectives associated with high uncertainty are in some sense “sandboxed” in cluster thinking: they are stopped from carrying strong weight in the final decision, even when such perspectives involve extreme claims (e.g., a low-certainty argument that “animal welfare is 100,000x as promising a cause as global poverty” receives no more weight than if it were an argument that “animal welfare is 10x as promising a cause as global poverty”).
Finally, cluster thinking is often (though not necessarily) associated with what I call “regression to normality”: the stranger and more unusual the action-relevant implications of a perspective, the higher the bar for taking it seriously (“extraordinary claims require extraordinary evidence”).
… I don’t believe that either style of thinking fully matches my best model of the “theoretically ideal” way to combine beliefs (more below); each can be seen as a more intellectually tractable approximation to this ideal.
I believe that each style of thinking has advantages relative to the other. I see sequence thinking as being highly useful for idea generation, brainstorming, reflection, and discussion, due to the way in which it makes assumptions explicit, allows extreme factors to carry extreme weight and generate surprising conclusions, and resists “regression to normality.” However, I see cluster thinking as superior in its tendency to reach good conclusions about which action (from a given set of options) should be taken.
… Sequence thinking presumes a particular framework for thinking about the consequences of one’s actions. It may incorporate many considerations, but all are translated into a single language, a single mental model, and in some sense a single “formula.” I believe this is at odds with how successful prediction systems operate, whether in finance, software, or domains such as political forecasting; such systems generally combine the predictions of multiple models in ways that purposefully avoid letting any one model (especially a low-certainty one) carry too much weight when it contradicts the others. On this point, I find Nate Silver’s discussion of his own system and the relationship to the work of Philip Tetlock (and the related concept of foxes vs. hedgehogs) germane
While the post is over a decade old it still seems foundational to how GiveWell think about their CEAs:
Cost-effectiveness is the single most important input in our evaluation of a program’s impact. However, there are many limitations to cost-effectiveness estimates, and we do not assess programs solely based on their estimated cost-effectiveness.
I think of cluster thinking-based intervention ranking as better than the sequence thinking-plus-bayesian correction approach you explored above to account for the optimiser’s curse for these reasons, especially the observation that successful prediction systems across most domains use cluster not sequence thinking.
especially the observation that successful prediction systems across most domains use cluster not sequence thinking.
I find this “observation” confusing / misleading, given that Holden defines cluster thinking as aggregating decisions from multiple perspectives. This is very different from aggregating the predictions of multiple models. The evidence of “success” he cites only applies to the latter (where “success” is with respect to Brier scores and such), not the former.
And this is practically relevant: If you aggregate multiple models but then maximize EV under the aggregated model, you don’t get the “sandboxing” property Holden claims cluster thinking satisfies. The fanatical/Pascalian model will still dominate the EV calculation.
(ETA: As an aside on sequence thinking / cluster thinking generally, I wish these discussions made it very clear whether we’re taking ST/CT as (1) different normative standards for good epistemology / decision-making per se, vs. as (2) different procedures for satisfying a given epistemological / decision-theoretic standard. Cf. “criterion of rightness vs. decision procedure” in ethics. This would be helpful for clarifying what’s meant by claims like “cluster thinking is how ‘successful’ prediction systems operate”. I’ve been assuming (2), here, FWIW.)
Thanks for the intriguing pushback, part of why I kept bringing this up over the years was to surface this kind of counterargument, upvoted. Flagging for myself later to look into the evidence base behind
The evidence of “success” he cites only applies to the latter (where “success” is with respect to Brier scores and such), not the former.
because I’d always assumed it was “obviously” the former (wrongly it seems), since the latter seemed non-robust in the sense Dan Luu looked into (cf. “you really have to understand things”, which multi-model aggregations are not).
I was going to link to the 2011 GiveWell blog post by Holden Karnofsky arguing against taking EV estimates literally, but I see Alex Berger has already mentioned it above. I’d call out these passages in particular to save folks the effort of clicking through:
Holden later presented the underlying thinking more systematically in the 2014 GiveWell post sequence thinking vs cluster thinking:
While the post is over a decade old it still seems foundational to how GiveWell think about their CEAs:
I think of cluster thinking-based intervention ranking as better than the sequence thinking-plus-bayesian correction approach you explored above to account for the optimiser’s curse for these reasons, especially the observation that successful prediction systems across most domains use cluster not sequence thinking.
I find this “observation” confusing / misleading, given that Holden defines cluster thinking as aggregating decisions from multiple perspectives. This is very different from aggregating the predictions of multiple models. The evidence of “success” he cites only applies to the latter (where “success” is with respect to Brier scores and such), not the former.
And this is practically relevant: If you aggregate multiple models but then maximize EV under the aggregated model, you don’t get the “sandboxing” property Holden claims cluster thinking satisfies. The fanatical/Pascalian model will still dominate the EV calculation.
(ETA: As an aside on sequence thinking / cluster thinking generally, I wish these discussions made it very clear whether we’re taking ST/CT as (1) different normative standards for good epistemology / decision-making per se, vs. as (2) different procedures for satisfying a given epistemological / decision-theoretic standard. Cf. “criterion of rightness vs. decision procedure” in ethics. This would be helpful for clarifying what’s meant by claims like “cluster thinking is how ‘successful’ prediction systems operate”. I’ve been assuming (2), here, FWIW.)
Thanks for the intriguing pushback, part of why I kept bringing this up over the years was to surface this kind of counterargument, upvoted. Flagging for myself later to look into the evidence base behind
because I’d always assumed it was “obviously” the former (wrongly it seems), since the latter seemed non-robust in the sense Dan Luu looked into (cf. “you really have to understand things”, which multi-model aggregations are not).
I’ve also assumed (2) FWIW.