Verstergaard has a reply on their website FWIW, can’t vouch for it/just passing along: https://vestergaard.com/blogs/vestergaard-position-bloomberg-article-malaria-bed-nets-papua-new-guinea/
Alexander_Berger
Exciting news! I worked closely with Zach at Open Phil before he left to be interim CEO of EV US, and was sad to lose him, but I was happy for EV at the time, and I’m excited now for what Zach will be able to do at the helm of CEA.
Great to hear about finding such a good fit, thanks for sharing!
Hi Dustin :)
FWIW I also don’t particularly understand the normative appeal of democratizing funding within the EA community. It seems to me like the common normative basis for democracy would tend to argue for democratizing control of resources in a much broader way, rather than within the self-selected EA community. I think epistemic/efficiency arguments for empowering more decision-makers within EA are generally more persuasive, but wouldn’t necessarily look like “democracy” per se and might look more like more regranting, forecasting tournaments, etc.
Just wanted to say that I thought this post was very interesting and I was grateful to read it.
Just wanted to comment to say I thought this was very well done, nice work! I agree with Charles that replication work like this seems valuable and under-supplied.
I enjoyed the book and recommend it to others!
In case of of interest to EA forum folks, I wrote a long tweet thread with more substance on what I learned from it and remaining questions I have here: https://twitter.com/albrgr/status/1559570635390562305
Thanks MHR. I agree that one shouldn’t need to insist on statistical significance, but if GiveWell thinks that the actual expected effect is ~12% of the MK result, then I think if you’re updating on a similarly-to-MK-powered trial, you’re almost to the point of updating on a coinflip because of how underpowered you are to detect the expected effect.
I agree it would be useful to do this in a more formal bayesian framework which accurately characterizes the GW priors. It wouldn’t surprise me if one of the conclusions was that I’m misinterpreting GiveWell’s current views, or that it’s hard to articulate a formal prior that gets you from the MK results to GiveWell’s current views.
Thanks, appreciate it! I sympathize with this for some definition of low FWIW: “I have an intuition that low VSLs are a problem and we shouldn’t respect them” but I think it’s just a question of what the relevant “low” is.
Thanks Karthik. I think we might be talking past each other a bit, but replying in order on your first four replies:
My key issue with higher etas isn’t philosophical disagreement, it’s as guidance for practical decision-making. If I had taken your post at face value and used eta=1.5 to value UK GDP relative to other ways we could spend money, I think I would have predictably destroyed a lot of value for the global poor by failing to account for the full set of spillovers (because I think doing so is somewhere between very difficult and impossible). Even within low-income countries there are still pervasive tax, pecuniary, other externalities from high-income spending/consumption on lower-income co-nationals, that are closer to linear than logarithmic in $s. None of this is to deny the possibility or likelihood that in a totally abstract pure notion of consumption where it didn’t have any externalities at all and it was truly final personal consumption, it would be appropriate to have a log or steeper eta, it’s to say that that is a predictably bad approximation of our world and accordingly a bad decision rule given the actual data that we have. I think the main reply here has to be a defense of the feasibility of explicitly accounting for all relevant spillovers, and having made multiple (admittedly weak!) stabs in that direction, I’m personally pessimistic, but I’d certainly love to see others’ attempts.
In the blog post I linked in my #2 above we explicitly consider the set point implied by the IDInsight survey data, and we think it’s consistent with what we’re doing. We’re open to the argument for using a higher fixed constant on being alive, but instead of making you focus more on redistribution of income, the first order consequence of that decision would be to focus more on saving poor people’s lives (which is in fact what we predominantly do). It’s also worth noting that as your weight there gets high, it gets increasingly out of line with people’s revealed preferences and the VSL literature (and it’s not obvious to me why you’d take those revealed preferences less seriously than the revealed preferences around eta).
“I think almost everyone would agree that 10% income increase is worth much more to a poor person than a rich person”—I don’t think that’s right as a descriptive claim but again even if it were the point I’m making in #1 above still holds—if your income measure is imperfect as a measure of purely private consumption without any externalities, and I think they all are, then any small positive externalities that are ~linear in $ will dominate the effective utility calculation as eta gets to or above 1. I think there are many such externalities—taxes, philanthropy, aid, R&D, trade… - such that very high etas will lead to predictably bad policy advice.
You can add a constant normalizing function and it doesn’t change my original point—maybe it’s worth checking the Weitzman paper I linked to get an intuition? There’s genuinely more “at stake” in higher incomes when you have a lower eta vs a higher eta, and so if you’re trying make the correct utilitarian decision under true uncertainty, you don’t want to take a unweighted mean of eta and then run with it, you want to run your scenarios over different etas and weight by the stakes to get the best aggregate outcome. (I think how you specify the units might matter for the conclusion here though, a la the two envelope problem; I’m not sure.)
Hey Karthik, starting separate thread for a different issue. I opened your main spreadsheet for the first time, and I’m not positive but I think the 90% reduction claim is due to a spreadsheet error? The utility gain in B5 that flows through to your bottom line takeaway is hardcoded as being in log terms, but if eta changes than the utility gain to $s at the global average should change (and by the way I think it would really matter if you were denominating in units of global average, global median, or global poverty level). In this copy I made a change to reimplement isoelastic utility in B7 and B8. In this version, when eta=1.00001, OP ROI is 169, and when eta=1.5, OP ROI is 130, for a difference of ~25% rather than 90%. I didn’t really follow what was happening in the rest of the sheet so it’s possible this is wrong or misguided or implemented incorrectly.
Hey Karthik,
Thanks for the thoughtful post, I really appreciate it!
Open Phil has thought some about arguments for higher eta but as far as I can find never written them up, so I’ll go through some of the relevant arguments in my mind:
I think the #1 issue is that as eta gets large, the modeled utility at stake at high income levels approaches zero, which makes it fragile/vulnerable to errors, and those errors are easily decisive because our models do a bad job capturing empirically relevant spillovers that are close to linear rather than logarithmic or worse in $s.
For instance, take the UK, with GDP per capita of ~$40K. Until recently they gave 0.7% of GNI to foreign aid. Let’s assume their foreign aid is on average roughly as good as GiveDirectly, which is giving income to people living on ~$400/year. With eta=1.5, which implies a marginal $ at $400 is worth 1,000x a marginal $ at $40,000, if we reduced UK GDP by 1%, the loss of the 0.7% going to foreign aid is 7x more important than the loss of the 1% of GDP we assumed was just consumed by people with average incomes of $40,000. So if we had been willing to trade UK GDP for incomes of people at $400/year at the 1,000x rate implied by eta=1.5, we would have destroyed 7x the value for low income people before even getting to the costs for people in the UK by ignoring this practically relevant spillover.
You might be inclined to try to correct/control for this, but I think that’s rare in practice and difficult in principle: I don’t think foreign aid is the only place with this kind of international spillover (think R&D, trade, immigration). I think we live in an interconnected world and the assumption from high etas that abstract away from that seem dangerously wrong to me.
Depending on what you hold fixed, higher etas can also sharpen the challenge of how to weigh tradeoffs between lifesaving and income-increasing interventions, which we discuss here. Basically, if you hold a high-income VSLY fixed at something like 4x GDPpc and let the intercept move, higher etas imply that absolute welfare at lower income levels are much lower, which on a ~standard utilitarian framework would imply that social willingness to pay to save lower-income lives should be much lower than for higher-income lives. I think that’s a pretty unattractive implication.
FWIW it’s not as important but I looked into it once a while ago and I thought the equal sacrifice approach in Evans and Groom didn’t make sense, though I haven’t discussed this with others and may be wrong. (It assumes taxpayers are sacrificing an equal amount of utility everywhere on the income spectrum, and estimates eta from that, but it seems to me that that’s wrong—a marginal $ for a high income person in the US is taxed at ~35% federally, compared to ~10% for someone who might be making 10x less money—but on logarithmic utility the high-income person’s taxes should be vastly higher.) If instead you instead look at work like Hendren’s Efficient Welfare Weights, you get a ratio on welfare weights at the top of the income distribution relative to the bottom that is <2. (This makes sense as a description of the tradeoffs the tax code is making because, while our tax codes are progressive, a tax code that was actually efficiently codifying eta=1.4 would place ~0 weight on high incomes and would be at the ~peak of the Laffer curve, which AFAIK is not an accurate characterization of US or UK tax structures.)
Other lines of evidence in Groom make IMO better arguments for higher eta, though overall I’m not sure how much weight to put on revealed preference vs other factors here. One source I’ve seen cited elsewhere that seems maybe better to me is Dropp et al. 2017, which surveys a couple hundred economists about the right eta and gets a median of 1 and mean of 1.35. But per the argument #1 above, you’d get a very different answer if you aggregated over implied welfare levels (which I think would make you effectively want to end up with an eta <1), rather than taking the mean of eta and then extrapolating welfare levels. (I think this is related to this insight from Weitzman.)
In practice, we actually originally chose an eta=1 for simplicity (you can do math more easily and don’t need to know whole distributions as much) and because it roughly accords with the life satisfaction data (though that is contested). I personally think that the #1 point above dominates and if we were to revisit this, it would make more sense to revisit down than up, but I still see eta=1 as a reasonable compromise and don’t see more work on this as currently one of our top priorities.
On your 36% adjustment within the log framework: I don’t think our estimates for this are accurate to anything like 36%; I’d be happy if they turn out to be within a factor of 2-3x. So I find it easy to believe you could be right here. But I think your changes come from a period when inequality increased substantially, to a historically unusual level, and I would be surprised if it made sense to predict a continuation of that increasing trend indefinitely over the relevant horizon for Tom’s model (many decades to centuries).
More broadly, I agree that the gains from redistribution can be substantial and I think our work reflects that (e.g., our Global Aid Policy program).
I don’t have a particularly good estimate on total time, but my impression is that most doctors recommend people plan to take a couple weeks off from office work, which would maybe 2-3x your 52 hr estimate?
Hi Nicole,
I think this is a cool choice and a good post—thanks for both! I agree with your bottom line that kidney donation can be a good choice for EAs and just wanted to flag a few additional resources and considerations:
I think these other EA forum posts about the costs and benefits of donation are worth checking out. In my mind the most important update relative to when I donated is that the best long-run studies now suggest a roughly 1 percentage point increase in later-life risk of kidney failure because of donating. I think that translates less than 1:1 to mortality for a variety of reasons (ability to get a transplant, maybe xenotransplantation or other things will be easy in 20-50 years) but I think that factor probably swamps the near-term (roughly 1⁄3,000) risk of death in surgery when thinking about the EV calculation.
I think I took ~3 weeks off work to recover from donation (it was also around the holidays for me), and I think for folks who work in altruistic jobs that may dominate the cost calculation. 52 hours seems like a very low estimate of the expected time cost to me all in though.
I think people sometimes assume that the original donor gets full counterfactual “credit” for all the steps in a chain. My read of this evidence is that even though average chain length is ~4, the marginal social value of an altruistic donor starting a chain is “only” ~.8-1.7 transplants (depending on blood type) because the relevant counterfactual can be other chains being longer.
I think things like this post are themselves a pretty important channel for impact. I think the impact of my personal donation was dominated by the small influence I had on getting Dylan Matthews to donate, which then had a big knock-on impact because his writing led a number of other people to donate.
Overall, I think these kinds of persuasion considerations can play a weirdly big role in how you evaluate kidney donation, and I don’t have a clear bottom line on which way they cut.
Hi MHR,
I really appreciate substantive posts like this, thanks!
This response is just speaking for myself, doing rough math on the weekend that I haven’t run by anyone else. Someone (e.g., from @GiveWell) should correct me if I’m wrong, but I think you’re vastly understating the difficulty and cost of running an informative replication given the situation on deworming. (My math below seems intuitively too pessimistic, so I welcome corrections!)
If you look at slide 58 here you get the minimum detectable effect (MDE) size with 80% power can be approximated as 2.8*the standard error (which is itself effectively inversely proportional to the square of the sample size).
I didn’t check the original sources, but this GiveWell doc on their deworming replicability adjustment implies that the standard error for log(income/consumption) in the most recent replications is ~.066 (on a “main effect” of .109). The original RCT involved 75 schools, and according to figure A1 here the followup KLPS 4 involved surveying 4,135 participants in the original trial. GiveWell’s most recent cost-effectiveness analysis for Deworm the World makes 2 key adjustments to the main effect from the RCT:
A replicability adjustment of .13 (row 11)
A geography-specific adjustment for worm burden which averages about .12 (row 40) (this is because worm burdens are now much lower than they were at the time of MK)
Together, these adjustments imply that GiveWell projects the per-capita benefit to the people dewormed to be just .13*.12=1.56% of the .109 impact on log income in the late followups to the original Miguel and Kremer RCT. So if we wanted to detect the effect GiveWell expects to see in mass deworming, we’d have an MDE of ~.0017 on log income, which with 80% power and the formula above (MDE=2.8*standard error) implies we’d need the standard error to be .0017/2.8=~.00061 log points. So a well-powered study to get the effect GiveWell expects would need a standard error roughly 108 times smaller than the standard error (.066) GiveWell calculates on the actual followup RCTs.
But because standard errors are inversely proportional to the square root of sample size, if you used the same study design, getting a 108x smaller standard error would require a 108*108=11,664 times larger sample. I think that might imply a sample size of ~all the elementary schools in India (11,664*75=874K), which would presumably include many schools that do not in fact actually have significant worm burdens.
If the original MK study and one followup cost $1M (which I think is the right order of magnitude but may be too high or too low), this implies that a followup powered to find the effect GiveWell expects would cost many billions of dollars. And of course it would take well over a decade to get the long term followup results here. (That said, it wouldn’t surprise me if I’m getting the math wrong here—someone please flag if so!)
I’m sure there are better study designs than the one I’m implicitly modeling here that could generate more power, or places where worm burdens are still high enough to make this somewhat more economical, but I’m skeptical they can overcome the fundamental difficulty of detecting small effects in cluster RCTs.
I think a totally reasonable reaction to this is to be more skeptical of small cheap interventions, because they’re so hard to study and it’s so easy to end up driven by your priors.
I also hadn’t seen these slides, thanks for posting! (And thanks to Michael for the post, I thought it was interesting/thought-provoking.)
Thanks for the thorough engagement, Michael. We appreciate thoughtful critical engagement with our work and are always happy to see more of it. (And thanks for flagging this to us in advance so we could think about it—we appreciate that too!)
One place where I particularly appreciate the push is on better defining and articulating what we mean by “worldviews” and how we approach worldview diversification. By worldview we definitely do not mean “a set of philosophical assumptions”—as Holden writes in the blog post where he introduced the concept, we define worldviews as:
a set of highly debatable (and perhaps impossible to evaluate) beliefs that favor a certain kind of giving. One worldview might imply that evidence-backed charities serving the global poor are far more worthwhile than either of the types of giving discussed above; another might imply that farm animal welfare is; another might imply that global catastrophic risk reduction is. A given worldview represents a combination of views, sometimes very difficult to disentangle, such that uncertainty between worldviews is constituted by a mix of empirical uncertainty (uncertainty about facts), normative uncertainty (uncertainty about morality), and methodological uncertainty (e.g. uncertainty about how to handle uncertainty, as laid out in the third bullet point above).
We think it is a mistake to collapse worldviews in the sense that we use them to popular debates in philosophy, and we definitely don’t aim to be exhaustive across worldviews that have many philosophical adherents. We see proliferation of worldviews as costly for the standard intellectual reason that they inhibit optimization, as well as carrying substantial practical costs, so we think the bar for putting money behind an additional worldview is significantly higher than you seem to think. But we haven’t done a good job articulating and exploring what we do mean and how that interacts with the case for worldview diversification (which itself remains undertheorized). We appreciate the push on this and are planning to do more thinking and writing on it in the future.
In terms of disagreements, I think maybe the biggest one is a meta one about the value of philosophy per se. We are less worried about internal consistency than we think it is appropriate for philosophers to be, and accordingly less interested in costly exercises that would make us more consistent without carrying obviously large practical benefits. When we encounter critiques, our main questions are, “how would we spend our funding differently if this critique were correct? How costly are the deviations that we’re making according to this critique?” As an example of a case where we spent a lot of time thinking about the philosophy and ended up thinking it didn’t really have high utility stakes and so just deprioritized it for now, see the last footnote on this post (where we find that the utility stakes of a ~3x increase in valuations on lives in some countries would be surprisingly small, not because they would not change what we would fund but because the costs of mistakes are not that big on the view that has higher valuations). You mentioned being confused by what’s going on in that sheet, which is totally fair—feel free to email Peter for a more detailed explanation/walkthrough as the footnote indicates.
In this particular writeup, you haven’t focused as much on the upshot of what we should fund that we don’t (or what we do fund that we shouldn’t), but elsewhere in your writing I take your implication to be that we should do more on mental health. Based on my understanding of your critiques, I think that takeaway is wrong, and in fact taking on board your critiques here would lead us to do more of what most of OP Global Health and Wellbeing already does—save kids’ lives and work to fight the worst abuses of factory farming, potentially with a marginal reduction in our more limited work focused on increasing incomes. Three particular disagreements that I think drive this:
Set point. I think setting a neutral point on a life satisfaction scale of 5⁄10 is somewhere between unreasonable and unconscionable, and OP institutionally is comfortable with the implication that saving human lives is almost always good. Given that we think the correct neutral point is low, taking your other points on board would imply that we should place even more weight on life-saving interventions. We think that is plausible, but for now we’ll note that we’re already really far in this direction compared to other actors. That doesn’t mean we shouldn’t go further, but we do think it should prompt some humility on our part re: even more extreme divergence with consensus, which is one reason we’re going slowly.
Hedonism. We think that most plausible arguments for hedonism end up being arguments for the dominance of farm animal welfare. We seem to put a lot of weight on those arguments relative to you, and farm animal welfare is OP GHW’s biggest area of giving after GiveWell recommendations. If we updated toward more weight on hedonism, we think the correct implication would be even more work on FAW, rather than work on human mental health. A little more abstractly, we don’t think that different measures of subjective wellbeing (hedonic and evaluative) neatly track different theories of welfare. That doesn’t mean they’re useless—we can still learn a lot when noisy measures all point in the same direction—but we don’t think it makes sense to entrench a certain survey-based measure like life satisfaction scores as the ultimate goal.
Population ethics. While we’re ambivalent about how much to bet on the total view, we disagree with your claim that doing so would reduce our willingness to pay for saving lives given offsetting fertility effects. As I wrote here, Roodman’s report is only counting the first generation. If he is right that preventing two under-5 deaths leads to ~one fewer birth, that’s still one more kid net making it to adulthood and being able to have kids of their own. Given fertility rates in the places where we fund work to save lives, I think that would more than offset the Roodman adjustment in just a few decades, and potentially cumulatively lead to much higher weight on the value of saving kids’ lives today (though one would also have to be attentive to potential costs of bigger populations).
Related to the point about placing less weight on the value of philosophy per se, we’re reluctant to get pulled into long written back and forths about this kind of thing, so I’m not planning to say more on this thread by default, but happy to continue these discussions in the future. And thanks again for taking the time to engage here.
- Open Phil Should Allocate Most Neartermist Funding to Animal Welfare by 19 Nov 2023 17:00 UTC; 489 points) (
- Winners of the EA Criticism and Red Teaming Contest by 1 Oct 2022 1:50 UTC; 226 points) (
- Prioritising animal welfare over global health and development? by 13 May 2023 9:03 UTC; 107 points) (
GiveWell could answer more confidently but FWIW my take is:
-December 2022 is totally fine relative to today.
-I currently expect this increase in marginal cost-effectiveness to persist in future years, but with a lot of uncertainty/low confidence.
I wrote a long twitter thread with some replies here FWIW: https://twitter.com/albrgr/status/1532726108130377729
FWIW I think I’m an example of Type 1 (literally, in Lorenzo’s data) and I also agree that abstractly more of Type 2 would be helpful (but I think there are various tradeoffs and difficulties that make it not straightforwardly clear what to do about it).