Verstergaard has a reply on their website FWIW, can’t vouch for it/just passing along: https://vestergaard.com/blogs/vestergaard-position-bloomberg-article-malaria-bed-nets-papua-new-guinea/
Alexander_Berger
Open Philanthropy: Our Progress in 2023 and Plans for 2024
Exciting news! I worked closely with Zach at Open Phil before he left to be interim CEO of EV US, and was sad to lose him, but I was happy for EV at the time, and I’m excited now for what Zach will be able to do at the helm of CEA.
Suggestions for Individual Donors from Open Philanthropy Staff – 2023
Our planned allocation to GiveWell’s recommendations for the next few years
Great to hear about finding such a good fit, thanks for sharing!
Hi Dustin :)
FWIW I also don’t particularly understand the normative appeal of democratizing funding within the EA community. It seems to me like the common normative basis for democracy would tend to argue for democratizing control of resources in a much broader way, rather than within the self-selected EA community. I think epistemic/efficiency arguments for empowering more decision-makers within EA are generally more persuasive, but wouldn’t necessarily look like “democracy” per se and might look more like more regranting, forecasting tournaments, etc.
Announcing the awardees for Open Philanthropy’s $150M Regranting Challenge
Just wanted to say that I thought this post was very interesting and I was grateful to read it.
Just wanted to comment to say I thought this was very well done, nice work! I agree with Charles that replication work like this seems valuable and under-supplied.
I enjoyed the book and recommend it to others!
In case of of interest to EA forum folks, I wrote a long tweet thread with more substance on what I learned from it and remaining questions I have here: https://twitter.com/albrgr/status/1559570635390562305
Thanks MHR. I agree that one shouldn’t need to insist on statistical significance, but if GiveWell thinks that the actual expected effect is ~12% of the MK result, then I think if you’re updating on a similarly-to-MK-powered trial, you’re almost to the point of updating on a coinflip because of how underpowered you are to detect the expected effect.
I agree it would be useful to do this in a more formal bayesian framework which accurately characterizes the GW priors. It wouldn’t surprise me if one of the conclusions was that I’m misinterpreting GiveWell’s current views, or that it’s hard to articulate a formal prior that gets you from the MK results to GiveWell’s current views.
Thanks, appreciate it! I sympathize with this for some definition of low FWIW: “I have an intuition that low VSLs are a problem and we shouldn’t respect them” but I think it’s just a question of what the relevant “low” is.
Thanks Karthik. I think we might be talking past each other a bit, but replying in order on your first four replies:
My key issue with higher etas isn’t philosophical disagreement, it’s as guidance for practical decision-making. If I had taken your post at face value and used eta=1.5 to value UK GDP relative to other ways we could spend money, I think I would have predictably destroyed a lot of value for the global poor by failing to account for the full set of spillovers (because I think doing so is somewhere between very difficult and impossible). Even within low-income countries there are still pervasive tax, pecuniary, other externalities from high-income spending/consumption on lower-income co-nationals, that are closer to linear than logarithmic in $s. None of this is to deny the possibility or likelihood that in a totally abstract pure notion of consumption where it didn’t have any externalities at all and it was truly final personal consumption, it would be appropriate to have a log or steeper eta, it’s to say that that is a predictably bad approximation of our world and accordingly a bad decision rule given the actual data that we have. I think the main reply here has to be a defense of the feasibility of explicitly accounting for all relevant spillovers, and having made multiple (admittedly weak!) stabs in that direction, I’m personally pessimistic, but I’d certainly love to see others’ attempts.
In the blog post I linked in my #2 above we explicitly consider the set point implied by the IDInsight survey data, and we think it’s consistent with what we’re doing. We’re open to the argument for using a higher fixed constant on being alive, but instead of making you focus more on redistribution of income, the first order consequence of that decision would be to focus more on saving poor people’s lives (which is in fact what we predominantly do). It’s also worth noting that as your weight there gets high, it gets increasingly out of line with people’s revealed preferences and the VSL literature (and it’s not obvious to me why you’d take those revealed preferences less seriously than the revealed preferences around eta).
“I think almost everyone would agree that 10% income increase is worth much more to a poor person than a rich person”—I don’t think that’s right as a descriptive claim but again even if it were the point I’m making in #1 above still holds—if your income measure is imperfect as a measure of purely private consumption without any externalities, and I think they all are, then any small positive externalities that are ~linear in $ will dominate the effective utility calculation as eta gets to or above 1. I think there are many such externalities—taxes, philanthropy, aid, R&D, trade… - such that very high etas will lead to predictably bad policy advice.
You can add a constant normalizing function and it doesn’t change my original point—maybe it’s worth checking the Weitzman paper I linked to get an intuition? There’s genuinely more “at stake” in higher incomes when you have a lower eta vs a higher eta, and so if you’re trying make the correct utilitarian decision under true uncertainty, you don’t want to take a unweighted mean of eta and then run with it, you want to run your scenarios over different etas and weight by the stakes to get the best aggregate outcome. (I think how you specify the units might matter for the conclusion here though, a la the two envelope problem; I’m not sure.)
Hey Karthik, starting separate thread for a different issue. I opened your main spreadsheet for the first time, and I’m not positive but I think the 90% reduction claim is due to a spreadsheet error? The utility gain in B5 that flows through to your bottom line takeaway is hardcoded as being in log terms, but if eta changes than the utility gain to $s at the global average should change (and by the way I think it would really matter if you were denominating in units of global average, global median, or global poverty level). In this copy I made a change to reimplement isoelastic utility in B7 and B8. In this version, when eta=1.00001, OP ROI is 169, and when eta=1.5, OP ROI is 130, for a difference of ~25% rather than 90%. I didn’t really follow what was happening in the rest of the sheet so it’s possible this is wrong or misguided or implemented incorrectly.
Hey Karthik,
Thanks for the thoughtful post, I really appreciate it!
Open Phil has thought some about arguments for higher eta but as far as I can find never written them up, so I’ll go through some of the relevant arguments in my mind:
I think the #1 issue is that as eta gets large, the modeled utility at stake at high income levels approaches zero, which makes it fragile/vulnerable to errors, and those errors are easily decisive because our models do a bad job capturing empirically relevant spillovers that are close to linear rather than logarithmic or worse in $s.
For instance, take the UK, with GDP per capita of ~$40K. Until recently they gave 0.7% of GNI to foreign aid. Let’s assume their foreign aid is on average roughly as good as GiveDirectly, which is giving income to people living on ~$400/year. With eta=1.5, which implies a marginal $ at $400 is worth 1,000x a marginal $ at $40,000, if we reduced UK GDP by 1%, the loss of the 0.7% going to foreign aid is 7x more important than the loss of the 1% of GDP we assumed was just consumed by people with average incomes of $40,000. So if we had been willing to trade UK GDP for incomes of people at $400/year at the 1,000x rate implied by eta=1.5, we would have destroyed 7x the value for low income people before even getting to the costs for people in the UK by ignoring this practically relevant spillover.
You might be inclined to try to correct/control for this, but I think that’s rare in practice and difficult in principle: I don’t think foreign aid is the only place with this kind of international spillover (think R&D, trade, immigration). I think we live in an interconnected world and the assumption from high etas that abstract away from that seem dangerously wrong to me.
Depending on what you hold fixed, higher etas can also sharpen the challenge of how to weigh tradeoffs between lifesaving and income-increasing interventions, which we discuss here. Basically, if you hold a high-income VSLY fixed at something like 4x GDPpc and let the intercept move, higher etas imply that absolute welfare at lower income levels are much lower, which on a ~standard utilitarian framework would imply that social willingness to pay to save lower-income lives should be much lower than for higher-income lives. I think that’s a pretty unattractive implication.
FWIW it’s not as important but I looked into it once a while ago and I thought the equal sacrifice approach in Evans and Groom didn’t make sense, though I haven’t discussed this with others and may be wrong. (It assumes taxpayers are sacrificing an equal amount of utility everywhere on the income spectrum, and estimates eta from that, but it seems to me that that’s wrong—a marginal $ for a high income person in the US is taxed at ~35% federally, compared to ~10% for someone who might be making 10x less money—but on logarithmic utility the high-income person’s taxes should be vastly higher.) If instead you instead look at work like Hendren’s Efficient Welfare Weights, you get a ratio on welfare weights at the top of the income distribution relative to the bottom that is <2. (This makes sense as a description of the tradeoffs the tax code is making because, while our tax codes are progressive, a tax code that was actually efficiently codifying eta=1.4 would place ~0 weight on high incomes and would be at the ~peak of the Laffer curve, which AFAIK is not an accurate characterization of US or UK tax structures.)
Other lines of evidence in Groom make IMO better arguments for higher eta, though overall I’m not sure how much weight to put on revealed preference vs other factors here. One source I’ve seen cited elsewhere that seems maybe better to me is Dropp et al. 2017, which surveys a couple hundred economists about the right eta and gets a median of 1 and mean of 1.35. But per the argument #1 above, you’d get a very different answer if you aggregated over implied welfare levels (which I think would make you effectively want to end up with an eta <1), rather than taking the mean of eta and then extrapolating welfare levels. (I think this is related to this insight from Weitzman.)
In practice, we actually originally chose an eta=1 for simplicity (you can do math more easily and don’t need to know whole distributions as much) and because it roughly accords with the life satisfaction data (though that is contested). I personally think that the #1 point above dominates and if we were to revisit this, it would make more sense to revisit down than up, but I still see eta=1 as a reasonable compromise and don’t see more work on this as currently one of our top priorities.
On your 36% adjustment within the log framework: I don’t think our estimates for this are accurate to anything like 36%; I’d be happy if they turn out to be within a factor of 2-3x. So I find it easy to believe you could be right here. But I think your changes come from a period when inequality increased substantially, to a historically unusual level, and I would be surprised if it made sense to predict a continuation of that increasing trend indefinitely over the relevant horizon for Tom’s model (many decades to centuries).
More broadly, I agree that the gains from redistribution can be substantial and I think our work reflects that (e.g., our Global Aid Policy program).
I don’t have a particularly good estimate on total time, but my impression is that most doctors recommend people plan to take a couple weeks off from office work, which would maybe 2-3x your 52 hr estimate?
Hi Nicole,
I think this is a cool choice and a good post—thanks for both! I agree with your bottom line that kidney donation can be a good choice for EAs and just wanted to flag a few additional resources and considerations:
I think these other EA forum posts about the costs and benefits of donation are worth checking out. In my mind the most important update relative to when I donated is that the best long-run studies now suggest a roughly 1 percentage point increase in later-life risk of kidney failure because of donating. I think that translates less than 1:1 to mortality for a variety of reasons (ability to get a transplant, maybe xenotransplantation or other things will be easy in 20-50 years) but I think that factor probably swamps the near-term (roughly 1⁄3,000) risk of death in surgery when thinking about the EV calculation.
I think I took ~3 weeks off work to recover from donation (it was also around the holidays for me), and I think for folks who work in altruistic jobs that may dominate the cost calculation. 52 hours seems like a very low estimate of the expected time cost to me all in though.
I think people sometimes assume that the original donor gets full counterfactual “credit” for all the steps in a chain. My read of this evidence is that even though average chain length is ~4, the marginal social value of an altruistic donor starting a chain is “only” ~.8-1.7 transplants (depending on blood type) because the relevant counterfactual can be other chains being longer.
I think things like this post are themselves a pretty important channel for impact. I think the impact of my personal donation was dominated by the small influence I had on getting Dylan Matthews to donate, which then had a big knock-on impact because his writing led a number of other people to donate.
Overall, I think these kinds of persuasion considerations can play a weirdly big role in how you evaluate kidney donation, and I don’t have a clear bottom line on which way they cut.
Hi MHR,
I really appreciate substantive posts like this, thanks!
This response is just speaking for myself, doing rough math on the weekend that I haven’t run by anyone else. Someone (e.g., from @GiveWell) should correct me if I’m wrong, but I think you’re vastly understating the difficulty and cost of running an informative replication given the situation on deworming. (My math below seems intuitively too pessimistic, so I welcome corrections!)
If you look at slide 58 here you get the minimum detectable effect (MDE) size with 80% power can be approximated as 2.8*the standard error (which is itself effectively inversely proportional to the square of the sample size).
I didn’t check the original sources, but this GiveWell doc on their deworming replicability adjustment implies that the standard error for log(income/consumption) in the most recent replications is ~.066 (on a “main effect” of .109). The original RCT involved 75 schools, and according to figure A1 here the followup KLPS 4 involved surveying 4,135 participants in the original trial. GiveWell’s most recent cost-effectiveness analysis for Deworm the World makes 2 key adjustments to the main effect from the RCT:
A replicability adjustment of .13 (row 11)
A geography-specific adjustment for worm burden which averages about .12 (row 40) (this is because worm burdens are now much lower than they were at the time of MK)
Together, these adjustments imply that GiveWell projects the per-capita benefit to the people dewormed to be just .13*.12=1.56% of the .109 impact on log income in the late followups to the original Miguel and Kremer RCT. So if we wanted to detect the effect GiveWell expects to see in mass deworming, we’d have an MDE of ~.0017 on log income, which with 80% power and the formula above (MDE=2.8*standard error) implies we’d need the standard error to be .0017/2.8=~.00061 log points. So a well-powered study to get the effect GiveWell expects would need a standard error roughly 108 times smaller than the standard error (.066) GiveWell calculates on the actual followup RCTs.
But because standard errors are inversely proportional to the square root of sample size, if you used the same study design, getting a 108x smaller standard error would require a 108*108=11,664 times larger sample. I think that might imply a sample size of ~all the elementary schools in India (11,664*75=874K), which would presumably include many schools that do not in fact actually have significant worm burdens.
If the original MK study and one followup cost $1M (which I think is the right order of magnitude but may be too high or too low), this implies that a followup powered to find the effect GiveWell expects would cost many billions of dollars. And of course it would take well over a decade to get the long term followup results here. (That said, it wouldn’t surprise me if I’m getting the math wrong here—someone please flag if so!)
I’m sure there are better study designs than the one I’m implicitly modeling here that could generate more power, or places where worm burdens are still high enough to make this somewhat more economical, but I’m skeptical they can overcome the fundamental difficulty of detecting small effects in cluster RCTs.
I think a totally reasonable reaction to this is to be more skeptical of small cheap interventions, because they’re so hard to study and it’s so easy to end up driven by your priors.
FWIW I think I’m an example of Type 1 (literally, in Lorenzo’s data) and I also agree that abstractly more of Type 2 would be helpful (but I think there are various tradeoffs and difficulties that make it not straightforwardly clear what to do about it).