In 2023, we conducted “red teaming” to critically examine our four top charities. We found several issues: 4 mistakes and 10 areas requiring more work. We thought these could significantly affect our 2024 grants: $5m-$40m in grants we wouldn’t have made otherwise and $5m-$40m less in grants we would have made otherwise (out of ~$325m total).
This report looks back at how addressing these issues changed our actual grantmaking decisions in 2024. Our rough estimate is that red teaming led to ~$37m in grants we wouldn’t have made otherwise and prevented ~$20m in grants we would have made otherwise, out of ~$340m total grants. The biggest driver was incorporating multiple sources for disease burden data rather than relying on single sources.1 There were also several cases where updates did not change grant decisions but led to meaningful changes in our research.
Some self-assessed progress that caught my eye — incomplete list, full one here; these “led to important errors or… worsened the credibility of our research” (0 = no progress made, 10 = completely resolved):
Failure to engage with outside experts (8/10): We spent 240 days at conferences/site visits in 2024 (vs. 60 in 2023). We think this type of external engagement helped us avoid ~$4m in grants and identify new grant opportunities like Uduma water utility ($480,000). We’ve established ongoing relationships with field experts. (more)
Failure to check burden data against multiple sources(8/10): By using multiple data sources for disease burden, we made ~$34m in grants we likely wouldn’t have otherwise and declined ~$14m in grants we probably would have made. We’ve implemented comprehensive guidelines for triangulating data sources. (more)
Failure to account for individuals receiving interventions from other sources (7/10): We were underestimating how many people would get nets without our campaigns, reducing cost-effectiveness by 20-25%. We’ve updated our models but have made limited progress on exploring routine distribution systems (continuous distribution through existing health channels) as an alternative or complement to our mass campaigns. (more)
Failure to estimate interactions between programs (7/10): We adjusted our vitamin A model to account for overlap with azithromycin distribution (reducing effectiveness by ~15%) and accounted for malaria vaccine coverage when estimating nets impact. We’ve developed a framework to systematically address this. (more)
(As an aside, I’ve noticed plenty of claims of GW top charity-beating cost-effectiveness figures both on the forum and elsewhere, and I basically never give them the credence I’d give to GW’s own estimates, due to the kind of (usually downward) adjustments mentioned above like receiving interventions from other sources or between-program interventions, and GW’s sheer reasoning thoroughness behind those adjustments, seriously, click on any of those “(more)”s)
Some other issues they’d “been aware of at the time of red teaming and had deprioritized but that we thought were worth looking into following red teaming” — again incomplete list, full one here:
Insufficient attention to inconsistency across cost-effectiveness analyses (CEAs)(8/10): We made our estimates of long-term income effects of preventive health programs more consistent (now 20-30% of benefits across top charities vs. previously 10-40%) and fixed implausible assumptions on indirect deaths (deaths prevented, e.g., by malaria prevention that aren’t attributed to malaria on cause-of-death data). We’ve implemented regular consistency checks. (more)
Insufficient attention to some fundamental drivers of intervention efficacy (7/10): We updated our assumptions about net durability and chemical decay on nets (each changing cost-effectiveness by −5% and 11% across geographies) and consulted experts about vaccine efficacy concerns, but we haven’t systematically addressed monitoring intervention efficacy drivers across programs. (more)
Insufficient sideways checks on coverage, costs, and program impact (7/10): We funded $900,000 for external surveys of Evidence Action’s water programs, incorporated additional DHS data in our models, and added other verification methods. We’ve made this a standard part of our process but think there are other areas where we’d benefit from additional verification of program metrics. (more)
Insufficient follow-up on potentially concerning monitoring and costing data (7/10): We’ve encouraged Helen Keller to improve its monitoring (now requiring independent checks of 10% of households), verified AMF’s data systems have improved, and published our first program lookbacks. However, we still think there are important gaps. (more)
I always had the impression GW engaged outside experts a fair bit, so I was pleasantly surprised to learn they thought they weren’t doing enough of it and then actually followed through so seriously, this is an A+ example of organisational commitment to and follow-through on self-improvement so I’d like to quote this section in full:
In 2024, we spent ~240 days at conferences or site visits, compared to ~60 in 2023. We spoke to experts more regularly as part of grant investigations, and tried a few new approaches to getting external feedback. While it’s tough to establish impact, we think this led to four smaller grants we might not have made otherwise (totalling ~$1 million) and led us to deprioritize a ~$10 million grant we might’ve made otherwise.
More detail on what we said we’d do to address this issue and what we found (text in italics is drawn from our original report):
More regularly attend conferences with experts in areas in which we fund programs (malaria, vaccination, etc.).
In 2024, our research team attended 16 conferences, or ~140 days, compared to ~40 days at conferences in 2023.35
We think these conferences helped us build relationships with experts and identify new grant opportunities. Two examples:
A conversation with another funder at a conference led us to re-evaluate our assumptions on HPV coverage and ultimately deprioritize a roughly $10 million grant we may have made otherwise.36
We learned about Uduma, a for-profit rural water utility, at a conference and made a $480,000 grant to them in November 2024.37
We also made more site visits. In 2023, we spent approximately 20 days on site visits. In 2024, the number was approximately 100 days.38
Reach out to experts more regularly as part of grant investigations and intervention research. We’ve always consulted with program implementers, researchers, and others through the course of our work, but we think we should allocate more relative time to conversations over desk research in most cases.
Our research team has allocated more time to expert conversations. A few examples:
Our 2024 grants for VAS to Helen Keller International relied significantly on conversations with program experts. Excluding conversations with the grantee, we had 15 external conversations.
We’ve set up longer-term contracts with individuals who provide us regular feedback. For example, our water and livelihoods team has engaged Daniele Lantagne and Paul Gunstensen for input on grant opportunities and external review of our research.
We spoke with other implementers about programs we’re considering. For example, we discussed our 2024 grant to support PATH’s technical assistance to support the rollout of malaria vaccines with external stakeholders in the space.39
This led to learning about some new grant opportunities. For example:
We are currently considering a $4 million grant that we learned about through an expert conversation.40
Experiment with new approaches for getting feedback on our work.
In addition to the above, we tried a few other approaches we hadn’t (or hadn’t extensively) used before. Three examples:
Following our red teaming of GiveWell’s top charities, we decided to review our iron grantmaking to understand what were the top research questions we should address as we consider making additional grants in the near future. We had three experts review our work in parallel to internal red teaming, so we could get input and ask questions along the way.41 We did not do this during our top charities red teaming, in the report of which we wrote “we had limited back-and-forth with external experts during the red teaming process, and we think more engagement with individuals outside of GiveWell could improve the process.”
We made a grant to Busara to collect qualitative information on our grants to Helen Keller International’s vitamin A supplementation program in Nigeria.42
We funded the Center for Global Development to understand why highly cost-effective GiveWell programs aren’t funded by other groups focused on saving lives. This evaluation was designed to get external scrutiny from an organization with expertise in global health and development, and by other funders and decision-makers in low- and middle-income countries.
Some quick reactions:
I like that GW thinks they should allocate more time to expert conversations vs desk research in most cases
I like that GW are improving their own red-teaming process by having experts review their work in parallel
I too am keen to see what CGD find out re: why GW top-recommended programs aren’t funded by other groups you’d expect to do so
the Zipline exploratory grant is very cool, I raved about it previously
I wouldn’t have expected that the biggest driver in terms of grants made/not made would be failure to sense check raw data in burden calculations; while they’ve done a lot to redress this there’s still a lot more on the horizon, poised to affect grantmaking for areas like maternal mortality (prev. underrated, deserves a second look)
funnily enough, they self-scored 5⁄10 on “insufficient focus on simplicity in cost-effectiveness models”; as someone who spent all my corporate career pained by working with big messy spreadsheets and who’s also checked out GW’s CEAs over the years I think they’re being a bit harsh on themselves here...
all my favorite people are great at a skill I’ve labeled in my head as “staring into the abyss.”1
Staring into the abyss means thinking reasonably about things that are uncomfortable to contemplate, like arguments against your religious beliefs, or in favor of breaking up with your partner. It’s common to procrastinate on thinking hard about these things because it might require you to acknowledge that you were very wrong about something in the past, and perhaps wasted a bunch of time based on that (e.g. dating the wrong person or praying to the wrong god). However, in most cases you have to either admit this eventually or, if you never admit it, lock yourself into a sub-optimal future life trajectory, so it’s best to be impatient and stare directly into the uncomfortable topic until you’ve figured out what to do. …
I noticed that it wasn’t just Drew (cofounder and CEO of Wave) who is great at this, but many the people whose work I respect the most, or who have had the most impact on how I think. Conversely, I also noticed that for many of the people I know who have struggled to make good high-level life decisions, they were at least partly blocked by having an abyss that they needed to stare into, but flinched away from.
So I’ve come to believe that becoming more willing to stare into the abyss is one of the most important things you can do to become a better thinker and make better decisions about how to spend your life.
I agree, and I think there’s an organisational analogue as well, which GiveWell exemplifies above.
I admire influential orgs that publicly change their mind due to external feedback, and GiveWell is as usual exemplary of this (see also their grant “lookbacks”). From their recently published Progress on Issues We Identified During Top Charities Red Teaming, here’s how external feedback changed their bottomline grantmaking:
Some self-assessed progress that caught my eye — incomplete list, full one here; these “led to important errors or… worsened the credibility of our research” (0 = no progress made, 10 = completely resolved):
(As an aside, I’ve noticed plenty of claims of GW top charity-beating cost-effectiveness figures both on the forum and elsewhere, and I basically never give them the credence I’d give to GW’s own estimates, due to the kind of (usually downward) adjustments mentioned above like receiving interventions from other sources or between-program interventions, and GW’s sheer reasoning thoroughness behind those adjustments, seriously, click on any of those “(more)”s)
Some other issues they’d “been aware of at the time of red teaming and had deprioritized but that we thought were worth looking into following red teaming” — again incomplete list, full one here:
I always had the impression GW engaged outside experts a fair bit, so I was pleasantly surprised to learn they thought they weren’t doing enough of it and then actually followed through so seriously, this is an A+ example of organisational commitment to and follow-through on self-improvement so I’d like to quote this section in full:
Some quick reactions:
I like that GW thinks they should allocate more time to expert conversations vs desk research in most cases
I like that GW are improving their own red-teaming process by having experts review their work in parallel
I too am keen to see what CGD find out re: why GW top-recommended programs aren’t funded by other groups you’d expect to do so
the Zipline exploratory grant is very cool, I raved about it previously
I wouldn’t have expected that the biggest driver in terms of grants made/not made would be failure to sense check raw data in burden calculations; while they’ve done a lot to redress this there’s still a lot more on the horizon, poised to affect grantmaking for areas like maternal mortality (prev. underrated, deserves a second look)
funnily enough, they self-scored 5⁄10 on “insufficient focus on simplicity in cost-effectiveness models”; as someone who spent all my corporate career
pained byworking with big messy spreadsheets and who’s also checked out GW’s CEAs over the years I think they’re being a bit harsh on themselves here...Ben Kuhn has a great essay about how
I agree, and I think there’s an organisational analogue as well, which GiveWell exemplifies above.