In the 1990s, the World Health Organization (WHO) had an important function. They had to calculate, estimate, and publish the number of deaths caused by different diseases. These numbers influenced several things, from government spending on treatment programs, to the public perception of progress being made on different issues. However, even though the people doing the calculations were well-meaning and generally competent, there was a big problem. There was no oversight and the process lacked consistency, meaning that each WHO group used different methods, calculations, and assumptions. This resulted in estimates double- and triple-counting a single death.
This mis-estimation was potentially fatal, with funding and intellectual resources being devoted towards certain diseases over other, more important areas. A concerned staff member at the WHO noticed the problem after discovering that by adding up the four biggest killers (malaria, diarrhea, TB, and measles) in a lower income country, this added up to more than 100% of the total number of deaths in that country, and that was not even counting all other causes of death. When the employee brought up the concern with coworkers and management, it was largely dismissed. It would have looked bad, both for the individual groups and the WHO as a whole, to admit or address such a large mistake. Even after the staff member triple-checked his work and strengthened it through deeper research, it went unheard. The unspoken rule was: don’t embarrass the higher-ups.
The end line result was the founding of a completely new project outside of the WHO- the Global Burden of Disease Study- which measured impact correctly, did not double or triple count deaths, and which is, in fact, used to this day by groups like GiveWell, the Gates Foundation, and many others.
This is a true story, paraphrased from Epic Measures, and it highlights one of my biggest concerns about the EA movement. Trying to calculate counterfactual impact is a very hard task and much like with the WHO numbers, not only is each EA organization using different systems, they each have an incentive to publish high impact results. With impact, the calculations are even harder to do correctly than in the case of deaths, as it is often plausible that five different people or organizations were required for an action to happen. Sadly, if each of the five take 100% credit, you will end up with the EA movement as a whole taking 500% credit for a given action.
This can also happen with donations. It would be very easy for an EA to find out about EA from Charity Science, to read blog posts from both GWWC and TLYCS, sign up for both pledges, and then donate directly to GiveWell (who would count this impact again). This person would become quadruple counted in EA, with each organization using their donations as impact to justify their running. The problem is that, at the end of the day, if the person donated $1000, TLYCS, GWWC, GiveWell, and Charity Science may each have spent $500 on programs for getting this person into the movement/​donating. Each organization would proudly report they have 2:1 ratios and give themselves a pat on the back, when really the EA movement as a whole just spent $2000 for $1000 worth of donations.
The previous example used donations because it’s easy and clear cut to make the case that this is the wrong move without getting into more difficult issues, but it generalizes to talent as well. For example, recently, Fortify Health was founded. Clearly the founders deserve 100% impact- without them, the project certainly would not have happened. But wait a second: both of them think that without Charity Science’s support, the project would definitely not have happened. So, technically, Charity Science could also take 100% credit. (Since from our perspective, if we did not help Fortify Health it would not have happened, so it is a 100% counterfactually caused by Charity Science project). But wait a second, what about the donors who funded the project early on (because of Charity Science’s recommendation)? Surely they deserve some credit for impact as well! What about the fact that without the EA movement, it would have been much less likely for Charity Science and Fortify Health to connect?
With multiple organizations and individuals, you can very easily attribute a lot more impact than actually happens. A project’s evaluation could easily create the perception of x4 the impact it really had. This is even more likely, if it’s unclear where people are taking their credit for impact from (e.g. I might publish a report on Charity Science’s overall impact with “supporting new charities” impact listed, but not specify on the exact help I gave or how many others were involved). This is not even talking about deliberate rounding or naive overestimation of the value of that project.
Sadly, all these issues occur even with everyone trying to be as honest and careful as they can be. To jump back to the financial example, you can imagine Charity Science, GWWC and TLYCS not knowing exactly how much the person who donates $1000 is actually donating, leading to different and often over optimistic estimates across the organization.
The solutions
Sadly, I cannot think of a silver bullet solution. Thankfully, though, I think there are some things that can really help.
Transparent sharing of data regarding impact and the methodology to calculate impact
Mistakes like this are much more likely to happen the less clear and transparent the causal chain of impact is. Many organizations have internal counterfactual calculations, but it’s hard for donors or other organizations to make sense of the end line data without knowing how it was, in fact, estimated. Obviously, not all data is going to be shareable (e.g. the names of the people donating). However, the process for calculating impact can be shared and compared, which, in turn, can allow for an open discussion of these issues (e.g. how to disaggregate the impact of two organizations taking similar actions.) It also gives the community a chance to sanity check each other’s numbers. If Charity Science was massively over-estimating something relative to external observers, it would be hard for them to point out this flaw without a high level of transparency.
Efforts towards a consistent evaluation process between organizations
The more similar the process that is used between organizations, the easier it would be to take seriously the end line numbers. Something like this could be coordinated on the EA Forum and could clear up a lot of confusion regarding impact evaluation. (For example, if I hire someone to Charity Science, does that count as a career change?) I think that current organizations have very different intuitions and processes, and thus, end line numbers. I also think that to increase consistency, donors should insist upon seeing the data before donating to an organization.
Independent unbiased external impact analysis
The solution to the WHO problem was not just more interdepartmental coordination and transparency. It was, in fact, independent external analysis. Although I think this is “the solution”, it’s easily the hardest to execute well. The results from something like this would a) be very sensitive to the evaluators’ values (e.g. if they valued one cause a lot more than another, it would be hard to generalize), b) be very time consuming (I expect it would take many hours to get a strong understanding of all the aspects of an organization; likely months to years of full time work), c) would require a fairly unprecedented level of transparency in the charity world.
Things like this can happen. I think GiveWell’s external reviewing of poverty charities is a good example of something pretty close to the ideal, and I think it would allow for much stronger evaluation and accountability when considering and comparing the impacts of different organizations.
Triple counting impact in EA
The problem
In the 1990s, the World Health Organization (WHO) had an important function. They had to calculate, estimate, and publish the number of deaths caused by different diseases. These numbers influenced several things, from government spending on treatment programs, to the public perception of progress being made on different issues. However, even though the people doing the calculations were well-meaning and generally competent, there was a big problem. There was no oversight and the process lacked consistency, meaning that each WHO group used different methods, calculations, and assumptions. This resulted in estimates double- and triple-counting a single death.
This mis-estimation was potentially fatal, with funding and intellectual resources being devoted towards certain diseases over other, more important areas. A concerned staff member at the WHO noticed the problem after discovering that by adding up the four biggest killers (malaria, diarrhea, TB, and measles) in a lower income country, this added up to more than 100% of the total number of deaths in that country, and that was not even counting all other causes of death. When the employee brought up the concern with coworkers and management, it was largely dismissed. It would have looked bad, both for the individual groups and the WHO as a whole, to admit or address such a large mistake. Even after the staff member triple-checked his work and strengthened it through deeper research, it went unheard. The unspoken rule was: don’t embarrass the higher-ups.
The end line result was the founding of a completely new project outside of the WHO- the Global Burden of Disease Study- which measured impact correctly, did not double or triple count deaths, and which is, in fact, used to this day by groups like GiveWell, the Gates Foundation, and many others.
This is a true story, paraphrased from Epic Measures, and it highlights one of my biggest concerns about the EA movement. Trying to calculate counterfactual impact is a very hard task and much like with the WHO numbers, not only is each EA organization using different systems, they each have an incentive to publish high impact results. With impact, the calculations are even harder to do correctly than in the case of deaths, as it is often plausible that five different people or organizations were required for an action to happen. Sadly, if each of the five take 100% credit, you will end up with the EA movement as a whole taking 500% credit for a given action.
This can also happen with donations. It would be very easy for an EA to find out about EA from Charity Science, to read blog posts from both GWWC and TLYCS, sign up for both pledges, and then donate directly to GiveWell (who would count this impact again). This person would become quadruple counted in EA, with each organization using their donations as impact to justify their running. The problem is that, at the end of the day, if the person donated $1000, TLYCS, GWWC, GiveWell, and Charity Science may each have spent $500 on programs for getting this person into the movement/​donating. Each organization would proudly report they have 2:1 ratios and give themselves a pat on the back, when really the EA movement as a whole just spent $2000 for $1000 worth of donations.
The previous example used donations because it’s easy and clear cut to make the case that this is the wrong move without getting into more difficult issues, but it generalizes to talent as well. For example, recently, Fortify Health was founded. Clearly the founders deserve 100% impact- without them, the project certainly would not have happened. But wait a second: both of them think that without Charity Science’s support, the project would definitely not have happened. So, technically, Charity Science could also take 100% credit. (Since from our perspective, if we did not help Fortify Health it would not have happened, so it is a 100% counterfactually caused by Charity Science project). But wait a second, what about the donors who funded the project early on (because of Charity Science’s recommendation)? Surely they deserve some credit for impact as well! What about the fact that without the EA movement, it would have been much less likely for Charity Science and Fortify Health to connect?
With multiple organizations and individuals, you can very easily attribute a lot more impact than actually happens. A project’s evaluation could easily create the perception of x4 the impact it really had. This is even more likely, if it’s unclear where people are taking their credit for impact from (e.g. I might publish a report on Charity Science’s overall impact with “supporting new charities” impact listed, but not specify on the exact help I gave or how many others were involved). This is not even talking about deliberate rounding or naive overestimation of the value of that project.
Sadly, all these issues occur even with everyone trying to be as honest and careful as they can be. To jump back to the financial example, you can imagine Charity Science, GWWC and TLYCS not knowing exactly how much the person who donates $1000 is actually donating, leading to different and often over optimistic estimates across the organization.
The solutions
Sadly, I cannot think of a silver bullet solution. Thankfully, though, I think there are some things that can really help.
Transparent sharing of data regarding impact and the methodology to calculate impact
Mistakes like this are much more likely to happen the less clear and transparent the causal chain of impact is. Many organizations have internal counterfactual calculations, but it’s hard for donors or other organizations to make sense of the end line data without knowing how it was, in fact, estimated. Obviously, not all data is going to be shareable (e.g. the names of the people donating). However, the process for calculating impact can be shared and compared, which, in turn, can allow for an open discussion of these issues (e.g. how to disaggregate the impact of two organizations taking similar actions.) It also gives the community a chance to sanity check each other’s numbers. If Charity Science was massively over-estimating something relative to external observers, it would be hard for them to point out this flaw without a high level of transparency.
Efforts towards a consistent evaluation process between organizations
The more similar the process that is used between organizations, the easier it would be to take seriously the end line numbers. Something like this could be coordinated on the EA Forum and could clear up a lot of confusion regarding impact evaluation. (For example, if I hire someone to Charity Science, does that count as a career change?) I think that current organizations have very different intuitions and processes, and thus, end line numbers. I also think that to increase consistency, donors should insist upon seeing the data before donating to an organization.
Independent unbiased external impact analysis
The solution to the WHO problem was not just more interdepartmental coordination and transparency. It was, in fact, independent external analysis. Although I think this is “the solution”, it’s easily the hardest to execute well. The results from something like this would a) be very sensitive to the evaluators’ values (e.g. if they valued one cause a lot more than another, it would be hard to generalize), b) be very time consuming (I expect it would take many hours to get a strong understanding of all the aspects of an organization; likely months to years of full time work), c) would require a fairly unprecedented level of transparency in the charity world.
Things like this can happen. I think GiveWell’s external reviewing of poverty charities is a good example of something pretty close to the ideal, and I think it would allow for much stronger evaluation and accountability when considering and comparing the impacts of different organizations.