Twin challenges: constructing a metric that (1) we (EA) will not hate, that (2) will not confuse the public
Comparability across causes and across outcomes is very difficult
What credit do I get for things like those below, and how can we compare these in a way that is satisfying to us, and understandable to the larger public, including billionaires?
I donate $1 billion …
FIRST-WORLD: To elderly hospices in the US and the UK
GIVEWELL: To GiveWell charities,
GIVEWELL*: one of which is later found to have been funding a program whose impact is thrown in doubt
GIVEWELL-ESQUE: To a non-Givewell charity working to prevent malaria. They use method similar to AMF but Givewell didn’t have the resources to evaluate them, and they were ‘too similar’ to be worth a separate evaluation
OXFAM
ANIMALS: to successfully promote legislation to end prawn eyestalk ablation in Chile
LONGTERMIST: to fund AI safety research, generating research papers deemed very interesting
GOOD-FAILED-BET: To fund research into a possible cure for alzheimers disease, which looked good but turned out unsuccessful.
It would be very hard to resolve, even amongst ourselves, about GIVEWELL vs ANIMALS vs LONGTERMIST.
So, should we limit it to ‘GH&D only’? But that would drag attention away from animals and LT causes that many/most EAs value above all else.
Perhaps a good first pass would simply be to sum “donations to all plausibly-high-impact charities”… maybe “all of the above except for FIRST-WORLD?” But then, we would probably want to discount OXFAM … relative to the others but by how much? And how can we be claiming to measure the GIVEWELL, ANIMALS, and LONGTERMIST benefits by the same units? Unless the importance of ‘value of prawns’ sentient life/suffering’ happens to add up to just the sweet spot where the expected good accomplished just equals a top Givewell charity, one will vastly outweigh the other.
The evidence synthesis base is thin
Even within global public health/development, we have basically a single source of a public evaluations that we trust (GiveWell), or at most a handful (public OP reports? Founders’ pledge) These give rigorous assessments in accounting of a handful of interventions and charities, backed by strong academic evidence. I think we can be fairly confident that these interventions (like bednets, micronutrients) are in fact very likely to have strong positive effects.
But what do we do with charities such as Oxfam, Doctors without Borders etc, where, I guess, most of the GPHD giving goes. As far as I know there has been no credible comparison effectiveness rating for of these, because it’s very difficult, because they do many things, and because their theories of change involves some things that are harder to measure. GiveWell does not say that ‘AMF is 10.6 times more impactful than Oxfam’. They just don’t report on this.
For reasons including ‘the ability to make an impact list’ I’ve been advocating that we do more to try to come up with credible reasoning transparent metrics of the effectiveness of charities
‘going down the list’ where the evidence is thinner
stating our assumptions clearly, including our uncertainty, and giving measures of its calibration
I was hopeful ImpactMatters would go in this direction, but I didn’t see it. SoGive might still, if the funding is there. There are also some good initiatives coming out of QURI which would need alternate funding, and I think HLI is working in this area also.
I also mention this in my response to your other comment, but in case others didn’t notice that: my current best guess for how we can reasonably compare across cause areas is to use something like WALYs. For animals my guess is we’ll adjust WALYs with some measure of brain complexity.
In general the rankings will be super sensitive to assumptions. Through really high quality research we might be able to reduce disagreements a little, but no matter what there will still be lots of disagreements about assumptions.
I mentioned in the post that the default ranking might eventually become some blend of rankings from many EA orgs. Nathan has a good suggestion below about using surveys to do this blending. A key point is that you can factor out just the differences in assumptions between two rankings and survey people about which assumptions they find most credible.
I think you highlight something really important at the end of your post about the benefit of making these assumptions explicit.
Twin challenges: constructing a metric that (1) we (EA) will not hate, that (2) will not confuse the public
Comparability across causes and across outcomes is very difficult
What credit do I get for things like those below, and how can we compare these in a way that is satisfying to us, and understandable to the larger public, including billionaires?
I donate $1 billion …
FIRST-WORLD: To elderly hospices in the US and the UK
GIVEWELL: To GiveWell charities,
GIVEWELL*: one of which is later found to have been funding a program whose impact is thrown in doubt
GIVEWELL-ESQUE: To a non-Givewell charity working to prevent malaria. They use method similar to AMF but Givewell didn’t have the resources to evaluate them, and they were ‘too similar’ to be worth a separate evaluation
OXFAM
ANIMALS: to successfully promote legislation to end prawn eyestalk ablation in Chile
LONGTERMIST: to fund AI safety research, generating research papers deemed very interesting
GOOD-FAILED-BET: To fund research into a possible cure for alzheimers disease, which looked good but turned out unsuccessful.
It would be very hard to resolve, even amongst ourselves, about GIVEWELL vs ANIMALS vs LONGTERMIST.
So, should we limit it to ‘GH&D only’? But that would drag attention away from animals and LT causes that many/most EAs value above all else.
Perhaps a good first pass would simply be to sum “donations to all plausibly-high-impact charities”… maybe “all of the above except for FIRST-WORLD?” But then, we would probably want to discount OXFAM … relative to the others but by how much? And how can we be claiming to measure the GIVEWELL, ANIMALS, and LONGTERMIST benefits by the same units? Unless the importance of ‘value of prawns’ sentient life/suffering’ happens to add up to just the sweet spot where the expected good accomplished just equals a top Givewell charity, one will vastly outweigh the other.
The evidence synthesis base is thin
Even within global public health/development, we have basically a single source of a public evaluations that we trust (GiveWell), or at most a handful (public OP reports? Founders’ pledge) These give rigorous assessments in accounting of a handful of interventions and charities, backed by strong academic evidence. I think we can be fairly confident that these interventions (like bednets, micronutrients) are in fact very likely to have strong positive effects.
But what do we do with charities such as Oxfam, Doctors without Borders etc, where, I guess, most of the GPHD giving goes. As far as I know there has been no credible comparison effectiveness rating for of these, because it’s very difficult, because they do many things, and because their theories of change involves some things that are harder to measure. GiveWell does not say that ‘AMF is 10.6 times more impactful than Oxfam’. They just don’t report on this.
For reasons including ‘the ability to make an impact list’ I’ve been advocating that we do more to try to come up with credible reasoning transparent metrics of the effectiveness of charities
‘going down the list’ where the evidence is thinner
stating our assumptions clearly, including our uncertainty, and giving measures of its calibration
I was hopeful ImpactMatters would go in this direction, but I didn’t see it. SoGive might still, if the funding is there. There are also some good initiatives coming out of QURI which would need alternate funding, and I think HLI is working in this area also.
I also mention this in my response to your other comment, but in case others didn’t notice that: my current best guess for how we can reasonably compare across cause areas is to use something like WALYs. For animals my guess is we’ll adjust WALYs with some measure of brain complexity.
In general the rankings will be super sensitive to assumptions. Through really high quality research we might be able to reduce disagreements a little, but no matter what there will still be lots of disagreements about assumptions.
I mentioned in the post that the default ranking might eventually become some blend of rankings from many EA orgs. Nathan has a good suggestion below about using surveys to do this blending. A key point is that you can factor out just the differences in assumptions between two rankings and survey people about which assumptions they find most credible.
I think you highlight something really important at the end of your post about the benefit of making these assumptions explicit.