Relative Impact of the First 10 EA Forum Prize Winners
We don’t normally estimate the value of small to medium-sized projects.
But we could!
If we could do this reliably and scalably, this might lead us to choose better projects
Here is a small & very speculative attempt
My estimates are very uncertain (ranging several orders of magnitude), but they still seem useful for comparing projects. Nonetheless, the reader is advised to not take them too seriously.
The EA forum—and local groups—have been seeing a decent amount of projects, but few are evaluated for impact. This makes it difficult to choose between projects beforehand, beyond using personal intuition (however good it might be), a connection to a broader research agenda, or other rough heuristics. Ideally, we would have something more objective, and more scalable.
As part of QURI’s efforts to evaluate and estimate the impact of things in general, and projects QURI itself might carry out in particular, I tried to evaluate the impact of 10 projects I expected to be fairly valuable.
I chose the first 10 posts which won the EA Forum Prize, back in 2017 and 2018, to evaluate. For each of the 10 posts, each estimate has a structure like the one below. Note that not all estimates will have each element:
Title of the post
Background information: What are some salient facts about the post?
Theory of change: If this isn’t clear, how is this post aiming to have an impact?
Reasoning about my estimate: How do I arrive at my estimate of impact given what I know about the world?
Guesstimate model: Verbal reasoning can be particularly messy, so I also provide a guesstimate model
Ballpark: A verbal estimate
Estimate: A numerical estimate of impact
If a writeup refers to a project distinct from the writeup, I generally try to estimate the impact of both the project and the writeup.
Where possible, I estimated their impact in an ad-hoc scale, Quality Adjusted Research Papers (QARPs for short), whose levels correspond to the following:
|~0.1 mQARPs||A thoughtful comment||A thoughtful comment about the details of setting up a charity|
|~1 mQARPs||A good blog post, a particularly good comment||What considerations influence whether I have more influence over short or long timelines?|
|~10 mQARPs||An excellent blog post||Humans Who Are Not Concentrating Are Not General Intelligences|
|~100 mQ||A fairly valuable paper||Categorizing Variants of Goodhart’s Law.|
|~1 QARPs||A particularly valuable paper||The Vulnerable World Hypothesis|
|~10-100 QARPs||A research agenda||The Global Priorities Institute’s Research Agenda.|
|~100-1000+ QARPs||A foundational popular book on a valuable topic||Superintelligence, Thinking Fast and Slow|
|~1000+ QARPs||A foundational research work||Shannon’s “A Mathematical Theory of Communication.”|
Ideally, this would both have relative meaning (i.e., I claim that an average thoughtful comment is worth less than an average good post), and absolute meaning (i.e., after thinking about it, a factor of 10x between an average thoughtful comment and an average good post seems roughly right). In practice, the second part is a work in progress. In an ideal world, this estimate would be cause-independent, but cause comparability is not a solved problem, and in practice the scale is more aimed towards long-term focused projects.
To elaborate on cause independence, upon reflection we may find out that a fairly valuable paper on AI Alignment might be 20 times as a fairly valuable paper on Food Security, and give both of their impacts in a common unit. But we are uncertain about their actual relative impacts, and they will not only depend on uncertainty, but also on moral preferences and values (e.g., weight given to animals, weight given to people who currently don’t exist, etc.) To get around this, I just estimated how valuable a projects is within a field, leaving the work of categorizing and comparing fields as a separate endeavor: I don’t adjust impact for different causes, as long as it’s an established Effective Altruist cause.
Some projects don’t easily lend themselves to be rated in QARPs; in that case I’ve also used “dollars moved”. Impact is adjusted for Shapley values, which avoids double or triple-counting impact. In every example here, this will be equivalent to calculating counterfactual value, and dividing by the number of necessary stakeholders. This requires a judgment call for what is a “necessary stake-holder”. Intervals are meant to be 80% confidence intervals, but in general all estimates are highly speculative and shouldn’t be taken too seriously.
Total project impact:
$100k in money donated
Gave a substantial monetary and status boost to ALLFED
Reasoning about my estimate: The impact of the grants is the cumulative 1 year output by various researchers at ALLFED/AI Impacts/GCRI/WASR, plus status effects. Because these are relatively young organizations, I’m going to estimate something like a salary of $20k to $40k, and an impact of 10 to 50 mQARPs/month per new hire. Note that in order to not double count impact, the impact has to be divided between the funding providers and the grantee (and possibly with the new hires as well). Counterfactual fungibility for AI impacts or for ALLFED does not really seem to be a concern given that the funding which this grant could have replaced was probably also directed adequately.
Guesstimate model here
Ballpark: Between 6 fairly valuable papers (600 mQARPs) and 3 really good ones (3 QARPs).
Estimate: 500 mQARPs to 4 QARPs
Counterfactual impact of Adam Gleave winning the donor lottery (as opposed to other winners):
If one looks at the list of lottery donors: https://app.effectivealtruism.org/lotteries/31553453298138, the participants I can recognize also strike me as intelligent and savvy.
Reasoning: If I think about how much I would pay to have the winner win again, my first answer is “not much”. But I then learned that Adam Gleave then went on to become a part of the Long-Term Future Fund, which might be indicative of a particularly good fit. Another consideration is that the impact ought to be shared between the people who participated in the donor lotteries, and the organizations which made such a donor lottery happen. As an aside, one might have thought that he could have donated the money to EA Funds, but EA Funds wasn’t as active or as legible in 2017. Lastly, note that for the lower end of the 80% confidence interval to be above zero, Adam will have to have been in the top 20% of donor lottery participants, which intuitively seems likely but about which I’m not completely sure. I don’t really have a good factorization of these considerations, so I’m going more with an educated guess.
Ballpark: Roughly as valuable as a fairly valuable paper.
Estimate: −50 mQARPs to 500 mQARPs
Impact of the writeup alone:
Made donor lottery reports into “a thing”, and future donor lottery winners more informed and willing to share their conclusions.
It seems like the post gave a status boost to ALLFED
Reasoning: With regards to impact achieved so far vs impact to be achieved in the future, most of the impact is still in the future. However, the proxies of impact, particularly the health of ALLFED as an organization, look good. So I feel reasonably ok making an educated guess.
Ballpark: I imagine that the mean is a little less valuable than a fairly valuable paper, though the positive tail could be much higher if one values reputational effects to ALLFED highly and thinks they were significant. It’s hard to see how the writeup could have had a negative effect.
Estimate: 50 mQARPs to 250 mQARPs.
Impact of the hiring round itself:
Background considerations: From Effective Altruism Foundation: Plans for 2020: “We planned to hire a Research Analyst for grantmaking and an Operations Analyst and made two job offers. One of them was not accepted; the other one did not work out during the first few months of employment. In hindsight, it might have been better to hire even more slowly and ensure we understood the roles we were hiring for better. Doing so would have allowed us to make a more convincing case for the positions and hire from a larger pool of candidates”
Reasoning: Impact was then probably negative, given that hiring rounds seem to be time-intensive. I don’t know how long applications took, but I’m going to say 1 to 10 FTE (full time equivalents) for 2 weeks (counting the time that applicants spent). This gives 0.5 to 5 FTE/months. Say that one FTE can produce 5 to 20 mQARPs per month, this gives you an estimate of 15 to 75mQARPs lost. When discussing a draft, commenters pointed out that I didn’t count opportunity costs for other projects, however, in this case the opportunity costs were not only on the side of the people who did the project, but also on the side of the applicants, which feels meaningfully different.
Guesstimate model here.
Ballpark: Loss of between one fairly valuable paper to half an excellent EA forum blog post.
Estimate: −70 to −5 mQARPs
When reviewing this section, some commenters pointed out that, for them, calculating the opportunity cost didn’t make as much sense. I disagree with that. Further, I’m also not attempting to calculate the expected value ex ante; in this case this feels inelegant because the expected value will depend a whole bunch on the information, accuracy and calibration of the person doing the expected value calculation, and I don’t want to estimate how accurate or calibrated the piece’s author was at the time (though he is pretty good now).
Impact of the writeup (as opposed to impact of the hiring process):
Background considerations: The post mentions that “This post might be useful for other organizations and (future) applicants in the community.” This seems to have been the case, as it is e.g., the Fish Welfare Initiative (Hiring Process and Takeaways from Fish Welfare Initiative) mentioned: “We relied heavily on the following resources [among which was the writeup under consideration] and highly recommend looking them over.” The fact that the writeup was highly upvoted is also suggestive of the fact that people found it valuable.
Reasoning: The way to estimate impact here would be something like: “Counterfactual impact of the best hire(s) in the organizations it influenced, as opposed to the impact of the hires who would otherwise have been chosen”. Let’s say it influenced 1 to 5 hiring rounds, and advice in the post allowed advice-takers to hire 0 to 3 people per organization who were 1 to 10% more effective, and who stayed with the organization for 0.5 to 3 years. If one hire can produce 5 to 20 mQARPs worth of impact per year, that corresponds to 0 to 100 mQARPS per year. But in order to not triple count impact, that has to be shared between the advice-giver, the advice-taker, and the hire, possibly funders, etc. so it amounts to ~1 to 30 mQARPs. Note that by saying “one hire can produce 5 to 20 mQARPs worth of impact per month”, I totally elide the (difficult) problem of comparing impact in different areas.
Guesstimate model: here
Ballpark: 0 to three excellent EA forum posts.
Estimate: 0 to 30 mQARPs
Impact of the post and the research:
Models of impact:
Lower donations to Cool Earth.
Cautionary tales / providing a good example of research
The project is a good piece of research, and might crate common knowledge that Sanjay has the ability to produce good research might be valuable in itself
The project seems to have alienated Cool Earth slightly.
The post might possibly spurred more research into EA/Climate Change
Reasoning: Commenters in the post indicated that the prevailing view at the time was that climate change was much less exciting than global poverty, and that nobody was really donating to Cool Earth. That nobody commented saying that they had changed their mind about donations is also weak evidence that nobody did.
Ballpark: −0.5 excellent EA forums post to +2 excellent EA forum posts
Estimate: −5 to 20 mQARPs
Expected impact of the project:
Reasoning: I’m going to pull some numbers out of thin air, and say that the project had a 1% to 10% probability of funding a startup worth $0.5 to $10 million, with a very long tail after that. I could try to guess the counterfactual amount of animal suffering averted, but I’m not really familiar with the details. Counterfactual replaceability of the potential company seems low, and in this regard the writeup still seems prophetic: Beyond Meat and Impossible Foods seem to be doing well in the beef substitute space, whereas the authors considered chicken nuggets and fish sticks, rather than beef replacements, as they had a higher volume of suffering per kilogram. Chicken nuggets and fish sticks still seem neglected today.
Estimate: 1% to 10% probability of funding a startup worth $0.5 million to $10 million.
Impact of the project:
Reasoning: Given that the authors didn’t actually end up starting a company, actual impact seems close to 0. However, it might have been slightly positive because of e.g. motivational effects, or because the knowledge accumulated during the projects was used later (e.g., in this post by the same author, which also won an EA forum prize).
Ballpark: 0 to 2 excellent EA Forum posts.
Estimate: 0 to 20 mQARPs
Impact of the writeup:
Reasoning: Small, given that nobody seems to have started a company or nonprofit because of the ideas outlined in the post (yet!)
Ballpark: 0 to 2 good EA forum posts.
Estimate: 0 to 20 mQARPs
Theory of change: By evaluating charities in the AI alignment space, this post enables donors to better direct their donations. As such, I think that the Shapley value should naïvely be something like 1/3rd of the counterfactual impact of donations. I.e., 1/3rd of the difference between how donors donated as a result of reading the post, and how they otherwise would have donated. 1/3rd of this counterfactual impact would go to Larks, 1/3rd to the donor, and 1/3rd to the charity.
Reasoning: I have reason to believe that the amount of money influenced was large, and Larks’s writeup was further the only one available. My confidence interval is then $100,000 to $1M in moved funding, of whose impact which Larks get a 1/3rd share. (Note: In his rot13 note at the end, Larks decided not to donate or to recommend donations to MIRI, which, in hindsight, given the failure of their undisclosed research, seems to have been the right move.) At this point, there are two competing effects: On the one hand, there might be a small number of AI safety organizations competing for the same funding. On the other hand, having a clear writeup may have made it possible for donors to donate at all. In the first case, the impact of Larks’s review is the difference in efficiency between the different charities inside the AI safety space. In the second case, the impact of Larks’s review is higher because donors might not otherwise have donated money that year. I’m somewhat arbitrarily estimating these considerations to range from halving the expected impact to leaving it mostly as it is. Larks’s preferred AI safety organization was able to convert $3m into 12 papers/projects (I’m counting coauthor papers as 1⁄2), each worth say 50 to 250 mQARPs, plus Shah’s AI Alignment Newsletter, which seems particularly valuable but also not really bottlenecked on funding at the time.
Guesstimate: You can see my first guesstimate model here, but I think that this was too low because it only took into account the value of papers, so here is a second model, which takes into account that the value which does not come from papers is 2 to 20x the magnitude of the value which does come from papers. I’m still uncertain, so my final estimate is even more uncertain than the Guesstimate model.
Ballpark: Between four excellent EA forum posts and 8 fairly valuable papers personally attributable to Larks (per Shapley calculations)
Estimates: $100,000 to $1M in moved funding, estimated to be 40 to 800 mQARPs
Note: Previously, I had given ALLFED hires something like 10 to 50 mQARPs/month. But if I calculate the mQARPs per hire per month of an AI safety organization, I get something like 2 to 15 mQARPs per person per month. It could be the case that this isn’t wrong, given that an organization favored by Larks produced 12 papers interesting enough for Larks to mention, on a $3M budget (and I model papers as 1⁄2 to 1/20th of the total value of an organization). It could also be the case that when ALLFED was just starting to scale up, it really was ~5x as valuable as already established AI safety organizations. In any case, do notice that my confidence intervals generally span more than a 5x range. Nonetheless, I’d still flag that producing good AI safety papers seems fairly expensive.
Models of impact:
People read about mental health as a cause area, and decide to work on it
People read about mental health as a cause area, and decide to donate to it
Laying out his reasoning about why mental health is important makes it easier to create and fundraise for the Happier Lives Institute
This didn’t happen in this case, but it has happened before and could have happened: Persuasive enough counter-arguments proposed by commenters could have changed Plant’s bottom line, and e.g., convinced the author to branch away from mental health and towards another field of research.
Reasoning: I don’t really have great models here. Most of the value of the post seems to come from making it easier for Michael Plant to set up HLI (the Happier Lives Institute), but most of the value of HLI seems to still lie in the future. Note that GiveWell seems to briefly have looked into StrongMinds, but then seems to have abandoned the idea, which is slightly weird.
Estimate: 0 to “totally unknown”. If forced to guess, I’d go with 0 to 100 mQARPs, eliding the (difficult) problem of comparing impact in different areas.
Models of impact:
By matching Facebook’s funds, the project directed money towards more effective charities.
By matching Facebook’s funds, the project elicited money which wouldn’t otherwise have been donated
The EA Giving Tuesday team also built expertise for future years
The EA Giving Tuesday team grows and becomes more capable by attempting new projects
Reasoning: I have the work cut out for me here from the original post: $469k worth of donations were matched, as opposed to $48k in 2017 (!!); this gives reason to think that most of it wouldn’t have happened in the absence of this project. Specifically, I’m going to estimate 75% to 100%. Responses from a follow up survey suggested “that $85k (12%) of donations may have been counterfactually caused by our initiative, though this estimate is highly uncertain”, so I’m going to estimate that 20 to 80% of those who claimed to donate more actually did. Counterfactual replaceability is also a concern (if the current team hadn’t organized this, another EA team might), but not a great one, since the team who would have taken their place is freed to do another project instead, and the current team is probably more capable than someone else doing it from scratch.
Estimate: $390K to $530K counterfactual donations, of which the Shapley value assigns $130K to $230K to the EA Giving Tuesday team.
Impact of the post alone:
Theory of change: By having more insight into how it functions, and into its own demographics and opinions, the EA community can improve itself.
Reasoning: I’m not really sure what concrete improvements the EA community has implemented as a result of this post, but the fact that I can’t think of any does not mean they don’t exist.
Ballpark: 0 to two excellent EA forum posts.
Estimate: 0 to 20 mQARPs.
Impact of the EAGx:
Reasoning: I’m pulling this completely out of thin air, but: 20 to 80% of the 200 attendees are made 2 to 10% more altruistic for the next 1 to 3 years. In particular, suppose that their average donation (either in terms of money or in terms of time and effort) is equivalent to $100 to $10000 /year.
Estimate: $200 to $350K in moved donations, with a mean influence of $30K, before adjusting for Shapley values, and $100 to $350K after adjusting for Shapley values. Note how this is $100 to $350,000, not $100K to $350K. This confidence interval is absurdly wide, which is what happens when you pull too many numbers out of thin air.
Impact of the writeup:
Theory of change: By having a writeup, future event organizers can better organize events.
Background considerations: The writeup gives advice about 9 different areas: Venue, Speaker Outreach, Funding, Web Presence, Audio Visual Services, Food, Marketing and Design, Presentations and Day-of Execution
Reasoning: Suppose that 0 to 5 events are made 1 to 20% better because they read the writeup under consideration. In particular, suppose that those events have 50 to 500 attendees, and that 1 to 10% attendees are made 1 to 10% more altruistic for the next 1 to 3 years. In particular, suppose that their average donation (either in terms of money or in terms of time or effort) is equivalent to $100 to $10000 /year.
Estimate: $0 to $1000 worth of influenced donations, with a mean influence of $100. The Shapley value would assign an influence worth of $0 to $500.
Theory of change: Better analysis leads to better grants, and better strategies by animal welfare advocates. This in turn leads to less animal suffering.
Reasoning: I’m rather uncertain. The impact of this post depends almost solely on whether other people, and in particular OpenPhilanthropy and other animal advocates, heeded its conclusions. There is reason to believe that its advice has in fact been heeded, but it is also likely that it would also have been heeded without this article. For example, some of the points mentioned in the article had been brought up before. That said, even if its ideas were already floating around, the post is extremely thorough, and might have solidified and given weight to the ideas therein covered.
Estimate: 0 mQARPs to 100 mQARPs, which elides the (difficult) problem of comparing impact in different areas.
|2017 Donor Lottery Grant||Between 6 fairly valuable papers and 3 really good ones||500 mQARPs to 4 QARPs|
|Adam Gleave winning the 2017 donor lottery (as opposed to other participants)||Roughly as valuable as a fairly valuable paper|
-50 mQARPs to 500 mQARPs
|2017 Donor Lottery Report (Writeup)||A little less valuable than a fairly valuable paper||50 mQARPs to 250 mQARPs|
|EAF’s Hiring Round||Loss of between one fairly valuable paper to an excellent EA forum blog post.||-70 to 5 mQARPs|
|Takeaways from EAF’s Hiring Round (Writeup)||Between two good EA forum posts to a fairly valuable paper||0 to 30 mQARPs|
|Why we have over-rated Cool Earth||-1/2 excellent EA forums post to +1.5 excellent EA forum posts||-5 to 20 mQARPs|
|Alternative Meat Startup Team (Project)||0 to 1 excellent EA Forum posts.||1 to 50 mQARPs|
|Lessons Learned from a Prospective Alternative Meat Startup Team (Writeup)||0 to 5 good EA forum posts||0 to 20 mQARPs|
|2018 AI Alignment Literature Review and Charity Comparison||Between two excellent EA forum posts and 6 fairly valuable papers||40 to 800 mQARPs|
|Cause profile: mental health||Very uncertain||0 to 100 mQARPs|
|EA Giving Tuesday Donation Matching Initiative 2018|
|$130K to $230K in Shapley-adjusted funding towards EA charities|
|EA Survey 2018 Series: Cause Selection||0 to an excellent EA forum post.||0 to 20 mQARPs|
|EAGx Boston 2018 (Event)|
|$100 to $350K in Shapley-adjusted funding towards EA charities|
|EAGx Boston 2018 Postmortem (Writeup)|
|$0 to $500 in Shapley-adjusted donations towards EA charities|
|Will companies meet their animal welfare commitments?||0 to a fairly valuable paper||0 to 100 mQARPs|
Comments and thoughts
An initial challenge in this domain relates to how to attain calibration. The way I would normally calibrate intuitions on a domain is by making a number of predictions at various levels of gut feeling, and then seeing empirically how frequently predictions made at different levels of gut feeling come out right. For example, I’ve previously found that my gut feeling of “I would be very surprised if this was false” generally corresponds to 95% (so 1 in 20 times, I am in fact wrong). But in this case, when considering or creating a new domain, I can’t actually check my predictions directly against reality, but instead have to check them against other people’s intuitions.
Comparison is still possible
Despite my wide levels of uncertainty, comparison is still possible. Even though I’m uncertain about the impact of both “Will companies meet their animal welfare commitments?” and “Lessons Learned from a Prospective Alternative Meat Startup Team”, I’d prefer to have the first over the second.
Similarly, while EAGx Boston 2018 and the EA Giving Tuesday Donation Matching Initiative might have taken similar amounts of time to organize, by comparably capable people, I prefer the second. This is in large part because EAGx events are scalable, whereas Giving Tuesdays are not.
I was also surprised by the high cost of producing papers when estimating the value of Larks’ review (though perhaps I shouldn’t have been). It could be the case that this was a problem with my estimates, or that papers truly are terribly inefficient.
Ozzie Gooen has in the past suggested that one could build a consensus around these kinds of estimates, and scale them further. In addition, one could also use these kinds of estimates to choose one’s own projects, or to recommend projects to others, and see how that fares. Note how in principle, these kinds of estimates don’t have to be perfect or perfectly calibrated, they just have to be better than the implicit estimates which would otherwise have been made.
In any case, there are also details to figure out or justify. For example, I’ve been using Shapley values, which I think are a more complicated, but often a more appropriate alternative to counterfactual values. Normally, this just means that I divide the total estimated impact by the estimated number of stakeholders, but sometimes, like in the case of a hiring round, I have the intuition that one might want to penalize the hiring organization for the lost opportunity cost of applicants, even though that’s not what Shapley values recommends. Further, it’s also sometimes not clear how many necessary stakeholders there are, or how important each stakeholder is, which makes the Shapley value unambiguous, or subject to a judgment call.
I’ve also been using a cause-impartial value function. That is, I judge a post in the animal welfare space using the same units as for a post in the long-termist space. But maybe it’s a better idea to have a different scale for each cause area, and then have a conversion factor which depends on the reader’s specific values. If I continue working on this idea, I will probably go in that direction.
Lastly, besides total impact, we also care about efficiency. For small and medium projects, I think that the most important kind of efficiency might be time efficiency. For example, when choosing between a project worth 100 mQARPs and one which is worth 10 mQARPs, one would also have to look at how long each takes, because maybe one can do 50 projects each worth 10 mQARPs in the time it takes to do a very elaborate 100 mQARPs project.
Thanks to David Manheim, Ozzie Gooen and Peter Hurford for thoughts, comments and suggestions.