Relative Impact of the First 10 EA Forum Prize Winners
Summary
We don’t normally estimate the value of small to medium-sized projects.
But we could!
If we could do this reliably and scalably, this might lead us to choose better projects
Here is a small & very speculative attempt
My estimates are very uncertain (ranging several orders of magnitude), but they still seem useful for comparing projects. Nonetheless, the reader is advised to not take them too seriously.
Introduction
The EA forum—and local groups—have been seeing a decent amount of projects, but few are evaluated for impact. This makes it difficult to choose between projects beforehand, beyond using personal intuition (however good it might be), a connection to a broader research agenda, or other rough heuristics. Ideally, we would have something more objective, and more scalable.
As part of QURI’s efforts to evaluate and estimate the impact of things in general, and projects QURI itself might carry out in particular, I tried to evaluate the impact of 10 projects I expected to be fairly valuable.
Methodology
I chose the first 10 posts which won the EA Forum Prize, back in 2017 and 2018, to evaluate. For each of the 10 posts, each estimate has a structure like the one below. Note that not all estimates will have each element:
Title of the post
Background information: What are some salient facts about the post?
Theory of change: If this isn’t clear, how is this post aiming to have an impact?
Reasoning about my estimate: How do I arrive at my estimate of impact given what I know about the world?
Guesstimate model: Verbal reasoning can be particularly messy, so I also provide a guesstimate model
Ballpark: A verbal estimate
Estimate: A numerical estimate of impact
If a writeup refers to a project distinct from the writeup, I generally try to estimate the impact of both the project and the writeup.
Where possible, I estimated their impact in an ad-hoc scale, Quality Adjusted Research Papers (QARPs for short), whose levels correspond to the following:
Value | Description | Example |
~0.1 mQARPs | A thoughtful comment | A thoughtful comment about the details of setting up a charity |
~1 mQARPs | A good blog post, a particularly good comment | What considerations influence whether I have more influence over short or long timelines? |
~10 mQARPs | An excellent blog post | Humans Who Are Not Concentrating Are Not General Intelligences |
~100 mQ | A fairly valuable paper | Categorizing Variants of Goodhart’s Law. |
~1 QARPs | A particularly valuable paper | The Vulnerable World Hypothesis |
~10-100 QARPs | A research agenda | The Global Priorities Institute’s Research Agenda. |
~100-1000+ QARPs | A foundational popular book on a valuable topic | Superintelligence, Thinking Fast and Slow |
~1000+ QARPs | A foundational research work | Shannon’s “A Mathematical Theory of Communication.” |
Ideally, this would both have relative meaning (i.e., I claim that an average thoughtful comment is worth less than an average good post), and absolute meaning (i.e., after thinking about it, a factor of 10x between an average thoughtful comment and an average good post seems roughly right). In practice, the second part is a work in progress. In an ideal world, this estimate would be cause-independent, but cause comparability is not a solved problem, and in practice the scale is more aimed towards long-term focused projects.
To elaborate on cause independence, upon reflection we may find out that a fairly valuable paper on AI Alignment might be 20 times as a fairly valuable paper on Food Security, and give both of their impacts in a common unit. But we are uncertain about their actual relative impacts, and they will not only depend on uncertainty, but also on moral preferences and values (e.g., weight given to animals, weight given to people who currently don’t exist, etc.) To get around this, I just estimated how valuable a projects is within a field, leaving the work of categorizing and comparing fields as a separate endeavor: I don’t adjust impact for different causes, as long as it’s an established Effective Altruist cause.
Some projects don’t easily lend themselves to be rated in QARPs; in that case I’ve also used “dollars moved”. Impact is adjusted for Shapley values, which avoids double or triple-counting impact. In every example here, this will be equivalent to calculating counterfactual value, and dividing by the number of necessary stakeholders. This requires a judgment call for what is a “necessary stake-holder”. Intervals are meant to be 80% confidence intervals, but in general all estimates are highly speculative and shouldn’t be taken too seriously.
Estimates
2017 Donor Lottery Report
Total project impact:
Background considerations:
$100k in money donated
Gave a substantial monetary and status boost to ALLFED
Reasoning about my estimate: The impact of the grants is the cumulative 1 year output by various researchers at ALLFED/AI Impacts/GCRI/WASR, plus status effects. Because these are relatively young organizations, I’m going to estimate something like a salary of $20k to $40k, and an impact of 10 to 50 mQARPs/month per new hire. Note that in order to not double count impact, the impact has to be divided between the funding providers and the grantee (and possibly with the new hires as well). Counterfactual fungibility for AI impacts or for ALLFED does not really seem to be a concern given that the funding which this grant could have replaced was probably also directed adequately.
Guesstimate model here
Ballpark: Between 6 fairly valuable papers (600 mQARPs) and 3 really good ones (3 QARPs).
Estimate: 500 mQARPs to 4 QARPs
Counterfactual impact of Adam Gleave winning the donor lottery (as opposed to other winners):
Background considerations:
If one looks at the list of lottery donors: https://app.effectivealtruism.org/lotteries/31553453298138, the participants I can recognize also strike me as intelligent and savvy.
Reasoning: If I think about how much I would pay to have the winner win again, my first answer is “not much”. But I then learned that Adam Gleave then went on to become a part of the Long-Term Future Fund, which might be indicative of a particularly good fit. Another consideration is that the impact ought to be shared between the people who participated in the donor lotteries, and the organizations which made such a donor lottery happen. As an aside, one might have thought that he could have donated the money to EA Funds, but EA Funds wasn’t as active or as legible in 2017. Lastly, note that for the lower end of the 80% confidence interval to be above zero, Adam will have to have been in the top 20% of donor lottery participants, which intuitively seems likely but about which I’m not completely sure. I don’t really have a good factorization of these considerations, so I’m going more with an educated guess.
Ballpark: Roughly as valuable as a fairly valuable paper.
Estimate: −50 mQARPs to 500 mQARPs
Impact of the writeup alone:
Background considerations:
Made donor lottery reports into “a thing”, and future donor lottery winners more informed and willing to share their conclusions.
It seems like the post gave a status boost to ALLFED
Reasoning: With regards to impact achieved so far vs impact to be achieved in the future, most of the impact is still in the future. However, the proxies of impact, particularly the health of ALLFED as an organization, look good. So I feel reasonably ok making an educated guess.
Ballpark: I imagine that the mean is a little less valuable than a fairly valuable paper, though the positive tail could be much higher if one values reputational effects to ALLFED highly and thinks they were significant. It’s hard to see how the writeup could have had a negative effect.
Estimate: 50 mQARPs to 250 mQARPs.
Takeaways from EAF’s Hiring Round.
Impact of the hiring round itself:
Background considerations: From Effective Altruism Foundation: Plans for 2020: “We planned to hire a Research Analyst for grantmaking and an Operations Analyst and made two job offers. One of them was not accepted; the other one did not work out during the first few months of employment. In hindsight, it might have been better to hire even more slowly and ensure we understood the roles we were hiring for better. Doing so would have allowed us to make a more convincing case for the positions and hire from a larger pool of candidates”
Reasoning: Impact was then probably negative, given that hiring rounds seem to be time-intensive. I don’t know how long applications took, but I’m going to say 1 to 10 FTE (full time equivalents) for 2 weeks (counting the time that applicants spent). This gives 0.5 to 5 FTE/months. Say that one FTE can produce 5 to 20 mQARPs per month, this gives you an estimate of 15 to 75mQARPs lost. When discussing a draft, commenters pointed out that I didn’t count opportunity costs for other projects, however, in this case the opportunity costs were not only on the side of the people who did the project, but also on the side of the applicants, which feels meaningfully different.
Guesstimate model here.
Ballpark: Loss of between one fairly valuable paper to half an excellent EA forum blog post.
Estimate: −70 to −5 mQARPs
When reviewing this section, some commenters pointed out that, for them, calculating the opportunity cost didn’t make as much sense. I disagree with that. Further, I’m also not attempting to calculate the expected value ex ante; in this case this feels inelegant because the expected value will depend a whole bunch on the information, accuracy and calibration of the person doing the expected value calculation, and I don’t want to estimate how accurate or calibrated the piece’s author was at the time (though he is pretty good now).
Impact of the writeup (as opposed to impact of the hiring process):
Background considerations: The post mentions that “This post might be useful for other organizations and (future) applicants in the community.” This seems to have been the case, as it is e.g., the Fish Welfare Initiative (Hiring Process and Takeaways from Fish Welfare Initiative) mentioned: “We relied heavily on the following resources [among which was the writeup under consideration] and highly recommend looking them over.” The fact that the writeup was highly upvoted is also suggestive of the fact that people found it valuable.
Reasoning: The way to estimate impact here would be something like: “Counterfactual impact of the best hire(s) in the organizations it influenced, as opposed to the impact of the hires who would otherwise have been chosen”. Let’s say it influenced 1 to 5 hiring rounds, and advice in the post allowed advice-takers to hire 0 to 3 people per organization who were 1 to 10% more effective, and who stayed with the organization for 0.5 to 3 years. If one hire can produce 5 to 20 mQARPs worth of impact per year, that corresponds to 0 to 100 mQARPS per year. But in order to not triple count impact, that has to be shared between the advice-giver, the advice-taker, and the hire, possibly funders, etc. so it amounts to ~1 to 30 mQARPs. Note that by saying “one hire can produce 5 to 20 mQARPs worth of impact per month”, I totally elide the (difficult) problem of comparing impact in different areas.
Guesstimate model: here
Ballpark: 0 to three excellent EA forum posts.
Estimate: 0 to 30 mQARPs
Why we have over-rated Cool Earth
Impact of the post and the research:
Models of impact:
Lower donations to Cool Earth.
Cautionary tales / providing a good example of research
The project is a good piece of research, and might crate common knowledge that Sanjay has the ability to produce good research might be valuable in itself
The project seems to have alienated Cool Earth slightly.
The post might possibly spurred more research into EA/Climate Change
Reasoning: Commenters in the post indicated that the prevailing view at the time was that climate change was much less exciting than global poverty, and that nobody was really donating to Cool Earth. That nobody commented saying that they had changed their mind about donations is also weak evidence that nobody did.
Ballpark: −0.5 excellent EA forums post to +2 excellent EA forum posts
Estimate: −5 to 20 mQARPs
Lessons Learned from a Prospective Alternative Meat Startup Team
Expected impact of the project:
Reasoning: I’m going to pull some numbers out of thin air, and say that the project had a 1% to 10% probability of funding a startup worth $0.5 to $10 million, with a very long tail after that. I could try to guess the counterfactual amount of animal suffering averted, but I’m not really familiar with the details. Counterfactual replaceability of the potential company seems low, and in this regard the writeup still seems prophetic: Beyond Meat and Impossible Foods seem to be doing well in the beef substitute space, whereas the authors considered chicken nuggets and fish sticks, rather than beef replacements, as they had a higher volume of suffering per kilogram. Chicken nuggets and fish sticks still seem neglected today.
Estimate: 1% to 10% probability of funding a startup worth $0.5 million to $10 million.
Impact of the project:
Reasoning: Given that the authors didn’t actually end up starting a company, actual impact seems close to 0. However, it might have been slightly positive because of e.g. motivational effects, or because the knowledge accumulated during the projects was used later (e.g., in this post by the same author, which also won an EA forum prize).
Ballpark: 0 to 2 excellent EA Forum posts.
Estimate: 0 to 20 mQARPs
Impact of the writeup:
Reasoning: Small, given that nobody seems to have started a company or nonprofit because of the ideas outlined in the post (yet!)
Ballpark: 0 to 2 good EA forum posts.
Estimate: 0 to 20 mQARPs
2018 AI Alignment Literature Review and Charity Comparison
Theory of change: By evaluating charities in the AI alignment space, this post enables donors to better direct their donations. As such, I think that the Shapley value should naïvely be something like 1/3rd of the counterfactual impact of donations. I.e., 1/3rd of the difference between how donors donated as a result of reading the post, and how they otherwise would have donated. 1/3rd of this counterfactual impact would go to Larks, 1/3rd to the donor, and 1/3rd to the charity.
Reasoning: I have reason to believe that the amount of money influenced was large, and Larks’s writeup was further the only one available. My confidence interval is then $100,000 to $1M in moved funding, of whose impact which Larks get a 1/3rd share. (Note: In his rot13 note at the end, Larks decided not to donate or to recommend donations to MIRI, which, in hindsight, given the failure of their undisclosed research, seems to have been the right move.) At this point, there are two competing effects: On the one hand, there might be a small number of AI safety organizations competing for the same funding. On the other hand, having a clear writeup may have made it possible for donors to donate at all. In the first case, the impact of Larks’s review is the difference in efficiency between the different charities inside the AI safety space. In the second case, the impact of Larks’s review is higher because donors might not otherwise have donated money that year. I’m somewhat arbitrarily estimating these considerations to range from halving the expected impact to leaving it mostly as it is. Larks’s preferred AI safety organization was able to convert $3m into 12 papers/projects (I’m counting coauthor papers as 1⁄2), each worth say 50 to 250 mQARPs, plus Shah’s AI Alignment Newsletter, which seems particularly valuable but also not really bottlenecked on funding at the time.
Guesstimate: You can see my first guesstimate model here, but I think that this was too low because it only took into account the value of papers, so here is a second model, which takes into account that the value which does not come from papers is 2 to 20x the magnitude of the value which does come from papers. I’m still uncertain, so my final estimate is even more uncertain than the Guesstimate model.
Ballpark: Between four excellent EA forum posts and 8 fairly valuable papers personally attributable to Larks (per Shapley calculations)
Estimates: $100,000 to $1M in moved funding, estimated to be 40 to 800 mQARPs
Note: Previously, I had given ALLFED hires something like 10 to 50 mQARPs/month. But if I calculate the mQARPs per hire per month of an AI safety organization, I get something like 2 to 15 mQARPs per person per month. It could be the case that this isn’t wrong, given that an organization favored by Larks produced 12 papers interesting enough for Larks to mention, on a $3M budget (and I model papers as 1⁄2 to 1/20th of the total value of an organization). It could also be the case that when ALLFED was just starting to scale up, it really was ~5x as valuable as already established AI safety organizations. In any case, do notice that my confidence intervals generally span more than a 5x range. Nonetheless, I’d still flag that producing good AI safety papers seems fairly expensive.
Cause profile: mental health
Models of impact:
People read about mental health as a cause area, and decide to work on it
People read about mental health as a cause area, and decide to donate to it
Laying out his reasoning about why mental health is important makes it easier to create and fundraise for the Happier Lives Institute
This didn’t happen in this case, but it has happened before and could have happened: Persuasive enough counter-arguments proposed by commenters could have changed Plant’s bottom line, and e.g., convinced the author to branch away from mental health and towards another field of research.
Reasoning: I don’t really have great models here. Most of the value of the post seems to come from making it easier for Michael Plant to set up HLI (the Happier Lives Institute), but most of the value of HLI seems to still lie in the future. Note that GiveWell seems to briefly have looked into StrongMinds, but then seems to have abandoned the idea, which is slightly weird.
Estimate: 0 to “totally unknown”. If forced to guess, I’d go with 0 to 100 mQARPs, eliding the (difficult) problem of comparing impact in different areas.
EA Giving Tuesday Donation Matching Initiative 2018 Retrospective
Models of impact:
By matching Facebook’s funds, the project directed money towards more effective charities.
By matching Facebook’s funds, the project elicited money which wouldn’t otherwise have been donated
The EA Giving Tuesday team also built expertise for future years
The EA Giving Tuesday team grows and becomes more capable by attempting new projects
Reasoning: I have the work cut out for me here from the original post: $469k worth of donations were matched, as opposed to $48k in 2017 (!!); this gives reason to think that most of it wouldn’t have happened in the absence of this project. Specifically, I’m going to estimate 75% to 100%. Responses from a follow up survey suggested “that $85k (12%) of donations may have been counterfactually caused by our initiative, though this estimate is highly uncertain”, so I’m going to estimate that 20 to 80% of those who claimed to donate more actually did. Counterfactual replaceability is also a concern (if the current team hadn’t organized this, another EA team might), but not a great one, since the team who would have taken their place is freed to do another project instead, and the current team is probably more capable than someone else doing it from scratch.
Estimate: $390K to $530K counterfactual donations, of which the Shapley value assigns $130K to $230K to the EA Giving Tuesday team.
EA Survey 2018 Series: Cause Selection
Impact of the post alone:
Theory of change: By having more insight into how it functions, and into its own demographics and opinions, the EA community can improve itself.
Reasoning: I’m not really sure what concrete improvements the EA community has implemented as a result of this post, but the fact that I can’t think of any does not mean they don’t exist.
Ballpark: 0 to two excellent EA forum posts.
Estimate: 0 to 20 mQARPs.
EAGx Boston 2018 Postmortem
Impact of the EAGx:
Reasoning: I’m pulling this completely out of thin air, but: 20 to 80% of the 200 attendees are made 2 to 10% more altruistic for the next 1 to 3 years. In particular, suppose that their average donation (either in terms of money or in terms of time and effort) is equivalent to $100 to $10000 /year.
Estimate: $200 to $350K in moved donations, with a mean influence of $30K, before adjusting for Shapley values, and $100 to $350K after adjusting for Shapley values. Note how this is $100 to $350,000, not $100K to $350K. This confidence interval is absurdly wide, which is what happens when you pull too many numbers out of thin air.
Impact of the writeup:
Theory of change: By having a writeup, future event organizers can better organize events.
Background considerations: The writeup gives advice about 9 different areas: Venue, Speaker Outreach, Funding, Web Presence, Audio Visual Services, Food, Marketing and Design, Presentations and Day-of Execution
Reasoning: Suppose that 0 to 5 events are made 1 to 20% better because they read the writeup under consideration. In particular, suppose that those events have 50 to 500 attendees, and that 1 to 10% attendees are made 1 to 10% more altruistic for the next 1 to 3 years. In particular, suppose that their average donation (either in terms of money or in terms of time or effort) is equivalent to $100 to $10000 /year.
Estimate: $0 to $1000 worth of influenced donations, with a mean influence of $100. The Shapley value would assign an influence worth of $0 to $500.
Will companies meet their animal welfare commitments?
Theory of change: Better analysis leads to better grants, and better strategies by animal welfare advocates. This in turn leads to less animal suffering.
Reasoning: I’m rather uncertain. The impact of this post depends almost solely on whether other people, and in particular OpenPhilanthropy and other animal advocates, heeded its conclusions. There is reason to believe that its advice has in fact been heeded, but it is also likely that it would also have been heeded without this article. For example, some of the points mentioned in the article had been brought up before. That said, even if its ideas were already floating around, the post is extremely thorough, and might have solidified and given weight to the ideas therein covered.
Estimate: 0 mQARPs to 100 mQARPs, which elides the (difficult) problem of comparing impact in different areas.
Table
Project | Ballpark | Estimate |
2017 Donor Lottery Grant | Between 6 fairly valuable papers and 3 really good ones | 500 mQARPs to 4 QARPs |
Adam Gleave winning the 2017 donor lottery (as opposed to other participants) | Roughly as valuable as a fairly valuable paper | -50 mQARPs to 500 mQARPs |
2017 Donor Lottery Report (Writeup) | A little less valuable than a fairly valuable paper | 50 mQARPs to 250 mQARPs |
EAF’s Hiring Round | Loss of between one fairly valuable paper to an excellent EA forum blog post. | -70 to 5 mQARPs |
Takeaways from EAF’s Hiring Round (Writeup) | Between two good EA forum posts to a fairly valuable paper | 0 to 30 mQARPs |
Why we have over-rated Cool Earth | -1/2 excellent EA forums post to +1.5 excellent EA forum posts | -5 to 20 mQARPs |
Alternative Meat Startup Team (Project) | 0 to 1 excellent EA Forum posts. | 1 to 50 mQARPs |
Lessons Learned from a Prospective Alternative Meat Startup Team (Writeup) | 0 to 5 good EA forum posts | 0 to 20 mQARPs |
2018 AI Alignment Literature Review and Charity Comparison | Between two excellent EA forum posts and 6 fairly valuable papers | 40 to 800 mQARPs |
Cause profile: mental health | Very uncertain | 0 to 100 mQARPs |
EA Giving Tuesday Donation Matching Initiative 2018 | → | $130K to $230K in Shapley-adjusted funding towards EA charities |
EA Survey 2018 Series: Cause Selection | 0 to an excellent EA forum post. | 0 to 20 mQARPs |
EAGx Boston 2018 (Event) | → | $100 to $350K in Shapley-adjusted funding towards EA charities |
EAGx Boston 2018 Postmortem (Writeup) | → | $0 to $500 in Shapley-adjusted donations towards EA charities |
Will companies meet their animal welfare commitments? | 0 to a fairly valuable paper | 0 to 100 mQARPs |
Comments and thoughts
Calibration
An initial challenge in this domain relates to how to attain calibration. The way I would normally calibrate intuitions on a domain is by making a number of predictions at various levels of gut feeling, and then seeing empirically how frequently predictions made at different levels of gut feeling come out right. For example, I’ve previously found that my gut feeling of “I would be very surprised if this was false” generally corresponds to 95% (so 1 in 20 times, I am in fact wrong). But in this case, when considering or creating a new domain, I can’t actually check my predictions directly against reality, but instead have to check them against other people’s intuitions.
Comparison is still possible
Despite my wide levels of uncertainty, comparison is still possible. Even though I’m uncertain about the impact of both “Will companies meet their animal welfare commitments?” and “Lessons Learned from a Prospective Alternative Meat Startup Team”, I’d prefer to have the first over the second.
Similarly, while EAGx Boston 2018 and the EA Giving Tuesday Donation Matching Initiative might have taken similar amounts of time to organize, by comparably capable people, I prefer the second. This is in large part because EAGx events are scalable, whereas Giving Tuesdays are not.
I was also surprised by the high cost of producing papers when estimating the value of Larks’ review (though perhaps I shouldn’t have been). It could be the case that this was a problem with my estimates, or that papers truly are terribly inefficient.
Future ideas
Ozzie Gooen has in the past suggested that one could build a consensus around these kinds of estimates, and scale them further. In addition, one could also use these kinds of estimates to choose one’s own projects, or to recommend projects to others, and see how that fares. Note how in principle, these kinds of estimates don’t have to be perfect or perfectly calibrated, they just have to be better than the implicit estimates which would otherwise have been made.
In any case, there are also details to figure out or justify. For example, I’ve been using Shapley values, which I think are a more complicated, but often a more appropriate alternative to counterfactual values. Normally, this just means that I divide the total estimated impact by the estimated number of stakeholders, but sometimes, like in the case of a hiring round, I have the intuition that one might want to penalize the hiring organization for the lost opportunity cost of applicants, even though that’s not what Shapley values recommends. Further, it’s also sometimes not clear how many necessary stakeholders there are, or how important each stakeholder is, which makes the Shapley value unambiguous, or subject to a judgment call.
I’ve also been using a cause-impartial value function. That is, I judge a post in the animal welfare space using the same units as for a post in the long-termist space. But maybe it’s a better idea to have a different scale for each cause area, and then have a conversion factor which depends on the reader’s specific values. If I continue working on this idea, I will probably go in that direction.
Lastly, besides total impact, we also care about efficiency. For small and medium projects, I think that the most important kind of efficiency might be time efficiency. For example, when choosing between a project worth 100 mQARPs and one which is worth 10 mQARPs, one would also have to look at how long each takes, because maybe one can do 50 projects each worth 10 mQARPs in the time it takes to do a very elaborate 100 mQARPs project.
Thanks to David Manheim, Ozzie Gooen and Peter Hurford for thoughts, comments and suggestions.
- Shallow evaluations of longtermist organizations by 24 Jun 2021 15:31 UTC; 192 points) (
- EA Updates for April 2021 by 26 Mar 2021 14:26 UTC; 39 points) (
- A peek at pairwise preference estimation in economics, marketing, and statistics by 8 Oct 2022 4:56 UTC; 31 points) (
- 16 Feb 2022 19:07 UTC; 21 points) 's comment on NunoSempere’s Quick takes by (
- 25 Mar 2021 4:09 UTC; 9 points) 's comment on EA Funds has appointed new fund managers by (
- 4 Dec 2022 12:16 UTC; 2 points) 's comment on Improving Karma: $8mn of possible value (my estimate) by (
So here are the mistakes pointed out in the comments:
EAF’s hiring round had a high value of information, which I didn’t incorporate, per Misha’s comment
“Why we have over-rated Cool Earth” was more impactful than I thought, per Khorton’s comment
I likely underestimated the posible negative impact of the 2017 donor lotery report, which was quite positive on ALLFED, per MichaelA’s comment.
I think this (a ~30% mistakes rate) is quite brutal, and still only a lower bound (because there might be other mistakes which commenters didn’t point out.) I’m pointing this out here because I want to reference this error rate in a forthcoming post.
There are a lot of things l like about this post. From small (e.g. the summary on top of it; and the table at the end) to large (e.g. it’s a good thing to do given a desire to understand how to quantify/estimate impact better).
Here are some things I am perplexed about or disagree with:
EAF hiring round estimate misses the enormous realized value of information. As far as I can see, EAF decided to move to London (partly) because of that.
> We moved to London (Primrose Hill) to better attract and retain staff and collaborate with other researchers in London and Oxford.
> Budget 2020: $994,000 (7.4 expected full-time equivalent employees). Our per-staff expenses have increased compared with 2019 because we do not have access to free office space anymore, and the cost of living in London is significantly higher than in Berlin.
The donor lottery evaluation seems to miss that $100K would have been donated otherwise.
Further, I would suggest another decomposition.
Impact = impact of running donor lottery as a tool (as opposed to donating without ~aggregation) + the counterfactuals impact of particular grants (as opposed to ~expected grants) + misc. side-effects (like a grantmaker joining LTFF).
I can understand why you added the first two terms. But it seems to me that
we can get a principled estimate about the first one based on arguments for donor lotteries (e.g. epistemic advantage coming from spending more time per dollar donated; and freed time of donors);
One can get more empirical and have a quick survey here.
estimating the second term is trickier because you need to make a guess about the impact of an average epistemically advantaged donation (as opposed to an average donation of 100K I which I think is missing from your estimate)
Both of these are doable because we saw how other donor lottery winners gave their money and how wealthy/invested donors give their money.
A good proxy for an impact of average donation might come from (a) EA survey donation data, (b) a quick survey of lottery participants. The latter seems superior because participating in an early donor lottery suggests a higher engagement with EA ideas &c.
After thinking a bit longer the choice of decomposition depends on what you want to understand better. It seems like your choice is better if you want to empirically understand whether the donor lottery is valuable.
Another weird thing is to see the 2017 Donor Lottery Grant having x5..10 higher impact than 2018 AI Alignment Literature Review and Charity Comparison.
I think it might come down to you not subtracting the counterfactual impact of donating 100K w/o lottery from donors’ lottery impact estimate.
The basic source of impact of the donor lottery and charity review comes from an epistemic advantage (someone dedicating more time to think/evaluate donations; people being better informed about the charities they are likely to donate to). Given how well received the literature review is it seems to be (quite likely) helpful to individual donors and given that it (according to your guess) impacted $100K..1M it should be kinda as impactful or more impactful than an abstract donor lottery.
And it’s hard to see this particular donor lottery as overwhelmingly more impactful than an average one.
I see now, that is weird. Note that if I calculate the total impact of the 100k to $1M I think Larks moved, the impact of that would be 100mQ to 2Q (change the Shapley value fraction in the Guessstimate to 1), which is closer to the 500mQ to 4Q I estimated from the 2017 Donor Lottery. And the difference can be attributed to a) Investing in organizations which are starting up, b) the high cost of producing AI safety papers, coupled with cause neutrality, and c) further error.
Good point re: value of information
Re: “The donor lottery evaluation seems to miss that $100K would have been donated otherwise”: I don’t think it does. In the “total project impact” section, I clarify that “Note that in order to not double count impact, the impact has to be divided between the funding providers and the grantee (and possibly with the new hires as well).”
Thank you, Nuno!
Am I understand correctly that the Shapley value multiplier (0.3 to 0.5) is responsible for preventing double counting?
If so why don’t you apply it to Positive status effects? The effect was also partially enabled by the funding providers (maybe less so).
Huh! I am surprised that your Shapley value calculation is not explicit but is reasonable.
Let’s limit ourselves to two players (= funding providers who are only capable of shallow evaluations and grantmakers who are capable of in-depth evaluation but don’t have their own funds). You get Shapley mult.=V(lottery, funding in-depth evaluated projects)−V(default, funding shallowly evaluated projects)2V(lottery). Your estimate of “0.3 to 0.5” implies that shallowly evaluated giving is as impactful as “0 to 0.4″ of in-depth evaluated giving.
This x2.5..∞ multiplier is reasonable but doesn’t feel quite right to put 10% on above ∞ :)
This makes me further confused about the gap between the donor lottery and the alignment review.
You are understanding correctly that the Shapley value multiplier is responsible for preventing double-counting, but you’re making a mistake when you say that it “implies that shallowly evaluated giving is as impactful as “0 to 0.4″ of in-depth evaluated giving”; the latter doesn’t follow.
In the two player game, you have Value({}), Value({1}), Value({2}), Value({1,2}), and the Shapley value for player 1 (the funders) is ([Value({1})- Value({})] + [Value({1,2})- Value({2})] )/2, and the value of player 2 (the donor lottery winner) is ([Value({2})- Value({})] + [Value({1,2})- Value({1})] )/2
In this case, I’m taking ([Value({2})- Value({})] to be ~0 for simplicity, so the value of player 2 is [Value({1,2})- Value({1})] )/2. Note that this is just the counterfactual value divided by a fraction.
If there were more players, it would be a little bit more complicated, but you’d end up with something similar to [Value({1,2,3})- Value({1,3})] )/3. Note again this is just the counterfactual value divided by a fraction.
But now, I don’t know how many players there are, so I just consider [Value({The World})- Value({The world without player 2})] )/(some estimates of how many players there are).
And the Shapley value multiplier would be 1/(some estimates of how many players there are).
At no point am I assuming that “shallowly evaluated giving is as impactful as 0 to 0.4 of in-depth evaluated giving”; the thing that I’m doing is just allocating value so that the sum of the value of each player is equal to the total value.
Thank you for engaging!
First, “note that this [misha: Shapley value of evaluator] is just the counterfactual value divided by a fraction [misha: by two].” Right, this is exactly the same in my comment. I further divide by total impact to calculate the Shapley multiplier.
Do you think we disagree?
Why isn’t my conclusion follows?
Second, you conclude “And the Shapley value multiplier would be 1/(some estimates of how many players there are)”, while your estimate is”0.3 to 0.5″. There have been like 30 participants over two lotteries that year, so you should have ended up with something an order of magnitude less like “3% to 10%”.
Am I missing something?
Third, for the model with more than two players, it’s unclear to me who the players are. If these are funders + N evaluators. You indeed will end up with 1N(1−V(funders)V(lottery)) because
Shapley multipliers should add up to 1, and
Shapley value of the funders is easy to calculate (any coalition without them lacks any impact).
Please note that V(funders) is V(default, …) from the comment above.
(Note that this model ignores that the beneficiary might win the lottery and no donations will be made.)
In the end,
I think that it is necessary to estimate X in “shallowly evaluated giving is as impactful as X times of in-depth evaluated giving”. Because if X≈1 impact of the evaluator is close to nil.
I might not understand how you model impact here, please, be more specific about the modeling setup and assumptions.
I don’t think that you should split evaluators. Well, basically because you want to disentangle the impact of evaluation and funding provision and not to calculate Adam’s personal impact.
Like, take it to the extreme: it would be pretty absurd to say that the overwhelmingly successful (e.g. seeding a new ACE Top Charity in yet unknown but highly tractable area of animal welfare and e.g. discovering AI alignment prodigy) donor lottery had an impact less than an average comment because there have been too many people (100K) contributing a dollar to participate in it.
Yes, we agree
No, we don’t agree. I think that Adam did better than other potential donor lottery winners, and so his counterfactual value is higher, and thus his Shapley value is also higher. If all the other donors had been clones of Adam, I agree that you’d just divide by n. Thus, the “In every example here, this will be equivalent to calculating counterfactual value, and dividing by the number of necessary stakeholders” is in fact wrong, and I was implicitly doing both of the following in one step: a. Calculating Shapley values with “evaluators” as one agent and b. thinking of Adam’s impact as a high proportion of the SV of the evaluator round,
The rest of our disagreements hinge on 2., and I agree that judging the evaluator step alone would make more sense.
On Sanjay’s Cool Earth post, I have seen it frequently referenced. Founders Pledge came out with some climate change recommendations shortly after and I think people have been largely donating to those now instead.
I’ll flag the narrow and lowish estimates about the Cool Earth as something I was most likely wrong about, then, thanks.
It seems plausible that people who gave to ALLFED, volunteered for ALLFED, worked for ALLFED, etc. due in part to Gleave’s report would otherwise have done better things with their resources.
The report may also have led to EAs/global catastrophic risk researchers/longtermists talking about ALLFED more often and more positively, which could perhaps negative effect on perceptions of those communities, e.g. because:
Papers associated with them often present explicit quantitative models and estimates about very uncertain things (which some people are just averse to in general)
ALLFED and those models sometimes make claims that can seem intuitively fairly unlikely
E.g., “AGI safety, alternative foods, and interventions for losing electricity/industry (and probably other interventions) likely save lives in the present generation more cost-effectively than GiveWell top charities.”
That’s a comment from Denkenberger rather than ALLFED as an insitutition, but ALLFED-related papers make similar claims
Those models do seem to have some noticeable issues
Though I’d personally say that this is to be expected with any models, and a great thing about models is that they often make it easier to identify and correct specific issues, and I personally still basically agree with the qualitative conclusions drawn from the models)
A big part of ALLFED’s focus is making a catastrophe less bad if it does happen, which could seem callous to some people
I think it’s unlikely that the donor lottery report would have those downsides to a substantial extent.
And I’m personally quite positive about ALLFED, David Denkenberger, and their work, and ALLFED is one of the three places I’ve donated the most to (along with GCRI and the Long-Term Future Fund).
I’m just disagreeing with the claim “It’s hard to see how the writeup could have had a negative effect.” I basically think most longtermism-related things could plausibly have negative effects, since they operate on variables that we think might be important for the long-term future and we’re really bad at predicting precisely how the effects play out. (But this doesn’t mean we just try to “do nothing”, of course! Something with some downside risk can still be very positive in expectation.)
I’m not sure how often my 80% confidence interval would include negative effects, nor whether it’d include them in the ALLFED case. So maybe this is just a nit-pick about your specific phrasing, and we’d agree on the substance of your model/estimate.
Yeah, I see what you’re saying. Do you think that it is hard for the writeup to have a negative total effect?
When I made my comment, I think I kind-of had in mind “negative total effect”, rather than “at least one negative effect, whether or not it’s offset”. But I don’t think I’d explicitly thought about the distinction (which is a bit silly), and my comment doesn’t make it totally clear what I meant, so it’s a good question.
I think my 80% confidence interval probably wouldn’t include an overall negative impact of the writeup. But I think my 95% confidence interval would.
Reasons why my 80% confidence interval probably wouldn’t include an overall negative impact of the writeup, despite what I said in my previous comment:
I think we should have some degree of confidence that, if there’s more public discussion by people with fairly good epistemics and good epistemic and discussion norms, that’ll tend to update people towards more accurate beliefs.
(Not every time, but more often than it does the opposite.)
As such, I think we should start off skeptical of claims like “An EA Forum post that influenced people’s beliefs and behaviours substantially influenced those things in a bad way, even though in theory someone else could’ve pointed that out convincingly and thus prevented that influence.”
And then there’s also the fact that Gleave later got a role on the LTFF, suggesting he’s probably good at reasoning about these things.
And there’s also my object-level positive impressions of ALLFED.
I have nothing to disagree about here :)
Overall thoughts
Thanks, I found this post interesting.
I don’t know what I think about the reasonableness of these specific evaluations, about how useful this sort of evaluation approach is, or about whether I’d like to see more of this sort of thing in future and exactly what form it should take. (To be clear, I literally just mean “I don’t know”, rather than meaning “I think this all sucks, but I’m being polite.”) But I think it’s plausible that this or something like it would be very valuable and should be scaled up substantially, so I think exploring the idea at least a bit is definitely worthwhile in expectation.
I’d be interested to hear roughly how long this whole process took you (or how long it took minus writing the actual post, or something)? This seems relevant to how worthwhile and scalable this sort of thing is.
(Of course, the process may become much faster as the people doing it become more experienced, better tools or templates for it are built, etc. But it may also become slower if one aims for more rigour / less pulling things out of thin air. In any case, I think how long this early attempt took should give at least a rough idea.)
I also had a bunch of reactions that aren’t especially important since they’re focused on specific points about each evaluation, rather than on the basic methods and how this sort of analysis can be useful. I’ll split them into seperate comments.
Recently Nuño asked me to do similar (but shallower) forecasting for ~150 project ideas. It took me about 5 hours. I think I could have done the evaluation faster but I left ~paragraph-long comments on like ⅓ to ½ projects and sentence long comments on most others; I haven’t done any advanced modeling or guesstimating.
Maybe an afternoon for the initial version, and then two weeks of occasional tweaks. Say 10h to 30h in total? I imagine that if one wanted to scale this, one could get it to 30 mins to an hour for each estimate.
I think that that seems promisingly fast to me, given that this was an early attempt and could probably be sped up (holding quality/rigour constant) by experience, tools, templates, etc. So that updates me a bit further towards enthusiasm about this general idea.
I’d also note that the larger goals are to scale in non-human ways. If we have a bunch of examples, we could:
1) Open this up to a prediction-market style setup, with a mix of volunteers and possibly inexpensive hires.
2) As we get samples, some people could use data analysis to make simple algorithms to estimate the value of many more documents.
3) We could later use ML and similar to scale this further.
So even if each item were rather time-costly right now, this might be an important step for later. If we can’t even do this, with a lot of work, that would be a significant blocker.
https://www.lesswrong.com/posts/kMmNdHpQPcnJgnAQF/prediction-augmented-evaluation-systems
I was somewhat confused by the scale using Categorizing Variants of Goodhart’s Law as an example of a 100mQ paper, given that the LW post version of that paper won the 2018 AI Alignment Prize ($5k), which makes a pretty strong case for it being “a particularly valuable paper” (1Q, the next category up). I also think this scale significantly overvalues research agendas and popular books relative to papers. I don’t think these aspects of the rubric wound up impacting the specific estimates made here, though.
I’m not sure on the exact valuation research agendas should get, but I would argue that well thought-through research agendas can be hugely beneficial in that they can reorient many researchers in high-impact directions, leading them to write papers on topics that are vastly more important than they might have otherwise chosen.
I would argue an ‘ingenious’ paper written on an unimportant topic isn’t anywhere near as good as a ‘pretty good’ paper written on a hugely important topic.
Yes, the scale is under construction, and you’re not the first person to mention that the specific research agenda mentioned is overvalued.
Some specific things I was confused about
The estimated mQARPs per employee per month seems to differ substantially between sections. Is this based on something like dividing the posts/papers the org produced by the org’s total budget or number of FTE employees? (Your comment on ALLFED vs AI safety papers seems to indicate this? Note that I didn’t look closely at the Guesstimate model.)
“I have reason to believe that the amount of money influenced was large, and Larks’s writeup was further the only one available. My confidence interval is then $100,000 to $1M in moved funding”. That seems surprisingly high, but I have no specific knowledge on this. Could you share your reason to believe that? (But no problem if the reasoning is based on private info or is just hard to communicate concisely and explicitly.)
“You can see my first guesstimate model here, but I think that this was too low because it only took into account the value of papers, so here is a second model, which takes into account that the value which does not come from papers is 2 to 20x the magnitude of the value which does come from papers. I’m still uncertain, so my final estimate is even more uncertain than the Guesstimate model.”
Could you give a sense of where you see the rest of that value is coming from? (I’m not saying I disagree with you.)
Was that accounted for in the other evaluations too? My impression from your written description (without looking at the models) was that e.g. for ALLFED you estimated their impact as entirely coming from posts and papers?
Is there a reason you estimated the impact of the Giving Tuesday and EAGx things in terms of dollars moved, without converting that into mQARPS?
Part of why this confuses me is that:
I’d guess that dollars moved is actually not the primary value of EAGx (though you can still convert the value into dollars-moved-equivalents if you want)
Meanwhile, I’d guess that dollars moved is the primary value of the animal welfare commitments post and maybe some other posts (though you can still convert the value into mQARPs if you want)
“Similarly, while EAGx Boston 2018 and the EA Giving Tuesday Donation Matching Initiative might have taken similar amounts of time to organize, by comparably capable people, I prefer the second. This is in large part because EAGx events are scalable, whereas Giving Tuesdays are not.” Did you mean to say Giving Tuesday is scalable, whereas EAGx events are not?
In the case of ALLFED, this is based on my picturing of one employee going about its month, and asking myself how surprising it would be if they couldn’t produce 10 mQARPS of value per month, or how surprising it would be if they could produce 50 mQARPs per month. In the case of the AI safety organizations, this is based on estimating the value of each of the papers that Larks things are valuable enough to mention, and then estimating what fraction of the total value of an organization those are.
Private info
a) Building up researchers into more capable researchers, knowledge acquired that isn’t published, information value of trying out dead ends, acquiring prestige, etc. b) I actually didn’t estimate ALLFED’s impact, I estimated the impact of the marginal hires, per 1.
Personal taste, it’s possible that was the inferior choice. I found it more easy to picture the dollars moved than the improvement in productivity. In hindsight, maybe improving retention would be another main benefit which I didn’t consider.
I got that as a comment. The intuition here is that it would be really, really hard to find a project which moves as much money as Giving Tuesday and which you could do every day, every week, or every month. But if there are more than 52 local EA groups, an EAGx could be organized every week. If you think that EA is only doing projects at maximum efficiency (which it isn’t), and knowing only that Giving Tuesdays are done once a year and EAGx are done more often, I’d expect one EAGx to be less valuable than one Giving Tuesday.
Or, in other words, I’d expect there to be some tradeoff between quality and scalable.
Thanks for the clarifications :)
So did that estimate of the impact of marginal hires also account for how much those hires would contribute to “Building up researchers [themselves or others] into more capable researchers, knowledge acquired that isn’t published, information value of trying out dead ends, acquiring prestige”?
Oof, no it didn’t, good point.
How would something like this approach be used for decision-making?
You write:
And:
But this post estimates the impact of already completed projects/writeups. So precisely this sort of method couldn’t directly be used to choose what projects to do. Instead, I see at least two broad ways something like this method could be used as an input when choosing what projects to do:
When choosing what projects to do, look at estimates like these, and either explicitly reason about or form intuitions about what these estimates suggest about the impact of different projects one is considering
One way to do this would be to create classifications for different types of projects, and then look up what has been the estimated impact per dollar of past projects in the same or similar classifications to each of the projects one is now choosing between
I think there’d be many other specific ways to do this as well
When choosing what projects to do, explicitly make estimates like these for those specific future projects
If that’s the idea, then the sort of approach taken in this post could be seen as:
just a proof of concept, or
a way to calibrate one’s intuitions/forecasts (with the hope being that there’ll be transfer between calibration when estimating the impact of past projects and calibration when forecasting the impact of future projects), or
a way of getting reference classes / base rates / outside views
In that case, this second approach would sort-of incorporate the first approach suggested above as one part of it
Is one of these what you had in mind? Or both? Or something else?
Yeah, I think that the distinction between evaluation and forecasting is non-central. For example, these estimates can also be viewed as forecasts of what I would estimate if I spent 100x as much time on this, or as forecasts of what a really good system would output.
More to the point, if a project isn’t completed I could just estimate the distribution of expected quality, and the expected impact given each degree of quality (or, do a simplified version of that).
That said, I was thinking more about 2., though having a classification/lookup scheme would also be a way to produce explicit estimates.
Agreed, but that’s still different from forecasting the impact of a project that hasn’t happened yet, and the difference intuitively seems like it might be meaningful for our purposes. I.e., it’s not immediately obvious that methods and intuitions that work well for the sort of estimation/forecasting done in this post would also work well for forecasting the impact of a project that hasn’t happened yet.
One could likewise say that it’s not obvious that methods and intuitions that work well for forecasting how I’ll do in job applications would also work well for forecasting GDP growth in developing countries. So I guess my point was more fundamentally about the potential significance of the domain being different, rather than whether the thing can be seen as a type of forecasting or not.
So it sounds like you’re thinking that the sort of thing done in this post would be “a way to calibrate one’s intuitions/forecasts (with the hope being that there’ll be transfer between calibration when estimating the impact of past projects and calibration when forecasting the impact of future projects)”?
That does seem totally plausible to me; it just adds a step to the argument.
(I guess I’m also more generally interested in the question of how well forecasting accuracy and calibration transfers across domains—though at the same time I haven’t made the effort to look into it at all...)
Yes, I expect the intuitions and for estimation to generalize/help a great deal with the forecasting step, and though I agree that this might not be intuitively obvious. I understand that estimation and forecasting seem like different categories, but I don’t expect that to be a significant hurdle in practice.
Specific reactions to the evaluation of Takeaways from EAF’s Hiring Round
I think that’s a substantial part of the impact, but that there may be other substantial parts too, such as:
Time saved by employees who have to design application processes (since it’s usually easier to do things when one has a good writeup as guidance)
Causing orgs to hire sooner, since they’re more confident they can do it well and without a huge time investment
Something along the lines of “health of the organisation”; if that post reduces the chance of making a hire which isn’t a good fit, it reduces the chance of frictions and someone ending up being fired or quitting, which I imagine are negative for the culture of the organisation
Something along the lines of “health of the community”; I imagine a better hiring round will mean better applicant experiences, which could reduce rates of value drift or burnout or the like
But these are just quick thoughts, and I haven’t run application rounds myself, and I think that those things overlap somewhat such that there’s a risk of double-counting.
FWIW, I think the upper bound of my 80% confidence interval would be above 10% more effective and 3 years staying at the org, and definitely above 1% more effect and 0.5 years staying there.
I’m also not sure how to interpret your upper bound itself having a range? (Caveat that I haven’t looked at your Guesstimate model.)
I also think that one other effect perhaps worth modelling is that better hiring rounds might mean hires stay at the org for longer (since better and more fitting hires are chosen). This could either be modelled as more output by those employees, or as less cost/output-reduction by employees involved in later hiring rounds, or maybe both.
There are also cases in which an org just doesn’t hire anyone at all at a given time if they don’t find a good enough fit, and presumably better hiring rounds somewhat reduces the odds of that.
FWIW, intuitively, that seems like a pretty low upper bound for the value of improving other orgs’ hiring rounds. I guess this is just for the reasons noted above. (And obviously it’d be better if I actually myself provided specific alternative numbers and models—I’m just taking the quicker option, sorry!)
Many of the other projects were stated to have an impact by increasing the funding certain organisations received, thereby helping them hire more people, thereby resulting in more useful output. So by that logic, shouldn’t those projects also be penalised for the lost opportunity cost of applicants involved in the hiring rounds run by the orgs which received extra funding due to the project?
Or am I misunderstanding the reasoning or the modelling approach? (That’s very possible; I didn’t actually look at any of your Guesstimate models.)
Yes, those seem like at least somewhat important pathways to impact that I’ve neglected, particularly the first two points. I imagine that could easily lead to a 2x to 3x error (but probably not to a 10x error)
To answer this specifically:
Yeah, I disagree with this. I’d expect most interventions to have a small effect, and in particular I expect it to just be hard to change people’s actions by writing words. In particular, I’d be much higher if I was thinking about the difference between a completely terrible hiring round and an excellent one, but I don’t know that people start off all that terrible or that this particular post brings people up all that much.
That seems reasonable. I think my intuitions would still differ from yours, but I don’t have that much reason to expect my intuitions are well-calibrated here, nor have I thought about this carefully and explicitly.
Upper bound being a range is a mistake, fixed now.