As part of CEAās due diligence process, all grantees must submit progress reports documenting how theyāve spent their money. If a grantee applies for renewal, weāll perform a detailed evaluation of their past work. Additionally, we informally look back at past grants, focusing on grants that were controversial at the time, or seem to have been particularly good or bad.
Iād like us to be more systematic in our grant evaluation, and this is something weāre discussing. One problem is that many of the grants we make are quite small: so it just isnāt cost-effective for us to evaluate all our grants in detail. Because of this, any more detailed evaluation we perform would have to be on a subset of grants.
I view there being two main benefits of evaluation: 1) improving future grant decisions; 2) holding the fund accountable. Point 1) would suggest choosing grants we expect to be particularly informative: for example, those where fund managers disagreed internally, or those which we were particularly excited about and would like to replicate. Point 2) would suggest focusing on grants that were controversial amongst donors, or where there were potential conflicts of interest.
Itās important to note that other things help with these points, too. For 1) improving our grant making process, we are working on sharing best-practices between the different EA Funds. For 2) we are seeking to increase transparency about our internal processes, such as in this doc (which we will soon add as an FAQ entry). Since evaluation is time consuming in the short-term we are likely to only evaluate a small percentage of our grants, though we may scale this up as fund capacity grows.
Do the LTFF fund managers make forecasts about potential outcomes of grants?
And/āor do you write down in advance what sort of proxies youād want to see from this grant after x amount of time? (E.g., what youād want to see to feel that this had been a big success and that similar grant applications should be viewed (even) more positively in future, or that it would be worth renewing the grant if the grantee applied again.)
One reason that that first question came to mind was that I previously read a 2016 Open Phil post that states:
Both the Open Philanthropy Project and GiveWell recently began to make probabilistic forecasts about our grants. For the Open Philanthropy Project, see e.g. our forecasts about recent grants to Philip Tetlock and CIWF. For GiveWell, see e.g. forecasts about recent grants to Evidence Action and IPA. We also make and track some additional grant-related forecasts privately. The idea here is to be able to measure our accuracy later, as those predictions come true or are falsified, and perhaps to improve our accuracy from past experience. So far, we are simply encouraging predictions without putting much effort into ensuring their later measurability.
Weāre going to experiment with some forecasting sessions led by an experienced āforecast facilitatorāāsomeone who helps elicit forecasts from people about the work theyāre doing, in a way that tries to be as informative and helpful as possible. This might improve the forecasts mentioned in the previous bullet point.
(I donāt know whether, how, and how much Open Phil and GiveWell still do things like this.)
We havenāt historically done this. As someone who has tried pretty hard to incorporate forecasting into my work at LessWrong, my sense is that it actually takes a lot of time until you can get a group of 5 relatively disagreeable people to agree on an operationalization that makes sense to everyone, and so this isnāt really super feasible to do for lots of grants. Iāve made forecasts for LessWrong, and usually creating a set of forecasts that actually feels useful in assessing our performance takes me at least 5-10 hours.
Itās possible that other people are much better at this than I am, but this makes me kind of hesitant to use at least classical forecasting methods as part of LTFF evaluation.
It seems plausible to me that a useful version of forecasting grant outcomes would be too time-consuming to be worthwhile. (I donāt really have a strong stance on the matter currently.) And your experience with useful forecasting for LessWrong work being very time-consuming definitely seems like relevant data.
But this part of your answer confused me:
my sense is that it actually takes a lot of time until you can get a group of 5 relatively disagreeable people to agree on an operationalization that makes sense to everyone, and so this isnāt really super feasible to do for lots of grants
Naively, Iād have thought that, if that was a major obstacle, you could just have a bunch of separate operationalisations, and people can forecast on whichever ones they want to forecast on. If, later, some or all operationalisations do indeed seem to have been too flawed for it to be useful to compare reality to them, assess calibration, etc., you could just not do those things for those operationalisations/āthat grant.
(Note that Iām not necessarily imagining these forecasts being made public in advance or afterwards. They could be engaged in internally to the extent that makes senseāsometimes ignoring them if that seems appropriate in a given case.)
Is there a reason Iām missing for why this doesnāt work?
Or was the point about difficulty of agreeing on an operationalisation really meant just as evidence of how useful operationalisations are hard to generate, as opposed to the disagreement itself being the obstacle?
I think the most lightweight-but-still-useful forecasting operationalization Iād be excited about is something like
12/ā24/ā120 months from now, will I still be very excited about this grant?
12/ā24/ā120 months from now, will I be extremely excited about this grant?
This gets at whether people think itās a good idea ex post, and also (if people are well-calibrated) can quantify whether people are insufficiently or too risk/āambiguity-averse, in the classic sense of the term.
This seems helpful to assess fund managersā calibration and improve their own thinking and decision-making. Itās less likely to be useful for communicating their views transparently to one another, or to the community, and itās susceptible to post-hoc rationalization. Iād prefer an oracle external to the fund, like ā12 months from now, will X have a ā„7/ā10 excitement about this grant on a 1-10 scale?ā, where X is a person trusted by the fund managers who will likely know about the project anyway, such that the cost to resolve the forecast is small.
I plan to encourage the funds to experiment with something like this going forward.
Just to make sure Iām understanding, are you also indicating that the LTFF doesnāt write down in advance what sort of proxies youād want to see from this grant after x amount of time? And that you think the same challenges with doing useful forecasting for your LessWrong work would also apply to that?
These two things (forecasts and proxies) definitely seem related, and both would involve challenges in operationalising things. But they also seem meaningfully different.
Iād also think that, in evaluating a grant, I might find it useful to partly think in terms of āWhat would I like to see from this grantee x months/āyears from now? What sorts of outputs or outcomes would make me update more in favour of renewing this grantāif thatās requestedāand making similar grants in future?ā
Weāve definitely written informally things like āthis is what would convince me that this grant was a good ideaā, but we donāt have a more formalized process for writing down specific objective operationalizations that we all forecast on.
Iām personally actually pretty excited about trying to make some quick forecasts for a significant fraction (say, half) of the grants that we actually make, but this is something thatās on my list to discuss at some point with the LTFF. I mostly agree with the issues that Habryka mentions, though.
Do the LTFF fund managers make forecasts about potential outcomes of grants?
To add to Habrykaās response: we do give each grant a quantitative score (on ā5 to +5, where 0 is zero impact). This obviously isnāt as helpful as a detailed probabilistic forecast, but I think it does give a lot of the value. For example, one question Iād like to answer from retrospective evaluation is whether we should be more consensus driven or fund anything that at least one manager is excited about. We could address this by scrutinizing past grants that had a high variance in scores between managers.
I think it might make sense to start doing forecasting for some of our larger grants (where weāre willing to invest more time), and when the key uncertainties are easy to operationalize.
What processes do you have for monitoring the outcome/āimpact of grants, especially grants to individuals?
As part of CEAās due diligence process, all grantees must submit progress reports documenting how theyāve spent their money. If a grantee applies for renewal, weāll perform a detailed evaluation of their past work. Additionally, we informally look back at past grants, focusing on grants that were controversial at the time, or seem to have been particularly good or bad.
Iād like us to be more systematic in our grant evaluation, and this is something weāre discussing. One problem is that many of the grants we make are quite small: so it just isnāt cost-effective for us to evaluate all our grants in detail. Because of this, any more detailed evaluation we perform would have to be on a subset of grants.
I view there being two main benefits of evaluation: 1) improving future grant decisions; 2) holding the fund accountable. Point 1) would suggest choosing grants we expect to be particularly informative: for example, those where fund managers disagreed internally, or those which we were particularly excited about and would like to replicate. Point 2) would suggest focusing on grants that were controversial amongst donors, or where there were potential conflicts of interest.
Itās important to note that other things help with these points, too. For 1) improving our grant making process, we are working on sharing best-practices between the different EA Funds. For 2) we are seeking to increase transparency about our internal processes, such as in this doc (which we will soon add as an FAQ entry). Since evaluation is time consuming in the short-term we are likely to only evaluate a small percentage of our grants, though we may scale this up as fund capacity grows.
Interesting question and answer!
Do the LTFF fund managers make forecasts about potential outcomes of grants?
And/āor do you write down in advance what sort of proxies youād want to see from this grant after x amount of time? (E.g., what youād want to see to feel that this had been a big success and that similar grant applications should be viewed (even) more positively in future, or that it would be worth renewing the grant if the grantee applied again.)
One reason that that first question came to mind was that I previously read a 2016 Open Phil post that states:
(I donāt know whether, how, and how much Open Phil and GiveWell still do things like this.)
We havenāt historically done this. As someone who has tried pretty hard to incorporate forecasting into my work at LessWrong, my sense is that it actually takes a lot of time until you can get a group of 5 relatively disagreeable people to agree on an operationalization that makes sense to everyone, and so this isnāt really super feasible to do for lots of grants. Iāve made forecasts for LessWrong, and usually creating a set of forecasts that actually feels useful in assessing our performance takes me at least 5-10 hours.
Itās possible that other people are much better at this than I am, but this makes me kind of hesitant to use at least classical forecasting methods as part of LTFF evaluation.
Thanks for that answer.
It seems plausible to me that a useful version of forecasting grant outcomes would be too time-consuming to be worthwhile. (I donāt really have a strong stance on the matter currently.) And your experience with useful forecasting for LessWrong work being very time-consuming definitely seems like relevant data.
But this part of your answer confused me:
Naively, Iād have thought that, if that was a major obstacle, you could just have a bunch of separate operationalisations, and people can forecast on whichever ones they want to forecast on. If, later, some or all operationalisations do indeed seem to have been too flawed for it to be useful to compare reality to them, assess calibration, etc., you could just not do those things for those operationalisations/āthat grant.
(Note that Iām not necessarily imagining these forecasts being made public in advance or afterwards. They could be engaged in internally to the extent that makes senseāsometimes ignoring them if that seems appropriate in a given case.)
Is there a reason Iām missing for why this doesnāt work?
Or was the point about difficulty of agreeing on an operationalisation really meant just as evidence of how useful operationalisations are hard to generate, as opposed to the disagreement itself being the obstacle?
I think the most lightweight-but-still-useful forecasting operationalization Iād be excited about is something like
This gets at whether people think itās a good idea ex post, and also (if people are well-calibrated) can quantify whether people are insufficiently or too risk/āambiguity-averse, in the classic sense of the term.
This seems helpful to assess fund managersā calibration and improve their own thinking and decision-making. Itās less likely to be useful for communicating their views transparently to one another, or to the community, and itās susceptible to post-hoc rationalization. Iād prefer an oracle external to the fund, like ā12 months from now, will X have a ā„7/ā10 excitement about this grant on a 1-10 scale?ā, where X is a person trusted by the fund managers who will likely know about the project anyway, such that the cost to resolve the forecast is small.
I plan to encourage the funds to experiment with something like this going forward.
I agree that your proposed operationalization is better for the stated goals, assuming similar levels of overhead.
Just to make sure Iām understanding, are you also indicating that the LTFF doesnāt write down in advance what sort of proxies youād want to see from this grant after x amount of time? And that you think the same challenges with doing useful forecasting for your LessWrong work would also apply to that?
These two things (forecasts and proxies) definitely seem related, and both would involve challenges in operationalising things. But they also seem meaningfully different.
Iād also think that, in evaluating a grant, I might find it useful to partly think in terms of āWhat would I like to see from this grantee x months/āyears from now? What sorts of outputs or outcomes would make me update more in favour of renewing this grantāif thatās requestedāand making similar grants in future?ā
Weāve definitely written informally things like āthis is what would convince me that this grant was a good ideaā, but we donāt have a more formalized process for writing down specific objective operationalizations that we all forecast on.
Iām personally actually pretty excited about trying to make some quick forecasts for a significant fraction (say, half) of the grants that we actually make, but this is something thatās on my list to discuss at some point with the LTFF. I mostly agree with the issues that Habryka mentions, though.
To add to Habrykaās response: we do give each grant a quantitative score (on ā5 to +5, where 0 is zero impact). This obviously isnāt as helpful as a detailed probabilistic forecast, but I think it does give a lot of the value. For example, one question Iād like to answer from retrospective evaluation is whether we should be more consensus driven or fund anything that at least one manager is excited about. We could address this by scrutinizing past grants that had a high variance in scores between managers.
I think it might make sense to start doing forecasting for some of our larger grants (where weāre willing to invest more time), and when the key uncertainties are easy to operationalize.
Thank you!