There is a part of me which finds the outcome (a 30 to 40% success rate) intuitively disappointing. However, it may suggest that the LTF was taking the right amount of risk per a hits-based-giving approach.
FWIW, my immediate reaction had been exactly the opposite: “wow, the fact that this skews so positive means the LTFF isn’t risk-seeking enough”. But I don’t know if I’d stand by that assessment after thinking about it for another hour.
Yes, for me updating upwards on total success on a lower percentage success rate seems intuitively fairly weird. I’m not saying it’s wrong, it’s that I have to stop and think about it/use my system 2.
In particular, you have to have a prior distribution such that more valuable opportunities have a lower success rate. But then you have to have a bag of opportunities such that the worse they do, the more you get excited.
Now, I think this happens if you have a bag with “golden tickets”, “sure things”, and “duds”. Then not doing well would make you more excited if “sure things” were much less valuable than the weighted average of “duds” and “golden tickets”.
But to get that, I think you’d have to have “golden tickets” be a binary thing. But in practice, take something like GovAI. It seems like its theory of impact is robust enough that I would expect to see a long tail of impact or impact proxies, rather than a binary success/not success, instead of a lottery ticket shaped impact. Say that I’d expect their impact distribution to be a power law: In that case, I would not get more excited if I saw them fail again and again. Conversely, if I do see them getting some successes, I would update upwards on the mean and the standard deviation of the power law distribution from which their impact is drawn.
I agree with everything you say about the GovAI example (and more broadly your last paragraph).
I do think my system 1 seems to work a bit differently since I can imagine some situations in which I would find it intuitive to update upwards on total success based on a lower ‘success rate’ - though it would depend on the definition of the success rate. I can also tell some system-2 stories, but I don’t think they are conclusive.
E.g., I worry that a large fraction of outcomes with “impact at least x” might reflect a selection process that is too biased toward things that look typical or like sufficiently safe bets—thereby effectively sampling from a truncated range of a heavy-tailed distribution. The so selected grants might then have an expected value of n times the median of the full distribution, with n>1 depending on what share of outliers you systematically miss and how good your selection power within the truncated distribution is—and if the distribution is very heavy-tailed this can easily be less than the mean of the full distribution, i.e., might fail to even beat the benchmark of grant decisions by lottery.
(Tbc, in fact I think it’s implausible that LTFF or EAIF decisions are worse than decisions by lottery, at least if we imagine a lottery across all applications including desk rejects.)
Similarly, suppose I have a prior impact distribution that makes me expect that (made-up number) 80% of the total ex-post impact will be from 20% of all grants. Suppose further I then do an ex-post evaluation that makes me think that, actually the top 20% of grants only account for 50% of the total value. There are then different updates I can make (and how much I should make which of these depends on other context and the exact parameters):
The ex-post impact distribution is less heavy-tailed than I thought.
The grant selection process is systematically missing outliers.
The outcome was simply bad luck (which in a sense wouldn’t be that surprising since the empirical average is such an unstable estimate of the true mean of a highly heavy-tailed distribution). This could suggest that it would be valuable to find ways to increase the sample size, e.g., by spending less time on evaluating marginal grants and instead spending time on increasing the number of good applications.
However, I think that in this case my sys 1 probably “misfired” because the fraction of grants that performed better or worse than expected doesn’t seem to have a straightforward implication within the kind of models mentioned in this or your comment.
Makes sense. In particular, noticing that grants are all particularly legible might lead you update in the direction of a truncated distribution like you consider. So far, the LTFF seems like it has maybe moved a bit in the direction of more legibility, but not that much.
Conversely, if I do see them getting some successes, I would update upwards on the mean and the standard deviation of the power law distribution from which their impact is drawn.
It makes sense to update upwards on the mean, but why would you update on the standard deviation from n of 1? (I might be missing something obvious)
Well, because a success can be caused by a process who has a high mean, but also by a process which has a lower mean and a higher standard deviation. So for example, if you learn that someone has beaten Magnus Carlsen, it could be someone in the top 10, like Caruana, or it could be someone like Ivanchuk, who has a reputation as an “unreliable genius” and is currently number 56, but who, when he has good days, has extremely good days.
Suppose you give initial probability to all three normals. Then you sample an event, and its value is 1. Then you update against the green distribution, and in favor of the red and black distributions. The black distribution has a higher mean, but the red one has a higher standard deviation.
To really make this update, I’d want some more bins than the ones Nuno provide. That is, there could be an “extremely more successful than expected” bin; and all that matters is whether you manage to get any grant in that bin.
(For example, I think Roam got a grant in 2018-2019, and they might fall in that bin, though I haven’t thought a lot about it.)
Yeah I agree that info on how much absolute impact each grant seems to have had would be more relevant for making such updates. (Though of course absolute impact is very hard to estimate.)
Strictly speaking the info in the OP is consistent with “99% of all impact came from one grant”, and it could even be one of the “Not as successful as hoped for”. (Though taking into account all context/info I would guess that the highest-impact grants would be in the bucket “More successful than expected”.) And if that was the case one shouldn’t make any updates that would be motivated by “this looks less heavy-tailed than I expected”.
FWIW, my immediate reaction had been exactly the opposite: “wow, the fact that this skews so positive means the LTFF isn’t risk-seeking enough”. But I don’t know if I’d stand by that assessment after thinking about it for another hour.
Yes, for me updating upwards on total success on a lower percentage success rate seems intuitively fairly weird. I’m not saying it’s wrong, it’s that I have to stop and think about it/use my system 2.
In particular, you have to have a prior distribution such that more valuable opportunities have a lower success rate. But then you have to have a bag of opportunities such that the worse they do, the more you get excited.
Now, I think this happens if you have a bag with “golden tickets”, “sure things”, and “duds”. Then not doing well would make you more excited if “sure things” were much less valuable than the weighted average of “duds” and “golden tickets”.
But to get that, I think you’d have to have “golden tickets” be a binary thing. But in practice, take something like GovAI. It seems like its theory of impact is robust enough that I would expect to see a long tail of impact or impact proxies, rather than a binary success/not success, instead of a lottery ticket shaped impact. Say that I’d expect their impact distribution to be a power law: In that case, I would not get more excited if I saw them fail again and again. Conversely, if I do see them getting some successes, I would update upwards on the mean and the standard deviation of the power law distribution from which their impact is drawn.
Thanks, that makes sense.
I agree with everything you say about the GovAI example (and more broadly your last paragraph).
I do think my system 1 seems to work a bit differently since I can imagine some situations in which I would find it intuitive to update upwards on total success based on a lower ‘success rate’ - though it would depend on the definition of the success rate. I can also tell some system-2 stories, but I don’t think they are conclusive.
E.g., I worry that a large fraction of outcomes with “impact at least x” might reflect a selection process that is too biased toward things that look typical or like sufficiently safe bets—thereby effectively sampling from a truncated range of a heavy-tailed distribution. The so selected grants might then have an expected value of n times the median of the full distribution, with n>1 depending on what share of outliers you systematically miss and how good your selection power within the truncated distribution is—and if the distribution is very heavy-tailed this can easily be less than the mean of the full distribution, i.e., might fail to even beat the benchmark of grant decisions by lottery.
(Tbc, in fact I think it’s implausible that LTFF or EAIF decisions are worse than decisions by lottery, at least if we imagine a lottery across all applications including desk rejects.)
Similarly, suppose I have a prior impact distribution that makes me expect that (made-up number) 80% of the total ex-post impact will be from 20% of all grants. Suppose further I then do an ex-post evaluation that makes me think that, actually the top 20% of grants only account for 50% of the total value. There are then different updates I can make (and how much I should make which of these depends on other context and the exact parameters):
The ex-post impact distribution is less heavy-tailed than I thought.
The grant selection process is systematically missing outliers.
The outcome was simply bad luck (which in a sense wouldn’t be that surprising since the empirical average is such an unstable estimate of the true mean of a highly heavy-tailed distribution). This could suggest that it would be valuable to find ways to increase the sample size, e.g., by spending less time on evaluating marginal grants and instead spending time on increasing the number of good applications.
However, I think that in this case my sys 1 probably “misfired” because the fraction of grants that performed better or worse than expected doesn’t seem to have a straightforward implication within the kind of models mentioned in this or your comment.
Makes sense. In particular, noticing that grants are all particularly legible might lead you update in the direction of a truncated distribution like you consider. So far, the LTFF seems like it has maybe moved a bit in the direction of more legibility, but not that much.
It makes sense to update upwards on the mean, but why would you update on the standard deviation from n of 1? (I might be missing something obvious)
Well, because a success can be caused by a process who has a high mean, but also by a process which has a lower mean and a higher standard deviation. So for example, if you learn that someone has beaten Magnus Carlsen, it could be someone in the top 10, like Caruana, or it could be someone like Ivanchuk, who has a reputation as an “unreliable genius” and is currently number 56, but who, when he has good days, has extremely good days.
Suppose you give initial probability to all three normals. Then you sample an event, and its value is 1. Then you update against the green distribution, and in favor of the red and black distributions. The black distribution has a higher mean, but the red one has a higher standard deviation.
Thanks, I understand what you mean now!
To really make this update, I’d want some more bins than the ones Nuno provide. That is, there could be an “extremely more successful than expected” bin; and all that matters is whether you manage to get any grant in that bin.
(For example, I think Roam got a grant in 2018-2019, and they might fall in that bin, though I haven’t thought a lot about it.)
I would also want more bins than the ones I provide, i.e., not considering the total value is probably one of the parts I like less about this post.
Yeah I agree that info on how much absolute impact each grant seems to have had would be more relevant for making such updates. (Though of course absolute impact is very hard to estimate.)
Strictly speaking the info in the OP is consistent with “99% of all impact came from one grant”, and it could even be one of the “Not as successful as hoped for”. (Though taking into account all context/info I would guess that the highest-impact grants would be in the bucket “More successful than expected”.) And if that was the case one shouldn’t make any updates that would be motivated by “this looks less heavy-tailed than I expected”.