I agree with everything you say about the GovAI example (and more broadly your last paragraph).
I do think my system 1 seems to work a bit differently since I can imagine some situations in which I would find it intuitive to update upwards on total success based on a lower ‘success rate’ - though it would depend on the definition of the success rate. I can also tell some system-2 stories, but I don’t think they are conclusive.
E.g., I worry that a large fraction of outcomes with “impact at least x” might reflect a selection process that is too biased toward things that look typical or like sufficiently safe bets—thereby effectively sampling from a truncated range of a heavy-tailed distribution. The so selected grants might then have an expected value of n times the median of the full distribution, with n>1 depending on what share of outliers you systematically miss and how good your selection power within the truncated distribution is—and if the distribution is very heavy-tailed this can easily be less than the mean of the full distribution, i.e., might fail to even beat the benchmark of grant decisions by lottery.
(Tbc, in fact I think it’s implausible that LTFF or EAIF decisions are worse than decisions by lottery, at least if we imagine a lottery across all applications including desk rejects.)
Similarly, suppose I have a prior impact distribution that makes me expect that (made-up number) 80% of the total ex-post impact will be from 20% of all grants. Suppose further I then do an ex-post evaluation that makes me think that, actually the top 20% of grants only account for 50% of the total value. There are then different updates I can make (and how much I should make which of these depends on other context and the exact parameters):
The ex-post impact distribution is less heavy-tailed than I thought.
The grant selection process is systematically missing outliers.
The outcome was simply bad luck (which in a sense wouldn’t be that surprising since the empirical average is such an unstable estimate of the true mean of a highly heavy-tailed distribution). This could suggest that it would be valuable to find ways to increase the sample size, e.g., by spending less time on evaluating marginal grants and instead spending time on increasing the number of good applications.
However, I think that in this case my sys 1 probably “misfired” because the fraction of grants that performed better or worse than expected doesn’t seem to have a straightforward implication within the kind of models mentioned in this or your comment.
Makes sense. In particular, noticing that grants are all particularly legible might lead you update in the direction of a truncated distribution like you consider. So far, the LTFF seems like it has maybe moved a bit in the direction of more legibility, but not that much.
Thanks, that makes sense.
I agree with everything you say about the GovAI example (and more broadly your last paragraph).
I do think my system 1 seems to work a bit differently since I can imagine some situations in which I would find it intuitive to update upwards on total success based on a lower ‘success rate’ - though it would depend on the definition of the success rate. I can also tell some system-2 stories, but I don’t think they are conclusive.
E.g., I worry that a large fraction of outcomes with “impact at least x” might reflect a selection process that is too biased toward things that look typical or like sufficiently safe bets—thereby effectively sampling from a truncated range of a heavy-tailed distribution. The so selected grants might then have an expected value of n times the median of the full distribution, with n>1 depending on what share of outliers you systematically miss and how good your selection power within the truncated distribution is—and if the distribution is very heavy-tailed this can easily be less than the mean of the full distribution, i.e., might fail to even beat the benchmark of grant decisions by lottery.
(Tbc, in fact I think it’s implausible that LTFF or EAIF decisions are worse than decisions by lottery, at least if we imagine a lottery across all applications including desk rejects.)
Similarly, suppose I have a prior impact distribution that makes me expect that (made-up number) 80% of the total ex-post impact will be from 20% of all grants. Suppose further I then do an ex-post evaluation that makes me think that, actually the top 20% of grants only account for 50% of the total value. There are then different updates I can make (and how much I should make which of these depends on other context and the exact parameters):
The ex-post impact distribution is less heavy-tailed than I thought.
The grant selection process is systematically missing outliers.
The outcome was simply bad luck (which in a sense wouldn’t be that surprising since the empirical average is such an unstable estimate of the true mean of a highly heavy-tailed distribution). This could suggest that it would be valuable to find ways to increase the sample size, e.g., by spending less time on evaluating marginal grants and instead spending time on increasing the number of good applications.
However, I think that in this case my sys 1 probably “misfired” because the fraction of grants that performed better or worse than expected doesn’t seem to have a straightforward implication within the kind of models mentioned in this or your comment.
Makes sense. In particular, noticing that grants are all particularly legible might lead you update in the direction of a truncated distribution like you consider. So far, the LTFF seems like it has maybe moved a bit in the direction of more legibility, but not that much.