Yes, for me updating upwards on total success on a lower percentage success rate seems intuitively fairly weird. I’m not saying it’s wrong, it’s that I have to stop and think about it/use my system 2.
In particular, you have to have a prior distribution such that more valuable opportunities have a lower success rate. But then you have to have a bag of opportunities such that the worse they do, the more you get excited.
Now, I think this happens if you have a bag with “golden tickets”, “sure things”, and “duds”. Then not doing well would make you more excited if “sure things” were much less valuable than the weighted average of “duds” and “golden tickets”.
But to get that, I think you’d have to have “golden tickets” be a binary thing. But in practice, take something like GovAI. It seems like its theory of impact is robust enough that I would expect to see a long tail of impact or impact proxies, rather than a binary success/not success, instead of a lottery ticket shaped impact. Say that I’d expect their impact distribution to be a power law: In that case, I would not get more excited if I saw them fail again and again. Conversely, if I do see them getting some successes, I would update upwards on the mean and the standard deviation of the power law distribution from which their impact is drawn.
I agree with everything you say about the GovAI example (and more broadly your last paragraph).
I do think my system 1 seems to work a bit differently since I can imagine some situations in which I would find it intuitive to update upwards on total success based on a lower ‘success rate’ - though it would depend on the definition of the success rate. I can also tell some system-2 stories, but I don’t think they are conclusive.
E.g., I worry that a large fraction of outcomes with “impact at least x” might reflect a selection process that is too biased toward things that look typical or like sufficiently safe bets—thereby effectively sampling from a truncated range of a heavy-tailed distribution. The so selected grants might then have an expected value of n times the median of the full distribution, with n>1 depending on what share of outliers you systematically miss and how good your selection power within the truncated distribution is—and if the distribution is very heavy-tailed this can easily be less than the mean of the full distribution, i.e., might fail to even beat the benchmark of grant decisions by lottery.
(Tbc, in fact I think it’s implausible that LTFF or EAIF decisions are worse than decisions by lottery, at least if we imagine a lottery across all applications including desk rejects.)
Similarly, suppose I have a prior impact distribution that makes me expect that (made-up number) 80% of the total ex-post impact will be from 20% of all grants. Suppose further I then do an ex-post evaluation that makes me think that, actually the top 20% of grants only account for 50% of the total value. There are then different updates I can make (and how much I should make which of these depends on other context and the exact parameters):
The ex-post impact distribution is less heavy-tailed than I thought.
The grant selection process is systematically missing outliers.
The outcome was simply bad luck (which in a sense wouldn’t be that surprising since the empirical average is such an unstable estimate of the true mean of a highly heavy-tailed distribution). This could suggest that it would be valuable to find ways to increase the sample size, e.g., by spending less time on evaluating marginal grants and instead spending time on increasing the number of good applications.
However, I think that in this case my sys 1 probably “misfired” because the fraction of grants that performed better or worse than expected doesn’t seem to have a straightforward implication within the kind of models mentioned in this or your comment.
Makes sense. In particular, noticing that grants are all particularly legible might lead you update in the direction of a truncated distribution like you consider. So far, the LTFF seems like it has maybe moved a bit in the direction of more legibility, but not that much.
Conversely, if I do see them getting some successes, I would update upwards on the mean and the standard deviation of the power law distribution from which their impact is drawn.
It makes sense to update upwards on the mean, but why would you update on the standard deviation from n of 1? (I might be missing something obvious)
Well, because a success can be caused by a process who has a high mean, but also by a process which has a lower mean and a higher standard deviation. So for example, if you learn that someone has beaten Magnus Carlsen, it could be someone in the top 10, like Caruana, or it could be someone like Ivanchuk, who has a reputation as an “unreliable genius” and is currently number 56, but who, when he has good days, has extremely good days.
Suppose you give initial probability to all three normals. Then you sample an event, and its value is 1. Then you update against the green distribution, and in favor of the red and black distributions. The black distribution has a higher mean, but the red one has a higher standard deviation.
Yes, for me updating upwards on total success on a lower percentage success rate seems intuitively fairly weird. I’m not saying it’s wrong, it’s that I have to stop and think about it/use my system 2.
In particular, you have to have a prior distribution such that more valuable opportunities have a lower success rate. But then you have to have a bag of opportunities such that the worse they do, the more you get excited.
Now, I think this happens if you have a bag with “golden tickets”, “sure things”, and “duds”. Then not doing well would make you more excited if “sure things” were much less valuable than the weighted average of “duds” and “golden tickets”.
But to get that, I think you’d have to have “golden tickets” be a binary thing. But in practice, take something like GovAI. It seems like its theory of impact is robust enough that I would expect to see a long tail of impact or impact proxies, rather than a binary success/not success, instead of a lottery ticket shaped impact. Say that I’d expect their impact distribution to be a power law: In that case, I would not get more excited if I saw them fail again and again. Conversely, if I do see them getting some successes, I would update upwards on the mean and the standard deviation of the power law distribution from which their impact is drawn.
Thanks, that makes sense.
I agree with everything you say about the GovAI example (and more broadly your last paragraph).
I do think my system 1 seems to work a bit differently since I can imagine some situations in which I would find it intuitive to update upwards on total success based on a lower ‘success rate’ - though it would depend on the definition of the success rate. I can also tell some system-2 stories, but I don’t think they are conclusive.
E.g., I worry that a large fraction of outcomes with “impact at least x” might reflect a selection process that is too biased toward things that look typical or like sufficiently safe bets—thereby effectively sampling from a truncated range of a heavy-tailed distribution. The so selected grants might then have an expected value of n times the median of the full distribution, with n>1 depending on what share of outliers you systematically miss and how good your selection power within the truncated distribution is—and if the distribution is very heavy-tailed this can easily be less than the mean of the full distribution, i.e., might fail to even beat the benchmark of grant decisions by lottery.
(Tbc, in fact I think it’s implausible that LTFF or EAIF decisions are worse than decisions by lottery, at least if we imagine a lottery across all applications including desk rejects.)
Similarly, suppose I have a prior impact distribution that makes me expect that (made-up number) 80% of the total ex-post impact will be from 20% of all grants. Suppose further I then do an ex-post evaluation that makes me think that, actually the top 20% of grants only account for 50% of the total value. There are then different updates I can make (and how much I should make which of these depends on other context and the exact parameters):
The ex-post impact distribution is less heavy-tailed than I thought.
The grant selection process is systematically missing outliers.
The outcome was simply bad luck (which in a sense wouldn’t be that surprising since the empirical average is such an unstable estimate of the true mean of a highly heavy-tailed distribution). This could suggest that it would be valuable to find ways to increase the sample size, e.g., by spending less time on evaluating marginal grants and instead spending time on increasing the number of good applications.
However, I think that in this case my sys 1 probably “misfired” because the fraction of grants that performed better or worse than expected doesn’t seem to have a straightforward implication within the kind of models mentioned in this or your comment.
Makes sense. In particular, noticing that grants are all particularly legible might lead you update in the direction of a truncated distribution like you consider. So far, the LTFF seems like it has maybe moved a bit in the direction of more legibility, but not that much.
It makes sense to update upwards on the mean, but why would you update on the standard deviation from n of 1? (I might be missing something obvious)
Well, because a success can be caused by a process who has a high mean, but also by a process which has a lower mean and a higher standard deviation. So for example, if you learn that someone has beaten Magnus Carlsen, it could be someone in the top 10, like Caruana, or it could be someone like Ivanchuk, who has a reputation as an “unreliable genius” and is currently number 56, but who, when he has good days, has extremely good days.
Suppose you give initial probability to all three normals. Then you sample an event, and its value is 1. Then you update against the green distribution, and in favor of the red and black distributions. The black distribution has a higher mean, but the red one has a higher standard deviation.
Thanks, I understand what you mean now!