I haven’t had time yet to think about your specific claims, but I’m glad to see attention for this issue. Thank you to contributing what I suspect is an important discussion!
You might be interested in the following paper which essentially shows that under an additional assumption the Optimizer’s Curse not only makes us overestimate the value of the apparent top option but in fact can make us predictably choose the wrong option.
Denrell, J. and Liu, C., 2012. Top performers are not the most impressive when extreme performance indicates unreliability. Proceedings of the National Academy of Sciences, 109(24), pp.9331-9336.
The crucial assumption roughly is that the reliability of our assessments varies sufficiently much between options. Intuitively, I’m concerned that this might apply when EAs consider interventions across different cause areas: e.g., our uncertainty about the value of AI safety research is much larger than our uncertainty about the short-term benefits of unconditional cash transfers.
(See also the part on the Optimizer’s Curse and endnote [6] on Denrell and Liu (2012) in this post by me, though I suspect it won’t teach you anything new.)
Can you expand on how you would directly estimate the reliability of charity evaluations? I feel like there are a lot of realistic situations where this would be extremely difficult to do well.
I mean do the adjustment for the optimizer’s curse. Or whatever else is in that paper.
I think talk of doing things “well” or “reliably” should be tabooed from this discussion, because no one has any coherent idea of what the threshold for ‘well enough’ or ‘reliable enough’ means or is in this context. “Better” or “more reliable” makes sense.
Kind of an odd assumption that dependence on luck varies from player to player.
Intuitively, it strikes me as appropriate for some realistic situations. For example, you might try to estimate the performance of people based on quite different kinds or magnitudes of inputs; e.g. one applicant might have a long relevant track record, for another one you might just have a brief work test. Or you might compare the impact of interventions that are backed by very different kinds of evidence—say, a RCT vs. a speculative, qualitative argument.
Maybe there is something I’m missing here about why the assumption is odd, or perhaps even why the examples I gave don’t have the property required in the paper? (The latter would certainly be plausible as I read the paper a while ago, and even back then not very closely.)
If we are talking about charity evaluations then reliability can be estimated directly so this is no longer a predictable error.
Hmm. This made me wonder whether the paper’s results depends on the decision-maker being uncertain about which options have been estimated reliably vs. unreliably. It seems possible that the effect could disappear if the reliability of my estimates varies but I know that the variance of my value estimate for option 1 is v_1, the one for option 2 v_2 etc. (even if the v_i vary a lot). (I don’t have time to check the paper or get clear on this I’m afraid.)
I haven’t had time yet to think about your specific claims, but I’m glad to see attention for this issue. Thank you to contributing what I suspect is an important discussion!
You might be interested in the following paper which essentially shows that under an additional assumption the Optimizer’s Curse not only makes us overestimate the value of the apparent top option but in fact can make us predictably choose the wrong option.
The crucial assumption roughly is that the reliability of our assessments varies sufficiently much between options. Intuitively, I’m concerned that this might apply when EAs consider interventions across different cause areas: e.g., our uncertainty about the value of AI safety research is much larger than our uncertainty about the short-term benefits of unconditional cash transfers.
(See also the part on the Optimizer’s Curse and endnote [6] on Denrell and Liu (2012) in this post by me, though I suspect it won’t teach you anything new.)
Kind of an odd assumption that dependence on luck varies from player to player.
If we are talking about charity evaluations then reliability can be estimated directly so this is no longer a predictable error.
Can you expand on how you would directly estimate the reliability of charity evaluations? I feel like there are a lot of realistic situations where this would be extremely difficult to do well.
I mean do the adjustment for the optimizer’s curse. Or whatever else is in that paper.
I think talk of doing things “well” or “reliably” should be tabooed from this discussion, because no one has any coherent idea of what the threshold for ‘well enough’ or ‘reliable enough’ means or is in this context. “Better” or “more reliable” makes sense.
Intuitively, it strikes me as appropriate for some realistic situations. For example, you might try to estimate the performance of people based on quite different kinds or magnitudes of inputs; e.g. one applicant might have a long relevant track record, for another one you might just have a brief work test. Or you might compare the impact of interventions that are backed by very different kinds of evidence—say, a RCT vs. a speculative, qualitative argument.
Maybe there is something I’m missing here about why the assumption is odd, or perhaps even why the examples I gave don’t have the property required in the paper? (The latter would certainly be plausible as I read the paper a while ago, and even back then not very closely.)
Hmm. This made me wonder whether the paper’s results depends on the decision-maker being uncertain about which options have been estimated reliably vs. unreliably. It seems possible that the effect could disappear if the reliability of my estimates varies but I know that the variance of my value estimate for option 1 is v_1, the one for option 2 v_2 etc. (even if the v_i vary a lot). (I don’t have time to check the paper or get clear on this I’m afraid.)
Is this what you were trying to say here?
Thanks Max! That paper looks interesting—I’ll have to give it a closer read at some point.
I agree with you that how the reliability of assessments varies between options is crucial.