The Optimizerâs Curse is defined by Proposition 1: informally, the expectation of the estimated value of your chosen intervention overestimates the expectation of its true value when you select the intervention with the maximum estimate.
The proposed solution is to instead maximize the posterior expected value of the variable being estimated (conditional on your estimates, the data, etc.), with a prior distribution for this variable, and this is purported to be justified by Proposition 2.
However, Proposition 2 holds no matter which priors and models you use; there are no restrictions at all in its statement (or proof). It doesnât actually tell you that your posterior distributions will tend to better predict values you will later measure in the real world (e.g. by checking if they fall in your 95% credence intervals), because there need not be any connection between your models or priors and the real world. It only tells you that your maximum posterior EV equals your corresponding priorâs EV (taking both conditional on the data, or neither, although the posterior EV is already conditional on the data).
Something I would still call an âoptimizerâs curseâ can remain even with this solution when we are concerned with the values of future measurements rather than just the expected values of our posterior distributions based on our subjective priors. Iâll give 4 examples, the first just to illustrate, and the other 3 real-world examples:
1. Suppose you have n different fair coins, but you arenât 100% sure theyâre all fair, so you have a prior distribution over the future frequency of heads (it could be symmetric in heads and tails, so the expected value would be 1/2 for each), and you use the same prior for each coin. You want to choose the coin which has the maximum future frequency of landing heads, based on information about the results of finitely many new coin flips from each coin. If you select the one with the maximum expected posterior, and repeat this trial many times (flip each coin multiple times, select the one with the max posterior EV, and then repeat), you will tend to find the posterior EV of your chosen coin to be greater than 1/2, but since the coins are actually fair, your estimate will be too high more than half of the time on average. I would still call this an âoptimizerâs curseâ, even though it followed the recommendations of the original paper. Of course, in this scenario, it doesnât matter which coin is chosen.
Now, suppose all the coins are as before except for one which is actually biased towards heads, and you have a prior for it which will give a lower posterior EV conditional on k heads and no tails than the other coins would (e.g. youâve flipped it many times before with particular results to achieve this; or maybe you already know its bias with certainty). You will record the results of k coin flips for each coin. With enough coins, and depending on the actual probabilities involved, you could be less likely to select the biased coin (on average, over repeated trials) based on maximum posterior EV than by choosing a coin randomly; youâll do worse than chance.
(Math to demonstrate the possibility of the posteriors working this way for k heads out of k: you could have a uniform prior on the true future long-run average frequency of heads for the unbiased coins, i.e.p(Îźi)=1 for Îźi in the interval [0,1], then p(Îźi|kheads)=(k+1)Îźki, and E[Îźi|kheads]=(k+1)/(k+2), which goes to 1 as k goes to infinity. You could have a prior which gives certainty to your biased coin having any true average frequency <1, so any of the unbiased coins which lands heads k out of k times will beat it for k large enough.)
If you flip each coin k times, thereâs a number of coins, n, so that the true probability (not your modelled probability) of at least one of the nâ1 other coins getting k heads is strictly greater than 1â1/n, i.e. 1â(1â1/2k)nâ1>1â1/n (for k=2, you needn>8, and for k=10, you needn>9360, so n grows pretty fast as a function of k). This means, with probability strictly greater than 1â1/n, you wonât select the biased coin, so with probability strictly less than 1/n, you will select the biased coin. So, you actually do worse than random choice, because of how many different coins you have and how likely one of them is to get very lucky. You would have even been better off on average ignoring all of the new kĂn coin flips and sticking to your priors, if you already suspected the biased coin was better (if you had a prior with mean >1/2).
2. A common practice in machine learning is to select the model with the greatest accuracy on a validation set among multiple candidates. Suppose that the validation and test sets are a random split of a common dataset for each problem. You will find that under repeated trials (not necessarily identical; they could be over different datasets/âproblems, with different models) that by choosing the model with the greatest validation accuracy, this value will tend to be greater than its accuracy on the test set. If you build enough models each trial, you might find the models you select are actually overfitting to the validation set (memorizing it), sometimes to the point that the models with highest validation accuracy will tend to have worse test accuracy than models with validation accuracy in a lower interval. This depends on the particular dataset and machine learning models being used. Part of this problem is just that we arenât accounting for the possibility of overfitting in our model of the accuracies, but fixing this on its own wouldnât solve the extra bias introduced by having more models to choose from.
3. Due to the related satisficerâs curse, when doing multiple hypothesis tests, you should adjust your p-values upward or your p-value cutoffs (false positive rate, significance level threshold) downward in specific ways to better predict replicability. There are corrections for the cutoff that account for the number of tests being performed, a simple one is that if you want a false positive rate of Îą, and youâre doing m tests, you could instead use a cutoff of 1â(1âÎą)m.
4. The satisficerâs curse also guarantees that empirical study publication based on p-value cutoffs will cause published studies to replicate less often than their p-values alone would suggest. I think this is basically the same problem as 3.
Now, if you treat your priors as posteriors that are conditional on a sample of random observations and arguments youâve been exposed to or thought of yourself, youâd similarly find a bias towards interventions with âluckyâ observations and arguments. For the intervention you do select compared to an intervention chosen at random, youâre more likely to have been convinced by poor arguments that support it and less likely to have seen good arguments against it, regardless of the interventionâs actual merits, and this bias increases the more interventions you consider. The solution supported by Proposition 2 doesnât correct for the number of interventions under consideration.
It doesnât actually tell you that your posterior distributions will tend to better predict values you will later measure in the real world (e.g. by checking if they fall in your 95% credence intervals), because there need not be any connection between your models or priors and the real world.
This is an issue of the models and priors. If your models and priors are not right⌠then you should update over your priors and use better models. Of course they can still be wrong⌠but thatâs true of all beliefs, all reasoning, etc.
you will tend to find the posterior EV of your chosen coin to be greater than 1â2, but since the coins are actually fair, your estimate will be too high more than half of the time on average.
If you assume from the outside (unbeknownst to the agent) that they are all fair, then youâre not showing a problem with the agentâs reasoning, youâre just using relevant information which they lack.
you could have a uniform prior on the true future long-run average frequency of heads for the unbiased coins
My prior would not be uniform, it would be 0.5! What else could âunbiased coinsâ mean? This solves the problem, because then a coin with few head flips and zero tail flips will always have posterior of p > 0.5.
If you build enough models each trial, you might find the models you select are actually overfitting to the validation set (memorizing it), sometimes to the point that the models with highest validation accuracy will tend to have worse test accuracy than models with validation accuracy in a lower interval.
In this case we have a prior expectation that simpler models are more likely to be effective.
Do we have a prior expectation that one kind of charity is better? Well if so, just factor that in, business as usual. I donât see the problem exactly.
3. Due to the related satisficerâs curse, when doing multiple hypothesis tests, you should adjust your p-values upward or your p-value cutoffs (false positive rate, significance level threshold) downward in specific ways to better predict replicability.
4. The satisficerâs curse also guarantees that empirical study publication based on p-value cutoffs will cause published studies to replicate less often than their p-values alone would suggest.
Bayesian EV estimation doesnât do hypothesis testing with p-value cutoffs. This is the same problem popping up in a different framework, yes it will require a different solution in that context, but they are separate.
Now, if you treat your priors as posteriors that are conditional on a sample of random observations and arguments youâve been exposed to or thought of yourself, youâd similarly find a bias towards interventions with âluckyâ observations and arguments. For the intervention you do select compared to an intervention chosen at random, youâre more likely to have been convinced by poor arguments that support it and less likely to have seen good arguments against it, regardless of the interventionâs actual merits, and this bias increases the more interventions you consider. The solution supported by Proposition 2 doesnât correct for the number of interventions under consideration.
The proposed solution applies here too, just do (simplistic, informal) posterior EV correction for your (simplistic, informal) estimates.
Of course thatâs not going to be very reliable. But thatâs the whole point of using such simplistic, informal thinking. All kinds of rigor get sacrificed when charities are dismissed for sloppy reasons. If you think your informally-excluded charities might actually turn out to be optimal then you shouldnât be informally excluding them in the first place.
tl;dr: even using priors, with more options and hazier probabilities, you tend to increase the number of options which are too sensitive to supporting information (or just optimistically biased due to your priors), and these options look disproportionately good. This is still an optimizerâs curse in practice.
This is an issue of the models and priors. If your models and priors are not right⌠then you should update over your priors and use better models. Of course they can still be wrong⌠but thatâs true of all beliefs, all reasoning, etc.
If you assume from the outside (unbeknownst to the agent) that they are all fair, then youâre not showing a problem with the agentâs reasoning, youâre just using relevant information which they lack.
In practice, your models and priors will almost always be wrong, because you lack information; thereâs some truth of the matter of which you arenât aware. Itâs unrealistic to expect us to have good guesses for the priors in all cases, especially with little information or precedent as in hazy probabilities, a major point of the OP.
Youâd hope that more information would tend to allow you to make better predictions and bring you closer to the truth, but when optimizing, even with correctly specified likelihoods and after updating over priors as you said should be done, the predictions for the selected coin can be more biased in expectation with more information (results of coin flips). On the other hand, the predictions for any fixed coin will not be any more biased in expectation over the new information, and if the priorâs EV hadnât matched the true mean, the predictions would tend to be less biased.
More information (flips) per option (coin) would reduce the bias of the selection on average, but, as I showed, more options (coins) would increase it, too, because you get more chances to be unusually lucky.
My prior would not be uniform, it would be 0.5! What else could âunbiased coinsâ mean?
The intent here again is that you donât know the coins are fair.
Bayesian EV estimation doesnât do hypothesis testing with p-value cutoffs. This is the same problem popping up in a different framework, yes it will require a different solution in that context, but they are separate.
Fair enough.
The proposed solution applies here too, just do (simplistic, informal) posterior EV correction for your (simplistic, informal) estimates.
How would you do this in practice? Specifically, how would you get an idea of the magnitude for the correction you should make?
Maybe you could test your own (or your groupâs) prediction calibration and bias, but itâs not clear how exactly you should incorporate this information, and itâs likely these tests wonât be very representative when youâre considering the kinds of problems with hazy probabilities mentioned in the OP.
Iâm going to try to clarify further why I think the Bayesian solution in the original paper on the Optimizerâs Curse is inadequate.
The Optimizerâs Curse is defined by Proposition 1: informally, the expectation of the estimated value of your chosen intervention overestimates the expectation of its true value when you select the intervention with the maximum estimate.
The proposed solution is to instead maximize the posterior expected value of the variable being estimated (conditional on your estimates, the data, etc.), with a prior distribution for this variable, and this is purported to be justified by Proposition 2.
However, Proposition 2 holds no matter which priors and models you use; there are no restrictions at all in its statement (or proof). It doesnât actually tell you that your posterior distributions will tend to better predict values you will later measure in the real world (e.g. by checking if they fall in your 95% credence intervals), because there need not be any connection between your models or priors and the real world. It only tells you that your maximum posterior EV equals your corresponding priorâs EV (taking both conditional on the data, or neither, although the posterior EV is already conditional on the data).
Something I would still call an âoptimizerâs curseâ can remain even with this solution when we are concerned with the values of future measurements rather than just the expected values of our posterior distributions based on our subjective priors. Iâll give 4 examples, the first just to illustrate, and the other 3 real-world examples:
1. Suppose you have n different fair coins, but you arenât 100% sure theyâre all fair, so you have a prior distribution over the future frequency of heads (it could be symmetric in heads and tails, so the expected value would be 1/2 for each), and you use the same prior for each coin. You want to choose the coin which has the maximum future frequency of landing heads, based on information about the results of finitely many new coin flips from each coin. If you select the one with the maximum expected posterior, and repeat this trial many times (flip each coin multiple times, select the one with the max posterior EV, and then repeat), you will tend to find the posterior EV of your chosen coin to be greater than 1/2, but since the coins are actually fair, your estimate will be too high more than half of the time on average. I would still call this an âoptimizerâs curseâ, even though it followed the recommendations of the original paper. Of course, in this scenario, it doesnât matter which coin is chosen.
Now, suppose all the coins are as before except for one which is actually biased towards heads, and you have a prior for it which will give a lower posterior EV conditional on k heads and no tails than the other coins would (e.g. youâve flipped it many times before with particular results to achieve this; or maybe you already know its bias with certainty). You will record the results of k coin flips for each coin. With enough coins, and depending on the actual probabilities involved, you could be less likely to select the biased coin (on average, over repeated trials) based on maximum posterior EV than by choosing a coin randomly; youâll do worse than chance.
(Math to demonstrate the possibility of the posteriors working this way for k heads out of k: you could have a uniform prior on the true future long-run average frequency of heads for the unbiased coins, i.e.p(Îźi)=1 for Îźi in the interval [0,1], then p(Îźi|k heads)=(k+1)Îźki, and E[Îźi|k heads]=(k+1)/(k+2), which goes to 1 as k goes to infinity. You could have a prior which gives certainty to your biased coin having any true average frequency <1, so any of the unbiased coins which lands heads k out of k times will beat it for k large enough.)
If you flip each coin k times, thereâs a number of coins, n, so that the true probability (not your modelled probability) of at least one of the nâ1 other coins getting k heads is strictly greater than 1â1/n, i.e. 1â(1â1/2k)nâ1>1â1/n (for k=2, you need n>8, and for k=10, you need n>9360, so n grows pretty fast as a function of k). This means, with probability strictly greater than 1â1/n, you wonât select the biased coin, so with probability strictly less than 1/n, you will select the biased coin. So, you actually do worse than random choice, because of how many different coins you have and how likely one of them is to get very lucky. You would have even been better off on average ignoring all of the new kĂn coin flips and sticking to your priors, if you already suspected the biased coin was better (if you had a prior with mean >1/2).
2. A common practice in machine learning is to select the model with the greatest accuracy on a validation set among multiple candidates. Suppose that the validation and test sets are a random split of a common dataset for each problem. You will find that under repeated trials (not necessarily identical; they could be over different datasets/âproblems, with different models) that by choosing the model with the greatest validation accuracy, this value will tend to be greater than its accuracy on the test set. If you build enough models each trial, you might find the models you select are actually overfitting to the validation set (memorizing it), sometimes to the point that the models with highest validation accuracy will tend to have worse test accuracy than models with validation accuracy in a lower interval. This depends on the particular dataset and machine learning models being used. Part of this problem is just that we arenât accounting for the possibility of overfitting in our model of the accuracies, but fixing this on its own wouldnât solve the extra bias introduced by having more models to choose from.
3. Due to the related satisficerâs curse, when doing multiple hypothesis tests, you should adjust your p-values upward or your p-value cutoffs (false positive rate, significance level threshold) downward in specific ways to better predict replicability. There are corrections for the cutoff that account for the number of tests being performed, a simple one is that if you want a false positive rate of Îą, and youâre doing m tests, you could instead use a cutoff of 1â(1âÎą)m.
4. The satisficerâs curse also guarantees that empirical study publication based on p-value cutoffs will cause published studies to replicate less often than their p-values alone would suggest. I think this is basically the same problem as 3.
Now, if you treat your priors as posteriors that are conditional on a sample of random observations and arguments youâve been exposed to or thought of yourself, youâd similarly find a bias towards interventions with âluckyâ observations and arguments. For the intervention you do select compared to an intervention chosen at random, youâre more likely to have been convinced by poor arguments that support it and less likely to have seen good arguments against it, regardless of the interventionâs actual merits, and this bias increases the more interventions you consider. The solution supported by Proposition 2 doesnât correct for the number of interventions under consideration.
This is an issue of the models and priors. If your models and priors are not right⌠then you should update over your priors and use better models. Of course they can still be wrong⌠but thatâs true of all beliefs, all reasoning, etc.
If you assume from the outside (unbeknownst to the agent) that they are all fair, then youâre not showing a problem with the agentâs reasoning, youâre just using relevant information which they lack.
My prior would not be uniform, it would be 0.5! What else could âunbiased coinsâ mean? This solves the problem, because then a coin with few head flips and zero tail flips will always have posterior of p > 0.5.
In this case we have a prior expectation that simpler models are more likely to be effective.
Do we have a prior expectation that one kind of charity is better? Well if so, just factor that in, business as usual. I donât see the problem exactly.
Bayesian EV estimation doesnât do hypothesis testing with p-value cutoffs. This is the same problem popping up in a different framework, yes it will require a different solution in that context, but they are separate.
The proposed solution applies here too, just do (simplistic, informal) posterior EV correction for your (simplistic, informal) estimates.
Of course thatâs not going to be very reliable. But thatâs the whole point of using such simplistic, informal thinking. All kinds of rigor get sacrificed when charities are dismissed for sloppy reasons. If you think your informally-excluded charities might actually turn out to be optimal then you shouldnât be informally excluding them in the first place.
tl;dr: even using priors, with more options and hazier probabilities, you tend to increase the number of options which are too sensitive to supporting information (or just optimistically biased due to your priors), and these options look disproportionately good. This is still an optimizerâs curse in practice.
In practice, your models and priors will almost always be wrong, because you lack information; thereâs some truth of the matter of which you arenât aware. Itâs unrealistic to expect us to have good guesses for the priors in all cases, especially with little information or precedent as in hazy probabilities, a major point of the OP.
Youâd hope that more information would tend to allow you to make better predictions and bring you closer to the truth, but when optimizing, even with correctly specified likelihoods and after updating over priors as you said should be done, the predictions for the selected coin can be more biased in expectation with more information (results of coin flips). On the other hand, the predictions for any fixed coin will not be any more biased in expectation over the new information, and if the priorâs EV hadnât matched the true mean, the predictions would tend to be less biased.
More information (flips) per option (coin) would reduce the bias of the selection on average, but, as I showed, more options (coins) would increase it, too, because you get more chances to be unusually lucky.
The intent here again is that you donât know the coins are fair.
Fair enough.
How would you do this in practice? Specifically, how would you get an idea of the magnitude for the correction you should make?
Maybe you could test your own (or your groupâs) prediction calibration and bias, but itâs not clear how exactly you should incorporate this information, and itâs likely these tests wonât be very representative when youâre considering the kinds of problems with hazy probabilities mentioned in the OP.