It doesn’t actually tell you that your posterior distributions will tend to better predict values you will later measure in the real world (e.g. by checking if they fall in your 95% credence intervals), because there need not be any connection between your models or priors and the real world.

This is an issue of the models and priors. If your models and priors are not right… then you should update over your priors and use better models. Of course they can still be wrong… but that’s true of all beliefs, all reasoning, etc.

you will tend to find the posterior EV of your chosen coin to be greater than ^{1}⁄_{2}, but since the coins are actually fair, your estimate will be too high more than half of the time on average.

If you assume from the outside (unbeknownst to the agent) that they are all fair, then you’re not showing a problem with the agent’s reasoning, you’re just using relevant information which they lack.

you could have a uniform prior on the true future long-run average frequency of heads for the unbiased coins

My prior would not be uniform, it would be 0.5! What else could “unbiased coins” mean? This solves the problem, because then a coin with few head flips and zero tail flips will always have posterior of p > 0.5.

If you build enough models each trial, you might find the models you select are actually overfitting to the validation set (memorizing it), sometimes to the point that the models with highest validation accuracy will tend to have worse test accuracy than models with validation accuracy in a lower interval.

In this case we have a prior expectation that simpler models are more likely to be effective.

Do we have a prior expectation that one kind of charity is better? Well if so, just factor that in, business as usual. I don’t see the problem exactly.

3. Due to the related satisficer’s curse, when doing multiple hypothesis tests, you should adjust your p-values upward or your p-value cutoffs (false positive rate, significance level threshold) downward in specific ways to better predict replicability.

4. The satisficer’s curse also guarantees that empirical study publication based on p-value cutoffs will cause published studies to replicate less often than their p-values alone would suggest.

Bayesian EV estimation doesn’t do hypothesis testing with p-value cutoffs. This is the same problem popping up in a different framework, yes it will require a different solution in that context, but they are separate.

Now, if you treat your priors as posteriors that are conditional on a sample of random observations and arguments you’ve been exposed to or thought of yourself, you’d similarly find a bias towards interventions with “lucky” observations and arguments. For the intervention you do select compared to an intervention chosen at random, you’re more likely to have been convinced by poor arguments that support it and less likely to have seen good arguments against it, regardless of the intervention’s actual merits, and this bias increases the more interventions you consider. The solution supported by Proposition 2 doesn’t correct for the number of interventions under consideration.

The proposed solution applies here too, just do (simplistic, informal) posterior EV correction for your (simplistic, informal) estimates.

Of course that’s not going to be very reliable. But that’s the whole point of using such simplistic, informal thinking. All kinds of rigor get sacrificed when charities are dismissed for sloppy reasons. If you think your informally-excluded charities might actually turn out to be optimal then you shouldn’t be informally excluding them in the first place.

tl;dr: even using priors, with more options and hazier probabilities, you tend to increase the number of options which are too sensitive to supporting information (or just optimistically biased due to your priors), and these options look disproportionately good. This is still an optimizer’s curse in practice.

This is an issue of the models and priors. If your models and priors are not right… then you should update over your priors and use better models. Of course they can still be wrong… but that’s true of all beliefs, all reasoning, etc.

If you assume from the outside (unbeknownst to the agent) that they are all fair, then you’re not showing a problem with the agent’s reasoning, you’re just using relevant information which they lack.

In practice, your models and priors will almost always be wrong, because you lack information; there’s some truth of the matter of which you aren’t aware. It’s unrealistic to expect us to have good guesses for the priors in all cases, especially with little information or precedent as in hazy probabilities, a major point of the OP.

You’d hope that more information would tend to allow you to make better predictions and bring you closer to the truth, but when optimizing, even with correctly specified likelihoods and after updating over priors as you said should be done, the predictions for the selected coin can be more biased in expectation with more information (results of coin flips). On the other hand, the predictions for any fixed coin will not be any more biased in expectation over the new information, and if the prior’s EV hadn’t matched the true mean, the predictions would tend to be less biased.

More information (flips) per option (coin) would reduce the bias of the selection on average, but, as I showed, more options (coins) would increase it, too, because you get more chances to be unusually lucky.

My prior would not be uniform, it would be 0.5! What else could “unbiased coins” mean?

The intent here again is that you don’t know the coins are fair.

Bayesian EV estimation doesn’t do hypothesis testing with p-value cutoffs. This is the same problem popping up in a different framework, yes it will require a different solution in that context, but they are separate.

Fair enough.

The proposed solution applies here too, just do (simplistic, informal) posterior EV correction for your (simplistic, informal) estimates.

How would you do this in practice? Specifically, how would you get an idea of the magnitude for the correction you should make?

Maybe you could test your own (or your group’s) prediction calibration and bias, but it’s not clear how exactly you should incorporate this information, and it’s likely these tests won’t be very representative when you’re considering the kinds of problems with hazy probabilities mentioned in the OP.

This is an issue of the models and priors. If your models and priors are not right… then you should update over your priors and use better models. Of course they can still be wrong… but that’s true of all beliefs, all reasoning, etc.

If you assume from the outside (unbeknownst to the agent) that they are all fair, then you’re not showing a problem with the agent’s reasoning, you’re just using relevant information which they lack.

My prior would not be uniform, it would be 0.5! What else could “unbiased coins” mean? This solves the problem, because then a coin with few head flips and zero tail flips will always have posterior of p > 0.5.

In this case we have a prior expectation that simpler models are more likely to be effective.

Do we have a prior expectation that one kind of charity is better? Well if so, just factor that in, business as usual. I don’t see the problem exactly.

Bayesian EV estimation doesn’t do hypothesis testing with p-value cutoffs. This is the same problem popping up in a different framework, yes it will require a different solution in

thatcontext, but they are separate.The proposed solution applies here too, just do (simplistic, informal) posterior EV correction for your (simplistic, informal) estimates.

Of course that’s not going to be very reliable. But that’s the whole point of using such simplistic, informal thinking. All kinds of rigor get sacrificed when charities are dismissed for sloppy reasons. If you think your informally-excluded charities might actually turn out to be optimal then you shouldn’t be informally excluding them in the first place.

tl;dr: even using priors, with more options and hazier probabilities, you tend to increase the number of options which are too sensitive to supporting information (or just optimistically biased due to your priors), and these options look disproportionately good. This is still an optimizer’s curse in practice.

In practice, your models and priors will almost always be wrong, because you lack information; there’s some truth of the matter of which you aren’t aware. It’s unrealistic to expect us to have good guesses for the priors in all cases, especially with little information or precedent as in hazy probabilities, a major point of the OP.

You’d hope that more information would tend to allow you to make better predictions and bring you closer to the truth, but when optimizing, even with correctly specified likelihoods and after updating over priors as you said should be done, the predictions for the selected coin can be more biased in expectation with more information (results of coin flips). On the other hand, the predictions for any fixed coin will not be any more biased in expectation over the new information, and if the prior’s EV hadn’t matched the true mean, the predictions would tend to be less biased.

More information (flips) per option (coin) would reduce the bias of the selection on average, but, as I showed, more options (coins) would increase it, too, because you get more chances to be unusually lucky.

The intent here again is that you don’t know the coins are fair.

Fair enough.

How would you do this in practice? Specifically, how would you get an idea of the magnitude for the correction you should make?

Maybe you could test your own (or your group’s) prediction calibration and bias, but it’s not clear how exactly you should incorporate this information, and it’s likely these tests won’t be very representative when you’re considering the kinds of problems with hazy probabilities mentioned in the OP.