Interesting! Thank you for writing this, this is something I was also wondering about while reading for the Warwick EA fellowship. My intuition is also that in the case of a “many-membered set of probability functions”, I’d define a prior over those and then compute an expected value as if nothing happened. I acknowledge that there is substantial (or even overwhelming) uncertainty sometimes and I can understand the impulse of wanting a separate conceptual handle for that. But it’s still “decision making under uncertainty” and should therefore be subsumable under Bayesianism.
I feel similar to ben.smith that I might be completely missing something. But I also wonder if this confusion might just be an echo of the age-old Bayesianism vs Frequentism debate, where people have different intuition about whether priors over probability distributions are a-ok.
There is an argument from intuition that carry some force by Schoenfield (2012) that we can’t use a probability function:
(1) It is permissible to be insensitive to mild evidential sweetening. (2) If we are insensitive to mild evidential sweetening, our attitudes cannot be represented by a probability function. (3) It is permissible to have attitudes that are not representable by a probability function. (1, 2)
...
You are a confused detective trying to figure out whether Smith or Jones committed the crime. You have an enormous body of evidence that to evaluate. Here is some of it: You know that 68 out of the 103 eyewitnesses claim that Smith did it but Jones’ footprints were found at the crime scene. Smith has an alibi, and Jones doesn’t. But Jones has a clear record while Smith has committed crimes in the past. The gun that killed the victim belonged to Smith. But the lie detector, which is accurate 71% percent of time, suggests that Jones did it. After you have gotten all of this evidence, you have no idea who committed the crime. You are no more confident that Jones committed the crime than that Smith committed the crime, nor are you more confident that Smith committed the crime than that Jones committed the crime.
...
Now imagine that, after considering all of this evidence, you learn a new fact: it turns out that there were actually 69 eyewitnesses (rather than 68) testifying that Smith did it. Does this make it the case that you should now be more confident in S than J? That, if you had to choose right now who to send to jail, it should be Smith? I think not.
...
In our case, you are insensitive to evidential sweetening with respect to S since you are no more confident in S than ~S (i.e. J), and no more confident in ~S (i.e. J) than S. The extra eyewitness supports S more than it supports ~S, and yet despite learning about the extra eyewitness, you are no more confident in S than you are in ~S (i.e. J).
Intuitively, this sounds right. And if you went from this problem trying to understand solve the crime on intuition, you might really have no idea. Reading the passage, it sounds mind-boggling.
On the other hand, if you applied some reasoning and study, you might be able to come up with some probability estimates. You could identify the conditioning of P(Smith did it|an eyewitness says Smith did it), including a probability distribution on that probability itself, if you like. You can identify how to combine evidence from multiple witnesses, i.e., P(Smith did it|eyewitness 1 says Smith did it) & P(Smith did it|eyewitness 2 says Smith did it), and so on up to 68 and 69. You can estimate the independence of eyewitnesses, and from that work out how to properly combine evidence from multiple eyewitnesses.
And it might turn out that you don’t update as a result of the extra eyewitness, under some circumstances. Perhaps you know the eyewitnesses aren’t independent; they’re all card-carrying members of the “We hate Smith” club. In that case it simply turns out that the extra eye-witness is irrelevant to the problem; it doesn’t qualify as evidence, so it it doesn’t mean you’re insensitive to “mild evidential sweetening”.
I think a lot of the problem here is that these authors are discussing what one could do when one sits down for the first time and tries to grapple with a problem. In those cases there’s so many undefined features of the problem that it really does seem impossible and you really are clueless.
But that’s not the same as saying that, with sufficient time, you can’t put probability distributions to everything that’s relevant and try to work out the joint probability.
----
Schoenfield, M. Chilling out on epistemic rationality. Philos Stud158, 197–219 (2012).
While browsing types of uncertainties, I stumbled upon the idea of state space uncertainty and conscious unawareness, which sounds similar to your explanation of cluelessness and which might be another helpful angle for people with a more Bayesian perspective.
There are, in the real world, unforeseen contingencies: eventualities that even the educated decision maker will fail to foresee. For instance, the recent tsunami and subsequent nuclear meltdown in Japan are events that most agents would have omitted from their decision models. If a decision maker is aware of the possibility that they may not be aware of all relevant contingencies—a state that Walker and Dietz (2011) call ‘conscious unawareness’ —then they face state space uncertainty.
There are things you can do to correct for this sort of thing-for instance, go one level more meta, estimate the probability of unforeseen consequences in general, or within the class of problems that your specific problem fits into.
We couldn’t have predicted the fukushima disaster, but perhaps we can predict related things with some degree of certainty—the average cost and death toll of earthquakes worldwide, for instance. In fact, this is a fairly well explored space, since insurers have to understand the risk of earthquakes.
The ongoing pandemic is a harder example—the rarer the black swan, the more difficult it is to predict. But even then, prior to the 2020 pandemic, the WHO had estimated the amortized costs of pandemics as in the order of 1% of global GDP annually (averaged over years when there are and aren’t pandemics), which seems like a reasonable approximation.
I don’t know how much of a realistic solution that would be in practice.
I think the example Ben cites in his reply is very illustrative.
You might feel that you can’t justify your one specific choice of prior over another prior, so that particular choice is arbitrary, and then what you should do could depend on this arbitrary choice, whereas an equally reasonable prior would recommend a different decision. Someone else could have exactly the same information as you, but due to a different psychology, or just different patterns of neurons firing, come up with a different prior that ends up recommending a different decision. Choosing one prior over another without reason seems like a whim or a bias, and potentially especially prone to systematic error.
It seems bad if we’re basing how to do the most good on whims and biases.
If you’re lucky enough to have only finitely many equally reasonable priors, then I think it does make sense to just use a uniform meta-prior over them, i.e. just take their average. This doesn’t seem to work with infinitely many priors, since you could use different parametrizations to represent the same continuous family of distributions, with a different uniform distribution and therefore average for each parametrization. You’d have to justify your choice of parametrization!
As another example, imagine you have a coin that someone (who is trustworthy) has told you is biased towards heads, but they haven’t given you any hint how much, and you want to come up with a probability distribution for the fraction of heads over 1,000,000 flips. So, you want a distribution over the interval [0, 1]. Which distribution would you use? Say you give me a probability density function f. Why not (1−p)f(x)+p for some p∈(0,1)? Why not 1∫10f(xp)dxf(xp) for some p>0? If f is a weighted average of multiple distributions, why not apply one of these transformations to one of the component distributions and choose the resulting weighted average instead? Why the particular weights you’ve chosen and not slightly different ones?
Which distribution would you use? Why the particular weights you’ve chosen and not slightly different ones?
I think you just have to make your distribution uninformative enough that reasonable differences in the weights don’t change your overall conclusion. If they do, then I would concede that the solution to your specific question really is clueless. Otherwise, you can probably find a response.
come up with a probability distribution for the fraction of heads over 1,000,000 flips.
Rather than thinking of directly of appropriate distribution for the 1,000,000 flips, I’d think of a distribution to model p itself. Then you can run simulations based on the distribution of p to calculate the distribution of the fraction of 1000,000 flips. p∈(0.5,1.0], and then we need to select a distribution for p over that range.
There is no one correct probability distribution for p because any probability is just an expression of our belief, so you may use whatever probability distribution genuinely reflects your prior belief. A uniform distribution is a reasonable start. Perhaps you really are clueless about p, in which case, yes, there’s a certain amount of subjectivity about your choice. But prior beliefs are always inherently subjective, because they simply describe your belief about the state of the world as you know it now. The fact you might have to select a distribution, or set of distributions with some weighted average, is merely an expression of your uncertainty. This in itself, I think, doesn’t stop you from trying to estimate the result.
I think this expresses within Bayesian terms the philosophical idea that we can only make moral choices based on information available at the time; one can’t be held morally responsible for mistakes made on the basis of the information we didn’t have.
Perhaps you disagree with me that a uniform distribution is the best choice. You reason thus: “we have some idea about the properties of coins in general. It’s difficult to make a coin that is 100% biased towards heads. So that seems unlikely”. So we could pick a distribution that better reflects your prior belief. Perhaps a suitable choice might be Beta(2,2) with a truncation at 0.5, which will give the greatest likelihood of p just above 0.5, and a declining likelihood down to 1.0.
Maybe you and i just can’t agree after all that there is still no consistent and reasonable prior choice you can make, and not even any compromise. And let’s say we both run simulations using our own priors and find entirely different results and we can’t agree on any suitable weighting between them. In that case, yes, I can see you have cluelessness. I don’t think it follows that, if we went through the same process for estimating the longtermist moral worth of malaria bednet distribution, we must have intractable complex cluelessness about specific problems like malaria bednet distribution. I think I can admit that perhaps, right now, in our current belief state, we are genuinely clueless, but it seems that there is some work that can be done that might eliminate the cluelessness.
Interesting! Thank you for writing this, this is something I was also wondering about while reading for the Warwick EA fellowship. My intuition is also that in the case of a “many-membered set of probability functions”, I’d define a prior over those and then compute an expected value as if nothing happened. I acknowledge that there is substantial (or even overwhelming) uncertainty sometimes and I can understand the impulse of wanting a separate conceptual handle for that. But it’s still “decision making under uncertainty” and should therefore be subsumable under Bayesianism.
I feel similar to ben.smith that I might be completely missing something. But I also wonder if this confusion might just be an echo of the age-old Bayesianism vs Frequentism debate, where people have different intuition about whether priors over probability distributions are a-ok.
There is an argument from intuition that carry some force by Schoenfield (2012) that we can’t use a probability function:
Intuitively, this sounds right. And if you went from this problem trying to understand solve the crime on intuition, you might really have no idea. Reading the passage, it sounds mind-boggling.
On the other hand, if you applied some reasoning and study, you might be able to come up with some probability estimates. You could identify the conditioning of P(Smith did it|an eyewitness says Smith did it), including a probability distribution on that probability itself, if you like. You can identify how to combine evidence from multiple witnesses, i.e., P(Smith did it|eyewitness 1 says Smith did it) & P(Smith did it|eyewitness 2 says Smith did it), and so on up to 68 and 69. You can estimate the independence of eyewitnesses, and from that work out how to properly combine evidence from multiple eyewitnesses.
And it might turn out that you don’t update as a result of the extra eyewitness, under some circumstances. Perhaps you know the eyewitnesses aren’t independent; they’re all card-carrying members of the “We hate Smith” club. In that case it simply turns out that the extra eye-witness is irrelevant to the problem; it doesn’t qualify as evidence, so it it doesn’t mean you’re insensitive to “mild evidential sweetening”.
I think a lot of the problem here is that these authors are discussing what one could do when one sits down for the first time and tries to grapple with a problem. In those cases there’s so many undefined features of the problem that it really does seem impossible and you really are clueless.
But that’s not the same as saying that, with sufficient time, you can’t put probability distributions to everything that’s relevant and try to work out the joint probability.
----
Schoenfield, M. Chilling out on epistemic rationality. Philos Stud 158, 197–219 (2012).
While browsing types of uncertainties, I stumbled upon the idea of state space uncertainty and conscious unawareness, which sounds similar to your explanation of cluelessness and which might be another helpful angle for people with a more Bayesian perspective.
https://link.springer.com/article/10.1007/s10670-013-9518-4
A good point.
There are things you can do to correct for this sort of thing-for instance, go one level more meta, estimate the probability of unforeseen consequences in general, or within the class of problems that your specific problem fits into.
We couldn’t have predicted the fukushima disaster, but perhaps we can predict related things with some degree of certainty—the average cost and death toll of earthquakes worldwide, for instance. In fact, this is a fairly well explored space, since insurers have to understand the risk of earthquakes.
The ongoing pandemic is a harder example—the rarer the black swan, the more difficult it is to predict. But even then, prior to the 2020 pandemic, the WHO had estimated the amortized costs of pandemics as in the order of 1% of global GDP annually (averaged over years when there are and aren’t pandemics), which seems like a reasonable approximation.
I don’t know how much of a realistic solution that would be in practice.
This is a great example, thanks for sharing!
I think the example Ben cites in his reply is very illustrative.
You might feel that you can’t justify your one specific choice of prior over another prior, so that particular choice is arbitrary, and then what you should do could depend on this arbitrary choice, whereas an equally reasonable prior would recommend a different decision. Someone else could have exactly the same information as you, but due to a different psychology, or just different patterns of neurons firing, come up with a different prior that ends up recommending a different decision. Choosing one prior over another without reason seems like a whim or a bias, and potentially especially prone to systematic error.
It seems bad if we’re basing how to do the most good on whims and biases.
If you’re lucky enough to have only finitely many equally reasonable priors, then I think it does make sense to just use a uniform meta-prior over them, i.e. just take their average. This doesn’t seem to work with infinitely many priors, since you could use different parametrizations to represent the same continuous family of distributions, with a different uniform distribution and therefore average for each parametrization. You’d have to justify your choice of parametrization!
As another example, imagine you have a coin that someone (who is trustworthy) has told you is biased towards heads, but they haven’t given you any hint how much, and you want to come up with a probability distribution for the fraction of heads over 1,000,000 flips. So, you want a distribution over the interval [0, 1]. Which distribution would you use? Say you give me a probability density function f. Why not (1−p)f(x)+p for some p∈(0,1)? Why not 1∫10f(xp)dxf(xp) for some p>0? If f is a weighted average of multiple distributions, why not apply one of these transformations to one of the component distributions and choose the resulting weighted average instead? Why the particular weights you’ve chosen and not slightly different ones?
I think you just have to make your distribution uninformative enough that reasonable differences in the weights don’t change your overall conclusion. If they do, then I would concede that the solution to your specific question really is clueless. Otherwise, you can probably find a response.
Rather than thinking of directly of appropriate distribution for the 1,000,000 flips, I’d think of a distribution to model p itself. Then you can run simulations based on the distribution of p to calculate the distribution of the fraction of 1000,000 flips. p∈(0.5,1.0], and then we need to select a distribution for p over that range.
There is no one correct probability distribution for p because any probability is just an expression of our belief, so you may use whatever probability distribution genuinely reflects your prior belief. A uniform distribution is a reasonable start. Perhaps you really are clueless about p, in which case, yes, there’s a certain amount of subjectivity about your choice. But prior beliefs are always inherently subjective, because they simply describe your belief about the state of the world as you know it now. The fact you might have to select a distribution, or set of distributions with some weighted average, is merely an expression of your uncertainty. This in itself, I think, doesn’t stop you from trying to estimate the result.
I think this expresses within Bayesian terms the philosophical idea that we can only make moral choices based on information available at the time; one can’t be held morally responsible for mistakes made on the basis of the information we didn’t have.
Perhaps you disagree with me that a uniform distribution is the best choice. You reason thus: “we have some idea about the properties of coins in general. It’s difficult to make a coin that is 100% biased towards heads. So that seems unlikely”. So we could pick a distribution that better reflects your prior belief. Perhaps a suitable choice might be Beta(2,2) with a truncation at 0.5, which will give the greatest likelihood of p just above 0.5, and a declining likelihood down to 1.0.
Maybe you and i just can’t agree after all that there is still no consistent and reasonable prior choice you can make, and not even any compromise. And let’s say we both run simulations using our own priors and find entirely different results and we can’t agree on any suitable weighting between them. In that case, yes, I can see you have cluelessness. I don’t think it follows that, if we went through the same process for estimating the longtermist moral worth of malaria bednet distribution, we must have intractable complex cluelessness about specific problems like malaria bednet distribution. I think I can admit that perhaps, right now, in our current belief state, we are genuinely clueless, but it seems that there is some work that can be done that might eliminate the cluelessness.