I agree with the general point of “different situations will require different approaches”.
From that common ground, I am interested in seeing whether we can tease out when it is appropriate to use one method against the other.
*disclaimer: low confidence from here onwards
I do not find the first example about value 0 vs value 500 entirely persuasive, though I see where you are coming from, and I think I can see when it might work.
The arithmetic mean of probabilities is entirely justified when aggregating predictions from models that start from disjoint and exhaustive conditions (this was first pointed out to me by Ben Snodin, and Owen CB makes the same point in a comment above).
This suggests that if your experts are using radically different assumptions (and are not hedging their bets based on each others arguments) then the average probability seems more appealing. I think this is implicitly what is happening in Linch’s and your first and third examples—we are in a sense assuming that only one expert is correct in the assumptions that led them to their estimate, but you do not know which one.
My intuition is that once you have experts who are given all-considered estimates, the geometric mean takes over again. I realize that this is a poor argument; but I am making a concrete claim about when it is correct to use arithmetic vs geo mean of probabilities.
In slogan form: the average of probabilities works for aggregating hedgehogs, the geometric mean works for aggregating foxes.
On the second example about instition failure, the argument goes that the expected value of the aggregate probability ought to correspond to the mean of the expected values.
I do not think this is entirely correct—I think you lose information when taking the expected values before aggregating, and thus we should not in general expect this. This is an argument similar to (Lindley, 1983), where the author dismisses marginalization as a desirable property on similar grounds. For a concrete example, see this comment, where I worked out how the expected value of the aggregate of two log normals relates to the aggregate of the expected value.
What I think we should require is that the aggregate of the exponential distributions implied by the annual probabilities matches the exponential distribution implied by the aggregated annual probabilities.
Interestingly, if you take the geometric mean aggregate of two exponential densities fA,fB with associated annual probabilities pA,pB then you end up with f=√fAfB∫√fAfB=e−pA+pB2xpA+pB2.
That is, the geometric mean aggregation of the implied exponentials led to an exponential whose annual rate probability is the arithmetic mean of the individual rates.
EDIT: This is wrong, since the annualized probability does not match the rate parameter in an exponential. It still does not work after we correct it by substituting λ=−ln(1−p)
I consider this a strong argument against the geometric mean.
Note that the arithmetic mean fails to meet this property too—the mixture distribution fA+fB2 is not even an exponential! The harmonic mean does not satisfy this property either.
What is the class of aggregation methods implied by imposing this condition? I do not know.
I do not have much to say about the Jack, Queen, King example. I agree with the general point that yes, there are some implicit assumptions that make the geometric mean work well in practice.
Definitely the JQK example does not feel like “business as usual”. There is an unusual dependence between the beliefs of the experts. For example, had we pooled expert C as well then the example does no longer work.
I’d like to see whether we can derive some more intuitive examples that follow this pattern. There might be—but right now I am drawing a blank.
In sum, I think there is an important point here that needs to be acknoledged—the theoretical and empirical evidence I provided is not enough to pinpoint the conditions where the geometric mean is the better aggregate (as opposed to the arithmetic mean).
I think the intuition behind using mixture probabilities is correct when the experts are reasoning from mutually exclusive assumptions. I feel a lot less confident when aggregating experts giving all-considered views. In that case my current best guess is the geometric mean, but now I feel a lot less confident.
I think that first taking the expected value then aggregating loses you information. When taking a linear mixture this works by happy coincidence, but we should not expect this to generalize to situations where the correct pooling method is different.
I’d be interested in understanding better what is the class of pooling methods that “respects the exponential distribution” in the sense I defined above of having the exponential associated with a pooled annual rate matches the pooled exponentials implied by the individual annual rates.
And I’d be keen on more work identifying real life examples where the geometric mean approach breaks, and more work suggesting theoretical conditions where it does (not). Right now we only have external bayesianity motivating it, that while compelling is clearly not enough.
I agree with a lot of this. In particular, that the best approach for practical rationality involves calculating things out according to each of the probabilities and then aggregating from there (or something like that), rather than aggregating first. That was part of what I was trying to show with the institution example. And it was part of what I was getting at by suggesting that the problem is ill-posed — there are a number of different assumptions we are all making about what these probabilities are going to be used for and whether we can assume the experts are themselves careful reasoners etc. and this discussion has found various places where the best form of aggregation depends crucially on these kinds of matters. I’ve certainly learned quite a bit from the discussion.
I think if you wanted to take things further, then teasing out how different combinations of assumptions lead to different aggregation methods would be a good next step.
In particular, that the best approach for practical rationality involves calculating things out according to each of the probabilities and then aggregating from there (or something like that), rather than aggregating first.
I am confused about this part. I think I said exactly the opposite? You need to aggregate first, then calculate whatever you are interested in. Otherwise you lose information (because eg taking the expected value of the individual predictions loses information that was contained in the individual predictions, about for example the standard deviation of the distribution, which depending on the aggregation method might affect the combined expected value).
I think we are roughly in agreement on this, it is just hard to talk about. I think that compression of the set of expert estimates down to a single measure of central tendency (e.g. the arithmetic mean) loses information about the distribution that is needed to give the right answer in each of a variety of situations. So in this sense, we shouldn’t aggregate first.
The ideal system would neither aggregate first into a single number, nor use each estimate independently and then aggregate from there (I suggested doing so as a contrast to aggregation first, but agree that it is not ideal). Instead, the ideal system would use the whole distribution of estimates (perhaps transformed based on some underlying model about where expert judgments come from, such as assuming that numbers between the point estimates are also plausible) and then doing some kind of EV calculation based on that. But this is so general an approach as to not offer much guidance, without further development.
The ideal system would [not] aggregate first into a single number [...] Instead, the ideal system would use the whole distribution of estimates
I have been thinking a bit more about this.
And I have concluded that the ideal aggregation procedure should compress all the information into a single prediction—our best guess for the actual distribution of the event.
Concretely, I think that in an idealized framework we should be treating the expert predictions p1,...,pN as Bayesian evidence for the actual distribution of the event of interest E. That is, the idealized aggregation ^p should just match the conditional probability of the event given the predictions: ^p=P(E|p1,...,pN)∝P(E)P(p1,...,pN|E).
Of course, for this procedure to be practical you need to know the generative model for the individual predictions P(p1,...,pN|E). This is for the most part not realistic—the generative model needs to take into account details of how each forecaster is generating the prediction and the redundance of information between the predictions. So in practice we will need to approximate the aggregate measure using some sort of heuristic.
But, crucially, the approximation does not depend on the downstream task we intend to use the aggregate prediction for.
This is something hard for me to wrap my head around, since I too feel the intuitive grasp of wanting to retain information about eg the spread of the individual probabilities. I would feel more nervous making decisions when the forecasters widly disagree with each other, as opposed to when the forecasters are of one voice.
What is this intuition then telling us? What do we need the information about the spread for then?
My answer is that we need to understand the resilience of the aggregated prediction to new information. This already plays a role in the aggregated prediction, since it helps us weight the relative importance we should give to our prior beliefs P(E) vs the evidence from the experts P(p1,...,pn|E) - a wider spread or a smaller number of forecaster predictions will lead to weaker evidence, and therefore a higher relative weighting of our priors.
Similarly, the spread of distributions gives us information about how much would we gain from additional predictions.
I think this neatly resolves the tension between aggregating vs not, and clarifies when it is important to retain information about the distribution of forecasts: when value of information is relevant. Which, admittedly, is quite often! But when we cannot acquire new information, or we can rule out value of information as decision-relevant, then we should aggregate first into a single number, and make decisions based on our best guess, regardless of the task.
My answer is that we need to understand the resilience of the aggregated prediction to new information.
This seems roughly right to me. And in particular, I think this highlights the issue with the example of institutional failure. The problem with aggregating predictions to a single guess p of annual failure, and then using p to forecast, is that it assumes that the probability of failure in each year is independent from our perspective. But in fact, each year of no failure provides evidence that the risk of failure is low. And if the forecasters’ estimates initially had a wide spread, then we’re very sensitive to new information, and so we should update more on each passing year. This would lead to a high probability of failure in the first few years, but still a moderately high expected lifetime.
I don’t think I get your argument for why the approximation should not depend on the downstream task. Could you elaborate?
I am also a bit confused about the relationship between spread and resiliency: a larger spread of forecasts does not seem to necessarily imply weaker evidence: It seems like for a relatively rare event about which some forecasters could acquire insider information, a large spread might give you stronger evidence.
Imagine E is about the future enactment of a quite unusual government policy, and one of your forecasters is a high ranking government official. Then, if all of your forecasters are relatively well calibrated and have sufficient incentive to report their true beliefs, a 90% forecast for E by the government official and a 1% forecast by everyone else should likely shift your beliefs a lot more towards E than a 10% forecast by everyone.
I don’t think I get your argument for why the approximation should not depend on the downstream task. Could you elaborate?
Your best approximation of the summary distribution ^p=P(E|p1,...,pN) is already “as good as it can get”. You think we should be cautious and treat this probability as if it could be higher for precautionary reasons? Then I argue that you should treat it as higher, regardless of how you arrived at the estimate.
In the end this circles back to basic Bayesian / Utility theory—in the idealized framework your credences about an event should be represented as a single probability. Departing from this idealization requires further justification.
a larger spread of forecasts does not seem to necessarily imply weaker evidence
You are right that “weaker evidence” is not exactly correct—this is more about the expected variance introduced by hypothetical additional predictions. I’ve realized I am confused about what is the best way to think about this in formal terms, so I wonder if my intuition was right after all.
Thank you for your thoughts!
I agree with the general point of “different situations will require different approaches”.
From that common ground, I am interested in seeing whether we can tease out when it is appropriate to use one method against the other.
*disclaimer: low confidence from here onwards
I do not find the first example about value 0 vs value 500 entirely persuasive, though I see where you are coming from, and I think I can see when it might work.
The arithmetic mean of probabilities is entirely justified when aggregating predictions from models that start from disjoint and exhaustive conditions (this was first pointed out to me by Ben Snodin, and Owen CB makes the same point in a comment above).
This suggests that if your experts are using radically different assumptions (and are not hedging their bets based on each others arguments) then the average probability seems more appealing. I think this is implicitly what is happening in Linch’s and your first and third examples—we are in a sense assuming that only one expert is correct in the assumptions that led them to their estimate, but you do not know which one.
My intuition is that once you have experts who are given all-considered estimates, the geometric mean takes over again. I realize that this is a poor argument; but I am making a concrete claim about when it is correct to use arithmetic vs geo mean of probabilities.
In slogan form: the average of probabilities works for aggregating hedgehogs, the geometric mean works for aggregating foxes.
On the second example about instition failure, the argument goes that the expected value of the aggregate probability ought to correspond to the mean of the expected values.
I do not think this is entirely correct—I think you lose information when taking the expected values before aggregating, and thus we should not in general expect this. This is an argument similar to (Lindley, 1983), where the author dismisses marginalization as a desirable property on similar grounds. For a concrete example, see this comment, where I worked out how the expected value of the aggregate of two log normals relates to the aggregate of the expected value.
What I think we should require is that the aggregate of the exponential distributions implied by the annual probabilities matches the exponential distribution implied by the aggregated annual probabilities.
Interestingly, if you take the geometric mean aggregate of two exponential densities fA,fB with associated annual probabilities pA,pB then you end up with f=√fAfB∫√fAfB=e−pA+pB2xpA+pB2.
That is, the geometric mean aggregation of the implied exponentials led to an exponential whose annual rate probability is the arithmetic mean of the individual rates.
EDIT: This is wrong, since the annualized probability does not match the rate parameter in an exponential. It still does not work after we correct it by substituting λ=−ln(1−p)
I consider this a strong argument against the geometric mean.
Note that the arithmetic mean fails to meet this property too—the mixture distribution fA+fB2 is not even an exponential! The harmonic mean does not satisfy this property either.
What is the class of aggregation methods implied by imposing this condition? I do not know.
I do not have much to say about the Jack, Queen, King example. I agree with the general point that yes, there are some implicit assumptions that make the geometric mean work well in practice.
Definitely the JQK example does not feel like “business as usual”. There is an unusual dependence between the beliefs of the experts. For example, had we pooled expert C as well then the example does no longer work.
I’d like to see whether we can derive some more intuitive examples that follow this pattern. There might be—but right now I am drawing a blank.
In sum, I think there is an important point here that needs to be acknoledged—the theoretical and empirical evidence I provided is not enough to pinpoint the conditions where the geometric mean is the better aggregate (as opposed to the arithmetic mean).
I think the intuition behind using mixture probabilities is correct when the experts are reasoning from mutually exclusive assumptions. I feel a lot less confident when aggregating experts giving all-considered views. In that case my current best guess is the geometric mean, but now I feel a lot less confident.
I think that first taking the expected value then aggregating loses you information. When taking a linear mixture this works by happy coincidence, but we should not expect this to generalize to situations where the correct pooling method is different.
I’d be interested in understanding better what is the class of pooling methods that “respects the exponential distribution” in the sense I defined above of having the exponential associated with a pooled annual rate matches the pooled exponentials implied by the individual annual rates.
And I’d be keen on more work identifying real life examples where the geometric mean approach breaks, and more work suggesting theoretical conditions where it does (not). Right now we only have external bayesianity motivating it, that while compelling is clearly not enough.
I agree with a lot of this. In particular, that the best approach for practical rationality involves calculating things out according to each of the probabilities and then aggregating from there (or something like that), rather than aggregating first. That was part of what I was trying to show with the institution example. And it was part of what I was getting at by suggesting that the problem is ill-posed — there are a number of different assumptions we are all making about what these probabilities are going to be used for and whether we can assume the experts are themselves careful reasoners etc. and this discussion has found various places where the best form of aggregation depends crucially on these kinds of matters. I’ve certainly learned quite a bit from the discussion.
I think if you wanted to take things further, then teasing out how different combinations of assumptions lead to different aggregation methods would be a good next step.
Thank you! I learned too from the examples.
One question:
I am confused about this part. I think I said exactly the opposite? You need to aggregate first, then calculate whatever you are interested in. Otherwise you lose information (because eg taking the expected value of the individual predictions loses information that was contained in the individual predictions, about for example the standard deviation of the distribution, which depending on the aggregation method might affect the combined expected value).
What am I not seeing?
I think we are roughly in agreement on this, it is just hard to talk about. I think that compression of the set of expert estimates down to a single measure of central tendency (e.g. the arithmetic mean) loses information about the distribution that is needed to give the right answer in each of a variety of situations. So in this sense, we shouldn’t aggregate first.
The ideal system would neither aggregate first into a single number, nor use each estimate independently and then aggregate from there (I suggested doing so as a contrast to aggregation first, but agree that it is not ideal). Instead, the ideal system would use the whole distribution of estimates (perhaps transformed based on some underlying model about where expert judgments come from, such as assuming that numbers between the point estimates are also plausible) and then doing some kind of EV calculation based on that. But this is so general an approach as to not offer much guidance, without further development.
I have been thinking a bit more about this.
And I have concluded that the ideal aggregation procedure should compress all the information into a single prediction—our best guess for the actual distribution of the event.
Concretely, I think that in an idealized framework we should be treating the expert predictions p1,...,pN as Bayesian evidence for the actual distribution of the event of interest E. That is, the idealized aggregation ^p should just match the conditional probability of the event given the predictions: ^p=P(E|p1,...,pN)∝P(E)P(p1,...,pN|E).
Of course, for this procedure to be practical you need to know the generative model for the individual predictions P(p1,...,pN|E). This is for the most part not realistic—the generative model needs to take into account details of how each forecaster is generating the prediction and the redundance of information between the predictions. So in practice we will need to approximate the aggregate measure using some sort of heuristic.
But, crucially, the approximation does not depend on the downstream task we intend to use the aggregate prediction for.
This is something hard for me to wrap my head around, since I too feel the intuitive grasp of wanting to retain information about eg the spread of the individual probabilities. I would feel more nervous making decisions when the forecasters widly disagree with each other, as opposed to when the forecasters are of one voice.
What is this intuition then telling us? What do we need the information about the spread for then?
My answer is that we need to understand the resilience of the aggregated prediction to new information. This already plays a role in the aggregated prediction, since it helps us weight the relative importance we should give to our prior beliefs P(E) vs the evidence from the experts P(p1,...,pn|E) - a wider spread or a smaller number of forecaster predictions will lead to weaker evidence, and therefore a higher relative weighting of our priors.
Similarly, the spread of distributions gives us information about how much would we gain from additional predictions.
I think this neatly resolves the tension between aggregating vs not, and clarifies when it is important to retain information about the distribution of forecasts: when value of information is relevant. Which, admittedly, is quite often! But when we cannot acquire new information, or we can rule out value of information as decision-relevant, then we should aggregate first into a single number, and make decisions based on our best guess, regardless of the task.
This seems roughly right to me. And in particular, I think this highlights the issue with the example of institutional failure. The problem with aggregating predictions to a single guess p of annual failure, and then using p to forecast, is that it assumes that the probability of failure in each year is independent from our perspective. But in fact, each year of no failure provides evidence that the risk of failure is low. And if the forecasters’ estimates initially had a wide spread, then we’re very sensitive to new information, and so we should update more on each passing year. This would lead to a high probability of failure in the first few years, but still a moderately high expected lifetime.
I think this is a good account of the institutional failure example, thank you!
I don’t think I get your argument for why the approximation should not depend on the downstream task. Could you elaborate?
I am also a bit confused about the relationship between spread and resiliency: a larger spread of forecasts does not seem to necessarily imply weaker evidence: It seems like for a relatively rare event about which some forecasters could acquire insider information, a large spread might give you stronger evidence.
Imagine E is about the future enactment of a quite unusual government policy, and one of your forecasters is a high ranking government official. Then, if all of your forecasters are relatively well calibrated and have sufficient incentive to report their true beliefs, a 90% forecast for E by the government official and a 1% forecast by everyone else should likely shift your beliefs a lot more towards E than a 10% forecast by everyone.
Your best approximation of the summary distribution ^p=P(E|p1,...,pN) is already “as good as it can get”. You think we should be cautious and treat this probability as if it could be higher for precautionary reasons? Then I argue that you should treat it as higher, regardless of how you arrived at the estimate.
In the end this circles back to basic Bayesian / Utility theory—in the idealized framework your credences about an event should be represented as a single probability. Departing from this idealization requires further justification.
You are right that “weaker evidence” is not exactly correct—this is more about the expected variance introduced by hypothetical additional predictions. I’ve realized I am confused about what is the best way to think about this in formal terms, so I wonder if my intuition was right after all.