(don’t feel extremely confident about the below but seemed worth sharing)
I think it’s really great to flag this! But as I mentioned to you elsewhere I’m not sure we’re certain enough to make a blanket recommendation to the EA community.
I think we have some evidence that geometric mean of odds is better, but not that much evidence. Although I haven’t looked into the evidence that Simon_M shared from Metaculus.
I guess I can potentially see us changing our minds in a year’s time and deciding that arithmetic mean of probabilities is better after all, or that some other method is better than both of these.
Then maybe people will have made a costly change to a new method (learning what odds are, what a geometric mean is, learning how to apply it in practice, maybe understanding the argument for using the new method) that turns out not to have been worth it.
I guess I can potentially see us changing our minds in a year’s time and deciding that arithmetic mean of probabilities is better after all, or that some other method is better than both of these.
This seems very unlikely, I’ll bet your $20 against my $80 that this doesn’t happen.
Thanks both (and Owen too), I now feel more confident that geometric mean of odds is better!
(Edit: at 1:4 odds I don’t feel great about a blanket recommendation, but I guess the odds at which you’re indifferent to taking the bet are more heavily stacked against us changing our mind. And Owen’s <1% is obviously way lower)
Like Nuno I think this is very unlikely. Probably <1% that we’d straightforwardly prefer arithmetic mean of probabilities. Much higher chance that in some circumstances we’d prefer something else (e.g. unweighted geometric mean of probabilities gets very distorted by having one ignorant person put in a probability which is extremely close to zero, so in some circumstances you’d want to be able to avoid that).
I don’t think the amount of evidence here would be conclusive if we otherwise thought arithmetic means of probabilities were best. But also my prior before seeing this evidence significantly favoured taking geometric mean of odds—this comes from some conversations over a few years getting a feel for “what are sensible ways to treat probabilities” and feeling like for many purposes in this vicinity things behave better in log-odds space. However I didn’t have a proper grounding for that, so this post provides both theoretical support and empirical support, which in combination with the prior make it feel like a fairly strong case.
That said, I think it’s worth pointing out the case where arithmetic mean of probabilities is exactly right to use: if you think that exactly one of the estimates is correct but you don’t know which (rather than the usual situation of thinking they all provide evidence about what the correct answer is).
I often favour arithmetic means of the probabilities, and my best guess as to what is going on is that there are (at least) two important kinds of use-case for these probabilities, which lead to different answers.
Sorting this out does indeed seem very useful for the community, and I fear that the current piece gets it wrong by suggesting one approach at all times, when we actually often want the other one.
Looking back, it seems the cases where I favoured arithmetic means of probabilities are those where I’m imagining using the probability in an EV calculation to determine what to do. I’m worried that optimising Brier and Log scoring rules is not what you want to do in such cases, so this analysis leads us astray. My paradigm example for geometric mean looking incorrect is similar to Linch’s one below.
Suppose one option has value 10 and the other has value 500 with probability p (or else it has value zero). Now suppose you combine expert estimates of p and get 10% and 0.1%. In this case the averaging of probabilities says p=5.05% and the EV of the second option is 25.25, so you should choose it, while the geometric average of odds says p=1%, so the EV is 5, so you shouldn’t choose it. I think the arithmetic mean does better here.
Now suppose the second expert instead estimated 0.0000001%. The arithmetic mean considers this no big deal, while the geometric mean now things it is terrible — enough to make it not worth taking even if the prize if successful were now 1,000 times greater. This seems crazy to me. If the prize were 500,000 and one of two experts said 10% chance, you should choose that option no matter how low the other expert goes. In the extreme case of one saying zero exactly, the geometric mean downgrades the EV of the option to zero — no matter the stakes — which seems even more clearly wrong.
Now here is a case that goes the other way. Two experts give probabilities 10% and 0.1% for the annual chance of an institution failing. We are making a decision whose value is linear in the lifespan of the institution. Arithmetic mean says p=5.05%, so an expected lifespan of 19.8 years. Geometric mean says p=1%, so an expected lifespan of 100 years, which I think is better. But what I think is even better is to calculate the expected lifespans for each expert estimate and average them. This gives (10 + 1,000) / 2 = 505 years (which would correspond to an implicit probability of .198% — the harmonic mean.
Note that both of these can be relevant at the same time. e.g. suppose two surveyors estimated the chance your AirB&B will collapse each night and came back with 50% and 0.00000000001%. In that case, the geometric mean approach says it is fine, but really you shouldn’t stay there tonight. However simultaneously, expected number of nights it will last without collapsing is very high.
How I often model these cases internally is to assume a mixture model with the real probability randomly being one of the estimated probabilities (with equal weights unless stated otherwise). That gets what I think of as the intuitively right behaviours in the cases above.
Now this is only a sketch and people might disagree with my examples, but I hope it shows that “just use the geometric mean of odds ratios” is not generally good advice, and points the way towards understanding when to use other methods.
Thinking about this more, I’ve come up with an example which shows a way in which the general question is ill-posed — i.e. that no solution that takes a list of estimates and produces an aggregate can be generally correct, but instead requires additional assumptions.
Three cards (a Jack, Queen, and King) are shuffled and dealt to A, B, and C. Each person can see their card, and the one with the highest card will win. You want to know the chance C will win. Your experts are A and B. They both write down their answers on slips of paper and privately give them to you. A says 50%, so you know A doesn’t have the King. B also says 50%, which also lets you know B doesn’t have the King. You thus know the correct answer is a 100% chance that C has King. In this situation, expert estimates of (50%, 50%) lead to an aggregate estimate of 100%, while anything where an expert estimates 0% leads to an aggregate estimate of 0%. This violates all central estimate aggregation methods.
The point is that it shows there are additional assumptions of whether the information from the experts is independent etc that is needed for the problem to be well posed, and that without this, no form of mean could be generally correct.
I agree with the general point of “different situations will require different approaches”.
From that common ground, I am interested in seeing whether we can tease out when it is appropriate to use one method against the other.
*disclaimer: low confidence from here onwards
I do not find the first example about value 0 vs value 500 entirely persuasive, though I see where you are coming from, and I think I can see when it might work.
The arithmetic mean of probabilities is entirely justified when aggregating predictions from models that start from disjoint and exhaustive conditions (this was first pointed out to me by Ben Snodin, and Owen CB makes the same point in a comment above).
This suggests that if your experts are using radically different assumptions (and are not hedging their bets based on each others arguments) then the average probability seems more appealing. I think this is implicitly what is happening in Linch’s and your first and third examples—we are in a sense assuming that only one expert is correct in the assumptions that led them to their estimate, but you do not know which one.
My intuition is that once you have experts who are given all-considered estimates, the geometric mean takes over again. I realize that this is a poor argument; but I am making a concrete claim about when it is correct to use arithmetic vs geo mean of probabilities.
In slogan form: the average of probabilities works for aggregating hedgehogs, the geometric mean works for aggregating foxes.
On the second example about instition failure, the argument goes that the expected value of the aggregate probability ought to correspond to the mean of the expected values.
I do not think this is entirely correct—I think you lose information when taking the expected values before aggregating, and thus we should not in general expect this. This is an argument similar to (Lindley, 1983), where the author dismisses marginalization as a desirable property on similar grounds. For a concrete example, see this comment, where I worked out how the expected value of the aggregate of two log normals relates to the aggregate of the expected value.
What I think we should require is that the aggregate of the exponential distributions implied by the annual probabilities matches the exponential distribution implied by the aggregated annual probabilities.
Interestingly, if you take the geometric mean aggregate of two exponential densities fA,fB with associated annual probabilities pA,pB then you end up with f=√fAfB∫√fAfB=e−pA+pB2xpA+pB2.
That is, the geometric mean aggregation of the implied exponentials led to an exponential whose annual rate probability is the arithmetic mean of the individual rates.
EDIT: This is wrong, since the annualized probability does not match the rate parameter in an exponential. It still does not work after we correct it by substituting λ=−ln(1−p)
I consider this a strong argument against the geometric mean.
Note that the arithmetic mean fails to meet this property too—the mixture distribution fA+fB2 is not even an exponential! The harmonic mean does not satisfy this property either.
What is the class of aggregation methods implied by imposing this condition? I do not know.
I do not have much to say about the Jack, Queen, King example. I agree with the general point that yes, there are some implicit assumptions that make the geometric mean work well in practice.
Definitely the JQK example does not feel like “business as usual”. There is an unusual dependence between the beliefs of the experts. For example, had we pooled expert C as well then the example does no longer work.
I’d like to see whether we can derive some more intuitive examples that follow this pattern. There might be—but right now I am drawing a blank.
In sum, I think there is an important point here that needs to be acknoledged—the theoretical and empirical evidence I provided is not enough to pinpoint the conditions where the geometric mean is the better aggregate (as opposed to the arithmetic mean).
I think the intuition behind using mixture probabilities is correct when the experts are reasoning from mutually exclusive assumptions. I feel a lot less confident when aggregating experts giving all-considered views. In that case my current best guess is the geometric mean, but now I feel a lot less confident.
I think that first taking the expected value then aggregating loses you information. When taking a linear mixture this works by happy coincidence, but we should not expect this to generalize to situations where the correct pooling method is different.
I’d be interested in understanding better what is the class of pooling methods that “respects the exponential distribution” in the sense I defined above of having the exponential associated with a pooled annual rate matches the pooled exponentials implied by the individual annual rates.
And I’d be keen on more work identifying real life examples where the geometric mean approach breaks, and more work suggesting theoretical conditions where it does (not). Right now we only have external bayesianity motivating it, that while compelling is clearly not enough.
I agree with a lot of this. In particular, that the best approach for practical rationality involves calculating things out according to each of the probabilities and then aggregating from there (or something like that), rather than aggregating first. That was part of what I was trying to show with the institution example. And it was part of what I was getting at by suggesting that the problem is ill-posed — there are a number of different assumptions we are all making about what these probabilities are going to be used for and whether we can assume the experts are themselves careful reasoners etc. and this discussion has found various places where the best form of aggregation depends crucially on these kinds of matters. I’ve certainly learned quite a bit from the discussion.
I think if you wanted to take things further, then teasing out how different combinations of assumptions lead to different aggregation methods would be a good next step.
In particular, that the best approach for practical rationality involves calculating things out according to each of the probabilities and then aggregating from there (or something like that), rather than aggregating first.
I am confused about this part. I think I said exactly the opposite? You need to aggregate first, then calculate whatever you are interested in. Otherwise you lose information (because eg taking the expected value of the individual predictions loses information that was contained in the individual predictions, about for example the standard deviation of the distribution, which depending on the aggregation method might affect the combined expected value).
I think we are roughly in agreement on this, it is just hard to talk about. I think that compression of the set of expert estimates down to a single measure of central tendency (e.g. the arithmetic mean) loses information about the distribution that is needed to give the right answer in each of a variety of situations. So in this sense, we shouldn’t aggregate first.
The ideal system would neither aggregate first into a single number, nor use each estimate independently and then aggregate from there (I suggested doing so as a contrast to aggregation first, but agree that it is not ideal). Instead, the ideal system would use the whole distribution of estimates (perhaps transformed based on some underlying model about where expert judgments come from, such as assuming that numbers between the point estimates are also plausible) and then doing some kind of EV calculation based on that. But this is so general an approach as to not offer much guidance, without further development.
The ideal system would [not] aggregate first into a single number [...] Instead, the ideal system would use the whole distribution of estimates
I have been thinking a bit more about this.
And I have concluded that the ideal aggregation procedure should compress all the information into a single prediction—our best guess for the actual distribution of the event.
Concretely, I think that in an idealized framework we should be treating the expert predictions p1,...,pN as Bayesian evidence for the actual distribution of the event of interest E. That is, the idealized aggregation ^p should just match the conditional probability of the event given the predictions: ^p=P(E|p1,...,pN)∝P(E)P(p1,...,pN|E).
Of course, for this procedure to be practical you need to know the generative model for the individual predictions P(p1,...,pN|E). This is for the most part not realistic—the generative model needs to take into account details of how each forecaster is generating the prediction and the redundance of information between the predictions. So in practice we will need to approximate the aggregate measure using some sort of heuristic.
But, crucially, the approximation does not depend on the downstream task we intend to use the aggregate prediction for.
This is something hard for me to wrap my head around, since I too feel the intuitive grasp of wanting to retain information about eg the spread of the individual probabilities. I would feel more nervous making decisions when the forecasters widly disagree with each other, as opposed to when the forecasters are of one voice.
What is this intuition then telling us? What do we need the information about the spread for then?
My answer is that we need to understand the resilience of the aggregated prediction to new information. This already plays a role in the aggregated prediction, since it helps us weight the relative importance we should give to our prior beliefs P(E) vs the evidence from the experts P(p1,...,pn|E) - a wider spread or a smaller number of forecaster predictions will lead to weaker evidence, and therefore a higher relative weighting of our priors.
Similarly, the spread of distributions gives us information about how much would we gain from additional predictions.
I think this neatly resolves the tension between aggregating vs not, and clarifies when it is important to retain information about the distribution of forecasts: when value of information is relevant. Which, admittedly, is quite often! But when we cannot acquire new information, or we can rule out value of information as decision-relevant, then we should aggregate first into a single number, and make decisions based on our best guess, regardless of the task.
My answer is that we need to understand the resilience of the aggregated prediction to new information.
This seems roughly right to me. And in particular, I think this highlights the issue with the example of institutional failure. The problem with aggregating predictions to a single guess p of annual failure, and then using p to forecast, is that it assumes that the probability of failure in each year is independent from our perspective. But in fact, each year of no failure provides evidence that the risk of failure is low. And if the forecasters’ estimates initially had a wide spread, then we’re very sensitive to new information, and so we should update more on each passing year. This would lead to a high probability of failure in the first few years, but still a moderately high expected lifetime.
I don’t think I get your argument for why the approximation should not depend on the downstream task. Could you elaborate?
I am also a bit confused about the relationship between spread and resiliency: a larger spread of forecasts does not seem to necessarily imply weaker evidence: It seems like for a relatively rare event about which some forecasters could acquire insider information, a large spread might give you stronger evidence.
Imagine E is about the future enactment of a quite unusual government policy, and one of your forecasters is a high ranking government official. Then, if all of your forecasters are relatively well calibrated and have sufficient incentive to report their true beliefs, a 90% forecast for E by the government official and a 1% forecast by everyone else should likely shift your beliefs a lot more towards E than a 10% forecast by everyone.
I don’t think I get your argument for why the approximation should not depend on the downstream task. Could you elaborate?
Your best approximation of the summary distribution ^p=P(E|p1,...,pN) is already “as good as it can get”. You think we should be cautious and treat this probability as if it could be higher for precautionary reasons? Then I argue that you should treat it as higher, regardless of how you arrived at the estimate.
In the end this circles back to basic Bayesian / Utility theory—in the idealized framework your credences about an event should be represented as a single probability. Departing from this idealization requires further justification.
a larger spread of forecasts does not seem to necessarily imply weaker evidence
You are right that “weaker evidence” is not exactly correct—this is more about the expected variance introduced by hypothetical additional predictions. I’ve realized I am confused about what is the best way to think about this in formal terms, so I wonder if my intuition was right after all.
UPDATE: Eric Neyman recently wrote about an extra assumption that I believe cleanly cuts into why this example fails.
The assumption is called the weak substitutes condition. Essentially, it means that there are diminishing marginal returns to each forecast.
The Jack, Queen and King example does not satisfy the weak substitutes condition, and forecast aggregation methods do not work well in it.
But I think that when the condition is met we can get often get good results with forecast aggregation. Furthermore I think it is a very reasonable condition to ask, and often met in practice.
I wrote more about Neyman’s result here, though I focus more on the implications for extremizing the mean of logodds.
This seems to connect to the concept of—fmeans: If the utility for an option is proportional to f(p), then the expected utility of your mixture model is equal to the expected utility using the f-mean of the expert’s probabilities p1 and p2 defined as f−1(f(p1)+f(p2)2), as the f in the utility calculation cancels out the f−1. If I recall correctly, all aggregation functions that fulfill some technical conditions on a generalized mean can be written as a f-mean.
In the first example, f is just linear, such that the f-mean is the arithmetic mean. In the second example, f is equal to the expected lifespan of 11−(1−p)=1p which yields the harmonic mean. As such, the geometric mean would correspond to the mixture model if and only if utility was logarithmic in p, as the geometric mean is the f-mean corresponding to the logarithm.
For a binary event with “true” probability q, the expected log-score for a forecast of p is qlog(p)∗(1−q)log(1−p)=log(pq(1−p)1−q), which equals log(√p1−p)=0.5log(p1−p) for q=0.5. So the geometric mean of odds would optimize yield the correct utility for the log-score according to the mixture model, if all the events we forecast were essentially coin tosses (which seems like a less satisfying synthesis than I hoped for).
Further questions that might be interesting to analyze from this point of view:
Is there some kind of approximate connection between the Brier score and the geometric mean of odds that could explain the empirical performance of the geometric mean on the Brier score? (There might very well not be anything, as the mixture model might not be the best way to think about aggregation).
What optimization target (under the mixture model) does extremization correspond to? Edit: As extremization is applied after the aggregation, it cannot be interpreted in terms of mixture models (if all forecasters give the same prediction, any f-mean has to have that value, but extremization yields a more extreme prediction.)
Note: After writing this, I noticed that UnexpectedValue’s comment on the top-level post essentially points to the same concept. I decided to still post this, as it seems more accessible than their technical paper while (probably) capturing the key insight.
Edit: Replaced “optimize” by “yield the correct utility for” in the third paragraph.
I want to push back a bit against the use of 0.00000000001% in this example. In particular, I was sort of assuming that experts are kind of calibrated, and if two human experts have that sort of disagreement:
Either this is the kind of scenario in which we’re discussing how a fair coin will land, and one of the experts has seen the coin
Or something is very, very wrong
In particular, with some light selection of experts (e.g, decent Metaculus forecasters), I think you’d almost never see this kind of scenario unless someone was trolling you. In particular, if the 0.0..001% person was willing to bet a correspondingly high amount at those odds, I would probably weigh it very highly. And in this case I think the geometric mean would in fact be appropriate.
Though I guess that it wouldn’t be if you’re querying random experts who can randomly be catastrophically wrong, and the arithmetic mean would be more robust.
I see what you mean, though you will find that scientific experts often end up endorsing probabilities like these. They model the situation, run the calculation and end up with 10^-12 and then say the probability is 10^-12. You are right that if you knew the experts were Bayesian and calibrated and aware of all the ways the model or calculation could be flawed, and had a good dose of humility, then you could read more into such small claimed probabilities — i.e. that they must have a mass of evidence they have not yet shared. But we are very rarely in a situation like that. Averaging a selection of Metaculus forecasters may be close, but is quite a special case when you think more broadly about the question of how to aggregate expert predictions.
They model the situation, run the calculation and end up with 10^-12 and then say the probability is 10^-12.
Consider that if you’re aggregating expert predictions, you might be generating probabilities too soon. Instead you could for instance interview the subject-matter experts, make the transcript available to expert forecasters, and then aggregate the probabilities of the latter. This might produce more accurate probabilities.
While it’s pretty easy to agree that a probability of a stupid mistake/typo is greater than 0.00000000001%, it is sometimes hard to follow in practice. I think Yudkowsky communicates it’s well on a more visceral level in his Infinite Certainty essay. I got to another level of appreciation of this point after doing a calibration exercise for mental arithmetics — all errors were unpredictable “oups” like misreading plus for minus or selecting the wrong answer after making correct calculations.
Note that both of these can be relevant at the same time. e.g. suppose two surveyors estimated the chance your AirB&B will collapse each night and came back with 50% and 0.00000000001%. In that case, the geometric mean approach says it is fine, but really you shouldn’t stay there tonight. However simultaneously, expected number of nights it will last without collapsing is very high.
This example weakens the case for the arithmetic mean.
First let me establish: both of the surveyors’ estimates are virtually impossible for anything listed on AirBnB. They must be fabricated, hallucinated, trolled, drunken, parasitically-motivated, wildly uncalibrated, or 2 simultaneous typos.
Even buildings that are considered structurally unsound often end up standing for years anyway, and 50% just isn’t plausible except for some extraordinary circumstances. 50% over the next 24-hour period is reasonable if the building looks like this.
And as for 0.00000000001%, this is permitted by physics but that’s the strongest endorsement I can give. This implies that after 100 million years, or 36,500,000,000 days, there would still have only been a 30.58% chance of a collapse. It’s a reasonable guess if the interior of the building is entirely filled with a very stable material, and the outside is encased in bedrock 100m below the surface, in a geologically-quiet area.
You advise the reader:
but really you shouldn’t stay there tonight. However simultaneously, expected number of nights it will last without collapsing is very high.
This seems either contradictory, or needs elaboration. You show the correct intuition by suggesting the real probability is much lower, and in all likelihood, the building will probably do the mundane thing they usually do: stand uncollapsed for years to come. I wouldn’t move in to start a family there, but I’m not worried if some kids camp in there for a few nights either.
So imagine giving it the arithmetic mean answer of ~25%. That is almost impossible for anything listed on AirBnb. Now I am poor at doing calculations, but I think the geometric mean is 0.00022361%. If true, then after 1,000 years it would give a chance of collapse of 55.79%. This is plausible for some kinds of real-world buildings. Personally I would expect a higher percent as most buildings aren’t designed to last that long, or would be deliberately demolished (and therefore “collapse”) before then. But hey, it’s a plausible forecast for many actual buildings.
One factor in all this is that geometric mean aggregation makes more sense when there are proper-scoring incentives to be accurate, e.g. log-scoring is used. That is, being wrong at 99.999999% confidence should totally ruin your whole track record and you would lose whatever forecaster-prestige you could’ve had. That’s a social system where you can take extreme predictions more seriously. But in untracked setups where people can just giving one-off numbers that aren’t scored, and no particular real incentive to give an accurate forecast, then it’s more plausible the arithmetic mean of probabilities ends up being superior in some cases. But even then, there are notable cases where it will be wildly off, such as the surveyor example you gave.
You raise valid points, e.g. how geomean could give terrible results under some conditions. Like if someone says “Yeah I think the probability is 1/Tree(3) man.” and the whole thing is ruined. That is a valuable point and reasonable, and there may be some domains or prestige game setups where geomean would be broken by some yahoo giving a wild estimate. However I don’t condone a meta-approach where you say “My aggregation method says 25%, which I’m even acknowledging can’t be right, but you should act as if it could be”. Might as well act as it’s nonsense and just assume the base rate for AirBnB collapses.
Now if one of the surveyors made money or prestige by telling people they should worry about buildings collapsing, they may prefer the arithmetic mean in this case. I can’t vouch for the surveyors. But as a forecaster, I would do some checks against history, and conclude the number is a drastic overestimate. Far more likely that the 50%-giving surveyor is either trolling, confused, or they are selling me travel insurance or something. And in the end, I would defer to empirical results, for example in SimonM’s great comment, and question series.
If I was to summarise your post in another way, it would be this:
The biggest problem with pooling is that a point estimate isn’t the end goal. In most applications you care about some transform of the estimate. In general, you’re better off keeping all of the information (ie your new prior) rather than just a point estimate of said prior.
I disagree with you that the most natural prior is “mixture distribution over experts”. (Although I wonder how much that actually ends up mattering in the real world).
I also think something “interesting” is being said here about the performance of estimates in the real world. If I had to say that the empirical performance of mean log-odds doing well, I would say that it means that “mixture distribution over experts” is not a great prior. But then, someone with my priors would say that...
That said, I think it’s worth pointing out the case where arithmetic mean of probabilities is exactly right to use: if you think that exactly one of the estimates is correct but you don’t know which (rather than the usual situation of thinking they all provide evidence about what the correct answer is).
To extend this and steelman the case for arithmetic mean of probabilities (or something in that general direction) a little, in some cases this seems a more intuitive formulation of risk (which is usually how these things are talked about in EA contexts), especially if we propagate further to expected values or value of information concerns.
Eg, suppose that we ask 3 sources we trust equally about risk from X vector of an EA org shutting down in 10 years. One person says 10%, 1 person says 0.1%, 1 person says 0.001%.
Arithmetic mean of probabilities gets you ~ 3.4%, geometric mean of odds gets you ~0.1%. 0.1% seems comfortably below the background rate of organizations dying, that in many cases it’s not worth the value of information to investigate further. Yet naively this seems to be too cavalier if one out of three sources thinks there’s a 10% chance of failure from X vector alone!
Also as a mild terminological note, I’m not sure I know what you mean by “correct answer” when we’re referring to probabilities in the real world. Outside of formal mathematical examples and maybe some quantum physics stuff, probabilities are usually statements about our own confusions in our maps of the world, not physically instantiated in the underlying reality.
Geometric mean is just a really useful tool for estimations in general. It also makes a lot of sense for aggregating results other than probabilities, eg for different Fermi estimates of real quantities.
(don’t feel extremely confident about the below but seemed worth sharing)
I think it’s really great to flag this! But as I mentioned to you elsewhere I’m not sure we’re certain enough to make a blanket recommendation to the EA community.
I think we have some evidence that geometric mean of odds is better, but not that much evidence. Although I haven’t looked into the evidence that Simon_M shared from Metaculus.
I guess I can potentially see us changing our minds in a year’s time and deciding that arithmetic mean of probabilities is better after all, or that some other method is better than both of these.
Then maybe people will have made a costly change to a new method (learning what odds are, what a geometric mean is, learning how to apply it in practice, maybe understanding the argument for using the new method) that turns out not to have been worth it.
This seems very unlikely, I’ll bet your $20 against my $80 that this doesn’t happen.
(I have not read the post)
I endorse these implicit odds, based on both theory and some intuitions from thinking about this in practice.
Thanks both (and Owen too), I now feel more confident that geometric mean of odds is better!
(Edit: at 1:4 odds I don’t feel great about a blanket recommendation, but I guess the odds at which you’re indifferent to taking the bet are more heavily stacked against us changing our mind. And Owen’s <1% is obviously way lower)
Like Nuno I think this is very unlikely. Probably <1% that we’d straightforwardly prefer arithmetic mean of probabilities. Much higher chance that in some circumstances we’d prefer something else (e.g. unweighted geometric mean of probabilities gets very distorted by having one ignorant person put in a probability which is extremely close to zero, so in some circumstances you’d want to be able to avoid that).
I don’t think the amount of evidence here would be conclusive if we otherwise thought arithmetic means of probabilities were best. But also my prior before seeing this evidence significantly favoured taking geometric mean of odds—this comes from some conversations over a few years getting a feel for “what are sensible ways to treat probabilities” and feeling like for many purposes in this vicinity things behave better in log-odds space. However I didn’t have a proper grounding for that, so this post provides both theoretical support and empirical support, which in combination with the prior make it feel like a fairly strong case.
That said, I think it’s worth pointing out the case where arithmetic mean of probabilities is exactly right to use: if you think that exactly one of the estimates is correct but you don’t know which (rather than the usual situation of thinking they all provide evidence about what the correct answer is).
I often favour arithmetic means of the probabilities, and my best guess as to what is going on is that there are (at least) two important kinds of use-case for these probabilities, which lead to different answers.
Sorting this out does indeed seem very useful for the community, and I fear that the current piece gets it wrong by suggesting one approach at all times, when we actually often want the other one.
Looking back, it seems the cases where I favoured arithmetic means of probabilities are those where I’m imagining using the probability in an EV calculation to determine what to do. I’m worried that optimising Brier and Log scoring rules is not what you want to do in such cases, so this analysis leads us astray. My paradigm example for geometric mean looking incorrect is similar to Linch’s one below.
Suppose one option has value 10 and the other has value 500 with probability p (or else it has value zero). Now suppose you combine expert estimates of p and get 10% and 0.1%. In this case the averaging of probabilities says p=5.05% and the EV of the second option is 25.25, so you should choose it, while the geometric average of odds says p=1%, so the EV is 5, so you shouldn’t choose it. I think the arithmetic mean does better here.
Now suppose the second expert instead estimated 0.0000001%. The arithmetic mean considers this no big deal, while the geometric mean now things it is terrible — enough to make it not worth taking even if the prize if successful were now 1,000 times greater. This seems crazy to me. If the prize were 500,000 and one of two experts said 10% chance, you should choose that option no matter how low the other expert goes. In the extreme case of one saying zero exactly, the geometric mean downgrades the EV of the option to zero — no matter the stakes — which seems even more clearly wrong.
Now here is a case that goes the other way. Two experts give probabilities 10% and 0.1% for the annual chance of an institution failing. We are making a decision whose value is linear in the lifespan of the institution. Arithmetic mean says p=5.05%, so an expected lifespan of 19.8 years. Geometric mean says p=1%, so an expected lifespan of 100 years, which I think is better. But what I think is even better is to calculate the expected lifespans for each expert estimate and average them. This gives (10 + 1,000) / 2 = 505 years (which would correspond to an implicit probability of .198% — the harmonic mean.
Note that both of these can be relevant at the same time. e.g. suppose two surveyors estimated the chance your AirB&B will collapse each night and came back with 50% and 0.00000000001%. In that case, the geometric mean approach says it is fine, but really you shouldn’t stay there tonight. However simultaneously, expected number of nights it will last without collapsing is very high.
How I often model these cases internally is to assume a mixture model with the real probability randomly being one of the estimated probabilities (with equal weights unless stated otherwise). That gets what I think of as the intuitively right behaviours in the cases above.
Now this is only a sketch and people might disagree with my examples, but I hope it shows that “just use the geometric mean of odds ratios” is not generally good advice, and points the way towards understanding when to use other methods.
Thinking about this more, I’ve come up with an example which shows a way in which the general question is ill-posed — i.e. that no solution that takes a list of estimates and produces an aggregate can be generally correct, but instead requires additional assumptions.
Three cards (a Jack, Queen, and King) are shuffled and dealt to A, B, and C. Each person can see their card, and the one with the highest card will win. You want to know the chance C will win. Your experts are A and B. They both write down their answers on slips of paper and privately give them to you. A says 50%, so you know A doesn’t have the King. B also says 50%, which also lets you know B doesn’t have the King. You thus know the correct answer is a 100% chance that C has King. In this situation, expert estimates of (50%, 50%) lead to an aggregate estimate of 100%, while anything where an expert estimates 0% leads to an aggregate estimate of 0%. This violates all central estimate aggregation methods.
The point is that it shows there are additional assumptions of whether the information from the experts is independent etc that is needed for the problem to be well posed, and that without this, no form of mean could be generally correct.
Thank you for your thoughts!
I agree with the general point of “different situations will require different approaches”.
From that common ground, I am interested in seeing whether we can tease out when it is appropriate to use one method against the other.
*disclaimer: low confidence from here onwards
I do not find the first example about value 0 vs value 500 entirely persuasive, though I see where you are coming from, and I think I can see when it might work.
The arithmetic mean of probabilities is entirely justified when aggregating predictions from models that start from disjoint and exhaustive conditions (this was first pointed out to me by Ben Snodin, and Owen CB makes the same point in a comment above).
This suggests that if your experts are using radically different assumptions (and are not hedging their bets based on each others arguments) then the average probability seems more appealing. I think this is implicitly what is happening in Linch’s and your first and third examples—we are in a sense assuming that only one expert is correct in the assumptions that led them to their estimate, but you do not know which one.
My intuition is that once you have experts who are given all-considered estimates, the geometric mean takes over again. I realize that this is a poor argument; but I am making a concrete claim about when it is correct to use arithmetic vs geo mean of probabilities.
In slogan form: the average of probabilities works for aggregating hedgehogs, the geometric mean works for aggregating foxes.
On the second example about instition failure, the argument goes that the expected value of the aggregate probability ought to correspond to the mean of the expected values.
I do not think this is entirely correct—I think you lose information when taking the expected values before aggregating, and thus we should not in general expect this. This is an argument similar to (Lindley, 1983), where the author dismisses marginalization as a desirable property on similar grounds. For a concrete example, see this comment, where I worked out how the expected value of the aggregate of two log normals relates to the aggregate of the expected value.
What I think we should require is that the aggregate of the exponential distributions implied by the annual probabilities matches the exponential distribution implied by the aggregated annual probabilities.
Interestingly, if you take the geometric mean aggregate of two exponential densities fA,fB with associated annual probabilities pA,pB then you end up with f=√fAfB∫√fAfB=e−pA+pB2xpA+pB2.
That is, the geometric mean aggregation of the implied exponentials led to an exponential whose annual rate probability is the arithmetic mean of the individual rates.
EDIT: This is wrong, since the annualized probability does not match the rate parameter in an exponential. It still does not work after we correct it by substituting λ=−ln(1−p)
I consider this a strong argument against the geometric mean.
Note that the arithmetic mean fails to meet this property too—the mixture distribution fA+fB2 is not even an exponential! The harmonic mean does not satisfy this property either.
What is the class of aggregation methods implied by imposing this condition? I do not know.
I do not have much to say about the Jack, Queen, King example. I agree with the general point that yes, there are some implicit assumptions that make the geometric mean work well in practice.
Definitely the JQK example does not feel like “business as usual”. There is an unusual dependence between the beliefs of the experts. For example, had we pooled expert C as well then the example does no longer work.
I’d like to see whether we can derive some more intuitive examples that follow this pattern. There might be—but right now I am drawing a blank.
In sum, I think there is an important point here that needs to be acknoledged—the theoretical and empirical evidence I provided is not enough to pinpoint the conditions where the geometric mean is the better aggregate (as opposed to the arithmetic mean).
I think the intuition behind using mixture probabilities is correct when the experts are reasoning from mutually exclusive assumptions. I feel a lot less confident when aggregating experts giving all-considered views. In that case my current best guess is the geometric mean, but now I feel a lot less confident.
I think that first taking the expected value then aggregating loses you information. When taking a linear mixture this works by happy coincidence, but we should not expect this to generalize to situations where the correct pooling method is different.
I’d be interested in understanding better what is the class of pooling methods that “respects the exponential distribution” in the sense I defined above of having the exponential associated with a pooled annual rate matches the pooled exponentials implied by the individual annual rates.
And I’d be keen on more work identifying real life examples where the geometric mean approach breaks, and more work suggesting theoretical conditions where it does (not). Right now we only have external bayesianity motivating it, that while compelling is clearly not enough.
I agree with a lot of this. In particular, that the best approach for practical rationality involves calculating things out according to each of the probabilities and then aggregating from there (or something like that), rather than aggregating first. That was part of what I was trying to show with the institution example. And it was part of what I was getting at by suggesting that the problem is ill-posed — there are a number of different assumptions we are all making about what these probabilities are going to be used for and whether we can assume the experts are themselves careful reasoners etc. and this discussion has found various places where the best form of aggregation depends crucially on these kinds of matters. I’ve certainly learned quite a bit from the discussion.
I think if you wanted to take things further, then teasing out how different combinations of assumptions lead to different aggregation methods would be a good next step.
Thank you! I learned too from the examples.
One question:
I am confused about this part. I think I said exactly the opposite? You need to aggregate first, then calculate whatever you are interested in. Otherwise you lose information (because eg taking the expected value of the individual predictions loses information that was contained in the individual predictions, about for example the standard deviation of the distribution, which depending on the aggregation method might affect the combined expected value).
What am I not seeing?
I think we are roughly in agreement on this, it is just hard to talk about. I think that compression of the set of expert estimates down to a single measure of central tendency (e.g. the arithmetic mean) loses information about the distribution that is needed to give the right answer in each of a variety of situations. So in this sense, we shouldn’t aggregate first.
The ideal system would neither aggregate first into a single number, nor use each estimate independently and then aggregate from there (I suggested doing so as a contrast to aggregation first, but agree that it is not ideal). Instead, the ideal system would use the whole distribution of estimates (perhaps transformed based on some underlying model about where expert judgments come from, such as assuming that numbers between the point estimates are also plausible) and then doing some kind of EV calculation based on that. But this is so general an approach as to not offer much guidance, without further development.
I have been thinking a bit more about this.
And I have concluded that the ideal aggregation procedure should compress all the information into a single prediction—our best guess for the actual distribution of the event.
Concretely, I think that in an idealized framework we should be treating the expert predictions p1,...,pN as Bayesian evidence for the actual distribution of the event of interest E. That is, the idealized aggregation ^p should just match the conditional probability of the event given the predictions: ^p=P(E|p1,...,pN)∝P(E)P(p1,...,pN|E).
Of course, for this procedure to be practical you need to know the generative model for the individual predictions P(p1,...,pN|E). This is for the most part not realistic—the generative model needs to take into account details of how each forecaster is generating the prediction and the redundance of information between the predictions. So in practice we will need to approximate the aggregate measure using some sort of heuristic.
But, crucially, the approximation does not depend on the downstream task we intend to use the aggregate prediction for.
This is something hard for me to wrap my head around, since I too feel the intuitive grasp of wanting to retain information about eg the spread of the individual probabilities. I would feel more nervous making decisions when the forecasters widly disagree with each other, as opposed to when the forecasters are of one voice.
What is this intuition then telling us? What do we need the information about the spread for then?
My answer is that we need to understand the resilience of the aggregated prediction to new information. This already plays a role in the aggregated prediction, since it helps us weight the relative importance we should give to our prior beliefs P(E) vs the evidence from the experts P(p1,...,pn|E) - a wider spread or a smaller number of forecaster predictions will lead to weaker evidence, and therefore a higher relative weighting of our priors.
Similarly, the spread of distributions gives us information about how much would we gain from additional predictions.
I think this neatly resolves the tension between aggregating vs not, and clarifies when it is important to retain information about the distribution of forecasts: when value of information is relevant. Which, admittedly, is quite often! But when we cannot acquire new information, or we can rule out value of information as decision-relevant, then we should aggregate first into a single number, and make decisions based on our best guess, regardless of the task.
This seems roughly right to me. And in particular, I think this highlights the issue with the example of institutional failure. The problem with aggregating predictions to a single guess p of annual failure, and then using p to forecast, is that it assumes that the probability of failure in each year is independent from our perspective. But in fact, each year of no failure provides evidence that the risk of failure is low. And if the forecasters’ estimates initially had a wide spread, then we’re very sensitive to new information, and so we should update more on each passing year. This would lead to a high probability of failure in the first few years, but still a moderately high expected lifetime.
I think this is a good account of the institutional failure example, thank you!
I don’t think I get your argument for why the approximation should not depend on the downstream task. Could you elaborate?
I am also a bit confused about the relationship between spread and resiliency: a larger spread of forecasts does not seem to necessarily imply weaker evidence: It seems like for a relatively rare event about which some forecasters could acquire insider information, a large spread might give you stronger evidence.
Imagine E is about the future enactment of a quite unusual government policy, and one of your forecasters is a high ranking government official. Then, if all of your forecasters are relatively well calibrated and have sufficient incentive to report their true beliefs, a 90% forecast for E by the government official and a 1% forecast by everyone else should likely shift your beliefs a lot more towards E than a 10% forecast by everyone.
Your best approximation of the summary distribution ^p=P(E|p1,...,pN) is already “as good as it can get”. You think we should be cautious and treat this probability as if it could be higher for precautionary reasons? Then I argue that you should treat it as higher, regardless of how you arrived at the estimate.
In the end this circles back to basic Bayesian / Utility theory—in the idealized framework your credences about an event should be represented as a single probability. Departing from this idealization requires further justification.
You are right that “weaker evidence” is not exactly correct—this is more about the expected variance introduced by hypothetical additional predictions. I’ve realized I am confused about what is the best way to think about this in formal terms, so I wonder if my intuition was right after all.
UPDATE: Eric Neyman recently wrote about an extra assumption that I believe cleanly cuts into why this example fails.
The assumption is called the weak substitutes condition. Essentially, it means that there are diminishing marginal returns to each forecast.
The Jack, Queen and King example does not satisfy the weak substitutes condition, and forecast aggregation methods do not work well in it.
But I think that when the condition is met we can get often get good results with forecast aggregation. Furthermore I think it is a very reasonable condition to ask, and often met in practice.
I wrote more about Neyman’s result here, though I focus more on the implications for extremizing the mean of logodds.
This seems to connect to the concept of—fmeans: If the utility for an option is proportional to f(p), then the expected utility of your mixture model is equal to the expected utility using the f-mean of the expert’s probabilities p1 and p2 defined as f−1(f(p1)+f(p2)2), as the f in the utility calculation cancels out the f−1. If I recall correctly, all aggregation functions that fulfill some technical conditions on a generalized mean can be written as a f-mean.
In the first example, f is just linear, such that the f-mean is the arithmetic mean. In the second example, f is equal to the expected lifespan of 11−(1−p)=1p which yields the harmonic mean. As such, the geometric mean would correspond to the mixture model if and only if utility was logarithmic in p, as the geometric mean is the f-mean corresponding to the logarithm.
For a binary event with “true” probability q, the expected log-score for a forecast of p is qlog(p)∗(1−q)log(1−p)=log(pq(1−p)1−q), which equals log(√p1−p)=0.5log(p1−p) for q=0.5. So the geometric mean of odds would
optimizeyield the correct utility for the log-score according to the mixture model, if all the events we forecast were essentially coin tosses (which seems like a less satisfying synthesis than I hoped for).Further questions that might be interesting to analyze from this point of view:
Is there some kind of approximate connection between the Brier score and the geometric mean of odds that could explain the empirical performance of the geometric mean on the Brier score? (There might very well not be anything, as the mixture model might not be the best way to think about aggregation).
What optimization target (under the mixture model) does extremization correspond to? Edit: As extremization is applied after the aggregation, it cannot be interpreted in terms of mixture models (if all forecasters give the same prediction, any f-mean has to have that value, but extremization yields a more extreme prediction.)
Note: After writing this, I noticed that UnexpectedValue’s comment on the top-level post essentially points to the same concept. I decided to still post this, as it seems more accessible than their technical paper while (probably) capturing the key insight.
Edit: Replaced “optimize” by “yield the correct utility for” in the third paragraph.
Thanks — I hadn’t heard of f-means before and it is a useful concept, and relevant here.
I want to push back a bit against the use of 0.00000000001% in this example. In particular, I was sort of assuming that experts are kind of calibrated, and if two human experts have that sort of disagreement:
Either this is the kind of scenario in which we’re discussing how a fair coin will land, and one of the experts has seen the coin
Or something is very, very wrong
In particular, with some light selection of experts (e.g, decent Metaculus forecasters), I think you’d almost never see this kind of scenario unless someone was trolling you. In particular, if the 0.0..001% person was willing to bet a correspondingly high amount at those odds, I would probably weigh it very highly. And in this case I think the geometric mean would in fact be appropriate.
Though I guess that it wouldn’t be if you’re querying random experts who can randomly be catastrophically wrong, and the arithmetic mean would be more robust.
I see what you mean, though you will find that scientific experts often end up endorsing probabilities like these. They model the situation, run the calculation and end up with 10^-12 and then say the probability is 10^-12. You are right that if you knew the experts were Bayesian and calibrated and aware of all the ways the model or calculation could be flawed, and had a good dose of humility, then you could read more into such small claimed probabilities — i.e. that they must have a mass of evidence they have not yet shared. But we are very rarely in a situation like that. Averaging a selection of Metaculus forecasters may be close, but is quite a special case when you think more broadly about the question of how to aggregate expert predictions.
Consider that if you’re aggregating expert predictions, you might be generating probabilities too soon. Instead you could for instance interview the subject-matter experts, make the transcript available to expert forecasters, and then aggregate the probabilities of the latter. This might produce more accurate probabilities.
I endorse Nuño’s comment re: 0.00000000001%.
While it’s pretty easy to agree that a probability of a stupid mistake/typo is greater than 0.00000000001%, it is sometimes hard to follow in practice. I think Yudkowsky communicates it’s well on a more visceral level in his Infinite Certainty essay. I got to another level of appreciation of this point after doing a calibration exercise for mental arithmetics — all errors were unpredictable “oups” like misreading plus for minus or selecting the wrong answer after making correct calculations.
This example weakens the case for the arithmetic mean.
First let me establish: both of the surveyors’ estimates are virtually impossible for anything listed on AirBnB. They must be fabricated, hallucinated, trolled, drunken, parasitically-motivated, wildly uncalibrated, or 2 simultaneous typos.
Even buildings that are considered structurally unsound often end up standing for years anyway, and 50% just isn’t plausible except for some extraordinary circumstances. 50% over the next 24-hour period is reasonable if the building looks like this.
And as for 0.00000000001%, this is permitted by physics but that’s the strongest endorsement I can give. This implies that after 100 million years, or 36,500,000,000 days, there would still have only been a 30.58% chance of a collapse. It’s a reasonable guess if the interior of the building is entirely filled with a very stable material, and the outside is encased in bedrock 100m below the surface, in a geologically-quiet area.
You advise the reader:
This seems either contradictory, or needs elaboration. You show the correct intuition by suggesting the real probability is much lower, and in all likelihood, the building will probably do the mundane thing they usually do: stand uncollapsed for years to come. I wouldn’t move in to start a family there, but I’m not worried if some kids camp in there for a few nights either.
So imagine giving it the arithmetic mean answer of ~25%. That is almost impossible for anything listed on AirBnb. Now I am poor at doing calculations, but I think the geometric mean is 0.00022361%. If true, then after 1,000 years it would give a chance of collapse of 55.79%. This is plausible for some kinds of real-world buildings. Personally I would expect a higher percent as most buildings aren’t designed to last that long, or would be deliberately demolished (and therefore “collapse”) before then. But hey, it’s a plausible forecast for many actual buildings.
One factor in all this is that geometric mean aggregation makes more sense when there are proper-scoring incentives to be accurate, e.g. log-scoring is used. That is, being wrong at 99.999999% confidence should totally ruin your whole track record and you would lose whatever forecaster-prestige you could’ve had. That’s a social system where you can take extreme predictions more seriously. But in untracked setups where people can just giving one-off numbers that aren’t scored, and no particular real incentive to give an accurate forecast, then it’s more plausible the arithmetic mean of probabilities ends up being superior in some cases. But even then, there are notable cases where it will be wildly off, such as the surveyor example you gave.
You raise valid points, e.g. how geomean could give terrible results under some conditions. Like if someone says “Yeah I think the probability is 1/Tree(3) man.” and the whole thing is ruined. That is a valuable point and reasonable, and there may be some domains or prestige game setups where geomean would be broken by some yahoo giving a wild estimate. However I don’t condone a meta-approach where you say “My aggregation method says 25%, which I’m even acknowledging can’t be right, but you should act as if it could be”. Might as well act as it’s nonsense and just assume the base rate for AirBnB collapses.
Now if one of the surveyors made money or prestige by telling people they should worry about buildings collapsing, they may prefer the arithmetic mean in this case. I can’t vouch for the surveyors. But as a forecaster, I would do some checks against history, and conclude the number is a drastic overestimate. Far more likely that the 50%-giving surveyor is either trolling, confused, or they are selling me travel insurance or something. And in the end, I would defer to empirical results, for example in SimonM’s great comment, and question series.
If I was to summarise your post in another way, it would be this:
The biggest problem with pooling is that a point estimate isn’t the end goal. In most applications you care about some transform of the estimate. In general, you’re better off keeping all of the information (ie your new prior) rather than just a point estimate of said prior.
I disagree with you that the most natural prior is “mixture distribution over experts”. (Although I wonder how much that actually ends up mattering in the real world).
I also think something “interesting” is being said here about the performance of estimates in the real world. If I had to say that the empirical performance of mean log-odds doing well, I would say that it means that “mixture distribution over experts” is not a great prior. But then, someone with my priors would say that...
To extend this and steelman the case for arithmetic mean of probabilities (or something in that general direction) a little, in some cases this seems a more intuitive formulation of risk (which is usually how these things are talked about in EA contexts), especially if we propagate further to expected values or value of information concerns.
Eg, suppose that we ask 3 sources we trust equally about risk from X vector of an EA org shutting down in 10 years. One person says 10%, 1 person says 0.1%, 1 person says 0.001%.
Arithmetic mean of probabilities gets you ~ 3.4%, geometric mean of odds gets you ~0.1%. 0.1% seems comfortably below the background rate of organizations dying, that in many cases it’s not worth the value of information to investigate further. Yet naively this seems to be too cavalier if one out of three sources thinks there’s a 10% chance of failure from X vector alone!
Also as a mild terminological note, I’m not sure I know what you mean by “correct answer” when we’re referring to probabilities in the real world. Outside of formal mathematical examples and maybe some quantum physics stuff, probabilities are usually statements about our own confusions in our maps of the world, not physically instantiated in the underlying reality.
Geometric mean is just a really useful tool for estimations in general. It also makes a lot of sense for aggregating results other than probabilities, eg for different Fermi estimates of real quantities.