I wrote a post arguing for the opposite thesis, and was pointed here. A few comments about your arguments that I didn’t address in my post:
Regarding the empirical evidence supporting averaging log odds, note that averaging log odds will always give more extreme pooled probabilities than averaging probabilities does, and in the contexts in which this empirical evidence was collected, the experts were systematically underconfident, so that extremizing the results could make them better calibrated. This easily explains why average log odds outperformed average probabilities, and I don’t expect optimally-extremized average log odds to outperform optimally-extremized average probabilities (or similarly, I don’t expect unextremized average log odds to outperform average probabilities extremized just enough to give results as extreme as average log odds on average).
External Bayesianity seems like an actively undesirable property for probability pooling methods that treat experts symmetrically. When new evidence comes in, this should change how credible each expert is if different experts assigned different probabilities to that evidence. Thus the experts should not all be treated symmetrically both before and after new evidence comes in. If you do this, you’re throwing away the information that the evidence gives you about expert credibility, and if you throw away some of the evidence you receive, you should not expect your Bayesian updates to properly account for all the evidence you received. If you design some way of defining probabilities so that you somehow end up correctly updating on new evidence despite throwing away some of that evidence (as log odds averaging remarkably does), then, once you do adjust to account for the evidence that you were previously throwing away, you will no longer be correctly updating on new evidence (i.e. if you weight the experts differently depending on credibility, and update credibility in response to new evidence, then weighted averaging of log odds is no longer externally Bayesian, and weighted averaging of probabilities is if you do it right).
I talked about the argument that averaging probabilities ignores extreme predictions in my post, but the way you stated it, you added the extra twist that the expert giving more extreme predictions is known to be more knowledgeable than the expert giving less extreme predictions. If you know one expert is more knowledgeable, then of course you should not treat them symmetrically. As an argument for averaging log odds rather than averaging probabilities, this seems like cheating, by adding an extra assumption which supports extreme probabilities but isn’t used by either pooling method, giving an advantage to pooling methods that produce extreme probabilities.
Thank you for your thoughful reply. I think you raise interesting points, which move my confidence in my conclusions down.
Here are some comments
[...] averaging log odds will always give more extreme pooled probabilities than averaging probabilities does
As in your post, averaging the probs effectively erases the information from extreme individual probabilities, so I think you will agree that averaging log odds is not merely a more extreme version of averaging probs.
I nonetheless think this is a very important issue—the difficulty of separating the extremizing effect of log odds from its actual effect.
I don’t expect optimally-extremized average log odds to outperform optimally-extremized average probabilities
This is an empirical question that we can settle empirically. Using Simon_M’s script I computed the Brier and log scores for binary Metaculus questions of the extremized means and extremized log odds and extremizing factors between 1 and 3 in intervals of 0.05.
In this setting, the top performing metrics are the “optimally” extremized average log odds in term of log loss, surpassing the “optimally” extremized mean of probs.
Note that the Brier scores are identical, which is consistent with the average log odds outperforming the average probs only when extreme forecasts are involved.
Also notice that the optimal extremizing factor for the average of logodds is lower than for the average of probabilities—this relates to your observation that the average log odds are already relatively extremized compared to the mean of probs.
External Bayesianity seems like an actively undesirable property for probability pooling methods that treat experts symmetrically. When new evidence comes in, this should change how credible each expert is if different experts assigned different probabilities to that evidence.
I am not sure I follow your argument here.
I do agree that when new evidence comes in about the experts we should change how we weight them. But when we are pooling the probabilities we aren’t receiving any extra evidence about the experts (?).
I talked about the argument that averaging probabilities ignores extreme predictions in my post, but the way you stated it, you added the extra twist that the expert giving more extreme predictions is known to be more knowledgeable than the expert giving less extreme predictions. If you know one expert is more knowledgeable, then of course you should not treat them symmetrically.
I agree that the way I presented it I framed the extreme expert as more knowledgeable. I did this for illustrative purposes. But I believe the setting works just as well when we take both experts to be equally knowledgeable / calibrated. Throwing away the information from the extreme prediction seems bad.
Probabilities must add to 1.
I like invariance arguments—I think they can be quite illuminating. In fact I am quite puzzled by the fact that neither the average of probabilities nor the average of log odds seem to satisfy the basic invariance property of respecting annualized probabilities.
The A,B,C example you came up with is certainly a strike against average log odds and in favor of average probs. (EDIT: I do no longer endorse this conclusion, see my rebuttal here)
My personal takeaway here is that when you are aggregating probabilities derived from mutually exclusive conditions, then the average probability is the right way to go. But otherwise stick with log-odds.
[...] I maintain that, if you want a quick and dirty heuristic, averaging probabilities is a better quick and dirty heuristic than anything as senseless as averaging log odds.
I notice this is very surprising to me, because averaging log odds is anything but senseless.
This is a far lower confidence argument than the other points I raise here, but I think there is an aesthetic argument for averaging log odds—log odds make Bayes rule additive, and I expect means to work well when the underlying objects are additive (more about this from Owen CB here).
There is also the argument that average logodds are what you get when you try to optimize the minimum log loss in a certain situation—see Eric Neyman’s comment here.
Again, these arguments appeal mostly to aesthetic considerations. But I think it is unfair to call them senseless—they arise naturally in some circumstances.
if the worst odds you’d be willing to bet on are bounds on how seriously you take the hypothesis that someone else knows something that should make you update a particular amount, and you want to get an actual probability, then you should average over probabilities you perhaps should end up at, weighted by how likely it is that you should end up at them. This is an arithmetic mean of probabilities, not a geometric mean of odds.
Being honest I do not fully follow the reasoning here.
My gut feeling is this argument relies on an adversarial setting where you might get exploited. And this probably means that you should come up with a probability range for the additional evidence your opponent might have.
So if you think their evidence is uniformly distributed over −1 and 1 bits, you should combine that with your evidence by adding that evidence to your logarithmic odds. This gives you a probability distribution over the possible values. Then use that spread to decide which bet odds are worth the risk of exploitation.
I do not understand how this is about pooling different expert probabilities. But I might be misunderstanding your point.
Thank you again for writing the post and your comments. I think this is an important and fascinating issue, and I’m glad to see more discussion around it!
I think I can make sense of this. If you believe there’s some underlying exponential distribution on when some event will occur, but you don’t know the annual probability, then an exponential distribution is not a good model for your beliefs about when the event will occur, because a weighted average of exponential distributions with different annual probabilities is not an exponential distribution. This is because if time has gone by without the event occurring, this is evidence in favor of hypotheses with a low annual probability, so an average of exponential distributions should have its annual probability decrease over time.
An exponential distribution seems like the sort of probability distribution that I expect to be appropriate when the mechanism determining when the event occurs is well-understood, so different experts shouldn’t disagree on what the annual probability is. If the true annual rate is unknown, then good experts should account for their uncertainty and not report an exponential distribution. Or, in the case where the experts are explicit models and you believe one of the models is roughly correct, then the experts would report exponential distributions, but the average of these distributions is not an exponential distribution, for good reason.
I do agree that when new evidence comes in about the experts we should change how we weight them. But when we are pooling the probabilities we aren’t receiving any extra evidence about the experts (?).
Right, the evidence about the experts come from the new evidence that’s being updated on, not the pooling procedure. Suppose we’re pooling expert judgments, and we initially consider them all equally credible, so we use a symmetric pooling method. Then some evidence comes in. Our experts update on the evidence, and we also update on how credible each expert is, and pool their updated judgments together using an asymmetric pooling method, weighting experts by how well they anticipated evidence we’ve seen so far. This is clearest in the case where each expert is using some model, and we believe one of their models is correct but don’t know which one (the case you already agreed arithmetic averages of probabilities are appropriate). If we were weighting them all equally, and then we get some evidence that expert 1 thought was twice as likely as expert 2, then now we should think that expert 1 is twice as likely to be the one with the correct model as expert 2 is, and take a weighted arithmetic mean of their new probabilities where we weight expert 1 twice as heavily as expert 1. When you do this, your pooled probabilities handle Bayesian updates correctly. My point was that, even outside of this particular situation, we should still be taking expert credibility into account in some way, and expert credibility should depend on how well the expert anticipated observed evidence. If two experts assign odds ratios r0 and s0 to some event before observing new evidence, and we pool these into the odds ratio r1/20s1/20, and then we receive some evidence causing the experts to update to r1 and s1, respectively, but expert r anticipated that evidence better than expert s did, then I’d think this should mean we would weight expert r more heavily, and pool their new odds ratios into r2/31s1/31, or something like that. But we won’t handle Bayesian updates correctly if we do! The external Bayesianity property of the mean log odds pooling method means that to handle Bayesian updates correctly, we must update to the odds ratio r1/21s1/21, as if we learned nothing about the relative credibility of the two experts.
I agree that the way I presented it I framed the extreme expert as more knowledgeable. I did this for illustrative purposes. But I believe the setting works just as well when we take both experts to be equally knowledgeable / calibrated.
I suppose one reason not to see this as unfairly biased towards mean log odds is if you generally expect experts who give more extreme probabilities to actually be more knowledgeable in practice. I gave an example in my post illustrating why this isn’t always true, but a couple commenters on my post gave models for why it’s true under some assumptions, and I suppose it’s probably true in the data you’ve been using that’s been empirically supporting mean log odds.
Throwing away the information from the extreme prediction seems bad.
I can see where you’re coming from, but have an intuition that the geometric mean still trusts the information from outlying extreme predictions too much, which made a possible compromise solution occur to me, which to be clear, I’m not seriously endorsing.
I notice this is very surprising to me, because averaging log odds is anything but senseless.
I called it that because of its poor theoretical properties (I’m still not convinced they arise naturally in any circumstances), but in retrospect I don’t really endorse this given the apparently good empirical performance of mean log odds.
log odds make Bayes rule additive, and I expect means to work well when the underlying objects are additive
My take on this is that multiplying odds ratios is indeed a natural operation that you should expect to be an appropriate thing to do in many circumstances, but that taking the nth root of an odds ratio is not a natural operation, and neither is taking geometric means of odds ratios, which combines both of those operations. On the other hand, while adding probabilities is not a natural operation, taking weighted averages of probabilities is.
My gut feeling is this argument relies on an adversarial setting where you might get exploited. And this probably means that you should come up with a probability range for the additional evidence your opponent might have.
So if you think their evidence is uniformly distributed over −1 and 1 bits, you should combine that with your evidence by adding that evidence to your logarithmic odds. This gives you a probability distribution over the possible values. Then use that spread to decide which bet odds are worth the risk of exploitation.
Right, but I was talking about doing that backwards. If you’ve already worked out for which odds it’s worth accepting bets in each direction at, recover the probability that you must currently be assigning to the event in question. Arithmetic means of the bounds on probabilities implied by the bets you’d accept is a rough approximation to this: If you would be on X at odds implying any probability less than 2%, and you’d bet against X at odds implying any probability greater than 50%, then this is consistent with you currently assigning probability 26% to X, with a 50% chance that an adversary has evidence against X (in which case X has a 2% chance of being true), and a 50% chance that an adversary has evidence for X (in which case X has a 50% chance of being true).
I do not understand how this is about pooling different expert probabilities. But I might be misunderstanding your point.
It isn’t. My post was about pooling multiple probabilities of the same event. One source of multiple probabilities of the same event is the beliefs of different experts, which your post focused on exclusively. But a different possible source of multiple probabilities of the same event is the bounds in each direction on the probability of some event implied by the betting behavior of a single expert.
I wrote a post arguing for the opposite thesis, and was pointed here. A few comments about your arguments that I didn’t address in my post:
Regarding the empirical evidence supporting averaging log odds, note that averaging log odds will always give more extreme pooled probabilities than averaging probabilities does, and in the contexts in which this empirical evidence was collected, the experts were systematically underconfident, so that extremizing the results could make them better calibrated. This easily explains why average log odds outperformed average probabilities, and I don’t expect optimally-extremized average log odds to outperform optimally-extremized average probabilities (or similarly, I don’t expect unextremized average log odds to outperform average probabilities extremized just enough to give results as extreme as average log odds on average).
External Bayesianity seems like an actively undesirable property for probability pooling methods that treat experts symmetrically. When new evidence comes in, this should change how credible each expert is if different experts assigned different probabilities to that evidence. Thus the experts should not all be treated symmetrically both before and after new evidence comes in. If you do this, you’re throwing away the information that the evidence gives you about expert credibility, and if you throw away some of the evidence you receive, you should not expect your Bayesian updates to properly account for all the evidence you received. If you design some way of defining probabilities so that you somehow end up correctly updating on new evidence despite throwing away some of that evidence (as log odds averaging remarkably does), then, once you do adjust to account for the evidence that you were previously throwing away, you will no longer be correctly updating on new evidence (i.e. if you weight the experts differently depending on credibility, and update credibility in response to new evidence, then weighted averaging of log odds is no longer externally Bayesian, and weighted averaging of probabilities is if you do it right).
I talked about the argument that averaging probabilities ignores extreme predictions in my post, but the way you stated it, you added the extra twist that the expert giving more extreme predictions is known to be more knowledgeable than the expert giving less extreme predictions. If you know one expert is more knowledgeable, then of course you should not treat them symmetrically. As an argument for averaging log odds rather than averaging probabilities, this seems like cheating, by adding an extra assumption which supports extreme probabilities but isn’t used by either pooling method, giving an advantage to pooling methods that produce extreme probabilities.
Thank you for your thoughful reply. I think you raise interesting points, which move my confidence in my conclusions down.
Here are some comments
As in your post, averaging the probs effectively erases the information from extreme individual probabilities, so I think you will agree that averaging log odds is not merely a more extreme version of averaging probs.
I nonetheless think this is a very important issue—the difficulty of separating the extremizing effect of log odds from its actual effect.
This is an empirical question that we can settle empirically. Using Simon_M’s script I computed the Brier and log scores for binary Metaculus questions of the extremized means and extremized log odds and extremizing factors between 1 and 3 in intervals of 0.05.
In this setting, the top performing metrics are the “optimally” extremized average log odds in term of log loss, surpassing the “optimally” extremized mean of probs.
Note that the Brier scores are identical, which is consistent with the average log odds outperforming the average probs only when extreme forecasts are involved.
Also notice that the optimal extremizing factor for the average of logodds is lower than for the average of probabilities—this relates to your observation that the average log odds are already relatively extremized compared to the mean of probs.
There are reasons to question the validity of this experiment—we are effectively overfitting the extremizing factor to whatever gives the best results. And of course this is just one experiment. But I find it suggestive.
I am not sure I follow your argument here.
I do agree that when new evidence comes in about the experts we should change how we weight them. But when we are pooling the probabilities we aren’t receiving any extra evidence about the experts (?).
I agree that the way I presented it I framed the extreme expert as more knowledgeable. I did this for illustrative purposes. But I believe the setting works just as well when we take both experts to be equally knowledgeable / calibrated. Throwing away the information from the extreme prediction seems bad.
I like invariance arguments—I think they can be quite illuminating. In fact I am quite puzzled by the fact that neither the average of probabilities nor the average of log odds seem to satisfy the basic invariance property of respecting annualized probabilities.
The A,B,C example you came up with is certainly a strike against average log odds and in favor of average probs.(EDIT: I do no longer endorse this conclusion, see my rebuttal here)It reminds me of Toby Ord’s example with the Jack, Queen and King. I think dependency structures between events make the average log odds fail.
My personal takeaway here is that when you are aggregating probabilities derived from mutually exclusive conditions, then the average probability is the right way to go. But otherwise stick with log-odds.
I notice this is very surprising to me, because averaging log odds is anything but senseless.
This is a far lower confidence argument than the other points I raise here, but I think there is an aesthetic argument for averaging log odds—log odds make Bayes rule additive, and I expect means to work well when the underlying objects are additive (more about this from Owen CB here).
There is also the argument that average logodds are what you get when you try to optimize the minimum log loss in a certain situation—see Eric Neyman’s comment here.
Again, these arguments appeal mostly to aesthetic considerations. But I think it is unfair to call them senseless—they arise naturally in some circumstances.
Being honest I do not fully follow the reasoning here.
My gut feeling is this argument relies on an adversarial setting where you might get exploited. And this probably means that you should come up with a probability range for the additional evidence your opponent might have.
So if you think their evidence is uniformly distributed over −1 and 1 bits, you should combine that with your evidence by adding that evidence to your logarithmic odds. This gives you a probability distribution over the possible values. Then use that spread to decide which bet odds are worth the risk of exploitation.
I do not understand how this is about pooling different expert probabilities. But I might be misunderstanding your point.
Thank you again for writing the post and your comments. I think this is an important and fascinating issue, and I’m glad to see more discussion around it!
I think I can make sense of this. If you believe there’s some underlying exponential distribution on when some event will occur, but you don’t know the annual probability, then an exponential distribution is not a good model for your beliefs about when the event will occur, because a weighted average of exponential distributions with different annual probabilities is not an exponential distribution. This is because if time has gone by without the event occurring, this is evidence in favor of hypotheses with a low annual probability, so an average of exponential distributions should have its annual probability decrease over time.
An exponential distribution seems like the sort of probability distribution that I expect to be appropriate when the mechanism determining when the event occurs is well-understood, so different experts shouldn’t disagree on what the annual probability is. If the true annual rate is unknown, then good experts should account for their uncertainty and not report an exponential distribution. Or, in the case where the experts are explicit models and you believe one of the models is roughly correct, then the experts would report exponential distributions, but the average of these distributions is not an exponential distribution, for good reason.
Right, the evidence about the experts come from the new evidence that’s being updated on, not the pooling procedure. Suppose we’re pooling expert judgments, and we initially consider them all equally credible, so we use a symmetric pooling method. Then some evidence comes in. Our experts update on the evidence, and we also update on how credible each expert is, and pool their updated judgments together using an asymmetric pooling method, weighting experts by how well they anticipated evidence we’ve seen so far. This is clearest in the case where each expert is using some model, and we believe one of their models is correct but don’t know which one (the case you already agreed arithmetic averages of probabilities are appropriate). If we were weighting them all equally, and then we get some evidence that expert 1 thought was twice as likely as expert 2, then now we should think that expert 1 is twice as likely to be the one with the correct model as expert 2 is, and take a weighted arithmetic mean of their new probabilities where we weight expert 1 twice as heavily as expert 1. When you do this, your pooled probabilities handle Bayesian updates correctly. My point was that, even outside of this particular situation, we should still be taking expert credibility into account in some way, and expert credibility should depend on how well the expert anticipated observed evidence. If two experts assign odds ratios r0 and s0 to some event before observing new evidence, and we pool these into the odds ratio r1/20s1/20, and then we receive some evidence causing the experts to update to r1 and s1, respectively, but expert r anticipated that evidence better than expert s did, then I’d think this should mean we would weight expert r more heavily, and pool their new odds ratios into r2/31s1/31, or something like that. But we won’t handle Bayesian updates correctly if we do! The external Bayesianity property of the mean log odds pooling method means that to handle Bayesian updates correctly, we must update to the odds ratio r1/21s1/21, as if we learned nothing about the relative credibility of the two experts.
I suppose one reason not to see this as unfairly biased towards mean log odds is if you generally expect experts who give more extreme probabilities to actually be more knowledgeable in practice. I gave an example in my post illustrating why this isn’t always true, but a couple commenters on my post gave models for why it’s true under some assumptions, and I suppose it’s probably true in the data you’ve been using that’s been empirically supporting mean log odds.
I can see where you’re coming from, but have an intuition that the geometric mean still trusts the information from outlying extreme predictions too much, which made a possible compromise solution occur to me, which to be clear, I’m not seriously endorsing.
I called it that because of its poor theoretical properties (I’m still not convinced they arise naturally in any circumstances), but in retrospect I don’t really endorse this given the apparently good empirical performance of mean log odds.
My take on this is that multiplying odds ratios is indeed a natural operation that you should expect to be an appropriate thing to do in many circumstances, but that taking the nth root of an odds ratio is not a natural operation, and neither is taking geometric means of odds ratios, which combines both of those operations. On the other hand, while adding probabilities is not a natural operation, taking weighted averages of probabilities is.
Right, but I was talking about doing that backwards. If you’ve already worked out for which odds it’s worth accepting bets in each direction at, recover the probability that you must currently be assigning to the event in question. Arithmetic means of the bounds on probabilities implied by the bets you’d accept is a rough approximation to this: If you would be on X at odds implying any probability less than 2%, and you’d bet against X at odds implying any probability greater than 50%, then this is consistent with you currently assigning probability 26% to X, with a 50% chance that an adversary has evidence against X (in which case X has a 2% chance of being true), and a 50% chance that an adversary has evidence for X (in which case X has a 50% chance of being true).
It isn’t. My post was about pooling multiple probabilities of the same event. One source of multiple probabilities of the same event is the beliefs of different experts, which your post focused on exclusively. But a different possible source of multiple probabilities of the same event is the bounds in each direction on the probability of some event implied by the betting behavior of a single expert.
I have though more about this. I now believe that this invariance property is not reasonable—aggregating outcomes is (surprisingly) not a natural operation in Bayesian reasoning. So I do not think this is a strike agains log-odd pooling.