Thanks for writing this up; I agree with your conclusions.
There’s a neat one-to-one correspondence between proper scoring rules and probabilistic opinion pooling methods satisfying certain axioms, and this correspondence maps Brier’s quadratic scoring rule to arithmetic pooling (averaging probabilities) and the log scoring rule to logarithmic pooling (geometric mean of odds). I’ll illustrate the correspondence with an example.
Let’s say you have two experts: one says 10% and one says 50%. You see these predictions and need to come up with your own prediction, and you’ll be scored using the Brier loss: (1 - x)^2, where x is the probability you assign to whichever outcome ends up happening (you want to minimize this). Suppose you know nothing about pooling; one really basic thing you can do is to pick an expert to trust at random: report 10% with probability 1⁄2 and 50% with probability 1⁄2. Your expected Brier loss in the case of YES is (0.81 + 0.25)/2 = 0.53, and your expected loss in the case of NO is (0.01 + 0.25)/2 = 0.13.
But, you can do better. Suppose you say 35% -- then your loss is 0.4225 in the case of YES and 0.1225 in the case of NO—better in both cases! So you might ask: what is the strategy the gives me the largest possible guaranteed improvement over choosing a random expert? The answer is linear pooling (averaging the experts). This gets you 0.49 in the case of YES and 0.09 in the case of NO (an improvement of 0.04 in each case).
Now suppose you were instead being scored with a log loss—so your loss is -ln(x), where x is the probability you assign to whichever outcome ends up happening. Your expected log loss in the case of YES is (-ln(0.1) - ln(0.5))/2 ~ 1.498, and in the case of NO is (-ln(0.9) - ln(0.5))/2 ~ 0.399.
Again you can ask: what is the strategy that gives you the largest possible guaranteed improvement of this “choose a random expert” strategy? This time, the answer is logarithmic pooling (taking the geometric mean of the odds). This is 25%, which has a loss of 1.386 in the case of YES and 0.288 in the case of NO, an improvement of about 0.111 in each case.
(This works just as well with weights: say you trust one expert more than the other. You could choose an expert at random in proportion to these weights; the strategy that guarantees the largest improvement over this is to take the weighted pool of the experts’ probabilities.)
This generalizes to other scoring rules as well. I co-wrote a paper about this, which you can find here, or here’s a talk if you prefer.
What’s the moral here? I wouldn’t say that it’s “use arithmetic pooling if you’re being scored with the Brier score and logarithmic pooling if you’re being scored with the log score”; as Simon’s data somewhat convincingly demonstrated (and as I think I would have predicted), logarithmic pooling works better regardless of the scoring rule.
Instead I would say: the same judgments that would influence your decision about which scoring rule to use should also influence your decision about which pooling method to use. The log scoring rule is useful for distinguishing between extreme probabilities; it treats 0.01% as substantially different from 1%. Logarithmic pooling does the same thing: the pool of 1% and 50% is about 10%, and the pool of 0.01% and 50% is about 1%. By contrast, if you don’t care about the difference between 0.01% and 1% (“they both round to zero”), perhaps you should use the quadratic scoring rule; and if you’re already not taking distinctions between low and extremely low probabilities seriously, you might as well use linear pooling.
Thanks for writing this up; I agree with your conclusions.
There’s a neat one-to-one correspondence between proper scoring rules and probabilistic opinion pooling methods satisfying certain axioms, and this correspondence maps Brier’s quadratic scoring rule to arithmetic pooling (averaging probabilities) and the log scoring rule to logarithmic pooling (geometric mean of odds). I’ll illustrate the correspondence with an example.
Let’s say you have two experts: one says 10% and one says 50%. You see these predictions and need to come up with your own prediction, and you’ll be scored using the Brier loss: (1 - x)^2, where x is the probability you assign to whichever outcome ends up happening (you want to minimize this). Suppose you know nothing about pooling; one really basic thing you can do is to pick an expert to trust at random: report 10% with probability 1⁄2 and 50% with probability 1⁄2. Your expected Brier loss in the case of YES is (0.81 + 0.25)/2 = 0.53, and your expected loss in the case of NO is (0.01 + 0.25)/2 = 0.13.
But, you can do better. Suppose you say 35% -- then your loss is 0.4225 in the case of YES and 0.1225 in the case of NO—better in both cases! So you might ask: what is the strategy the gives me the largest possible guaranteed improvement over choosing a random expert? The answer is linear pooling (averaging the experts). This gets you 0.49 in the case of YES and 0.09 in the case of NO (an improvement of 0.04 in each case).
Now suppose you were instead being scored with a log loss—so your loss is -ln(x), where x is the probability you assign to whichever outcome ends up happening. Your expected log loss in the case of YES is (-ln(0.1) - ln(0.5))/2 ~ 1.498, and in the case of NO is (-ln(0.9) - ln(0.5))/2 ~ 0.399.
Again you can ask: what is the strategy that gives you the largest possible guaranteed improvement of this “choose a random expert” strategy? This time, the answer is logarithmic pooling (taking the geometric mean of the odds). This is 25%, which has a loss of 1.386 in the case of YES and 0.288 in the case of NO, an improvement of about 0.111 in each case.
(This works just as well with weights: say you trust one expert more than the other. You could choose an expert at random in proportion to these weights; the strategy that guarantees the largest improvement over this is to take the weighted pool of the experts’ probabilities.)
This generalizes to other scoring rules as well. I co-wrote a paper about this, which you can find here, or here’s a talk if you prefer.
What’s the moral here? I wouldn’t say that it’s “use arithmetic pooling if you’re being scored with the Brier score and logarithmic pooling if you’re being scored with the log score”; as Simon’s data somewhat convincingly demonstrated (and as I think I would have predicted), logarithmic pooling works better regardless of the scoring rule.
Instead I would say: the same judgments that would influence your decision about which scoring rule to use should also influence your decision about which pooling method to use. The log scoring rule is useful for distinguishing between extreme probabilities; it treats 0.01% as substantially different from 1%. Logarithmic pooling does the same thing: the pool of 1% and 50% is about 10%, and the pool of 0.01% and 50% is about 1%. By contrast, if you don’t care about the difference between 0.01% and 1% (“they both round to zero”), perhaps you should use the quadratic scoring rule; and if you’re already not taking distinctions between low and extremely low probabilities seriously, you might as well use linear pooling.