# Jsevillamol

Karma: 857

PhD student at Aberdeen University studying Bayesian reasoning

Interested in practical exercises and theoretical considerations related to causal inference, forecasting and prioritization.

• The ideal system would [not] aggregate first into a single number [...] Instead, the ideal system would use the whole distribution of estimates

And I have concluded that the ideal aggregation procedure should compress all the information into a single prediction—our best guess for the actual distribution of the event.

Concretely, I think that in an idealized framework we should be treating the expert predictions as Bayesian evidence for the actual distribution of the event of interest . That is, the idealized aggregation should just match the conditional probability of the event given the predictions: .

Of course, for this procedure to be practical you need to know the generative model for the individual predictions . This is for the most part not realistic—the generative model needs to take into account details of how each forecaster is generating the prediction and the redundance of information between the predictions. So in practice we will need to approximate the aggregate measure using some sort of heuristic.

But, crucially, the approximation does not depend on the downstream task we intend to use the aggregate prediction for.

This is something hard for me to wrap my head around, since I too feel the intuitive grasp of wanting to retain information about eg the spread of the individual probabilities. I would feel more nervous making decisions when the forecasters widly disagree with each other, as opposed to when the forecasters are of one voice.

What is this intuition then telling us? What do we need the information about the spread for then?

My answer is that we need to understand the resilience of the aggregated prediction to new information. This already plays a role in the aggregated prediction, since it helps us weight the relative importance we should give to our prior beliefs vs the evidence from the experts - a wider spread or a smaller number of forecaster predictions will lead to weaker evidence, and therefore a higher relative weighting of our priors.

Similarly, the spread of distributions gives us information about how much would we gain from additional predictions.

I think this neatly resolves the tension between aggregating vs not, and clarifies when it is important to retain information about the distribution of forecasts: when value of information is relevant. Which, admittedly, is quite often! But when we cannot acquire new information, or we can rule out value of information as decision-relevant, then we should aggregate first into a single number, and make decisions based on our best guess, regardless of the task.

• I’ve been having some mixed feelings about some recent initiatives in the Forum.

These include things in the space of the creative fiction contest, posting humorous top level content and asking people to share memes.

I am having trouble articulating exactly what is causing my uneasiness. I think its something along the lines of “I use the EA Forum to stay up to date on research, projects and considerations about Effective Altruism. Fun content distracts from that experience, and makes it harder for the work I publish in the Forum to be taken seriously”.

On the other hand, I do see the value of having friendly content around. It makes the community more approachable. And the last thing I would want is to gatekeep people out for wanting to have fun together. I love hanging out with EAs too!

I trust the leadership of the Forum to have thought about these and other considerations. But I am voicing my opinion in case there are more who also share this uneasiness, to see if we can pinpoint it and figure out what to do about it.

Things that I think would help mitigate my uneasiness:

• Create a peer-reviewed forum on top of the EA Forum, which curates research/​thoughful content. An interface like the Alignment Forum /​ LessWrong would work well for this.

• Create a separate place of discourse (a Facebook group?) for fun content, perhaps linked somehow from the EA Forum.

• Have the fun content be hidden by default, like personal posts, so people need to opt into it.

What do other people think? Do other people feel this way?

# An­nounc­ing ries­goscatas­trofi­cos­globales.com

14 Sep 2021 15:42 UTC
35 points
• Thank you! I learned too from the examples.

One question:

In particular, that the best approach for practical rationality involves calculating things out according to each of the probabilities and then aggregating from there (or something like that), rather than aggregating first.

I am confused about this part. I think I said exactly the opposite? You need to aggregate first, then calculate whatever you are interested in. Otherwise you lose information (because eg taking the expected value of the individual predictions loses information that was contained in the individual predictions, about for example the standard deviation of the distribution, which depending on the aggregation method might affect the combined expected value).

What am I not seeing?

• Thank you for your thoughts!

I agree with the general point of “different situations will require different approaches”.

From that common ground, I am interested in seeing whether we can tease out when it is appropriate to use one method against the other.

*disclaimer: low confidence from here onwards

I do not find the first example about value 0 vs value 500 entirely persuasive, though I see where you are coming from, and I think I can see when it might work.

The arithmetic mean of probabilities is entirely justified when aggregating predictions from models that start from disjoint and exhaustive conditions (this was first pointed out to me by Ben Snodin, and Owen CB makes the same point in a comment above).

This suggests that if your experts are using radically different assumptions (and are not hedging their bets based on each others arguments) then the average probability seems more appealing. I think this is implicitly what is happening in Linch’s and your first and third examples—we are in a sense assuming that only one expert is correct in the assumptions that led them to their estimate, but you do not know which one.

My intuition is that once you have experts who are given all-considered estimates, the geometric mean takes over again. I realize that this is a poor argument; but I am making a concrete claim about when it is correct to use arithmetic vs geo mean of probabilities.

In slogan form: the average of probabilities works for aggregating hedgehogs, the geometric mean works for aggregating foxes.

On the second example about instition failure, the argument goes that the expected value of the aggregate probability ought to correspond to the mean of the expected values.

I do not think this is entirely correct—I think you lose information when taking the expected values before aggregating, and thus we should not in general expect this. This is an argument similar to (Lindley, 1983), where the author dismisses marginalization as a desirable property on similar grounds. For a concrete example, see this comment, where I worked out how the expected value of the aggregate of two log normals relates to the aggregate of the expected value.

What I think we should require is that the aggregate of the exponential distributions implied by the annual probabilities matches the exponential distribution implied by the aggregated annual probabilities.

Interestingly, if you take the geometric mean aggregate of two exponential densities with associated annual probabilities then you end up with .

That is, the geometric mean aggregation of the implied exponentials led to an exponential whose annual rate probability is the arithmetic mean of the individual rates.

EDIT: This is wrong, since the annualized probability does not match the rate parameter in an exponential. It still does not work after we correct it by substituting

I consider this a strong argument against the geometric mean.

Note that the arithmetic mean fails to meet this property too—the mixture distribution is not even an exponential! The harmonic mean does not satisfy this property either.

What is the class of aggregation methods implied by imposing this condition? I do not know.

I do not have much to say about the Jack, Queen, King example. I agree with the general point that yes, there are some implicit assumptions that make the geometric mean work well in practice.

Definitely the JQK example does not feel like “business as usual”. There is an unusual dependence between the beliefs of the experts. For example, had we pooled expert C as well then the example does no longer work.

I’d like to see whether we can derive some more intuitive examples that follow this pattern. There might be—but right now I am drawing a blank.

In sum, I think there is an important point here that needs to be acknoledged—the theoretical and empirical evidence I provided is not enough to pinpoint the conditions where the geometric mean is the better aggregate (as opposed to the arithmetic mean).

I think the intuition behind using mixture probabilities is correct when the experts are reasoning from mutually exclusive assumptions. I feel a lot less confident when aggregating experts giving all-considered views. In that case my current best guess is the geometric mean, but now I feel a lot less confident.

I think that first taking the expected value then aggregating loses you information. When taking a linear mixture this works by happy coincidence, but we should not expect this to generalize to situations where the correct pooling method is different.

I’d be interested in understanding better what is the class of pooling methods that “respects the exponential distribution” in the sense I defined above of having the exponential associated with a pooled annual rate matches the pooled exponentials implied by the individual annual rates.

And I’d be keen on more work identifying real life examples where the geometric mean approach breaks, and more work suggesting theoretical conditions where it does (not). Right now we only have external bayesianity motivating it, that while compelling is clearly not enough.

• Let’s work this example through together! (but I will change the quantities to 10 and 20 for numerical stability reasons)

One thing we need to be careful with is not mixing the implied beliefs with the object level claims.

In this case, person A’s claim that the value is is more accurately a claim that the beliefs of person A can be summed up as some distribution over the positive numbers, eg a log normal with parameters and . So the density distribution of beliefs of A is (and similar for person B, with ). The scale parameters intuitively represent the uncertainty of person A and person B.

Taking , these densities look like:

Note that the mean of these distributions is slightly displaced upwards from the median . Concretely, the mean is computed as , and equals 10.05 and 20.10 for person A and person B respectively.

To aggregate the distributions, we can use the generalization of the geometric mean of odds referred to in footnote [1] of the post.

According to that, the aggregated distribution has a density .

The plot of the aggregated density looks like:

I actually notice that I am very surprised about this—I expected the aggregate distribution to be bimodal, but here it seems to have a single peak.

For this particular example, a numerical approximation of the expected value seems to equal around 14.21 - which exactly equals the geometric mean of the means.

I am not taking away any solid conclusions from this exercise—I notice I am still very confused about how the aggregated distribution looks like, and I encountered serious numerical stability issues when changing the parameters, which make me suspect a bug.

Maybe a Monte Carlo approach for estimating the expected value would solve the stability issues—I’ll see if I can get around to that at some point.

Meanwhile, here is my code for the results above.

EDIT: Diego Chicharro has pointed out to me that the expected value can be easily computed analytically in Mathematica.

The resulting expected value of the aggregated distribution is .

In the case where we have then that the expected value is , which is exactly the geometric mean of the expected values of the individual predictions.

• I think you mean bearish

Oops yes ὃB

You point out this highly skilled management/​leadership/​labor is not fungible

Yes, exactly.

I think what I am pointing towards is something like “if you are one such highly skilled editor, and your plan is to work on something like this part time delegating work to more junior people, then you are going to find yourself burnt out very soon. Managing a team of junior people /​ people who do not share your aesthetic sense to do highly skilled labor will be, at least for the first six months or so, much more work than if you do it on your own.”.

I think an editor will be ten times more likely to succeed if:

1. They have a high skilled co-founder who shares their vision

2. They have a plan to work on something like this full time, at least for a while

3. They have a plan for training aligned junior people on skills OR to teach taste to experts

On hindsight I think my comment was too negative, since I would still be excited about someone retrying a distill-like experiment and throwing money at it.

• I am more bullish about this. I think for distill to succeed it needs to have at least two full time editors committed to the mission.

Managing people is hard. Managing people, training them and making sure the vision of the project is preserved is insanely hard—a full time job for at least two people.

Plus the part Distill was bottlenecked on is very high skilled labour, which needed a special aesthetic sensitivity and commitment.

50 senior hours per draft sounds insane—but I do believe the Distill staff when they say it is needed.

This wraps back to why new journals are so difficult : you need talented researchers with additional entrepreneurial skills to push it forward. But researchers by and large would much rather just work on their research than manage a journal.

• Create a journal of AI safety, and get prestigious people like Russell publishing on them.

Basically many people in academia are stuck chasing publications. Aligning that incentive seems important.

The problem is that journals are hard work, and require a very specific profile to push it forward.

Here is a post mortem of a previous attempt: https://​​distill.pub/​​2021/​​distill-hiatus/​​

• META: Do you think you could edit this comment to include...

1. The number of questions, and aggregated predictions per question?

2. The information on extremized geometric mean you computed below (I think it is not receiving as much attention due to being buried in the replies)?

3. Possibly a code snippet to reproduce the results?

• You are right and I should be more mindful of this.

I have reformulated the main equations using only commonly known symbols, moved the equations that were not critical for the text to a footnote and added plain language explanations to the rest.

(I hope it is okay that I stole your explanation of the geometric mean!)

• I mean in the past people were underconfident (so extremizing would make their predictions better). Since then they’ve stopped being underconfident. My assumption is that this is because the average predictor is now more skilled or because more predictors improves the quality of the average.

Gotcha!

The bias isn’t that more questions resolve positively than users expect.

Oh I see!

• but also the average predictor improving their ability also fixed that underconfidence

What do mean by this?

Metaculus has a known bias towards questions resolving positive

Oh I see!

It is very cool that this works.

One thing that confuses me—when you take the geometric mean of probabilities you end up with . So the pooled probability gets slighly nudged towards 0 in comparison to what you would get with the geometric mean of odds. Doesn’t that mean that it should be less accurate, given the bias towards questions resolving positively?

What am I missing?

# When pool­ing fore­casts, use the ge­o­met­ric mean of odds

3 Sep 2021 9:58 UTC
76 points
• Thank you for the superb analysis!

This increases my confidence in the geo mean of the odds, and decreases my confidence in the extremization bit.

I find it very interesting that the extremized version was consistently below by a narrow margin. I wonder if this means that there is a subset of questions where it works well, and another where it underperforms.

One question /​ nitpick: what do you mean by geometric mean of the probabilities? If you just take the geometric mean of probabilities then you do not get a valid probability—the sum of the pooled ps and (1-p)s does not equal 1. You need to rescale them, at which point you end with the geometric mean of odds.

Unexpected values explains this better than me here.

• Thank you!

The freedom for side projects is the best—though I should warn other people here than having a supportive supervisor who is okay with this is crucial.

I have definitely heard more than one horror story from colleagues who were constantly fighting their supervisors on the direction of their research, and felt they had little room for side projects.