It sounds like the headline claim is that (A) we are 33.2% to live in a world where the risk of loss-of-control catastrophe is <1%, and 7.6% to live in a world where the risk is >35%, and a whole distribution of values between, and (B) that it follows from A that the correct subjective probability of loss-of-control catastrophe is given by the geometric mean of the risk, over possible worlds.
The âheadlineâ result from this analysis is that the geometric mean of all synthetic forecasts of the future is that the Communityâs current best guess for the risk of AI catastrophe due to an out-of-control AGI is around 1.6%. You could argue the toss about whether this means that the most reliable âfair betting oddsâ are 1.6% or not (Future Fund are slightly unclear about whether theyâd bet on simple mean, median etc and both of these figures are higher than the geometric mean[9]).
I want to argue that the geometric mean is not an appropriate way of aggregating probabilities across different âworlds we might live inâ into a subjective probability (as requested by the prize). This argument doesnât touch on the essayâs main argument in favor of considering distributions, but may move the headline subjective probability that it suggests to 9.65%, effectively outside the range of opinion-change prizes, so I thought it worth clarifying in case I misunderstand.
Consider an experiment where you flip a fair coin A. If A is heads you flip a 99%heads coin B; if A is tails you flip a 1%heads coin B. Weâre interested in forming a subjective probability that B is heads.
The answer I find intuitive for p(B=heads) is 50%, which is achieved by taking the arithmetic average over worlds. The geometric average over worlds gives 9.9% instead, which doesnât seem like âfair betting oddsâ for B being heads under any natural interpretation of those words. Whatâs worse, the geometric-mean methodology suggests a 9.9% subjective probability of tails, and then p(H)+p(T) does not add to 1.
(If youâre willing to accept probabilities that are 0 and 1, then an even starker experiment is given by a 1% chance to end up in a world with 0% risk and a 99% chance to end up in a world with 100% riskâthe geometric mean is 0.)
Footnote 9 of the post suggests that the operative meaning of âfair betting oddsâ is sufficiently undefined by the prize announcement that perhaps it refers to a Brier-score bet, but I believe that it is clear from the prize announcement that a 1âvsâX bet is the kind under consideration. The prize announcementâs footnote 1 says âWe will pose many of these beliefs in terms of <u>subjective probabilities, which represent betting odds</âu> that we consider fair in the sense that weâd be roughly indifferent between betting in favor of the relevant propositions <u>at those odds</âu> or betting against them.â
I donât know of a natural meaning of âbet in favor of P at 97:3 oddsâ other than âbet to win $97N if P and lose $3N if not Pâ, which the bettor should be indifferent about if . Is there some other bet that you believe âbet in favor of P at odds of X:Yâ could mean? In particular, is there a meaning which would support forming odds (and subjective probability) according to a geometric mean over worlds?
(I work at the FTX Foundation, but have no connection to the prizes or their judging, and my question-asking here is as a EA Forum user, not in any capacity connected to the prizes.)
Hmm I accidentally deleted a comment earlier, but roughly:
I think thereâs decent theoretical and empirical arguments for having a prior where you should be using geometric mean of odds over arithmetic mean of probabilities when aggregating forecasts. Jaime has a primer here. However there was some pushback in the comments, especially by Toby Ord. My general takeaway is that geometric mean of odds is a good default when aggregating forecasts by epistemic peers but there are a number of exceptions where some other aggregation schema is better.
Arguably Froolowâs data (which is essentially a glorified survey of rationalists) is closer to a situation where we want to aggregate forecasts than a situation where we have âobjectiveâ probabilities over probabilities (as in your coin example).
So I can see why they used geometric mean as a default, though I think they vastly exaggerated the confidence that we should have in that being the correct modeling decision.
I also donât quite understand why they used geometric mean of probabilities rather than geometric mean of odds.
This comment is exactly right, although it seems I came across stronger on the point about geometric mean of odds than I intended to. I wanted to say basically exactly what you did in this commentâthere are relatively sound reasons to treat geometric mean of odds as the default in this case, but that there was a reasonable argument for simple means too. For example see footnotes 7 and 9 where I make this point. What I wanted to get across was that the argument about simple means vs geometric mean of odds was likely not the most productive argument to be havingâpoint estimates always (necessarily) summarise the underlying distribution of data, and it is dangerous to merely use summary statistics when the distribution itself contains interesting and actionable information
Just for clarityâI use geometric mean of odds, which I then convert back into probability as an additional step (because people are more familiar with probability than odds). If I said anywhere that I took the geometric mean of probabilities then this is a typo and I will correct it!
What I wanted to get across was that the argument about simple means vs geometric mean of odds was likely not the most productive argument to be havingâpoint estimates always (necessarily) summarise the underlying distribution of data, and it is dangerous to merely use summary statistics when the distribution itself contains interesting and actionable information
I agree about this in general but Iâm skeptical about treating distributions of probabilities the same way we treat distributions of quantities.
Perhaps more importantly, I assumed that the FTX FF got their numbers for reasons other than deferring to the forecasts of random rationalists. If Iâm correct, this leads me to think that sophisticated statistics on top of the forecasts of random rationalists is unlikely to change their minds.
Just for clarityâI use geometric mean of odds, which I then convert back into probability as an additional step (because people are more familiar with probability than odds). If I said anywhere that I took the geometric mean of probabilities then this is a typo and I will correct it!
Thanks! This is my fault for commenting before checking the math first! However, I think you couldâve emphasized what you actually did more. You did not say âgeometric mean of probabilities.â But you also did not say âgeometric mean of oddsâ anywhere except in footnote 7 and this comment. In the main text, you only said âgeometric meanâ and the word âprobabilityâ was frequently in surrounding texts.
I think thatâs a fair criticism. For all I know, the FF are not at all uncertain about their estimates (or at least not uncertain over order-of-magnitude) and so the SDO mechanism doesnât come into play. I still think there is value in explicitly and systematically considering uncertainty, even if you end up concluding it doesnât really matter for your specific beliefs -if only because you canât be totally confident it doesnât matter until you have actually done the maths.
Iâve updated the text to replace âgeometric meanâ with âgeometric mean of oddsâ everywhere it occurs. Thanks so much for the close reading and spotting the error.
Iâve updated the text to replace âgeometric meanâ with âgeometric mean of oddsâ everywhere it occurs. Thanks so much for the close reading and spotting the error.
Thanks! Though itâs not so much an error as just moderately confusing communication.
As you probably already know, I think one advantage of geometric mean of odds over probabilities is that it directly addresses one of Rossâs objections:
> Consider an experiment where you flip a fair coin A. If A is heads you flip a 99%heads coin B; if A is tails you flip a 1%heads coin B. Weâre interested in forming a subjective probability that B is heads.
The answer I find intuitive for p(B=heads) is 50%, which is achieved by taking the arithmetic average over worlds. The geometric average over worlds gives 9.9% instead, which doesnât seem like âfair betting oddsâ for B being heads under any natural interpretation of those words. Whatâs worse, the geometric-mean methodology suggests a 9.9% subjective probability of tails, and then p(H)+p(T) does not add to 1.
Geomean of odds of 99% heads and 1% heads is
sqrt(99â1):sqrt(1â99)=1:1=50
More generally, geomean of X:Y and Y:X is 50%, and geomean of odds is equally sensitive to outlier probabilities in both directions (whereas geomean of probabilities is only sensitive to outlierly low probabilities).
I think there are good reasons for preferring geometric mean of odds to simple mean when presenting data of this type, but not good enough that Iâd take to the barricades over them. Linch (below) links to the same post I do in giving my reasons to believe this. Overall, however, this is an essay about distributions rather than point estimates so if your main objection is to the summary statistic I used then I think we agree on the material points, but have a disagreement about how the work should be presented.
On the point about betting odds, I note that the contest announcement also states âApplicants need not agree with or use our same conception of probabilityâ. I think the way in which I actually disagree with the Future Fund is more radical than simple means vs geometric mean of oddsâI think they ought to stop putting so much emphasis on summary statistics altogether.
Thanks for clarifying âgeomean of probabilitiesâ versus âgeomean of odds elsethread. I agree that that resolves some (but not all) of my concerns with geomeaning.
I think the way in which I actually disagree with the Future Fund is more radical than simple means vs geometric mean of oddsâI think they ought to stop putting so much emphasis on summary statistics altogether.
I agree with your pro-distribution position here, but I think you will be pleasantly surprised by how much reasoning over distributions goes into cost-benefit estimates at the Future Fund. This claim is based on nonpublic information, though, as those estimates have not yet been put up for public discussion. I will suggest, though, that itâs not an accident that Leopold Aschenbrenner is talking with QURI about improvements to Squiggle: https://ââgithub.com/ââquantified-uncertainty/ââsquiggle/ââdiscussions
So my subjective take is that if the true issue is âyou should reason over distributions of core parametersâ, then in fact thereâs little disagreement between you and the FF judges (which is good!), but it all adds up to normality (which is bad for the claim âmoving to reasoning over distributions should move your subjective probabilitiesâ).
If weâre focusing on the Worldview Prize question as posed (âshould these probability estimates change?â), then I think the geo-vs-arith difference is totally cruxyânote that the arithmetic summary of your results (9.65%) is in line with the product of the baseline subjective probabilities for the prize (something like a 3% for loss-of-control x-risk before 2043; something like 9% before 2100).
I do think itâs reasonable to critique the fact that those point probabilities are presented without any indication that the path of reasoning goes through reasoning over distributions, though. So I personally am happy with this post calling attention to distributional reasoning, since itâs unclear in this case whether that is an update. I just donât expect it to win the prizes for changing estimates.
Because I do think distributional reasoning is important, though, I do want to zoom in on the arith-vs-geo question (which I think, on reflection, is subtler than the position I took in my top-level comment). Rather than being a minor detail, I think this is important because it influences whether greater uncertainty tends to raise or lower our âfair betting oddsâ (which, at the end of the day, are the numbers that matter for how the FF decides to spend money).
I agree with Jamie and you and Linch that when pooling forecasts, itâs reasonable (maybe optimal? maybe not?) to use geomeans. So if youâre pooling expert forecasts of {1:1000, 1:100, 1:10}, you might have a subjective belief of something like â1:100, but with a âstandard deviationâ of 6.5x to either sideâ. This is lower than the arithmean-pooled summary stats, and I think thatâs directionally right.
I think this is an importantly different question from âhow should you act when your subjective belief is a distribution like that. I think that if you have a subjective belief like â1%, but with a âstandard deviationâ of 6.5x to either sideâ, you should push a button that gives you $98.8 if youâre right and loses $1.2 if youâre wrong. In particular, I think you should take the arithmean over your subjective distribution of beliefs (here, ~1.4%) and take bets that are good relative to that number. This will lead to decision-relevant effective probabilities that are higher than geomean-pooled point estimates (for small probabilities).
For this use-case (eg, âwhat bets should we make with our moneyâ), Iâd argue that you need to use a point estimate to decide what bets to make, and that you should make that point estimate by (1) geomean-pooling raw estimates of parameters, (2) reasoning over distributions of all parameters, then (3) taking arithmean of the resulting distribution-over-probabilities and (4) acting according to that mean probability.
In the case of the Worldview Prize, my interpretation is that the prize is described and judged in terms of (3), because that is the most directly valuable thing in terms of producing better (4)s.
An explicit case where I think itâs important to arithmean over your subjective distribution of beliefs:
coin A is fair
coin B is either 2% heads or 98% heads, you donât know
you lose if either comes up tails.
So your p(win) is âeither 1% or 49%â.
I claim the FF should push the button that pays us $80 if win, -$20 if lose, and in general make action decisions consistent with a point estimate of 25%. (Iâm ignoring here the opportunity to seek value of information, which could be significant!).
Itâs important not to use geomean-of-odds to produce your actions in this scenario; that gives you ~9.85%, and would imply you should avoid the +$80;-$20 button, which I claim is the wrong choice.
I agree that the arith-vs-geo question is basically the crux when it comes to whether this essay should move FFâs âfair betting probabilitiesâ - it sounds like everyone is pretty happy with the point about distributions and Iâm really pleased about that because it was the main point I was trying to get across. Iâm even more pleased that there is background work going on in the analysis of uncertainty space, because thatâs an area where public statements by AI Risk organisations have sometimes lagged behind the state of the art in other risk management applications.
With respect to the crux, I hate to say itâbecause Iâd love to be able to make as robust a claim for the prize as possibleâbut Iâm not sure there is a principled reason for using geomean over arithmean for this application (or vice versa). The way I view it, they are both just snapshots of what is âreallyâ going on, which is the full distribution of possible outcomes given in the graphs /â model. By analogy, I would be very suspicious of someone who always argued the arithmean would be a better estimate of central tendency than the median for every dataset /â use case! I agree with you the problem of which is best for this particular dataset /â use case is subtle, and I think I would characterise it as being a question of whether my manipulations of peopleâs forecasts have retained some essential âforecast-yâ characteristic which means geomean is more appropriate for various features it has, or whether they have been processed into having some sort of âoutcome-yâ characteristic in which case arithmean is more appropriate. I take your point below in the coin example and the obvious superiority of arithmeans for that application, but my interpretation is that the FF didnât intend for the âfair betting oddsâ position to limit discussion about alternate ways to think about probabilities (âApplicants need not agree with or use our same conception of probabilityâ)
However, to be absolutely clear, even if geomean was the right measure of central tendency I wouldnât expect the judges to pay that particular attentionâif all I had done was find a novel way of averaging results then my argument would basically be mathematical sophistry, perhaps only one step better than simply redefining âAI Riskâ until I got a result I liked. I think the distribution point is the actually valuable part of the essay, and Iâm quite explicit in the essay that neither geomean nor arithmean is a good substitute for the full distribution. While I would obviously be delighted if I could also convince you my weak preference for geomean as a summary statistic was actually robust and considered, Iâm actually not especially wedded to the argument for one summary statistic over the other. I did realise after I got my results that the crux for moving probabilities was going to be a very dry debate about different measures of central tendency, but I figured since the Fund was interested in essays on the theme of âa bunch of this AI stuff is basically right, but we should be focusing on entirely different aspects of the problemâ (even if they arenât being strictly solicited for the prize) the distribution bit of the essay might find a readership there anyway.
By the way, I know your four-step argument is intended just as a sketch of why you prefer arithmean for this application, but I do want to just flag up that I think it goes wrong on step 4, because acting according to arithmean probability (or geomean, for that matter) throws away information about distributions. As I mention here and elsewhere, I think the distribution issue is far more important than the geo-vs-arith issue, so while I donât really feel strongly if I lose the prize because the judges donât share my intuition that geomean is a slightly better measure of central tendency I would be sad to miss out because the distribution point was misunderstood! I describe in Section 5.2.2 how the distribution implied by my model would quite radically change some funding decisions, probably by more than an argument taking the arithmean to 3% (of course, if youâre already working on distribution issues then youâve probably already reached those conclusions and so I wonât be changing your mind by making themâbut in terms of publicly available arguments about AI Risk Iâd defend the case that the distribution issue implies more radical redistribution of funds than changing the arithmean to 1.6%). So I think âact according to that mean probabilityâ is wrong for many important decisions you might want to takeâanalogous to buying a lot of trousers with 1.97 legs in my example in the essay. No additional comment if that is what you meant though and were just using shorthand for that position.
Iâd argue that you need to use a point estimate to decide what bets to make, and that you should make that point estimate by (1) geomean-pooling raw estimates of parameters, (2) reasoning over distributions of all parameters, then (3) taking arithmean of the resulting distribution-over-probabilities and (4) acting according to that mean probability.
I think âact according to that mean probabilityâ is wrong for many important decisions you might want to takeâanalogous to buying a lot of trousers with 1.97 legs in my example in the essay. No additional comment if that is what you meant though and were just using shorthand for that position.
Clarifying, I do agree that there are some situations where you need something other than a subjective p(risk) to compare EV(value|action A) with EV(value|action B). I donât actually know how to construct a clear analogy from the 1.97-legged trousers example if the variable weâre meaning is probabilities (though I agree that there are non-analogous examples; VOI for example).
Iâll go further, though, and claim that what really matters is what worlds the risk is distributed over, and that expanding the point-estimate probability to a distribution of probabilities, by itself, doesnât add any real value. If it is to be a valuable exercise, you have to be careful what youâre expanding and what youâre refusing to expand.
More concretely, you want to be expanding over things your intervention wonât control, and then asking about your interventionâs effect at each point in things-you-wonât-control-space, then integrating back together. If you expand over any axis of uncertainty, then not only is there a multiplicity of valid expansions, but the natural interpretation will be misleading.
For example, say we have a 10% chance of drawing a dangerous ball from a series of urns, and 90% chance of drawing a safe one. If we describe it as (1) â50% chance of 9.9% risk, 50% chance of 10.1% riskâ or (2) â50% chance of 19% risk, 50% chance of 1% riskâ or (3) â10% chance of 99.1% risk, 90% chance of 0.1% riskâ, what does it change our opinion of <intervention A>? (You can, of course, construct a two-step ball-drawing procedure that produces any of these distributions-over-probabilities.)
I think the natural intuition is that interventions are best in (2), because most probabilities of risk are middle-ish, and worst in (3), because probability of risk is near-determined. And this, I think, is analogous to the argument of the post that anti-AI-risk interventions are less valuable than the point-estimate probability would indicate.
But that argument assumes (and requires) that our interventions can only chance the second ball-drawing step, and not the first. So using that argument requires that, in the first place, we sliced the distribution up over things we couldnât control. (If that is the thing we can control with our intervention, then interventions are best in the world of (3).)
Back to the argument of the original post: Youâre deriving a distribution over several p(X|Y) parameters from expert surveys, and so the bottom-line distribution over total probabilities reflects the uncertainty in expertsâ opinions on those conditional probabilities. Is it right to model our potential interventions as influencing the resolution of particular p(X|Y) rolls, or as influencing the distribution of p(X|Y) at a particular stage?
I claim itâs possible to argue either side.
Maybe a question like âp(much harder to build aligned than misaligned AGI | strong incentives to build AGI systems)â (the second survey question) is split between a quarter of the experts saying ~0% and three-quarters of the experts saying ~100%. (This extremizes the example, to sharpen the hypothetical analysis.) We interpret this as saying thereâs a one-quarter chance weâre ~perfectly safe and a three-quarters chance that itâs hopeless to develop and aligned AGI instead of a misaligned one.
If we interpret that as if God will roll a die and put us in the âmuch harderâ world with three-quarters probability and the ânot much harderâ world with one-quarters probability, then maybe our work to increase the we get an aligned AGI is low-value, because itâs unlikely to move either the ~0% or ~100% much lower (and we canât change the die). If this was the only stage, then maybe all of working on AGI risk is worthless.
But âthree-quarter chance itâs hopelessâ is also consistent with a scenario where thereâs a three-quarters chance that AGI development will be available to anyone, and many low-resourced actors will not have alignment teams and find it ~impossible to develop with alignment, but a one-quarter chance that AGI development will be available only to well-resourced actors, who will find it trivial to add on an alignment team and develop alignment. But then working on AGI risk might not be worthless, since we can work on increasing the chance that AGI development is only available to actors with alignment teams.
I claim that it isnât clear, from the survey results, whether the distribution of expertsâ probabilities for each step reflect something more like the God-rolls-a-die model, or different opinions about the default path of a thing we can intervene on. And if thatâs not clear, then itâs not clear what to do with the distribution-over-probabilities from the main results. Probably theyâre a step forward in our collective understanding, but I donât think you can conclude from the high chances of low risk that thereâs a low value to working on risk mitigation.
It sounds like the headline claim is that (A) we are 33.2% to live in a world where the risk of loss-of-control catastrophe is <1%, and 7.6% to live in a world where the risk is >35%, and a whole distribution of values between, and (B) that it follows from A that the correct subjective probability of loss-of-control catastrophe is given by the geometric mean of the risk, over possible worlds.
I want to argue that the geometric mean is not an appropriate way of aggregating probabilities across different âworlds we might live inâ into a subjective probability (as requested by the prize). This argument doesnât touch on the essayâs main argument in favor of considering distributions, but may move the headline subjective probability that it suggests to 9.65%, effectively outside the range of opinion-change prizes, so I thought it worth clarifying in case I misunderstand.
Consider an experiment where you flip a fair coin A. If A is heads you flip a 99%heads coin B; if A is tails you flip a 1%heads coin B. Weâre interested in forming a subjective probability that B is heads.
The answer I find intuitive for p(B=heads) is 50%, which is achieved by taking the arithmetic average over worlds. The geometric average over worlds gives 9.9% instead, which doesnât seem like âfair betting oddsâ for B being heads under any natural interpretation of those words. Whatâs worse, the geometric-mean methodology suggests a 9.9% subjective probability of tails, and then p(H)+p(T) does not add to 1.
(If youâre willing to accept probabilities that are 0 and 1, then an even starker experiment is given by a 1% chance to end up in a world with 0% risk and a 99% chance to end up in a world with 100% riskâthe geometric mean is 0.)
Footnote 9 of the post suggests that the operative meaning of âfair betting oddsâ is sufficiently undefined by the prize announcement that perhaps it refers to a Brier-score bet, but I believe that it is clear from the prize announcement that a 1âvsâX bet is the kind under consideration. The prize announcementâs footnote 1 says âWe will pose many of these beliefs in terms of <u>subjective probabilities, which represent betting odds</âu> that we consider fair in the sense that weâd be roughly indifferent between betting in favor of the relevant propositions <u>at those odds</âu> or betting against them.â
I donât know of a natural meaning of âbet in favor of P at 97:3 oddsâ other than âbet to win $97N if P and lose $3N if not Pâ, which the bettor should be indifferent about if . Is there some other bet that you believe âbet in favor of P at odds of X:Yâ could mean? In particular, is there a meaning which would support forming odds (and subjective probability) according to a geometric mean over worlds?
(I work at the FTX Foundation, but have no connection to the prizes or their judging, and my question-asking here is as a EA Forum user, not in any capacity connected to the prizes.)
Hmm I accidentally deleted a comment earlier, but roughly:
I think thereâs decent theoretical and empirical arguments for having a prior where you should be using geometric mean of odds over arithmetic mean of probabilities when aggregating forecasts. Jaime has a primer here. However there was some pushback in the comments, especially by Toby Ord. My general takeaway is that geometric mean of odds is a good default when aggregating forecasts by epistemic peers but there are a number of exceptions where some other aggregation schema is better.
Arguably Froolowâs data (which is essentially a glorified survey of rationalists) is closer to a situation where we want to aggregate forecasts than a situation where we have âobjectiveâ probabilities over probabilities (as in your coin example).
So I can see why they used geometric mean as a default, though I think they vastly exaggerated the confidence that we should have in that being the correct modeling decision.
I also donât quite understand why they used geometric mean of probabilities rather than geometric mean of odds.
This comment is exactly right, although it seems I came across stronger on the point about geometric mean of odds than I intended to. I wanted to say basically exactly what you did in this commentâthere are relatively sound reasons to treat geometric mean of odds as the default in this case, but that there was a reasonable argument for simple means too. For example see footnotes 7 and 9 where I make this point. What I wanted to get across was that the argument about simple means vs geometric mean of odds was likely not the most productive argument to be havingâpoint estimates always (necessarily) summarise the underlying distribution of data, and it is dangerous to merely use summary statistics when the distribution itself contains interesting and actionable information
Just for clarityâI use geometric mean of odds, which I then convert back into probability as an additional step (because people are more familiar with probability than odds). If I said anywhere that I took the geometric mean of probabilities then this is a typo and I will correct it!
I agree about this in general but Iâm skeptical about treating distributions of probabilities the same way we treat distributions of quantities.
Perhaps more importantly, I assumed that the FTX FF got their numbers for reasons other than deferring to the forecasts of random rationalists. If Iâm correct, this leads me to think that sophisticated statistics on top of the forecasts of random rationalists is unlikely to change their minds.
Thanks! This is my fault for commenting before checking the math first! However, I think you couldâve emphasized what you actually did more. You did not say âgeometric mean of probabilities.â But you also did not say âgeometric mean of oddsâ anywhere except in footnote 7 and this comment. In the main text, you only said âgeometric meanâ and the word âprobabilityâ was frequently in surrounding texts.
I think thatâs a fair criticism. For all I know, the FF are not at all uncertain about their estimates (or at least not uncertain over order-of-magnitude) and so the SDO mechanism doesnât come into play. I still think there is value in explicitly and systematically considering uncertainty, even if you end up concluding it doesnât really matter for your specific beliefs -if only because you canât be totally confident it doesnât matter until you have actually done the maths.
Iâve updated the text to replace âgeometric meanâ with âgeometric mean of oddsâ everywhere it occurs. Thanks so much for the close reading and spotting the error.
Thanks! Though itâs not so much an error as just moderately confusing communication.
As you probably already know, I think one advantage of geometric mean of odds over probabilities is that it directly addresses one of Rossâs objections:
Geomean of odds of 99% heads and 1% heads is
sqrt(99â1):sqrt(1â99)=1:1=50
More generally, geomean of X:Y and Y:X is 50%, and geomean of odds is equally sensitive to outlier probabilities in both directions (whereas geomean of probabilities is only sensitive to outlierly low probabilities).
I agree that geomean-of-odds performs better than geomean-of-probs!
I still think it has issues for converting your beliefs to actions, but I collected that discussion under a cousin comment here: https://ââforum.effectivealtruism.org/ââposts/ââZ7r83zrSXcis6ymKo/ââdissolving-ai-risk-parameter-uncertainty-in-ai-future?commentId=9LxG3WDa4QkLhT36r
I think there are good reasons for preferring geometric mean of odds to simple mean when presenting data of this type, but not good enough that Iâd take to the barricades over them. Linch (below) links to the same post I do in giving my reasons to believe this. Overall, however, this is an essay about distributions rather than point estimates so if your main objection is to the summary statistic I used then I think we agree on the material points, but have a disagreement about how the work should be presented.
On the point about betting odds, I note that the contest announcement also states âApplicants need not agree with or use our same conception of probabilityâ. I think the way in which I actually disagree with the Future Fund is more radical than simple means vs geometric mean of oddsâI think they ought to stop putting so much emphasis on summary statistics altogether.
Thanks for clarifying âgeomean of probabilitiesâ versus âgeomean of odds elsethread. I agree that that resolves some (but not all) of my concerns with geomeaning.
I agree with your pro-distribution position here, but I think you will be pleasantly surprised by how much reasoning over distributions goes into cost-benefit estimates at the Future Fund. This claim is based on nonpublic information, though, as those estimates have not yet been put up for public discussion. I will suggest, though, that itâs not an accident that Leopold Aschenbrenner is talking with QURI about improvements to Squiggle: https://ââgithub.com/ââquantified-uncertainty/ââsquiggle/ââdiscussions
So my subjective take is that if the true issue is âyou should reason over distributions of core parametersâ, then in fact thereâs little disagreement between you and the FF judges (which is good!), but it all adds up to normality (which is bad for the claim âmoving to reasoning over distributions should move your subjective probabilitiesâ).
If weâre focusing on the Worldview Prize question as posed (âshould these probability estimates change?â), then I think the geo-vs-arith difference is totally cruxyânote that the arithmetic summary of your results (9.65%) is in line with the product of the baseline subjective probabilities for the prize (something like a 3% for loss-of-control x-risk before 2043; something like 9% before 2100).
I do think itâs reasonable to critique the fact that those point probabilities are presented without any indication that the path of reasoning goes through reasoning over distributions, though. So I personally am happy with this post calling attention to distributional reasoning, since itâs unclear in this case whether that is an update. I just donât expect it to win the prizes for changing estimates.
Because I do think distributional reasoning is important, though, I do want to zoom in on the arith-vs-geo question (which I think, on reflection, is subtler than the position I took in my top-level comment). Rather than being a minor detail, I think this is important because it influences whether greater uncertainty tends to raise or lower our âfair betting oddsâ (which, at the end of the day, are the numbers that matter for how the FF decides to spend money).
I agree with Jamie and you and Linch that when pooling forecasts, itâs reasonable (maybe optimal? maybe not?) to use geomeans. So if youâre pooling expert forecasts of {1:1000, 1:100, 1:10}, you might have a subjective belief of something like â1:100, but with a âstandard deviationâ of 6.5x to either sideâ. This is lower than the arithmean-pooled summary stats, and I think thatâs directionally right.
I think this is an importantly different question from âhow should you act when your subjective belief is a distribution like that. I think that if you have a subjective belief like â1%, but with a âstandard deviationâ of 6.5x to either sideâ, you should push a button that gives you $98.8 if youâre right and loses $1.2 if youâre wrong. In particular, I think you should take the arithmean over your subjective distribution of beliefs (here, ~1.4%) and take bets that are good relative to that number. This will lead to decision-relevant effective probabilities that are higher than geomean-pooled point estimates (for small probabilities).
If youâre combining multiple case parameters multiplicatively, then the arith>geo effect compounds as you introduce uncertainty in more placesâif the quantity of interest is x*y, where x and y each had expert estimates of {1:1000, 1:100, 1:10} that we assume independent, then arithmean(x*y) is about twice geomean(x*y). Hereâs a quick Squiggle showing what I mean: https://ââwww.squiggle-language.com/ââplayground/ââ#code=eNqrVirOyC8PLs3NTSyqVLIqKSpN1QELuaZkluQXwUQy8zJLMhNzggtLM9PTc1KDS4oy89KVrJQqFGwVcvLT8%2FKLchNzNIAsDQM9A0NNHQ0jfWPNOAM9U82YvJi8SqJUVQFVVShoKVQCsaGBQUyeUi0A3tIyEg%3D%3D
For this use-case (eg, âwhat bets should we make with our moneyâ), Iâd argue that you need to use a point estimate to decide what bets to make, and that you should make that point estimate by (1) geomean-pooling raw estimates of parameters, (2) reasoning over distributions of all parameters, then (3) taking arithmean of the resulting distribution-over-probabilities and (4) acting according to that mean probability.
In the case of the Worldview Prize, my interpretation is that the prize is described and judged in terms of (3), because that is the most directly valuable thing in terms of producing better (4)s.
An explicit case where I think itâs important to arithmean over your subjective distribution of beliefs:
coin A is fair
coin B is either 2% heads or 98% heads, you donât know
you lose if either comes up tails.
So your p(win) is âeither 1% or 49%â.
I claim the FF should push the button that pays us $80 if win, -$20 if lose, and in general make action decisions consistent with a point estimate of 25%. (Iâm ignoring here the opportunity to seek value of information, which could be significant!).
Itâs important not to use geomean-of-odds to produce your actions in this scenario; that gives you ~9.85%, and would imply you should avoid the +$80;-$20 button, which I claim is the wrong choice.
I agree that the arith-vs-geo question is basically the crux when it comes to whether this essay should move FFâs âfair betting probabilitiesâ - it sounds like everyone is pretty happy with the point about distributions and Iâm really pleased about that because it was the main point I was trying to get across. Iâm even more pleased that there is background work going on in the analysis of uncertainty space, because thatâs an area where public statements by AI Risk organisations have sometimes lagged behind the state of the art in other risk management applications.
With respect to the crux, I hate to say itâbecause Iâd love to be able to make as robust a claim for the prize as possibleâbut Iâm not sure there is a principled reason for using geomean over arithmean for this application (or vice versa). The way I view it, they are both just snapshots of what is âreallyâ going on, which is the full distribution of possible outcomes given in the graphs /â model. By analogy, I would be very suspicious of someone who always argued the arithmean would be a better estimate of central tendency than the median for every dataset /â use case! I agree with you the problem of which is best for this particular dataset /â use case is subtle, and I think I would characterise it as being a question of whether my manipulations of peopleâs forecasts have retained some essential âforecast-yâ characteristic which means geomean is more appropriate for various features it has, or whether they have been processed into having some sort of âoutcome-yâ characteristic in which case arithmean is more appropriate. I take your point below in the coin example and the obvious superiority of arithmeans for that application, but my interpretation is that the FF didnât intend for the âfair betting oddsâ position to limit discussion about alternate ways to think about probabilities (âApplicants need not agree with or use our same conception of probabilityâ)
However, to be absolutely clear, even if geomean was the right measure of central tendency I wouldnât expect the judges to pay that particular attentionâif all I had done was find a novel way of averaging results then my argument would basically be mathematical sophistry, perhaps only one step better than simply redefining âAI Riskâ until I got a result I liked. I think the distribution point is the actually valuable part of the essay, and Iâm quite explicit in the essay that neither geomean nor arithmean is a good substitute for the full distribution. While I would obviously be delighted if I could also convince you my weak preference for geomean as a summary statistic was actually robust and considered, Iâm actually not especially wedded to the argument for one summary statistic over the other. I did realise after I got my results that the crux for moving probabilities was going to be a very dry debate about different measures of central tendency, but I figured since the Fund was interested in essays on the theme of âa bunch of this AI stuff is basically right, but we should be focusing on entirely different aspects of the problemâ (even if they arenât being strictly solicited for the prize) the distribution bit of the essay might find a readership there anyway.
By the way, I know your four-step argument is intended just as a sketch of why you prefer arithmean for this application, but I do want to just flag up that I think it goes wrong on step 4, because acting according to arithmean probability (or geomean, for that matter) throws away information about distributions. As I mention here and elsewhere, I think the distribution issue is far more important than the geo-vs-arith issue, so while I donât really feel strongly if I lose the prize because the judges donât share my intuition that geomean is a slightly better measure of central tendency I would be sad to miss out because the distribution point was misunderstood! I describe in Section 5.2.2 how the distribution implied by my model would quite radically change some funding decisions, probably by more than an argument taking the arithmean to 3% (of course, if youâre already working on distribution issues then youâve probably already reached those conclusions and so I wonât be changing your mind by making themâbut in terms of publicly available arguments about AI Risk Iâd defend the case that the distribution issue implies more radical redistribution of funds than changing the arithmean to 1.6%). So I think âact according to that mean probabilityâ is wrong for many important decisions you might want to takeâanalogous to buying a lot of trousers with 1.97 legs in my example in the essay. No additional comment if that is what you meant though and were just using shorthand for that position.
Clarifying, I do agree that there are some situations where you need something other than a subjective p(risk) to compare EV(value|action A) with EV(value|action B). I donât actually know how to construct a clear analogy from the 1.97-legged trousers example if the variable weâre meaning is probabilities (though I agree that there are non-analogous examples; VOI for example).
Iâll go further, though, and claim that what really matters is what worlds the risk is distributed over, and that expanding the point-estimate probability to a distribution of probabilities, by itself, doesnât add any real value. If it is to be a valuable exercise, you have to be careful what youâre expanding and what youâre refusing to expand.
More concretely, you want to be expanding over things your intervention wonât control, and then asking about your interventionâs effect at each point in things-you-wonât-control-space, then integrating back together. If you expand over any axis of uncertainty, then not only is there a multiplicity of valid expansions, but the natural interpretation will be misleading.
For example, say we have a 10% chance of drawing a dangerous ball from a series of urns, and 90% chance of drawing a safe one. If we describe it as (1) â50% chance of 9.9% risk, 50% chance of 10.1% riskâ or (2) â50% chance of 19% risk, 50% chance of 1% riskâ or (3) â10% chance of 99.1% risk, 90% chance of 0.1% riskâ, what does it change our opinion of <intervention A>? (You can, of course, construct a two-step ball-drawing procedure that produces any of these distributions-over-probabilities.)
I think the natural intuition is that interventions are best in (2), because most probabilities of risk are middle-ish, and worst in (3), because probability of risk is near-determined. And this, I think, is analogous to the argument of the post that anti-AI-risk interventions are less valuable than the point-estimate probability would indicate.
But that argument assumes (and requires) that our interventions can only chance the second ball-drawing step, and not the first. So using that argument requires that, in the first place, we sliced the distribution up over things we couldnât control. (If that is the thing we can control with our intervention, then interventions are best in the world of (3).)
Back to the argument of the original post: Youâre deriving a distribution over several p(X|Y) parameters from expert surveys, and so the bottom-line distribution over total probabilities reflects the uncertainty in expertsâ opinions on those conditional probabilities. Is it right to model our potential interventions as influencing the resolution of particular p(X|Y) rolls, or as influencing the distribution of p(X|Y) at a particular stage?
I claim itâs possible to argue either side.
Maybe a question like âp(much harder to build aligned than misaligned AGI | strong incentives to build AGI systems)â (the second survey question) is split between a quarter of the experts saying ~0% and three-quarters of the experts saying ~100%. (This extremizes the example, to sharpen the hypothetical analysis.) We interpret this as saying thereâs a one-quarter chance weâre ~perfectly safe and a three-quarters chance that itâs hopeless to develop and aligned AGI instead of a misaligned one.
If we interpret that as if God will roll a die and put us in the âmuch harderâ world with three-quarters probability and the ânot much harderâ world with one-quarters probability, then maybe our work to increase the we get an aligned AGI is low-value, because itâs unlikely to move either the ~0% or ~100% much lower (and we canât change the die). If this was the only stage, then maybe all of working on AGI risk is worthless.
But âthree-quarter chance itâs hopelessâ is also consistent with a scenario where thereâs a three-quarters chance that AGI development will be available to anyone, and many low-resourced actors will not have alignment teams and find it ~impossible to develop with alignment, but a one-quarter chance that AGI development will be available only to well-resourced actors, who will find it trivial to add on an alignment team and develop alignment. But then working on AGI risk might not be worthless, since we can work on increasing the chance that AGI development is only available to actors with alignment teams.
I claim that it isnât clear, from the survey results, whether the distribution of expertsâ probabilities for each step reflect something more like the God-rolls-a-die model, or different opinions about the default path of a thing we can intervene on. And if thatâs not clear, then itâs not clear what to do with the distribution-over-probabilities from the main results. Probably theyâre a step forward in our collective understanding, but I donât think you can conclude from the high chances of low risk that thereâs a low value to working on risk mitigation.