Expected value estimates you can take (somewhat) literally


A ge­o­met­ric in­ter­pre­ta­tion of tail di­ver­gence offers a pseudo-ro­bust mechanism of quan­tify­ing the ex­pected ‘re­gres­sion to the mean’ of a cost effec­tive­ness es­ti­mate (i.e. ex­pected value given es­ti­mated value). With caveats, this seems to be su­pe­rior to cur­rent at­tempts at gues­ti­ma­tion and ‘call­ing the bot­tom’ of an­ti­ci­pated re­gres­sion to the mean. There are a few pleas­ing corol­laries: 1) it vin­di­cates a ‘re­gres­sive mind­set’ (par­tic­u­larly cham­pi­oned by Givewell), and un­der­scores the value of perform­ing re­search to achieve more ac­cu­rate es­ti­ma­tion; 2) it sug­gests (and pro­vides a frame­work for quan­tifi­ca­tion for) how much an es­ti­mated ex­pected value should be dis­counted by how spec­u­la­tive it is; 3) it sug­gests em­piri­cal ap­proaches one can use to bet­ter quan­tify de­gree of re­gres­sion, us­ing var­i­ance in ex­pected-value es­ti­mates.

[Cross-posted here. I’m deeply un­cer­tain about this, and won­der if this is an over-long elab­o­ra­tion on a very sim­ple er­ror. If so, please show me!]


Es­ti­mates of ex­pected value regress to the mean—some­thing that seems much more valuable than av­er­age will gen­er­ally look not-quite-so-valuable when one has a bet­ter look. This prob­lem is par­tic­u­larly acute for effec­tive al­tru­ism and its fo­cus on the high­est ex­pected value causes, as these are likely to regress the most. Quan­tify­ing how much one ex­pects an es­ti­mate to regress is not straight­for­ward, and this has in part lead to groups like Givewell warn­ing against ex­plicit ex­pected value es­ti­mates. More ro­bust es­ti­ma­tion has ob­vi­ous value, es­pe­cially when com­par­ing be­tween di­verse cause ar­eas, but wor­ries about sig­nifi­cant and differ­en­tial over-es­ti­ma­tion make these en­ter­prises highly spec­u­la­tive.

‘In prac­tice’, I have seen peo­ple offer dis­counts to try and ac­count for this effect (e.g. “Givewell think AMF averts a DALY for $45-ish, so lets say it’s $60 to try and ac­count for this”), or men­tion it as a caveat with­out quan­tifi­ca­tion. It re­mains very hard to get an in­tu­itive han­dle on how big one should ex­pect this effect to be, es­pe­cially when look­ing at differ­ent causes: one might ex­pect more spec­u­la­tive causes in fields like an­i­mal welfare or global catas­trophic risks to ‘regress more’ than rel­a­tively well-stud­ied ones in pub­lic health, but how much more?

I am un­aware of any pre­vi­ous at­tempts to di­rectly quan­tify this effect. I hope this es­say can make some head­way on this is­sue, and thus al­low more trust­wor­thy ex­pected value es­ti­mates.

A ge­o­met­ric ex­pla­na­tion of re­gres­sion to the mean

(A re­lated post here)

Why is there re­gres­sion to the mean for ex­pected value calcu­la­tions? For a toy model, pre­tend there are some well-be­haved nor­mally dis­tributed es­ti­mates of ex­pected value, and the es­ti­ma­tion tech­nique that it cap­tures 90% of the var­i­ance in ex­pected value. A stylized scat­ter plot might look some­thing like this, with a fairly tight el­lipse cov­er­ing the points.


De­spite the es­ti­mates of ex­pected value be­ing closely cor­re­lated to the ac­tual ex­pected val­ues, there re­mains some scat­ter, and so the tails come apart: the es­ti­mated-to-be high­est ex­pected value things are un­likely to have the high­est ac­tual ex­pected value (al­though it should still be pretty high). The bet­ter the es­ti­mate, the tighter the cor­re­la­tion, and there­fore the tails di­verge less:


Another way of rep­re­sent­ing what is go­ing on is us­ing two vec­tors in­stead of Carte­sian co­or­di­nates. If the set of es­ti­mates and ac­tual ex­pected val­ues were plot­ted as vec­tors in n-di­men­sional space (nor­mal­ized to a mean of zero), the co­sine be­tween them would be equal to the r-squared of their cor­re­la­tion. [1] So our es­ti­mates cor­re­lated to the ac­tual cost effec­tive­ness val­ues with an r-square of 0.9 looks like this:


The two vec­tors lie in a similar di­rec­tion to one an­other, as the ar­c­cos of 0.9 is ~26 de­grees, (in ac­cord with in­tu­ition, an r-square of 1 - perfect cor­re­la­tion—leads to vec­tors par­allel, an r-square of zero vec­tors or­thog­o­nal, and and r-square of −1 vec­tors an­tipar­allel). It also sup­plies a way to see how one can es­ti­mate the ‘real’ ex­pected value given an es­ti­mated ex­pected value—just pro­ject the ex­pected value vec­tor onto the ‘ac­tual’ value vec­tor, and the rest is ba­sic tri­gonom­e­try: mul­ti­ply the hy­potenuse (the es­ti­mate) by the co­sine of the an­gle (the r-square) to ar­rive at the ac­tual value. So, in our toy model with an R-square of 0.9, an es­ti­mate which is 1SD above the mean es­ti­mate puts the ex­pected value at 0.9SD above the mean ex­pected value.

There are fur­ther div­i­dends to this con­cep­tu­al­iza­tion: it helps ex­plain why re­gres­sion to mean hap­pens across the en­tire range, but it is par­tic­u­larly no­tice­able at the tails: the ab­solute amount of re­gres­sion to the mean (taken as es­ti­mate minus ac­tual value given es­ti­mate) grows lin­early with the size of the es­ti­mate. [2] Re­as­sur­ingly, it is or­der pre­serv­ing: al­though E > A|E if E is above the mean, A|E1 > A|E2 if E1>E2, and so among a set of es­ti­mates of ex­pected value, the high­est es­ti­mate has the high­est ex­pected value.

Given this fairly neat way of get­ting from an es­ti­mate to the ex­pected value given an es­ti­mate, can this be ex­ploited to give es­ti­mates that ‘pre-empt’ their own re­gres­sion?

A quick and fairly dirty at­tempt at ‘re­gres­sion-proofing’ an estimate

A nat­u­ral ap­proach would be to try and es­ti­mate the ex­pected value given an es­ti­mate by cor­rect­ing for the cor­re­la­tion be­tween the es­ti­mated ex­pected value and the ac­tual ex­pected value. Although the ge­om­e­try above was neat, there are sev­eral ob­sta­cles in the way of ap­pli­ca­tion to ‘real’ cost-effec­tive­ness es­ti­mates. In as­cend­ing or­der of difficulty:

1) The var­i­ances of the two dis­tri­bu­tions need to be stan­dard­ized.

Less of a big deal, we we’d gen­er­ally hope for and aim that our es­ti­mates of ex­pected value are drawn from a similar dis­tri­bu­tion to the ac­tual ex­pected val­ues—if it’s not, our es­ti­mates are sys­tem­i­cally wrong some­how.

2) We need to know the means of the dis­tri­bu­tion to do the stan­dard­iza­tion—af­ter all, if an in­ter­ven­tion was es­ti­mated to be be­low the mean, we should an­ti­ci­pate it to regress up­wards.

Trick­ier for an EA con­text, as the groups that do eval­u­a­tion fo­cus their efforts on what ap­pear to be the most promis­ing things, so there isn’t a clear han­dle on the ‘mean global health in­ter­ven­tion’ which may be our dis­tri­bu­tion of in­ter­est. To some ex­tent though, this prob­lem solves it­self if the un­der­ly­ing dis­tri­bu­tions of in­ter­est are log nor­mal or similarly fat tailed and you are con­fi­dent your es­ti­mate lies far from the mean (what­ever it is): log (X—some­thing small) ap­prox­i­mates to log(X)

3) Es­ti­mate er­rors should be at-least-vaguely sym­met­ri­cal.

If you ex­pect your es­ti­mates to be sys­tem­at­i­cally ‘off’, this sort of cor­rec­tion has limited util­ity. That said, per­haps this is not as bur­den­some as it first ap­pears: al­though we should ex­pect our es­ti­mates (es­pe­cially in more spec­u­la­tive fields) to be prone to sys­tem­atic er­ror, this can be mod­el­led as sym­met­ri­cal er­ror if we are gen­uinely un­cer­tain of what sign the sys­tem­atic er­ror should be. And, if you ex­pect your es­ti­mate to be sys­tem­at­i­cally erring one way or an­other, why haven’t you re­vised it ac­cord­ingly?

4) How do you know the r-square be­tween your es­ti­mates and their true val­ues?

Gen­er­ally, these will re­main hard to es­ti­mate, es­pe­cially so for more spec­u­la­tive causes, as the true ex­pected val­ues will de­pend on re­con­dite is­sues in the spec­u­la­tive cause it­self. Fur­ther, I don’t think many of us have a good ‘han­dle’ on what an r-square = 0.9 dis­tri­bu­tion looks like com­pared to an r-square of 0.5, and our ad­just­ments (es­pe­cially in fat-tailed dis­tri­bu­tions) will be highly sen­si­tive to it. How­ever, there is scope for im­prove­ment—one can use var­i­ous ap­proaches of bench­mark­ing (if you think your es­ti­ma­tor has an r-square of 0.9, you could ask your­self whether you think it re­ally should track 90% of the var­i­ance in ex­pected value, and one could gain a bet­ter ‘feel’ for the strength of var­i­ous cor­re­la­tions by see­ing some rep­re­sen­ta­tive ex­am­ples, and then use these to an­chor an in­tu­ition about where the es­ti­ma­tor should fall com­pared to these). Also, at least in cases with more es­ti­mates available, there would be pseudo-for­mal ap­proaches to es­ti­mat­ing the rele­vant an­gle be­tween es­ti­mated and ac­tual ex­pected value (on which more later).

5) What’s the un­der­ly­ing dis­tri­bu­tion?

Without many es­ti­mates, this is hard, and there is a risk of de­volv­ing into refer­ence class ten­nis with differ­ent offers as to what un­der­ly­ing dis­tri­bu­tions ‘should’ be (broadly, fat tailed dis­tri­bu­tions with low means ‘hurt’ an es­ti­mate more, and vice-versa). This prob­lem gets par­tic­u­larly acute when look­ing be­tween cause ar­eas, ne­ces­si­tat­ing some shared dis­tri­bu­tion of ac­tual ex­pected val­ues. I don’t have any par­tic­u­larly good an­swers to offer here. Ideas wel­come!

With all these caveats duly caveated, let’s have a stab at es­ti­mat­ing the ‘real’ ex­pected value of malaria nets, given Givewell’s es­ti­mate: [3]

  1. (One of) Givewell’s old es­ti­mates for malaria nets was $45 dol­lars per DALY averted. This rate is the ‘wrong way around’, as we are in­ter­ested in Unit/​Cost rather than Cost/​Unit. So con­vert it to some­thing like ‘DALYs averted per $100000’: 2222 DALYs/​100000$ [4]

  2. I aver (and oth­ers agree) that de­vel­op­ing world in­ter­ven­tions are ap­prox­i­mately log-nor­mally dis­tributed, with a cor­re­spond­ingly fat tail to the right. [5] So take logs to get to nor­mal­ity: 3.347

  3. I think pretty highly of givewell, and so I rate their es­ti­mates pretty highly too. I think their es­ti­ma­tion cap­tures around 90% of the var­i­ance in ac­tual ex­pected value, so they cor­re­late with an R-square of 0.9. So mul­ti­ply by this fac­tor: 3.012.

  4. Then we do ev­ery­thing in re­verse, so the ‘re­vised’ ex­pected value es­ti­mate is 1028 DALYs/​100000$, or back in the head­line figure, $97 dol­lars per DALY averted.

Some mid­dling reflections

Nat­u­rally, this ‘at­tempt’ is far from rigor­ous, and it may take in­tol­er­able tech­ni­cal liber­ties. That said, I ‘buy’ this bot­tom line es­ti­mate more than my pre­vi­ous at­tempts to gues­ti­mate the ‘post re­gres­sion’ cost effec­tive­ness: I had pre­vi­ously been wor­ried I was be­ing too op­ti­mistic, whereas now my fears about my es­ti­mate are equal and op­po­site: I wouldn’t be sur­prised if this was too op­ti­mistic, but I’d be similarly un­sur­prised if I was too pes­simistic ei­ther.

It also has some salu­tary les­sons at­tached: the cor­rec­tion turns out be greater than a fac­tor of two, even when you think the es­ti­mate is re­ally good. This vin­di­cates my fear we are prone to un­der-es­ti­mate the risk, and how strongly re­gres­sion can bite on the right of a fat-tailed dis­tri­bu­tion.

The ‘post re­gres­sion’ es­ti­mate of ex­pected value de­pends a lot on the r-square—the ‘an­gle’ be­tween the two vec­tors (e.g., if you thought Givewell were ‘only’ as good as a 0.8 cor­re­la­tion to real effect, then the re­vised es­ti­mate be­comes $210 per DALY averted). It is un­for­tu­nate to have an es­ti­ma­tion method sen­si­tive to a vari­able that is hard for us to grasp. It sug­gests that al­though our cen­tral mea­sure for ex­pected value may be $97/​DALY, this should have wide cre­dence bounds.

Com­par­ing es­ti­mates of differ­ing quality

One may be more in­ter­ested not in the ab­solute ex­pected value of a some­thing, but how it com­pares to some­thing else. This prob­lem gets harder when com­par­ing across di­verse cause ar­eas: one in­tu­ition may be that a more spec­u­la­tive es­ti­mate should be dis­counted when ‘up against’ a more ro­bust es­ti­mate when com­par­ing two causes or two in­ter­ven­tions. A com­mon venue I see this is com­par­ing in­ter­ven­tions be­tween hu­man and an­i­mal welfare. A com­mon line of ar­gu­ment is some­thing like:

For $1000 you might buy 20 or so QALYs, or maybe save a life (though that is op­ti­mistic) if you gave to AMF, in com­par­i­son, our best guess for an an­i­mal char­ity like The Hu­mane League is it might save 3400 an­i­mals from a life in agri­cul­ture. Isn’t that even bet­ter?

Brack­et­ing all the con­sid­er­able un­cer­tain­ties of the right ‘trade off’ be­tween an­i­mal and hu­man welfare, anti- or pro-na­tal­ist con­cerns, try­ing to cash out ‘a life in agri­cul­ture’ in QALY terms, etc. etc. there’s an un­der­ly­ing worry that ACE’s es­ti­mate will be more spec­u­la­tive than Givewell’s, and so their bot­tom line figures should be ad­justed down­wards com­pared to Givewell’s as they have ‘fur­ther to regress’. I think this is cor­rect:


Sup­pose there’s a more spec­u­la­tive source of ex­pected value es­ti­mates, and you want to com­pare these to your cur­rent batch of higher qual­ity ex­pected value es­ti­mates. Say you think the spec­u­la­tive es­ti­mates cor­re­late to real ex­pected value with an r-square of 0.4 - they leave most of the var­i­ance un­ac­counted for. Geo­met­ri­cally, this means your es­ti­mates and ‘ac­tual’ value di­verge quite a lot—the an­gle is around 66 de­grees, and as a con­se­quence you ex­pected value given the es­ti­mate is sig­nifi­cantly dis­counted. This can mean an in­ter­ven­tion with a high (but spec­u­la­tive) es­ti­mate still has a lower ex­pected value than an in­ter­ven­tion with a lower but more re­li­able es­ti­mate. (Plot­ted above: E2 > E1, but A2|E2 < A1|E1)

This cor­rec­tion wouldn’t be a big deal (it’s only go­ing to be within an or­der of mag­ni­tude) ex­cept that we gen­er­ally think the dis­tri­bu­tions are fat tailed and so these vec­tors are on some­thing like a log scale. Go­ing back to the pre­vi­ous case, if you thought that An­i­mal Char­ity Eval­u­a­tors es­ti­mates were ‘only’ cor­re­lated to real effects to a R-square of 0.4, your ‘post re­gres­sion’ es­ti­mate as to how good The Hu­mane League should be around 25 An­i­mals saved per $1000 dol­lars, a re­duc­tion by two or­ders of mag­ni­tude.

Another fast and loose ex­am­ple would be some­thing like this: sup­pose you are con­sid­er­ing donat­ing to a non-profit that wants to build a friendly AI. You es­ti­mate looks like it would be very good, but you know it is highly spec­u­la­tive, so es­ti­mates about its valuable might only be very weakly cor­re­lated with the truth (maybe R-square 0.1). How would it stack up to the AMF es­ti­mate made ear­lier? If you take Givewell to cor­re­late much bet­ter to the truth (r-square 0.9, as ear­lier), then you can run the math­e­mat­ics in re­verse to see how ‘good’ your es­ti­mate of the AI char­ity has to be to still ‘beat’ Givewell’s top char­ity af­ter re­gres­sion. The es­ti­mate for the friendly AI char­ity would need to be some­thing greater than 10^25 DALYs averted per dol­lar donated for you to be­lieve it a ‘bet­ter deal’ than AMF—at least when Givewell were still recom­mend­ing it. [6]

A vin­di­ca­tion of the re­gres­sive mindset

Given the above, re­gres­sion to the mean, par­tic­u­larly on the sort of dis­tri­bu­tions Effec­tive Altru­ists are deal­ing with, looks sig­nifi­cant. If you are like me, you might be sur­prised at how large the down­ward ad­just­ment is, and how much big­ger it gets as your means of es­ti­ma­tion get more spec­u­la­tive.

This vin­di­cates the ‘re­gres­sive mind­set’, cham­pi­oned most promi­nently by Givewell: not only should we not take im­pres­sive sound­ing ex­pected value es­ti­mates liter­ally, we of­ten should ad­just them down­wards by or­ders of mag­ni­tude. [7] In­so­far as you are sur­prised by the size of cor­rec­tion re­quired in an­ti­ci­pa­tion of re­gres­sion to the mean, you should be less con­fi­dent of more spec­u­la­tive causes (in a man­ner com­men­su­rate with their spec­u­la­tive­ness), you should place a greater em­pha­sis on ro­bust­ness and re­li­a­bil­ity of es­ti­ma­tion, and you should be more bullish about the value of cause pri­ori­ti­za­tion re­search: im­prov­ing the cor­re­la­tion of our es­ti­mates with ac­tu­al­ity can mas­sively in­crease the ex­pected value of the best causes or in­ter­ven­tions sub­se­quently iden­ti­fied. [8] Givewell seem ‘ahead of the curve’, here, for which they de­serve sig­nifi­cant credit: much of this post can be taken as a more math­e­mat­i­cal re­ca­pitu­la­tion of prin­ci­ples they already es­pouse. [9]

Towards more effec­tive cost-effec­tive­ness estimates

Th­ese meth­ods are not with­out prob­lems: prin­ci­pally, a lot de­pends on es­ti­mate of the cor­re­la­tion, and es­ti­mat­ing this is difficult and could eas­ily serve as a fig leaf to cover prior prej­u­dice (e.g. I like an­i­mals but dis­like far fu­ture causes, so I’m bullish on how good we are at an­i­mal cause eval­u­a­tion but talk down our abil­ity to es­ti­mate far fu­ture causes, pos­si­bly giv­ing or­ders of mag­ni­tude of ad­van­tage to my pet causes over oth­ers). Pseudo-rigor­ous meth­ods can sup­ply us a false sense of se­cu­rity, both in our es­ti­mates, but also in our es­ti­ma­tion of our own abil­ities.

In this case I be­lieve the div­i­dends out­weigh the costs. It un­der­scores how big an effect re­gres­sion to the mean can be, and (if you are any­thing like me) prompts you to be more pes­simistic when con­fronted with promis­ing es­ti­mates. It ap­pears to offer a more ro­bust frame­work to al­low our in­tu­itions on these re­con­dite mat­ters to ap­proach re­flec­tive equil­ibrium, and to bet­ter en­train them to the ex­ter­nal world: the r-squares I offered in the ex­am­ples above are in fact my own es­ti­mates, and on re­al­iz­ing their ram­ifi­ca­tions I my out­look on an­i­mal and tar­geted far fu­ture in­ter­ven­tions has gone down­wards, but my es­ti­mate of re­search has gone up. [10] If you dis­agree with me, we could per­haps point to a more pre­cise source of our dis­agree­ment (maybe I over-egg how good Givewell is, but un­der-egg groups like ACE), and al­though much of our judge­ment of these mat­ters would still rely on in­tu­itive judge­ment calls, there’s greater ac­cess for data to change our minds (maybe you could show me some rep­re­sen­ta­tive as­so­ci­a­tions with an r-square of 0.9 and 0.4 re­spec­tively, and sug­gest these are too strong/​too weak to be analo­gous to the perfor­mance of Givewell and ACE). Prior to the math­e­mat­ics, it is not clear what we could ap­peal to if you held that ACE should ‘regress down’ by ~10% and I held it should regress down by >90%. [11]

Be­sides look­ing for ‘analo­gous data’ (quan­ti­ta­tive records of pre­dic­tion of var­i­ous bod­ies would be an ob­vi­ous—and use­ful! - place to start) there are well-worn fre­quen­tist meth­ods to es­ti­mate er­ror that could be ap­plied here, albeit with caveats: if there were mul­ti­ple in­de­pen­dent es­ti­mates of the same fam­ily of in­ter­ven­tions, the cor­re­la­tion be­tween them could be used to pre­dict the cor­re­la­tion be­tween them (or their mean) and the ‘true’ effect sizes (per­haps the ‘re­peat’ of the Disease Con­trol and Preven­tion Pri­or­ties pro­ject may provide such an op­por­tu­nity, al­though I’d guess they wouldn’t be truly in­de­pen­dent, nor what they are es­ti­mat­ing would be truly sta­tion­ary). Similarly, but with greater difficulty, one could try and look at the track record of how far your es­ti­mates have pre­vi­ously re­gressed to an­ti­ci­pate how much ones in the fu­ture will. Per­haps these could be fer­tile av­enues for fur­ther re­search. More qual­i­ta­tively, it speaks in favour of mul­ti­ple in­de­pen­dent ap­proaches to es­ti­ma­tion, and so sup­ports ‘cluster-es­que’ think­ing ap­proaches.

[1] See here, or here.

[2] Again, re­call that this with the means taken as zero—con­se­quently, es­ti­mates be­low the mean will regress up­wards.

[3] Given Givewell’s move away from recom­mend­ing AMF, this ‘real’ ex­pected value might be more like ‘ex­pected value a few years ago’, de­pend­ing on how much you agree with their rea­son­ing.

[4] It is worth pick­ing a big mul­ti­ple, as it gets fiddly deal­ing with logs that are nega­tive.

[5] There are fur­ther lines of sup­port­ing ev­i­dence: it looks like the fac­tors that con­tribute to effec­tive­ness mul­ti­ply rather than add, and cor­re­spond­ingly er­rors in the es­ti­mate tend to range over mul­ti­ples as well.

[6] This ig­nores wor­ries about ‘what dis­tri­bu­tion’: we might think that given the large un­cer­tain­ties about poor meat-eat­ing, broad-run con­ver­gence, value of yet-to-ex­ist be­ings etc. etc. mean both Givewell es­ti­mates and oth­ers are go­ing to be prone to a lot of scat­ter—these con­sid­er­a­tions might com­prise the bulk of var­i­ance in ex­pected value. You can prob­a­bly take out sources of un­cer­tainty that are sym­met­ric when com­par­ing be­tween two es­ti­ma­tors, but it can be difficult to give ‘fair deal­ing’ to both, and not colour ones es­ti­mates of es­ti­mate ac­cu­racy with sources of var­i­ance you in­clude for one in­ter­ven­tion but not an­other. Take this as an illus­tra­tion rather than a care­ful at­tempt to eval­u­ate di­rected far fu­ture in­ter­ven­tions.

There is an in­ter­est­ing corol­lary here, though. You might take the fu­ture se­ri­ously (e.g. you’re a to­tal util­i­tar­ian), but also hold their is some con­ver­gence be­tween good things now and good things in the fu­ture thanks to flow-through effects. You may be un­cer­tain about whether you should give to causes with proven di­rect im­pact now (and there­fore likely good im­pact into the fu­ture thanks to flow through) or to a cause with a spec­u­la­tive but very large po­ten­tial up­side across the fu­ture.

The fore­go­ing sug­gests a use­ful heuris­tic for these de­ci­sions. If you hold there is broad run cov­er­gence, then you could phrase it as some­thing like, “Im­pact mea­sured over the near term and to­tal im­pact are pos­i­tively cor­re­lated”. The cru­cial ques­tion is how strong this cor­re­la­tion is: roughly, if you think mea­sured near-term im­pact cor­re­lates more strongly with to­tal im­pact than your es­ti­ma­tion of far fu­ture causes cor­re­lates with to­tal im­pact, this weighs in favour of giv­ing to the ‘good thing now’ cause over the ‘spec­u­la­tive large benefit in the fu­ture’ cause, and vice versa.

[7] This im­plies that the sort of ‘back of the en­velope’ com­par­i­sons EAs are fond of mak­ing are prob­a­bly worth­less, and bet­ter re­frained from.

[8] My hunch is this sort of sur­prise should also lead one to up­date to­wards ‘broad’ in the ‘broad ver­sus tar­geted’ de­bate, but I’m not so con­fi­dent about that.

[9] Holden Karnofsky has also offered a math­e­mat­i­cal gloss to illus­trate Givewell’s scep­ti­cism and re­gres­sive mind­set (but he is keen to point out this is meant as illus­tra­tion, that it is not Givewell’s ‘offi­cial’ take, and that Givewell’s per­spec­tive on these is­sues does not rely on the illus­tra­tion he offers). Although I am not the best per­son to judge, and will all due re­spect meant, I be­lieve this es­say im­proves on his work:

The model by Karnofsky has the deeply im­plau­si­ble char­ac­ter­is­tic that your ex­pected value should start to fall as you es­ti­mated ex­pected value rises, and so the map­ping of like­li­hood onto pos­te­rior is no longer or­der-pre­serv­ing: if A has a mean es­ti­mate of 10 units value and B an es­ti­mate of 20 units value, you should think A has a higher ex­pected value than B. The rea­son the model demon­strates this patholog­i­cal fea­ture is that it stipu­lates the SD of the es­ti­mate should always equal its mean: as a con­se­quence for their model, no mat­ter how high the mean es­ti­mate, you should as­sign ~37% cre­dence (>-1SD) that the real value should be nega­tive—worse, as the mean es­ti­mate in­creases, lower per­centiles (e.g. the 5% ~-2SD con­fi­dence bound) con­tinue to fall, so that whilst an in­ter­ven­tion of mean ex­pected value of 5 Units has <<1% of its prob­a­bil­ity mass at −10 or less, this ceases to be the case with an in­ter­ven­tion with a mean ex­pected value of 50 Units or 500. Another way of look­ing at what is go­ing on is the steep in­crease in the SD means that the es­ti­mate be­comes less in­for­ma­tive far more rapidly than its mean rises, and so the up­date ig­nores the new in­for­ma­tion more and more as the mean rises (given up­dat­ing is com­mu­ta­tive, it might be clearer the other way around: you have this very vague es­ti­mate, and then you at­tend to the new es­ti­mate with ‘only’ an SD of 1 com­pared to 500 or 1000 or so, and strongly up­date to that).

This seems both im­plau­si­ble on its face and doesn’t con­cord with our ex­pe­rience as to how es­ti­ma­tors work in the wild ei­ther. The mo­ti­vat­ing in­tu­ition (that your ex­pected er­ror/​cre­dence in­ter­val/​what­ever of an es­ti­mate should get big­ger as your es­ti­mate grows) is bet­ter cap­tured with us­ing fat­ter tailed dis­tri­bu­tions (e.g. the con­fi­dence in­ter­val grows in ab­solute terms with log-nor­mal dis­tri­bu­tions).

I also think this is one case where the fre­quen­tist story gives an eas­ier route to some­thing ro­bust and un­der­stand­able than talk of Bayesian up­dat­ing. It is harder to get ‘hard num­bers’ (i.e. ‘so how much do I ad­just down­wards from the es­ti­mate?’) out the Bayesian story as eas­ily as the ge­o­met­ric one pro­vided above, and the im­por­tance (as we’ll see later) of mul­ti­ple—prefer­ably in­de­pen­dent—es­ti­mates can be seen more eas­ily, at least by my lights.

[10] Th­ese things might over­lap: it may be the best way of do­ing re­search in cause pri­ori­ti­za­tion is tri­al­ing par­tic­u­lar causes and see­ing how they turn out, and—par­tic­u­larly in far fu­ture di­rected causes—the line be­tween ‘in­ter­ven­tion’ and ‘re­search’ could get fuzzy. How­ever, it sug­gests in many cases the pri­mar­ily value of these ac­tivi­ties is the value of in­for­ma­tion gains, rather than the pu­ta­tive ‘di­rect’ im­pact (c.f. ‘Giv­ing to Learn’).

[11] It also seems fer­tile ground to bet­ter analyse things like search strat­egy, and the trade off be­tween eval­u­at­ing many things less ac­cu­rately or fewer things more rigor­ously in terms of find­ing the high­est ex­pected value causes.