If you’re talk­ing to two peo­ple, one with a small cut and an­other with mul­ti­ple scle­ro­sis, ev­ery­one pre­sent will agree that hav­ing mul­ti­ple scle­ro­sis is much worse. If you offered the two of them some magic op­tion that would re­store ex­actly one of them to full health, they would prob­a­bly be able to agree on who should get it. But in gen­eral, how should we com­pare across peo­ple to figure out whose situ­a­tion is worse, who would benefit more from treat­ment and who, ev­ery­thing else be­ing equal, should be treated first? This ex­am­ple was easy be­cause the differ­ence was nice and large, but what do we do in harder cases?

One is to ask peo­ple ques­tions like “if there were a surgery that could re­store you to full health (with­out im­prov­ing your lifes­pan) but had a 20% chance of kil­ling you, would you take it?” If they say “yes” then this in­di­cates that this dis­abil­ity, for this per­son, is more than 20% as bad as be­ing dead. Ask these “stan­dard gam­ble” ques­tions to a lot of peo­ple with a lot of dis­abil­ities, vary­ing the per­centages, and you could build up a list of how bad differ­ent ones are, all on a com­mon scale.

This would use­ful for bal­anc­ing pro­jects against each other, figur­ing out what to fo­cus on, and gen­er­ally set­ting fund­ing pri­ori­ties. Un­for­tu­nately peo­ple are re­ally bad at an­swer­ing ques­tions like this. Mostly we’re just bad at think­ing about per­centages and chances of bad things hap­pen­ing, but you also won­der about the prob­lems of ask­ing some­one with, say, “Schizophre­nia: acute state” to an­swer this sort of ques­tion.

You could fix this by ask­ing peo­ple about “time trade­offs”. For ex­am­ple, you could ask some­one with a dis­abil­ity about whether they would take a medicine that would re­store them to full health for a year even if it took two years off their life. Alter­na­tively you can ask peo­ple, gen­er­ally pub­lic health pro­fes­sion­als, if given the choice be­tween cur­ing 1000 peo­ple with dis­abil­ity X and 2000 peo­ple with dis­abil­ity Y which one they would choose. Th­ese “per­son trade­offs” get us out of need­ing to ask about prob­a­bil­ities, which means we can prob­a­bly trust the num­bers more, but we’re stuck ei­ther with col­lect­ing data from peo­ple with the dis­abil­ities in ques­tion (hard work, maybe the dis­abil­ity af­fects men­tal func­tion) or trust­ing that pub­lic health ex­perts fully un­der­stand what it’s like to have differ­ent dis­abil­ities (seems un­likely). And even if we did de­cide that we were only go­ing to col­lect data from peo­ple ac­tu­ally af­fected by a dis­abil­ity, re­mem­ber that in many cases they can’t ac­tu­ally give the com­par­i­son we want be­cause they haven’t ex­pe­rienced both hav­ing and not hav­ing the dis­abil­ity (ex: blind­ness from birth).

The first Global Bur­den of Disease Re­port (pdf) at­tempted to col­lect these weights for a large num­ber of differ­ent dis­eases. They got a panel of pub­lic health ex­perts to come to Geneva and through dis­cus­sion around “per­son trade­offs” came to con­sen­sus first on weights for 22 “in­di­ca­tor dis­eases”. Then they agreed on weights for the sev­eral hun­dred re­main­ing dis­eases by com­par­ing them to these an­chor con­di­tions.

You might worry that the fi­nal weights would be re­ally strongly af­fected by the par­tic­u­lar con­sen­sus the Geneva group hap­pened to get for the 22 an­chors, but they ran nine other at­tempts with differ­ent ex­perts and the av­er­age of those at­tempts cor­re­lates pretty well with the Geneva re­sults:

For the 2010 up­date to the Global Bur­den of Disease weights (pdf, also see the ap­pendix pdf) they de­cided to take an en­tirely differ­ent ap­proach. In­stead of ask­ing ex­perts to figure out trade­offs they asked lots of peo­ple in sev­eral coun­tries (In­done­sia, Peru, USA, Bangladesh, Tan­za­nia, plus a ‘global’ in­ter­net sur­vey) to do lots of com­par­i­sons where given two peo­ple they would say which one was healthier. This makes a lot of sense, ask­ing lots of reg­u­lar peo­ple, and the ques­tion is much sim­pler to an­swer. On the other hand, while I’m not sure ex­perts are all that good at es­ti­mat­ing how bad it is to have var­i­ous dis­abil­ities I would ex­pect reg­u­lar peo­ple to be even worse at it. Still, there was at least pretty good cor­re­la­tion be­tween coun­tries:

But hold on: how did they turn a large num­ber of re­sponses where peo­ple said one dis­abil­ity was more or less healthy than an­other into weight­ings on a 0-1 scale where 0 is full health and 1 is death? It turns out that a quar­ter (n=4000) of the peo­ple who took the sur­vey on the in­ter­net were also asked the “per­son trade­off” style ques­tions used in the 1990 ver­sion, which they called “pop­u­la­tion health equiv­alence ques­tions”. So they first de­ter­mined an or­der­ing from most to least healthy us­ing their large quan­tity of com­par­i­son data, and then used the trade­off data to map this or­der­ing onto the “0=healthy 1=death” line.

This means that when we look at the cross-coun­try cor­re­la­tions above we’re only see­ing their agree­ment on the rel­a­tive or­der­ing of con­di­tions, not on the ab­solute differ­ences. If peo­ple in In­done­sia on av­er­age think that the worst dis­abil­ities are only 10% as bad as be­ing dead while peo­ple in Peru think they’re 90% as bad, this wouldn’t keep them from hav­ing perfect cor­re­la­tion on a chart like this. Which is kind of a prob­lem, be­cause we need more than an or­der­ing for pri­ori­ti­za­tion.

It turns out that this method of es­ti­ma­tion ac­tu­ally gives pretty differ­ent re­sults from the one used in the ear­lier ver­sion: [1]

Yes, there’s a cor­re­la­tion, but it’s pretty weak. And it’s prob­a­bly not just about the fif­teen years be­tween when most of the first es­ti­mates were made and when most of the sec­ond were; these aren’t dis­abil­ities that are quickly chang­ing. This in­di­cates that what we’re try­ing to mea­sure is just not that well cap­tured by the mea­sure­ments we’re mak­ing.

So, a sum­mary. Get­ting good an­swers means ask­ing peo­ple ques­tions they’re not good at think­ing about, or that they are good at think­ing about but don’t have the right ex­pe­rience to be able to an­swer. It’s not too sur­pris­ing, then, that the an­swers you get via differ­ent meth­ods don’t agree very well. We do still need a rough way to say “benefit X to N peo­ple is bet­ter/​worse than benefit Y to M peo­ple”, but try­ing to do this in the gen­eral case doesn’t seem to have worked out very well.

Th­ese dis­abil­ity weights are only one step in es­ti­mat­ing $/​DALY for var­i­ous in­ter­ven­tions, and the messi­ness here is in some ways much less than the messi­ness in the other steps. After read­ing about how these es­ti­mates came to be, I’m pretty glad GiveWell doesn’t put much trust in them in figur­ing out which char­i­ties to recom­mend:

The re­sources that have already been in­vested in these cost-effec­tive­ness es­ti­mates are sig­nifi­cant. Yet in our view, the es­ti­mates are still far too sim­plified, sen­si­tive, and es­o­teric to be re­lied upon. If such a high level of fi­nan­cial and (es­pe­cially) hu­man-cap­i­tal in­vest­ment leaves us this far from hav­ing re­li­able es­ti­mates, it may be time to re­think the goal.

All that said—if this sort of anal­y­sis were the only way to figure out how to al­lo­cate re­sources for max­i­mal im­pact, we’d be ad­vo­cat­ing for more in­vest­ment in cost-effec­tive­ness anal­y­sis and we’d be de­ter­mined to “get it right”. But in our view, there are other ways of max­i­miz­ing cost-effec­tive­ness that can work bet­ter in this do­main—in par­tic­u­lar, mak­ing limited use of cost-effec­tive­ness es­ti­mates while fo­cus­ing on find­ing high-qual­ity ev­i­dence.

This isn’t to say we should never use dis­abil­ity weights; even if they were just made up by one guy on the spot (and they’re bet­ter than that) this would prob­a­bly still be bet­ter in some cases than re­fus­ing to make quan­ti­ta­tive com­par­i­sons at all. Get­ting rough num­bers like this is es­pe­cially use­ful for avoid­ing scope in­sen­si­tivity prob­lems, where you might be com­par­ing a large num­ber of peo­ple with some­thing minor against a small num­ber with some­thing ma­jor.

(I’d re­ally like to look into the QALY num­bers peo­ple use and how they get them. I be­lieve the pro­cess is similar, but I’m not too sure.)

For cu­ri­os­ity, how­ever, and with all that in mind, what are the ac­tual num­bers they found? Here are the 2010 weights:

[1] Tech­ni­cally these are the re­sults from the 2004 up­date to the 1990 ver­sion, but when you look at where their es­ti­mates come from (pdf you see that most just say they’re kept un­changed from the 1990 ver­sion.