But hold on: how did they turn a large number of responses where people said one disability was more or less healthy than another into weightings on a 0-1 scale where 0 is full health and 1 is death? It turns out that a quarter (n=4000) of the people who took the survey on the internet were also asked the “person tradeoff” style questions used in the 1990 version, which they called “population health equivalence questions”. So they first determined an ordering from most to least healthy using their large quantity of comparison data, and then used the tradeoff data to map this ordering onto the “0=healthy 1=death” line.
My understanding is that the large set of comparison data was used to construct a cardinal ordering (via some statistical technique based on the idea that if A is very often preferred to B then B must be substantially worse). The extra data from the population health equivalence questions was purely to work out how to scale this with respect to the 0=full health, 1=death.
Yes, I thought this too. Something like ELO scores used to generate cardinal rankings for chess players from the ordinal data of their previous match results.
Elo scores are based on trying to model the probability the A will beat B. They assume every player has some ability, their performance in a game is normally distributed, the player with the higher performance in that game will win, and all players have the same standard deviation. So each player can be represented by a single number, the mean of their normal distribution. If we see A beat B 75% of the time and A beat C 95% of the time, then we could construct normal distributions that would lead to this outcome and predict what fraction of the time B would beat C.
Trying to apply this to disease weightings is modeling there as being some underlying true badness of a disease that each person has noisy access to. This means that if A and B are close in badness we will sometimes see people rank A worse and other times rank B worse, while if A and B are far apart we won’t see this overlap. The key idea is that the closer together A and B are in badness, the less consistent people’s judgements of them will be. But look at this case:
Condition A is having moderate back pain, and otherwise perfect health.
Condition B is having moderate back pain, one toe amputated, and otherwise perfect health.
Almost everyone is going to say B is worse than A, and the people who rank A worse probably misunderstood the question. But this doesn’t mean B is much worse than A; it just means it’s an easy comparison for people to make, x vs x+y instead of x vs z.
Do we just assume this is rare with disability comparisons? That consistency is only common when things are far apart? Is there a way to test this?
I asked one of the architects behind the method about this issue. The answer boiled down to an agreement that this could cause problems in principle, but that in practice the overall badness of a condition was based on many comparisons, some of which might be easy but some of which will be hard, so it comes out okay on average.
the large set of comparison data was used to construct a cardinal ordering
I really should know this, but is a cardinal ordering one where you can say “A is twice as bad as B” or not? My understanding of their process is they used their stats to put everything in order from least-harmful to most-harmful, but this didn’t get them more than just an ordering. The equivalence questions added not only the information necessary to put this on the 0-1 scale, but also what was needed to say “A costs 3x less than B to treat, but we should still treat B because it’s 5x worse”. An ordering by itself doesn’t get us very far.
A cardinal ordering, strictly speaking, is one where you can say “the difference between A and B is twice as large as the difference between A and C”. I’d assumed that if you had “perfect health” on your scale, you could use distance from this to express “twice as bad as”.
From section 2A of the appendix you linked to:
The implication of this is that the probit regression yields estimates of values for each health state that capture the relative differences in health levels between states, consistent with the paired comparison responses, but that these health-state values are on an arbitrary scale rather than on a unique disability weight scale that ranges between 0 and 1.
However from looking at the rest of section 2 it seems that they didn’t (as I had thought) just take this cardinal scale at face value, using the population health equivalence questions to anchor the endpoints. Rather they made some assumptions about the shape of transformation required, and then used the extra data to fill in some of the parameter choices. So “twice as far from perfect health” doesn’t necessarily translate to “twice as bad”—but it does translate to a precise statement about how much worse.
If you can turn a bunch of “A is worse than B” statements into a cardinal ordering, then why do you need the population equivalence questions at all? Why not just include “perfect health” and “death” among your disabilities? Then we can eventually say “the difference between perfect health and A is X% of the difference between perfect health and death.”
I guess part of my confusion is I don’t really see how you can get this cardinal ordering from the data. So let’s say we find that condition A is universally considered worse than all other conditions. Perhaps it’s “death”, perhaps it’s just clearly the worst of the conditions we’re looking at. How can statistics give us a ratio by which it’s worse? If somehow it were twice as bad we would still see it be considered as “worst” in all it’s comparisons.
You are correct. You can’t really turn the ordinal stuff into a cardinal ordering, just into a kind of proxy ordering that has some cardinal structure, but it might not correspond to the cardinal structure we care about. For example if ‘perfect health’ was added and 100% of people ranked this above the other choice, then it would end up very far (possibly infinitely far) from the nearest option on the cardinal scale. What it is really measuring is the amount of disagreement about things at this part of the ordering, which is a proxy for closeness of the health levels, but there are cases like ‘perfect health’ vs slightly worse than that where they are close but there is no disagreement.
My understanding is that the large set of comparison data was used to construct a cardinal ordering (via some statistical technique based on the idea that if A is very often preferred to B then B must be substantially worse). The extra data from the population health equivalence questions was purely to work out how to scale this with respect to the 0=full health, 1=death.
Yes, I thought this too. Something like ELO scores used to generate cardinal rankings for chess players from the ordinal data of their previous match results.
Elo scores are based on trying to model the probability the A will beat B. They assume every player has some ability, their performance in a game is normally distributed, the player with the higher performance in that game will win, and all players have the same standard deviation. So each player can be represented by a single number, the mean of their normal distribution. If we see A beat B 75% of the time and A beat C 95% of the time, then we could construct normal distributions that would lead to this outcome and predict what fraction of the time B would beat C.
Trying to apply this to disease weightings is modeling there as being some underlying true badness of a disease that each person has noisy access to. This means that if A and B are close in badness we will sometimes see people rank A worse and other times rank B worse, while if A and B are far apart we won’t see this overlap. The key idea is that the closer together A and B are in badness, the less consistent people’s judgements of them will be. But look at this case:
Condition A is having moderate back pain, and otherwise perfect health.
Condition B is having moderate back pain, one toe amputated, and otherwise perfect health.
Almost everyone is going to say B is worse than A, and the people who rank A worse probably misunderstood the question. But this doesn’t mean B is much worse than A; it just means it’s an easy comparison for people to make, x vs x+y instead of x vs z.
Do we just assume this is rare with disability comparisons? That consistency is only common when things are far apart? Is there a way to test this?
I agree that this is not terribly well-founded.
I asked one of the architects behind the method about this issue. The answer boiled down to an agreement that this could cause problems in principle, but that in practice the overall badness of a condition was based on many comparisons, some of which might be easy but some of which will be hard, so it comes out okay on average.
“it just means it’s an easy comparison for people to make”
Yes, I completely agree and would have spelled it out just like you did in my comment further down if I’d had the time.
I really should know this, but is a cardinal ordering one where you can say “A is twice as bad as B” or not? My understanding of their process is they used their stats to put everything in order from least-harmful to most-harmful, but this didn’t get them more than just an ordering. The equivalence questions added not only the information necessary to put this on the 0-1 scale, but also what was needed to say “A costs 3x less than B to treat, but we should still treat B because it’s 5x worse”. An ordering by itself doesn’t get us very far.
A cardinal ordering, strictly speaking, is one where you can say “the difference between A and B is twice as large as the difference between A and C”. I’d assumed that if you had “perfect health” on your scale, you could use distance from this to express “twice as bad as”.
From section 2A of the appendix you linked to:
However from looking at the rest of section 2 it seems that they didn’t (as I had thought) just take this cardinal scale at face value, using the population health equivalence questions to anchor the endpoints. Rather they made some assumptions about the shape of transformation required, and then used the extra data to fill in some of the parameter choices. So “twice as far from perfect health” doesn’t necessarily translate to “twice as bad”—but it does translate to a precise statement about how much worse.
If you can turn a bunch of “A is worse than B” statements into a cardinal ordering, then why do you need the population equivalence questions at all? Why not just include “perfect health” and “death” among your disabilities? Then we can eventually say “the difference between perfect health and A is X% of the difference between perfect health and death.”
I guess part of my confusion is I don’t really see how you can get this cardinal ordering from the data. So let’s say we find that condition A is universally considered worse than all other conditions. Perhaps it’s “death”, perhaps it’s just clearly the worst of the conditions we’re looking at. How can statistics give us a ratio by which it’s worse? If somehow it were twice as bad we would still see it be considered as “worst” in all it’s comparisons.
You are correct. You can’t really turn the ordinal stuff into a cardinal ordering, just into a kind of proxy ordering that has some cardinal structure, but it might not correspond to the cardinal structure we care about. For example if ‘perfect health’ was added and 100% of people ranked this above the other choice, then it would end up very far (possibly infinitely far) from the nearest option on the cardinal scale. What it is really measuring is the amount of disagreement about things at this part of the ordering, which is a proxy for closeness of the health levels, but there are cases like ‘perfect health’ vs slightly worse than that where they are close but there is no disagreement.