Elo scores are based on trying to model the probability the A will beat B. They assume every player has some ability, their performance in a game is normally distributed, the player with the higher performance in that game will win, and all players have the same standard deviation. So each player can be represented by a single number, the mean of their normal distribution. If we see A beat B 75% of the time and A beat C 95% of the time, then we could construct normal distributions that would lead to this outcome and predict what fraction of the time B would beat C.
Trying to apply this to disease weightings is modeling there as being some underlying true badness of a disease that each person has noisy access to. This means that if A and B are close in badness we will sometimes see people rank A worse and other times rank B worse, while if A and B are far apart we won’t see this overlap. The key idea is that the closer together A and B are in badness, the less consistent people’s judgements of them will be. But look at this case:
Condition A is having moderate back pain, and otherwise perfect health.
Condition B is having moderate back pain, one toe amputated, and otherwise perfect health.
Almost everyone is going to say B is worse than A, and the people who rank A worse probably misunderstood the question. But this doesn’t mean B is much worse than A; it just means it’s an easy comparison for people to make, x vs x+y instead of x vs z.
Do we just assume this is rare with disability comparisons? That consistency is only common when things are far apart? Is there a way to test this?
I asked one of the architects behind the method about this issue. The answer boiled down to an agreement that this could cause problems in principle, but that in practice the overall badness of a condition was based on many comparisons, some of which might be easy but some of which will be hard, so it comes out okay on average.
Elo scores are based on trying to model the probability the A will beat B. They assume every player has some ability, their performance in a game is normally distributed, the player with the higher performance in that game will win, and all players have the same standard deviation. So each player can be represented by a single number, the mean of their normal distribution. If we see A beat B 75% of the time and A beat C 95% of the time, then we could construct normal distributions that would lead to this outcome and predict what fraction of the time B would beat C.
Trying to apply this to disease weightings is modeling there as being some underlying true badness of a disease that each person has noisy access to. This means that if A and B are close in badness we will sometimes see people rank A worse and other times rank B worse, while if A and B are far apart we won’t see this overlap. The key idea is that the closer together A and B are in badness, the less consistent people’s judgements of them will be. But look at this case:
Condition A is having moderate back pain, and otherwise perfect health.
Condition B is having moderate back pain, one toe amputated, and otherwise perfect health.
Almost everyone is going to say B is worse than A, and the people who rank A worse probably misunderstood the question. But this doesn’t mean B is much worse than A; it just means it’s an easy comparison for people to make, x vs x+y instead of x vs z.
Do we just assume this is rare with disability comparisons? That consistency is only common when things are far apart? Is there a way to test this?
I agree that this is not terribly well-founded.
I asked one of the architects behind the method about this issue. The answer boiled down to an agreement that this could cause problems in principle, but that in practice the overall badness of a condition was based on many comparisons, some of which might be easy but some of which will be hard, so it comes out okay on average.
“it just means it’s an easy comparison for people to make”
Yes, I completely agree and would have spelled it out just like you did in my comment further down if I’d had the time.