Should we be using Likelihood Ratios in everyday conversation the same way we use probabilities?
Disclaimer: Copy-pasting some Slack messages here, so this post is less coherent or well-written than others.
I’ve been thinking that perhaps we should be indicating likelihood ratios in everyday conversation to talk about the strength of evidence the same way we indicate probabilities in everyday conversation to talk about beliefs, that there should be a likelihood ratio calibration game, and that we should have cached likelihood ratios for common types of evidence (eg experimental research papers of a given level of quality).
However, maybe this is less useful because different pieces of evidence are often correlated? Or can we just talk about the strength of the uncorrelated portion of additional evidence?
So if you start out with a 50% probability, your prior odds are 1:1, your posterior odds after seeing all the evidence are 6:3 or 2:1, so your posterior probability is 67%.
If another person starts out with a 20% probability, their prior odds are 1:4, their posterior odds are 1:2, their posterior probability is 33%.
These two people agree on the strength of evidence but disagree on the prior. So the idea is that you can talk about the strength of the evidence / size of the update instead of the posterior probability (which might mainly depend on your prior).
Calibration game
A baseline calibration game proposal:
You get presented with a proposition, and submit a probability. Then you receive a piece of evidence that relates to the proposition (e.g. a sentence from a Wikipedia page about the issue, or a screenshot of a paper/abstract). You submit a likelihood ratio, which implies a certain posterior probability. Then both of these probabilities get scored using a proper scoring rule.
My guess is that you can do something more sophisticated here, but I think the baseline proposal basically works.
I really like the proposed calibration game! One thing I’m curious about is whether real-world evidence more often looks like a likelihood ratio or like something else (e.g. pointing towards a specific probability being correct). Maybe you could see this from the structure of priors+likelihoodratios+posteriors in the calibration game — e.g. check whether the long-run top-scorers likelihood ratios correlated more or less than their posterior probabilities.
(If someone wanted to build this: one option would be to start with pastcasting and then give archived articles or wikipedia pages as evidence. Maybe a sophisticated version could let you start out with an old relevant wikipedia page, and then see a wikipedia page much closer to the resolution date as extra evidence.)
Followup question: it seems like these likelihood ratios are fairly subjective. (like, why is the LR for the chicago survey 5:1 and not 10:1 or 20:1? How can you calibrate the likelihood ratio when there is no “right answer”?
It’s the same as with probabilities. How can probabilities be calibrated, given that they are fairly subjective? The LR can be calibrated the same way given that it’s just a function of two probabilities.
You can check probability estimates against outcomes. If you make 5 different predictions and estimate a 20% probability of each, then if you are well calibrated then you expect 1 out of the 5 to happen. If all of them happened, you probably made a mistake in your predictions. I don’t think this is perfect (it’s impractical to test very low probability predictions like 1 in a million), but there is at least some level of empiricism available.
There is no similar test for likliehood ratios. A question like “what is the chance that the chicago survey said minimum wages are fine if they actually aren’t” can’t be empirically tested.
There is also the question of whether people assign different strength to the same evidence. Maybe reporting why you think that the evidence is 1:3 rather than 1:1.5 or 1:6 would help.
Yeah exactly, that’s part of the idea here! E.g., on Metaculus, if someone posts a source and updates their belief, they could display the LR to indicate how much it updated them.
Should we be using Likelihood Ratios in everyday conversation the same way we use probabilities?
Disclaimer: Copy-pasting some Slack messages here, so this post is less coherent or well-written than others.
I’ve been thinking that perhaps we should be indicating likelihood ratios in everyday conversation to talk about the strength of evidence the same way we indicate probabilities in everyday conversation to talk about beliefs, that there should be a likelihood ratio calibration game, and that we should have cached likelihood ratios for common types of evidence (eg experimental research papers of a given level of quality).
However, maybe this is less useful because different pieces of evidence are often correlated? Or can we just talk about the strength of the uncorrelated portion of additional evidence?
See also: Strong Evidence is Common
Example
Here’s an example with made-up numbers:
Question: Are minimum wages good or bad for low-skill workers?
Theoretical arguments that minimum wages increase unemployment, LR = 1:3
Someone sends an empirical paper and the abstract says it improved the situation, LR = 1.2:1
IGM Chicago Survey results, LR = 5:1
So if you start out with a 50% probability, your prior odds are 1:1, your posterior odds after seeing all the evidence are 6:3 or 2:1, so your posterior probability is 67%.
If another person starts out with a 20% probability, their prior odds are 1:4, their posterior odds are 1:2, their posterior probability is 33%.
These two people agree on the strength of evidence but disagree on the prior. So the idea is that you can talk about the strength of the evidence / size of the update instead of the posterior probability (which might mainly depend on your prior).
Calibration game
A baseline calibration game proposal:
You get presented with a proposition, and submit a probability. Then you receive a piece of evidence that relates to the proposition (e.g. a sentence from a Wikipedia page about the issue, or a screenshot of a paper/abstract). You submit a likelihood ratio, which implies a certain posterior probability. Then both of these probabilities get scored using a proper scoring rule.
My guess is that you can do something more sophisticated here, but I think the baseline proposal basically works.
I really like the proposed calibration game! One thing I’m curious about is whether real-world evidence more often looks like a likelihood ratio or like something else (e.g. pointing towards a specific probability being correct). Maybe you could see this from the structure of priors+likelihoodratios+posteriors in the calibration game — e.g. check whether the long-run top-scorers likelihood ratios correlated more or less than their posterior probabilities.
(If someone wanted to build this: one option would be to start with pastcasting and then give archived articles or wikipedia pages as evidence. Maybe a sophisticated version could let you start out with an old relevant wikipedia page, and then see a wikipedia page much closer to the resolution date as extra evidence.)
Interesting point, agreed that this would be very interesting to analyze!
Relevant calibration game that was recently posted - I found it surprisingly addictive—maybe they’d be interested in implementing your ideas.
Can you walk through the actual calculations here? Why did the chicago survey shift the person from 1.2:1 to 5:1, and not a different ratio?
No, this is not the description of the absolute shift (i.e., not from 1.2:1 to 5:1) but for the relative shift (i.e., from 1:x to 5:x).
Yeah. Here’s the example in more detail:
Prior odds: 1:1
Theoretical arguments that minimum wages increase unemployment, LR = 1:3 → posterior odds 1:3
Someone sends an empirical paper and the abstract says it improved the situation, LR = 1.2:1 → posterior odds 1.2:3
IGM Chicago Survey results, LR = 5:1 → posterior odds 6:3 (or 2:1)
Ah yes, thank you, that clear it up.
Followup question: it seems like these likelihood ratios are fairly subjective. (like, why is the LR for the chicago survey 5:1 and not 10:1 or 20:1? How can you calibrate the likelihood ratio when there is no “right answer”?
It’s the same as with probabilities. How can probabilities be calibrated, given that they are fairly subjective? The LR can be calibrated the same way given that it’s just a function of two probabilities.
You can check probability estimates against outcomes. If you make 5 different predictions and estimate a 20% probability of each, then if you are well calibrated then you expect 1 out of the 5 to happen. If all of them happened, you probably made a mistake in your predictions. I don’t think this is perfect (it’s impractical to test very low probability predictions like 1 in a million), but there is at least some level of empiricism available.
There is no similar test for likliehood ratios. A question like “what is the chance that the chicago survey said minimum wages are fine if they actually aren’t” can’t be empirically tested.
There is also the question of whether people assign different strength to the same evidence. Maybe reporting why you think that the evidence is 1:3 rather than 1:1.5 or 1:6 would help.
Yeah exactly, that’s part of the idea here! E.g., on Metaculus, if someone posts a source and updates their belief, they could display the LR to indicate how much it updated them.
Note that bits might be better because you can sum them.
Yeah fair, although I expect people to have more difficulty converting log odds back into probabilities.