I tried to look for writing like this. I think that people do multiple hypothesis testing, like Harry in chapter 86 of HPMOR. There Harry is trying to weigh some different hypotheses against each other to explain his observations. There isn’t really a single train of conditional steps that constitutes the whole hypothesis.
My shoulder-Scott-Alexander is telling me (somewhat similar to my shoulder-Richard-Feynman) that there’s a lot of ways to trick myself with numbers, and that I should only do very simple things with them. I looked through some of his posts just now (1, 2, 3, 4, 5).
In summary: teacher quality probably explains 10% of the variation in same-year test scores. A +1 SD better teacher might cause a +0.1 SD year-on-year improvement in test scores. This decays quickly with time and is probably disappears entirely after four or five years, though there may also be small lingering effects. It’s hard to rule out the possibility that other factors, like endogenous sorting of students, or students’ genetic potential, contributes to this as an artifact, and most people agree that these sorts of scores combine some signal with a lot of noise. For some reason, even though teachers’ effects on test scores decay very quickly, studies have shown that they have significant impact on earning as much as 20 or 25 years later, so much so that kindergarten teacher quality can predict thousands of dollars of difference in adult income. This seemingly unbelievable finding has been replicated in quasi-experiments and even in real experiments and is difficult to banish. Since it does not happen through standardized test scores, the most likely explanation is that it involves non-cognitive factors like behavior. I really don’t know whether to believe this and right now I say 50-50 odds that this is a real effect or not – mostly based on low priors rather than on any weakness of the studies themselves. I don’t understand this field very well and place low confidence in anything I have to say about it.
I don’t know any post where Scott says “there’s a particular 6-step argument, and I assign 6 different probabilities to each step, and I trust that outcome number seems basically right”. His conclusions read more like 1 key number with some uncertainty, which never came from a single complex model, but from aggregating loads of little studies and pieces of evidence into a judgment.
I think I can’t think of a post like this by Scott or Robin or Eliezer or Nick or anyone. But would be interested in an example that is like this (from other fields or wherever), or feels similar.
Maybe not ‘insight’, but re. ‘accuracy’ this sort of decomposition is often in the tool box of better forecasters. I think the longest path I evaluated in a question had 4 steps rather than 6, and I think I’ve seen other forecasters do similar things on occasion. (The general practice of ‘breaking down problems’ to evaluate sub-issues is recommended in Superforecasting IIRC).
I guess the story why this works in geopolitical forecasting is folks tend to overestimate the chance ‘something happens’ and tend to be underdamped in increasing the likelihood of something based on suggestive antecedents (e.g. chance of a war given an altercation, etc.) So attending to “Even if A, for it to lead to D one should attend to P(B|A), P(C|B) etc. etc.”, tend to lead to downwards corrections.
Naturally, you can mess this up. Although it’s not obvious you are at greater risk if you arrange your decomposed considerations conjunctively or disjunctively: “All of A-E must be true for P to be true” ~also means “if any of ¬A-¬E are true, then ¬P”. In natural language and heuristics, I can imagine “Here are several different paths to P, and each of these seem not-too-improbable, so P must be highly likely” could also lead one astray.
It seems possible that attempting to produce “great insight” or “simple arguments of world-shattering importance” warrants a methodology different from the one I’ve used here. But my aim here is humbler: to formulate and evaluate an existing argument that I and various others take seriously, and that lots of resources are being devoted to; and to come to initial, informal, but still quantitative best-guesses about the premises and conclusion, which people can (hopefully) agree/disagree with at a somewhat fine-grained level—e.g., a level that just giving overall estimates, or just saying e.g. “significant probability,” “high enough to worry about,” etc can make more difficult to engage on.
In that vein, I think it’s possible you’re over-estimating how robust I take the premises and numbers here to be (I’m thinking here of your comments re: “very accurately carve the key parts of reality that are relevant,” and “trust the outcome number”). As I wrote in response to Rob above, my low-end/high-end range here is .1% to 40% (see footnote 179, previously 178), and in general, I hold the numbers here very lightly (I try to emphasize this in section 8).
FWIW, I think Superintelligence can be pretty readily seen as a multi-step argument (e.g., something like: superintelligence will happen eventually; fast take-off is plausible; if fast-take-off, then a superintelligence will probably get a decisive strategic advantage; alignment will be tricky; misalignment leads to power-seeking; therefore plausible doom). And more broadly, I think that people make arguments with many premises all the time (though sometimes the premises are suppressed). It’s true that people don’t usually assign probabilities to the premises (and Bostrom doesn’t, in Superintelligence—a fact that leaves the implied p(doom) correspondingly ambiguous) -- but I think this is centrally because assigning informal probabilities to claims (whether within a multi-step argument, or in general) just isn’t a very common practice, for reasons not centrally to do with e.g. multi-stage-fallacy type problems. Indeed, I expect I’d prefer a world where people assigned informal, lightly-held probabilities to their premises and conclusions (and formulated their arguments in premise-premise-conclusion form) more frequently.
I’m not sure exactly what you have in mind re: “examining a single worldview to see whether it’s consistent,” but consistency in a strict sense seems too cheap? E.g., “Bob has always been wrong before, but he’ll be right this time”; “Mortimer Snodgrass did it”; etc are all consistent. That said, my sense is that you have something broader in mind—maybe something like “plausible,” “compelling,” “sense-making,” etc. But it seems like these still leave the question of overall probabilities open...
Overall, my sense is that disagreement here is probably more productively focused on the object level—e.g., on the actual probabilities I give to the premises, and/or on pointing out and giving weight to scenarios that the premises don’t cover—rather than on the methodology in the abstract. In particular, I doubt that people who disagree a lot with my bottom line will end up saying: “If I was to do things your way, I’d roughly agree with the probabilities you gave to the premises; I just disagree that you should assign probabilities to premises in a multi-step argument as a way of thinking about issues like this.” Rather, I expect a lot of it comes down to substantive disagreement about the premises at issue (and perhaps, to people assigning significant credence to scenarios that don’t fit these premises, though I don’t feel like I’ve yet heard strong candidates—e.g., ones that seem to me to plausibly account for, say, >2/3rds of the overall X-risk from power-seeking, misaligned AI by 2070 -- in this regard).
I do think I was overestimating how robust you’re treating your numbers and premises, it seems like you’re holding them all much more lightly than I think I’d been envisioning.
FWIW I am more interested in engaging with some of what you wrote in in your other comment than engaging on the specific probability you assign, for some of the reasons I wrote about here.
I think I have more I could say on the methodology, but alas, I’m pretty blocked up with other work atm. It’d be neat to spend more time reading the report and leave more comments here sometime.
I tried to look for writing like this. I think that people do multiple hypothesis testing, like Harry in chapter 86 of HPMOR. There Harry is trying to weigh some different hypotheses against each other to explain his observations. There isn’t really a single train of conditional steps that constitutes the whole hypothesis.
My shoulder-Scott-Alexander is telling me (somewhat similar to my shoulder-Richard-Feynman) that there’s a lot of ways to trick myself with numbers, and that I should only do very simple things with them. I looked through some of his posts just now (1, 2, 3, 4, 5).
Here’s an example of a conclusion / belief from Scott’s post Teachers: Much More Than You Wanted to Know:
I don’t know any post where Scott says “there’s a particular 6-step argument, and I assign 6 different probabilities to each step, and I trust that outcome number seems basically right”. His conclusions read more like 1 key number with some uncertainty, which never came from a single complex model, but from aggregating loads of little studies and pieces of evidence into a judgment.
I think I can’t think of a post like this by Scott or Robin or Eliezer or Nick or anyone. But would be interested in an example that is like this (from other fields or wherever), or feels similar.
Maybe not ‘insight’, but re. ‘accuracy’ this sort of decomposition is often in the tool box of better forecasters. I think the longest path I evaluated in a question had 4 steps rather than 6, and I think I’ve seen other forecasters do similar things on occasion. (The general practice of ‘breaking down problems’ to evaluate sub-issues is recommended in Superforecasting IIRC).
I guess the story why this works in geopolitical forecasting is folks tend to overestimate the chance ‘something happens’ and tend to be underdamped in increasing the likelihood of something based on suggestive antecedents (e.g. chance of a war given an altercation, etc.) So attending to “Even if A, for it to lead to D one should attend to P(B|A), P(C|B) etc. etc.”, tend to lead to downwards corrections.
Naturally, you can mess this up. Although it’s not obvious you are at greater risk if you arrange your decomposed considerations conjunctively or disjunctively: “All of A-E must be true for P to be true” ~also means “if any of ¬A-¬E are true, then ¬P”. In natural language and heuristics, I can imagine “Here are several different paths to P, and each of these seem not-too-improbable, so P must be highly likely” could also lead one astray.
Hi Ben,
A few thoughts on this:
It seems possible that attempting to produce “great insight” or “simple arguments of world-shattering importance” warrants a methodology different from the one I’ve used here. But my aim here is humbler: to formulate and evaluate an existing argument that I and various others take seriously, and that lots of resources are being devoted to; and to come to initial, informal, but still quantitative best-guesses about the premises and conclusion, which people can (hopefully) agree/disagree with at a somewhat fine-grained level—e.g., a level that just giving overall estimates, or just saying e.g. “significant probability,” “high enough to worry about,” etc can make more difficult to engage on.
In that vein, I think it’s possible you’re over-estimating how robust I take the premises and numbers here to be (I’m thinking here of your comments re: “very accurately carve the key parts of reality that are relevant,” and “trust the outcome number”). As I wrote in response to Rob above, my low-end/high-end range here is .1% to 40% (see footnote 179, previously 178), and in general, I hold the numbers here very lightly (I try to emphasize this in section 8).
FWIW, I think Superintelligence can be pretty readily seen as a multi-step argument (e.g., something like: superintelligence will happen eventually; fast take-off is plausible; if fast-take-off, then a superintelligence will probably get a decisive strategic advantage; alignment will be tricky; misalignment leads to power-seeking; therefore plausible doom). And more broadly, I think that people make arguments with many premises all the time (though sometimes the premises are suppressed). It’s true that people don’t usually assign probabilities to the premises (and Bostrom doesn’t, in Superintelligence—a fact that leaves the implied p(doom) correspondingly ambiguous) -- but I think this is centrally because assigning informal probabilities to claims (whether within a multi-step argument, or in general) just isn’t a very common practice, for reasons not centrally to do with e.g. multi-stage-fallacy type problems. Indeed, I expect I’d prefer a world where people assigned informal, lightly-held probabilities to their premises and conclusions (and formulated their arguments in premise-premise-conclusion form) more frequently.
I’m not sure exactly what you have in mind re: “examining a single worldview to see whether it’s consistent,” but consistency in a strict sense seems too cheap? E.g., “Bob has always been wrong before, but he’ll be right this time”; “Mortimer Snodgrass did it”; etc are all consistent. That said, my sense is that you have something broader in mind—maybe something like “plausible,” “compelling,” “sense-making,” etc. But it seems like these still leave the question of overall probabilities open...
Overall, my sense is that disagreement here is probably more productively focused on the object level—e.g., on the actual probabilities I give to the premises, and/or on pointing out and giving weight to scenarios that the premises don’t cover—rather than on the methodology in the abstract. In particular, I doubt that people who disagree a lot with my bottom line will end up saying: “If I was to do things your way, I’d roughly agree with the probabilities you gave to the premises; I just disagree that you should assign probabilities to premises in a multi-step argument as a way of thinking about issues like this.” Rather, I expect a lot of it comes down to substantive disagreement about the premises at issue (and perhaps, to people assigning significant credence to scenarios that don’t fit these premises, though I don’t feel like I’ve yet heard strong candidates—e.g., ones that seem to me to plausibly account for, say, >2/3rds of the overall X-risk from power-seeking, misaligned AI by 2070 -- in this regard).
Thanks for the thoughtful reply.
I do think I was overestimating how robust you’re treating your numbers and premises, it seems like you’re holding them all much more lightly than I think I’d been envisioning.
FWIW I am more interested in engaging with some of what you wrote in in your other comment than engaging on the specific probability you assign, for some of the reasons I wrote about here.
I think I have more I could say on the methodology, but alas, I’m pretty blocked up with other work atm. It’d be neat to spend more time reading the report and leave more comments here sometime.
This links to A Sketch of Good Communication, not whichever comment you were intending to link :)
Fixed, tah.