Joe_Carlsmith comments on Draft report on existential risk from power-seeking AI

Joe_Carlsmith 1 May 2021 0:12 UTC
18 points
0 ∶ 0
(Continued from comment on the main thread)
I’m understanding your main points/objections in this comment as:
1. You think the multiple stage fallacy might be the methodological crux behind our disagreement.
2. You think that >80% of AI safety researchers at MIRI, FHI, CHAI, OpenAI, and DeepMind would assign >10% probability to existential catastrophe from technical problems with AI (at some point, not necessarily before 2070). So it seems like 80k saying 1-10% reflects a disagreement with the experts, which would be strange in the context of e.g. climate change, and at least worth flagging/separating. (Presumably, something similar would apply to my own estimates.)
3. You worry that there are social reasons not to sound alarmist about weird/novel GCRs, and that it can feel “conservative” to low-ball rather than high-ball the numbers. But low-balling (and/or focusing on/making salient lower-end numbers) has serious downsides. And you worry that EA folks have a track record of mistakes in this vein.
(as before, let’s call “there will be an existential catastrophe from power-seeking AI before 2070” p).
Re 1 (and 1c, from my response to the main thread): as I discuss in the document, I do think there are questions about multiple-stage fallacies, here, though I also think that not decomposing a claim into sub-claims can risk obscuring conjunctiveness (and I don’t see “abandon the practice of decomposing a claim into subclaims” as a solution to this). As an initial step towards addressing some of these worries, I included an appendix that reframes the argument using fewer premises (and also, in positive (e.g., “p is false”) vs. negative (“p is true”) forms). Of course, this doesn’t address e.g. the “the conclusion could be true, but some of the premises false” version of the “multiple stage fallacy” worry; but FWIW, I really do think that the premises here capture the majority of my own credence on p, at least. In particular, the timelines premise is fairly weak, premises 4-6 are implied by basically any p-like scenario, so it seems like the main contenders for false premises (even while p is true) are 2: (“There will be strong incentives to build APS systems”) and 3: (“It will be much harder to develop APS systems that would be practically PS-aligned if deployed, than to develop APS systems that would be practically PS-misaligned if deployed (even if relevant decision-makers don’t know this), but which are at least superficially attractive to deploy anyway”). Here, I note the scenarios most salient to me in footnote 173, namely: “we might see unintentional deployment of practical PS-misaligned APS systems even if they aren’t superficially attractive to deploy” and “practical PS-misaligned might be developed and deployed even absent strong incentives to develop them (for example, simply for the sake of scientific curiosity).” But I don’t see these are constituting more than e.g. 50% of the risk. If your own probability is driven substantially by scenarios where the premises I list are false, I’d be very curious to hear which ones (setting aside scenarios that aren’t driven by power-seeking, misaligned AI), and how much credence if you give them. I’d also be curious, more generally, to hear your more specific disagreements with the probabilities I give to the premises I list.
Re: 2, your characterization of the distribution of views amongst AI safety researchers (outside of MIRI) is in some tension with my own evidence; and I consulted with a number of people who fit your description of “specialists”/experts in preparing the document. That said, I’d certainly be interested to see more public data in this respect, especially in a form that breaks down in (rough) quantitative terms the different factors driving the probability in question, as I’ve tried to do in the document (off the top of my head, the public estimates most salient to me are Ord (2020) at 10% by 2100, Grace et al (2017)’s expert survey (5% median, with no target date), and FHI’s (2008) survey (5% on extinction from superintelligent AI by 2100), though we could gather up others from e.g. LW and previous X-risk books.) That said, importantly, and as indicated in my comment on the main thread, I don’t think of the community of AI safety researchers at the orgs you mention as in an epistemic position analogous to e.g. the IPCC, for a variety of reasons (and obviously, there are strong selection effects at work). Less importantly, I also don’t think the technical aspects of this problem the only factors relevant to assessing risk; at this point I have some feeling of having “heard the main arguments”; and >10% (especially if we don’t restrict to pre-2070 scenarios) is within my “high-low” range mentioned in footnote 178 (e.g., .1%-40%).
Re: 3, I do think that the “conservative” thing to do here is to focus on the higher-end estimates (especially given uncertainty/instability in the numbers), and I may revise to highlight this more in the text. But I think we should distinguish between the project of figuring out “what to focus on”/what’s “appropriately conservative,” and what our actual best-guess probabilities are; and just as there are risks of low-balling for the sake of not looking weird/alarmist, I think there are risks of high-balling for the sake of erring on the side of caution. My aim here has been to do neither; though obviously, it’s hard to eliminate biases (in both directions).
- Ben Pace 2 May 2021 3:25 UTC
  20 points
  0 ∶ 1
  Parent
  I think I share Robby’s sense that the methodology seems like it will obscure truth.
  That said, I have neither your (Joe) extensive philosophical background nor have spent substantial time like you on a report like this, and I am interested in evidence to the contrary.
  To me, it seems like you’ve tried to lay out a series of 6 steps of an argument, that you think each very accurately carve the key parts of reality that are relevant, and pondered each step for quite a while.
  When I ask myself whether I’ve seen something like this produce great insight, it’s hard. It’s not something I’ve done much myself explicitly. However, I can think of a nearby example where I think this has produced great insight, which is Nick Bostrom’s work. I think (?) Nick spends a lot of his time considering a simple, single key argument, looking at it from lots of perspectives, scrutinizing wording, asking what people from different scientific fields would think of it, poking and prodding and rotating and just exploring it. Through that work, I think he’s been able to find considerations that were very surprising and invalidated the arguments, and proposed very different arguments instead.
  When I think of examples here, I’m imagining that this sort of intellectual work produced the initial arguments about astronomical waste, and arguments since then about unilateralism and the vulnerable world hypothesis. Oh, and also simulation hypothesis (which became a tripartite structure).
  I think of Bostrom as trying to consider a single worldview, and find out whether it’s a consistent object. One feeling I have about turning it into a multi-step probabilistic argument is that it does the opposite, it does not try to examine one worldview to find falsehoods, but instead integrates over all the parts of the worldview that Bostrom would scrutinize, to make a single clump of lots of parts of different worldviews. I think Bostrom may have literally never published a six-step argument of the form that you have, where it was meant to hold anything of weight in the paper or book, and also never done so assigning each step a probability.
  To be clear, probabilistic discussions are great. Talking about precisely how strong a piece of evidence is – is it 2:1, 10:1, 100:1? Helps a lot in noticing which hypotheses to even pay attention to. The suspicion I have is that they are fairly different from the kind of cognition Bostrom does when doing this sort of philosophical argumentation that produces simple arguments of world-shattering importance. I suspect you’ve set yourself a harder task than Bostrom ever has (a 6-step argument), and thought you’ve made it easier for yourself by making it only probabilistic instead of deductive, whereas in fact this removes most of the tools that Bostrom was able to use to ensure he didn’t take mis-steps.
  But I am pretty interested if there are examples of great work using your methodology that you were inspired by when writing this up, or great works with nearby methodologies that feel similar to you. I’d be excited to read/discuss some.
  - Ben Pace 2 May 2021 4:10 UTC
    10 points
    0 ∶ 0
    Parent
    I tried to look for writing like this. I think that people do multiple hypothesis testing, like Harry in chapter 86 of HPMOR. There Harry is trying to weigh some different hypotheses against each other to explain his observations. There isn’t really a single train of conditional steps that constitutes the whole hypothesis.
    My shoulder-Scott-Alexander is telling me (somewhat similar to my shoulder-Richard-Feynman) that there’s a lot of ways to trick myself with numbers, and that I should only do very simple things with them. I looked through some of his posts just now (1, 2, 3, 4, 5).
    Here’s an example of a conclusion / belief from Scott’s post Teachers: Much More Than You Wanted to Know:
    In summary: teacher quality probably explains 10% of the variation in same-year test scores. A +1 SD better teacher might cause a +0.1 SD year-on-year improvement in test scores. This decays quickly with time and is probably disappears entirely after four or five years, though there may also be small lingering effects. It’s hard to rule out the possibility that other factors, like endogenous sorting of students, or students’ genetic potential, contributes to this as an artifact, and most people agree that these sorts of scores combine some signal with a lot of noise. For some reason, even though teachers’ effects on test scores decay very quickly, studies have shown that they have significant impact on earning as much as 20 or 25 years later, so much so that kindergarten teacher quality can predict thousands of dollars of difference in adult income. This seemingly unbelievable finding has been replicated in quasi-experiments and even in real experiments and is difficult to banish. Since it does not happen through standardized test scores, the most likely explanation is that it involves non-cognitive factors like behavior. I really don’t know whether to believe this and right now I say 50-50 odds that this is a real effect or not – mostly based on low priors rather than on any weakness of the studies themselves. I don’t understand this field very well and place low confidence in anything I have to say about it.
    I don’t know any post where Scott says “there’s a particular 6-step argument, and I assign 6 different probabilities to each step, and I trust that outcome number seems basically right”. His conclusions read more like 1 key number with some uncertainty, which never came from a single complex model, but from aggregating loads of little studies and pieces of evidence into a judgment.
    I think I can’t think of a post like this by Scott or Robin or Eliezer or Nick or anyone. But would be interested in an example that is like this (from other fields or wherever), or feels similar.
    - Gregory Lewis🔸 2 May 2021 9:53 UTC
      30 points
      0 ∶ 0
      Parent
      Maybe not ‘insight’, but re. ‘accuracy’ this sort of decomposition is often in the tool box of better forecasters. I think the longest path I evaluated in a question had 4 steps rather than 6, and I think I’ve seen other forecasters do similar things on occasion. (The general practice of ‘breaking down problems’ to evaluate sub-issues is recommended in Superforecasting IIRC).
      
      I guess the story why this works in geopolitical forecasting is folks tend to overestimate the chance ‘something happens’ and tend to be underdamped in increasing the likelihood of something based on suggestive antecedents (e.g. chance of a war given an altercation, etc.) So attending to “Even if A, for it to lead to D one should attend to P(B|A), P(C|B) etc. etc.”, tend to lead to downwards corrections.
      
      Naturally, you can mess this up. Although it’s not obvious you are at greater risk if you arrange your decomposed considerations conjunctively or disjunctively: “All of A-E must be true for P to be true” ~also means “if any of ¬A-¬E are true, then ¬P”. In natural language and heuristics, I can imagine “Here are several different paths to P, and each of these seem not-too-improbable, so P must be highly likely” could also lead one astray.
    - Joe_Carlsmith 7 May 2021 18:39 UTC
      22 points
      0 ∶ 0
      Parent
      Hi Ben,
      A few thoughts on this:
      It seems possible that attempting to produce “great insight” or “simple arguments of world-shattering importance” warrants a methodology different from the one I’ve used here. But my aim here is humbler: to formulate and evaluate an existing argument that I and various others take seriously, and that lots of resources are being devoted to; and to come to initial, informal, but still quantitative best-guesses about the premises and conclusion, which people can (hopefully) agree/disagree with at a somewhat fine-grained level—e.g., a level that just giving overall estimates, or just saying e.g. “significant probability,” “high enough to worry about,” etc can make more difficult to engage on.
      In that vein, I think it’s possible you’re over-estimating how robust I take the premises and numbers here to be (I’m thinking here of your comments re: “very accurately carve the key parts of reality that are relevant,” and “trust the outcome number”). As I wrote in response to Rob above, my low-end/high-end range here is .1% to 40% (see footnote 179, previously 178), and in general, I hold the numbers here very lightly (I try to emphasize this in section 8).
      FWIW, I think Superintelligence can be pretty readily seen as a multi-step argument (e.g., something like: superintelligence will happen eventually; fast take-off is plausible; if fast-take-off, then a superintelligence will probably get a decisive strategic advantage; alignment will be tricky; misalignment leads to power-seeking; therefore plausible doom). And more broadly, I think that people make arguments with many premises all the time (though sometimes the premises are suppressed). It’s true that people don’t usually assign probabilities to the premises (and Bostrom doesn’t, in Superintelligence—a fact that leaves the implied p(doom) correspondingly ambiguous) -- but I think this is centrally because assigning informal probabilities to claims (whether within a multi-step argument, or in general) just isn’t a very common practice, for reasons not centrally to do with e.g. multi-stage-fallacy type problems. Indeed, I expect I’d prefer a world where people assigned informal, lightly-held probabilities to their premises and conclusions (and formulated their arguments in premise-premise-conclusion form) more frequently.
      I’m not sure exactly what you have in mind re: “examining a single worldview to see whether it’s consistent,” but consistency in a strict sense seems too cheap? E.g., “Bob has always been wrong before, but he’ll be right this time”; “Mortimer Snodgrass did it”; etc are all consistent. That said, my sense is that you have something broader in mind—maybe something like “plausible,” “compelling,” “sense-making,” etc. But it seems like these still leave the question of overall probabilities open...
      Overall, my sense is that disagreement here is probably more productively focused on the object level—e.g., on the actual probabilities I give to the premises, and/or on pointing out and giving weight to scenarios that the premises don’t cover—rather than on the methodology in the abstract. In particular, I doubt that people who disagree a lot with my bottom line will end up saying: “If I was to do things your way, I’d roughly agree with the probabilities you gave to the premises; I just disagree that you should assign probabilities to premises in a multi-step argument as a way of thinking about issues like this.” Rather, I expect a lot of it comes down to substantive disagreement about the premises at issue (and perhaps, to people assigning significant credence to scenarios that don’t fit these premises, though I don’t feel like I’ve yet heard strong candidates—e.g., ones that seem to me to plausibly account for, say, >2/3rds of the overall X-risk from power-seeking, misaligned AI by 2070 -- in this regard).
      - Ben Pace 8 May 2021 6:33 UTC
        2 points
        0 ∶ 0
        Parent
        Thanks for the thoughtful reply.
        I do think I was overestimating how robust you’re treating your numbers and premises, it seems like you’re holding them all much more lightly than I think I’d been envisioning.
        FWIW I am more interested in engaging with some of what you wrote in in your other comment than engaging on the specific probability you assign, for some of the reasons I wrote about here.
        I think I have more I could say on the methodology, but alas, I’m pretty blocked up with other work atm. It’d be neat to spend more time reading the report and leave more comments here sometime.
        anonymous_ea 8 May 2021 21:02 UTC
        2 points
        0 ∶ 0
        Parent
        your other comment
        This links to A Sketch of Good Communication, not whichever comment you were intending to link :)
        Ben Pace 8 May 2021 21:22 UTC
        2 points
        0 ∶ 0
        Parent
        Fixed, tah.
  - richard_ngo 10 Dec 2022 8:20 UTC
    4 points
    0 ∶ 0
    Parent
    Great comment :)
- RyanCarey 2 Jun 2021 9:42 UTC
  11 points
  0 ∶ 0
  Parent
  The upshot seems to be that Joe, 80k, the AI researcher survey (2008), Holden-2016 are all at about a 3% estimate of AI risk, whereas AI safety researchers now are at about 30%. The latter is a bit lower (or at least differently distributed) than Rob expected, and seems higher than among Joe’s advisors.
  The divergence is big, but pretty explainable, because it concords with the direction that apparent biases point in. For the 3% camp, the credibility of one’s name, brand, or field benefits from making a lowball estimates. Whereas the 30% camp is self-selected to have severe concern. And risk perception all-round has increased a bit in the last 5-15 years due to Deep Learning.