Joe_Carlsmith comments on Draft report on existential risk from power-seeking AI

Joe_Carlsmith 30 Apr 2021 23:57 UTC
15 points
0 ∶ 0
Hi Rob,
Thanks for these comments.
Let’s call “there will be an existential catastrophe from power-seeking AI before 2070” p. I’m understanding your main objections in this comment as:
1. It seems to you like we’re in a world where p is true, by default. Hence, 5% on p seems too low to you. In particular:
  1. It implies 95% confidence on not p, which seems to you overly confident.
  2. If p is true by default, you think the world would look like it does now; so if this world isn’t enough to get me above 5%, what would be?
  3. Because p seems true to you by default, you suspect that an analysis that only ends up putting 5% on p involves something more than “the kind of mistake you should make in any ordinary way,” and requires some kind of mistake in methodology.
One thing I’ll note at the outset is the content of footnote 178, which (partly prompted by your comment) I may revise to foreground more in the main text: “In sensitivity tests, where I try to put in ‘low-end’ and ‘high-end’ estimates for the premises above, this number varies between ~.1% and ~40% (sampling from distributions over probabilities narrows this range a bit, but it also fails to capture certain sorts of correlations). And my central estimate varies between ~1-10% depending on my mood, what considerations are salient to me at the time, and so forth. This instability is yet another reason not to put too much weight on these numbers. And one might think variation in the direction of higher risk especially worrying.”
Re 1a: I’m open to 5% being too low. Indeed, I take “95% seems awfully confident,” and related worries in that vein, seriously as an objection. However, as the range above indicates, I also feel open to 5% being too high (indeed, at times it seems that way too me), and I don’t see “it would be strange to be so confident that all of humanity won’t be killed/disempowered because of X” as a forceful argument on its own (quite the contrary): rather, I think we really need to look at the object-level evidence and argument for X, which is what the document tries to do (not saying that quote represents your argument; but hopefully it can illustrate why one might start from a place of being unsurprised if the probability turns out low).
Re 1b: I’m not totally sure I’ve understood you here, but here are a few thoughts. At a high level, one answer to “what sort of evidence would make me update towards p being more likely” is “the considerations discussed in the document that I see as counting against p don’t apply, or seem less plausible” (examples here include considerations related to longer timelines, non-APS/modular/specialized/myopic/constrained/incentivized/not-able-to-easily-intelligence-explode systems sufficing in lots/maybe ~all of incentivized applications, questions about the ease of eliminating power-seeking behavior on relevant inputs during training/testing given default levels of effort, questions about why and in what circumstances we might expect PS-misaligned systems to be superficially/sufficiently attractive to deploy, warning shots, corrective feedback loops, limitations to what APS systems with lopsided/non-crazily-powerful capabilities can do, general incentives to avoid/prevent ridiculously destructive deployment, etc, plus more general considerations like “this feels like a very specific way things could go”).
But we could also imagine more “outside view” worlds where my probability would be higher: e.g., there is a body of experts as large and established as the experts working on climate change, which uses quantitative probabilistic models of the quality and precision used by the IPCC, along with an understanding of the mechanisms underlying the threat as clear and well-established as the relationship between carbon emissions and climate change, to reach a consensus on much higher estimates. Or: there is a significant, well-established track record of people correctly predicting future events and catastrophes of this broad type decades in advance, and people with that track record predict p with >5% probability.
That said, I think maybe this isn’t getting at the core of your objection, which could be something like: “if in fact this is a world where p is true, is your epistemology sensitive enough to that? E.g., show me that your epistemology is such that, if p is true, it detects p as true, or assigns it significant probability.” I think there may well be something to objections in this vein, and I’m interested in thinking about the more; but I also want to flag that at a glance, it feels kind of hard to articulate them in general terms. Thus, suppose Bob has been wrong about ⁹⁹⁄₁₀₀ predictions in the past. And you say: “OK, but if Bob was going to be right about this one, despite being consistently wrong in the past, the world would look just like it does now. Show me that your epistemology is sensitive enough to assign high probability to Bob being right about this one, if he’s about to be.” But this seems like a tough standard; you just should have low probability on Bob being right about this one, even if he is. Not saying that’s the exact form of your objection, or even that it’s really getting at the heart of things, but maybe you could lay out your objection in a way that doesn’t apply to the Bob case?
(Responses to 1c below)