Senior research analyst at Open Philanthropy. Doctorate in philosophy at the University of Oxford. Opinions my own.
Joe_Carlsmith
Hi Ben,
This does seem like a helpful kind of content to include (here I think of Luke’s section on this here, in the context of his work on moral patienthood). I’ll consider revising to say more in this vein. In the meantime, here are a few updates off the top of my head:
It now feels more salient to me now just how many AI applications may be covered by systems that either aren’t agentic planning/strategically aware (including e.g. interacting modular systems, especially where humans are in the loop for some parts, and/or intuitively “sphexish/brittle” non-APS systems ), or by systems which are specialized/myopic/limited in capability in various ways. That is, a generalized learning agent that’s superhuman (let alone better than e.g. all of human civilization) in ~all domains, with objectives as open-ended and long-term as “maximize paperclips,” now seems to me a much more specific type of system, and one whose role in an automated economy—especially early on—seems more unclear. (I discuss this a bit in Section 3, section 4.3.1.3, and section 4.3.2).
Thinking about the considerations discussed in the “unusual difficulties” section generally gave me more clarity about how this problem differs from safety problems arising in the context of other technologies (I think I had previously been putting more weight on considerations like “building technology that performs function F is easier than building some technology that performs function F safely and reliably,” which apply more generally).
I realized how much I had been implicitly conceptualizing the “alignment problem” as “we must give these AI systems objectives that we’re OK seeing pursued with ~arbitrary degrees of capability” (something akin to the “omni test”). Meeting standards in this vicinity (to the extent that they’re well defined in a given case) seems like a very desirable form of robustness (and I’m sympathetic to related comments from Eliezer to the effect that “don’t build systems that are searching for ways to kill you, even if you think the search will come up empty”), but I found it helpful to remember that the ultimate problem is “we need to ensure that these systems don’t seek power in misaligned ways on any inputs they’re in fact exposed to” (e.g., what I’m calling “practical PS-alignment”) -- a framing that leaves more conceptual room, at least, for options that don’t “get the objectives exactly right,” and/or that involve restricting a system’s capabilities/time horizons, preventing it from “intelligence exploding,” controlling its options/incentives, and so on (though I do think options in this vein raise their own issues, of the type of that the “omni test” is meant to avoid, see 4.3.1.3, 4.3.2.3, and 4.3.3). I discuss this a bit in section 4.1.
I realized that my thinking re: “races to the bottom on safety” had been driven centrally by abstract arguments/models that could apply in principle to many industries (e.g., pharmaceuticals). It now seems to me a knottier and more empirical question how models of this kind will actually apply in a given real-world case re: AI. I discuss this a bit in section 5.3.1.
- EA Forum Prize: Winners for April 2021 by 29 Jul 2021 1:12 UTC; 26 points) (
- 8 May 2021 6:33 UTC; 2 points) 's comment on Draft report on existential risk from power-seeking AI by (
Hi Ben,
A few thoughts on this:
It seems possible that attempting to produce “great insight” or “simple arguments of world-shattering importance” warrants a methodology different from the one I’ve used here. But my aim here is humbler: to formulate and evaluate an existing argument that I and various others take seriously, and that lots of resources are being devoted to; and to come to initial, informal, but still quantitative best-guesses about the premises and conclusion, which people can (hopefully) agree/disagree with at a somewhat fine-grained level—e.g., a level that just giving overall estimates, or just saying e.g. “significant probability,” “high enough to worry about,” etc can make more difficult to engage on.
In that vein, I think it’s possible you’re over-estimating how robust I take the premises and numbers here to be (I’m thinking here of your comments re: “very accurately carve the key parts of reality that are relevant,” and “trust the outcome number”). As I wrote in response to Rob above, my low-end/high-end range here is .1% to 40% (see footnote 179, previously 178), and in general, I hold the numbers here very lightly (I try to emphasize this in section 8).
FWIW, I think Superintelligence can be pretty readily seen as a multi-step argument (e.g., something like: superintelligence will happen eventually; fast take-off is plausible; if fast-take-off, then a superintelligence will probably get a decisive strategic advantage; alignment will be tricky; misalignment leads to power-seeking; therefore plausible doom). And more broadly, I think that people make arguments with many premises all the time (though sometimes the premises are suppressed). It’s true that people don’t usually assign probabilities to the premises (and Bostrom doesn’t, in Superintelligence—a fact that leaves the implied p(doom) correspondingly ambiguous) -- but I think this is centrally because assigning informal probabilities to claims (whether within a multi-step argument, or in general) just isn’t a very common practice, for reasons not centrally to do with e.g. multi-stage-fallacy type problems. Indeed, I expect I’d prefer a world where people assigned informal, lightly-held probabilities to their premises and conclusions (and formulated their arguments in premise-premise-conclusion form) more frequently.
I’m not sure exactly what you have in mind re: “examining a single worldview to see whether it’s consistent,” but consistency in a strict sense seems too cheap? E.g., “Bob has always been wrong before, but he’ll be right this time”; “Mortimer Snodgrass did it”; etc are all consistent. That said, my sense is that you have something broader in mind—maybe something like “plausible,” “compelling,” “sense-making,” etc. But it seems like these still leave the question of overall probabilities open...
Overall, my sense is that disagreement here is probably more productively focused on the object level—e.g., on the actual probabilities I give to the premises, and/or on pointing out and giving weight to scenarios that the premises don’t cover—rather than on the methodology in the abstract. In particular, I doubt that people who disagree a lot with my bottom line will end up saying: “If I was to do things your way, I’d roughly agree with the probabilities you gave to the premises; I just disagree that you should assign probabilities to premises in a multi-step argument as a way of thinking about issues like this.” Rather, I expect a lot of it comes down to substantive disagreement about the premises at issue (and perhaps, to people assigning significant credence to scenarios that don’t fit these premises, though I don’t feel like I’ve yet heard strong candidates—e.g., ones that seem to me to plausibly account for, say, >2/3rds of the overall X-risk from power-seeking, misaligned AI by 2070 -- in this regard).
Hi Michael —
I meant, in the post, for the following paragraphs to address the general issue you mention:
Some people don’t think that gratitude of this kind makes sense. Being created, we might say, can’t have been “better for” me, because if I hadn’t been created, I wouldn’t exist, and there would be no one that Wilbur’s choice was “worse for.” And if being created wasn’t better for me, the thought goes, then I shouldn’t be grateful to Wilbur for creating me.
Maybe the issues here are complicated, but at a high level: I don’t buy it. It seems to me very natural to see Wilbur as having done, for me, something incredibly significant — to have given me, on purpose, something that I value deeply. One option, for capturing this, is to say that something can be good for me, without being “better” for me (see e.g. McMahan (2009)). Another option is just to say that being created is better for me than not being created, even if I only exist — at least concretely — in one of the cases. Overall, I don’t feel especially invested in the metaphysics/semantics of “good for” and “better for” in this sort of case. I don’t have a worked out account of these issues, but neither do I see them as especially forceful reason not to be glad that I’m alive, or grateful to someone who caused me to be so.
That is, I don’t take myself to be advocating directly for comparativism here (though a few bits of the language in the post, in particular the reference to “better off dead,” do suggest that). As the quoted paragraphs note, comparativism is one option; another is to say that creating me is good for me, even if it’s not better for me (a la McMahan).
FWIW, though, I do currently feel intuitively open/sympathetic to comparativism, partly because it seems plausible that we can say truly things like “Joe would prefer to be live rather than not to live,” even if Joe doesn’t and never will exist; and clear that we can truly say “Joe prefers to live” in worlds where he does exist; and I tend to think about treating people well as centrally about being responsive to what they care about/would care about. But I haven’t tried to dig in on this stuff, partly because I see things like being glad I’m alive, and grateful to someone who caused me to be so, as on more generally solid ground than things like “betterness for Joe is a relation that requires two concrete Joe lives as relata” (see e.g. the Menagerie argument in Hilary’s powerpoint, p. 13, for the type of thing that makes me think that metaphysical premises like that aren’t a “super solid ground” type area).
At a higher level, though: the point I’m arguing against is specifically that the neutrality intuition is directly intuitive. I don’t see it that way, and the point of “poetically tugging at people’s intuitions” was precisely to try to illustrate and make vivid the intuitive situation as I see it. But as I note at the end — e.g., “direct intuitions about neutrality aren’t the only data available” — it’s a further question whether there is more to be said for neutrality overall (indeed, I think there is — though metaphysical issues like the ones you mention aren’t very central for me here). That said, I tend to see much of person-affecting ethics as driven at least in substantial part by appeal direct intuition, so I do think it would change the overall dialectical landscape a bit if people come in going “intuitively, we have strong reasons to create happy lives. But there are some metaphysical/semantic questions about how to make sense of this…”
(Continued from comment on the main thread)
I’m understanding your main points/objections in this comment as:
You think the multiple stage fallacy might be the methodological crux behind our disagreement.
You think that >80% of AI safety researchers at MIRI, FHI, CHAI, OpenAI, and DeepMind would assign >10% probability to existential catastrophe from technical problems with AI (at some point, not necessarily before 2070). So it seems like 80k saying 1-10% reflects a disagreement with the experts, which would be strange in the context of e.g. climate change, and at least worth flagging/separating. (Presumably, something similar would apply to my own estimates.)
You worry that there are social reasons not to sound alarmist about weird/novel GCRs, and that it can feel “conservative” to low-ball rather than high-ball the numbers. But low-balling (and/or focusing on/making salient lower-end numbers) has serious downsides. And you worry that EA folks have a track record of mistakes in this vein.
(as before, let’s call “there will be an existential catastrophe from power-seeking AI before 2070” p).
Re 1 (and 1c, from my response to the main thread): as I discuss in the document, I do think there are questions about multiple-stage fallacies, here, though I also think that not decomposing a claim into sub-claims can risk obscuring conjunctiveness (and I don’t see “abandon the practice of decomposing a claim into subclaims” as a solution to this). As an initial step towards addressing some of these worries, I included an appendix that reframes the argument using fewer premises (and also, in positive (e.g., “p is false”) vs. negative (“p is true”) forms). Of course, this doesn’t address e.g. the “the conclusion could be true, but some of the premises false” version of the “multiple stage fallacy” worry; but FWIW, I really do think that the premises here capture the majority of my own credence on p, at least. In particular, the timelines premise is fairly weak, premises 4-6 are implied by basically any p-like scenario, so it seems like the main contenders for false premises (even while p is true) are 2: (“There will be strong incentives to build APS systems”) and 3: (“It will be much harder to develop APS systems that would be practically PS-aligned if deployed, than to develop APS systems that would be practically PS-misaligned if deployed (even if relevant decision-makers don’t know this), but which are at least superficially attractive to deploy anyway”). Here, I note the scenarios most salient to me in footnote 173, namely: “we might see unintentional deployment of practical PS-misaligned APS systems even if they aren’t superficially attractive to deploy” and “practical PS-misaligned might be developed and deployed even absent strong incentives to develop them (for example, simply for the sake of scientific curiosity).” But I don’t see these are constituting more than e.g. 50% of the risk. If your own probability is driven substantially by scenarios where the premises I list are false, I’d be very curious to hear which ones (setting aside scenarios that aren’t driven by power-seeking, misaligned AI), and how much credence if you give them. I’d also be curious, more generally, to hear your more specific disagreements with the probabilities I give to the premises I list.
Re: 2, your characterization of the distribution of views amongst AI safety researchers (outside of MIRI) is in some tension with my own evidence; and I consulted with a number of people who fit your description of “specialists”/experts in preparing the document. That said, I’d certainly be interested to see more public data in this respect, especially in a form that breaks down in (rough) quantitative terms the different factors driving the probability in question, as I’ve tried to do in the document (off the top of my head, the public estimates most salient to me are Ord (2020) at 10% by 2100, Grace et al (2017)’s expert survey (5% median, with no target date), and FHI’s (2008) survey (5% on extinction from superintelligent AI by 2100), though we could gather up others from e.g. LW and previous X-risk books.) That said, importantly, and as indicated in my comment on the main thread, I don’t think of the community of AI safety researchers at the orgs you mention as in an epistemic position analogous to e.g. the IPCC, for a variety of reasons (and obviously, there are strong selection effects at work). Less importantly, I also don’t think the technical aspects of this problem the only factors relevant to assessing risk; at this point I have some feeling of having “heard the main arguments”; and >10% (especially if we don’t restrict to pre-2070 scenarios) is within my “high-low” range mentioned in footnote 178 (e.g., .1%-40%).
Re: 3, I do think that the “conservative” thing to do here is to focus on the higher-end estimates (especially given uncertainty/instability in the numbers), and I may revise to highlight this more in the text. But I think we should distinguish between the project of figuring out “what to focus on”/what’s “appropriately conservative,” and what our actual best-guess probabilities are; and just as there are risks of low-balling for the sake of not looking weird/alarmist, I think there are risks of high-balling for the sake of erring on the side of caution. My aim here has been to do neither; though obviously, it’s hard to eliminate biases (in both directions).
Hi Rob,
Thanks for these comments.
Let’s call “there will be an existential catastrophe from power-seeking AI before 2070” p. I’m understanding your main objections in this comment as:
It seems to you like we’re in a world where p is true, by default. Hence, 5% on p seems too low to you. In particular:
It implies 95% confidence on not p, which seems to you overly confident.
If p is true by default, you think the world would look like it does now; so if this world isn’t enough to get me above 5%, what would be?
Because p seems true to you by default, you suspect that an analysis that only ends up putting 5% on p involves something more than “the kind of mistake you should make in any ordinary way,” and requires some kind of mistake in methodology.
One thing I’ll note at the outset is the content of footnote 178, which (partly prompted by your comment) I may revise to foreground more in the main text: “In sensitivity tests, where I try to put in ‘low-end’ and ‘high-end’ estimates for the premises above, this number varies between ~.1% and ~40% (sampling from distributions over probabilities narrows this range a bit, but it also fails to capture certain sorts of correlations). And my central estimate varies between ~1-10% depending on my mood, what considerations are salient to me at the time, and so forth. This instability is yet another reason not to put too much weight on these numbers. And one might think variation in the direction of higher risk especially worrying.”
Re 1a: I’m open to 5% being too low. Indeed, I take “95% seems awfully confident,” and related worries in that vein, seriously as an objection. However, as the range above indicates, I also feel open to 5% being too high (indeed, at times it seems that way too me), and I don’t see “it would be strange to be so confident that all of humanity won’t be killed/disempowered because of X” as a forceful argument on its own (quite the contrary): rather, I think we really need to look at the object-level evidence and argument for X, which is what the document tries to do (not saying that quote represents your argument; but hopefully it can illustrate why one might start from a place of being unsurprised if the probability turns out low).
Re 1b: I’m not totally sure I’ve understood you here, but here are a few thoughts. At a high level, one answer to “what sort of evidence would make me update towards p being more likely” is “the considerations discussed in the document that I see as counting against p don’t apply, or seem less plausible” (examples here include considerations related to longer timelines, non-APS/modular/specialized/myopic/constrained/incentivized/not-able-to-easily-intelligence-explode systems sufficing in lots/maybe ~all of incentivized applications, questions about the ease of eliminating power-seeking behavior on relevant inputs during training/testing given default levels of effort, questions about why and in what circumstances we might expect PS-misaligned systems to be superficially/sufficiently attractive to deploy, warning shots, corrective feedback loops, limitations to what APS systems with lopsided/non-crazily-powerful capabilities can do, general incentives to avoid/prevent ridiculously destructive deployment, etc, plus more general considerations like “this feels like a very specific way things could go”).
But we could also imagine more “outside view” worlds where my probability would be higher: e.g., there is a body of experts as large and established as the experts working on climate change, which uses quantitative probabilistic models of the quality and precision used by the IPCC, along with an understanding of the mechanisms underlying the threat as clear and well-established as the relationship between carbon emissions and climate change, to reach a consensus on much higher estimates. Or: there is a significant, well-established track record of people correctly predicting future events and catastrophes of this broad type decades in advance, and people with that track record predict p with >5% probability.
That said, I think maybe this isn’t getting at the core of your objection, which could be something like: “if in fact this is a world where p is true, is your epistemology sensitive enough to that? E.g., show me that your epistemology is such that, if p is true, it detects p as true, or assigns it significant probability.” I think there may well be something to objections in this vein, and I’m interested in thinking about the more; but I also want to flag that at a glance, it feels kind of hard to articulate them in general terms. Thus, suppose Bob has been wrong about 99⁄100 predictions in the past. And you say: “OK, but if Bob was going to be right about this one, despite being consistently wrong in the past, the world would look just like it does now. Show me that your epistemology is sensitive enough to assign high probability to Bob being right about this one, if he’s about to be.” But this seems like a tough standard; you just should have low probability on Bob being right about this one, even if he is. Not saying that’s the exact form of your objection, or even that it’s really getting at the heart of things, but maybe you could lay out your objection in a way that doesn’t apply to the Bob case?
(Responses to 1c below)
A few questions about this:
Does this view imply that it is actually not possible to have a world where e.g. a machine creates one immortal happy person per day, forever, who then form an ever-growing line?
How does this view interpret cosmological hypotheses on which the universe is infinite? Is the claim that actually, on those hypotheses, the universe is finite after all?
It seems like lots of the (countable) worlds and cases discussed in the post can simply be reframed as never-ending processes, no? And then similar (identical?) questions will arise? Thus, for example, w5 is equivalent to a machine that creates a1 at −1, then a3 at −1, then a5 at −1, etc. w6 is equivalent to a machine that creates a1 at −1, then a2 at −1, a3 at −1, etc. What would this view say about which of these machines we should create, given the opportunity? How should we compare these to a w8 machine that creates b1 at −1, b2 at −1, b3 at −1, b4 at −1, etc?
Re: the Jaynes quote: I’m not sure I’ve understood the full picture here, but in general, to me it doesn’t feel like the central issues here have to do with dependencies on “how the limit is approached,” such that requiring that each scenario pin down an “order” solves the problems. For example, I think that a lot of what seems strange about Neutrality-violations in these cases is that even if we pin down an order for each case, the fact that you can re-arrange one into the other makes it seem like they ought to be ethically equivalent. Maybe we deny that, and maybe we do so for reasons related to what you’re talking about—but it seems like the same bullet.
Thanks for doing this! I’ve found it useful, and I expect that it will increase my engagement with EA Forum/LW content going forward.
Thanks for making this, Michel :)
I have sympathy for responses like “look, it’s just so clear that you can’t control the past in any practically relevant sense that we should basically just assume the type of arguments in this post are wrong somehow.” But I’m curious where you think the arguments actually go wrong, if you have a view about that? For example, do you think defecting in perfect deterministic twin prisoner’s dilemmas with identical inputs is the way to go?
I’m imagining computers with sufficiently robust hardware to function deterministically at the software level, in the sense of very reliably performing the same computation, even if there’s quantum randomness at a lower level. Imagine two good-quality calculators, manufactured by the same factory using the same process, which add together the same two numbers using the same algorithm, and hence very reliably move through the same high-level memory states and output the same answer. If quantum randomness makes them output different answers, I count that as a “malfunction.”
Thanks for doing this!
I’m sorry to hear about this, Nathan. As I say in the post, I do think that the question how to do gut-stuff right from a practical perspective is distinct from the epistemic angle that the post focuses on, and I think it’s important to attend to both.
Thanks for these comments.
Re: “physics-based priors,” I don’t think I have a full sense of what you have in mind, but at a high level, I don’t yet see how physics comes into the debate. That is, AFAICT everyone agrees about the relevant physics — and in particular, that you can’t causally influence the past, “change” the past, and so on. The question as I see it (and perhaps I should’ve emphasized this more in the post, and/or put things less provocatively) is more conceptual/normative: whether when making decisions we should think of the past the way CDT does — e.g., as a set of variables whose probabilities our decision-making can’t alter — or in the way that e.g. EDT does — e.g., as a set of variables whose probabilities our decision-making can alter (and thus, a set of variables that EDT-ish decision-making implicitly tries to “control” in a non-causal sense). Non-causal decision theories are weird; but they aren’t actually “I don’t believe in normal physics” weird. They’re more “I believe in managing the news about the already-fixed past” weird.
Re: CDT’s domain of applicability, it sounds like your view is something like: “CDT generally works, but it fails in the type of cases that Joe treats as counter-examples to CDT.” I agree with this, and I think most people who reject CDT would agree, too (after all, most decision theories agree on what to do in most everyday cases; the traditional questions have been about what direction to go when their verdicts come apart). I’m inclined to think of this as CDT being wrong, because I’m inclined to think of decision theory as searching for the theory that will get the full range of cases right — but I’m not sure that much hinges on this. That said, I do think that even acknowledging that CDT fails sometimes involves rejecting some principles/arguments one might’ve thought would hold good in general (e.g. “c’mon, man, it’s no use trying to control the past,”the “what would your friend who can see what’s in the boxes say is better” argument, and so on) and thereby saying some striking and weird stuff (e.g. “Ok, it makes sense to try to control the past sometimes, just not that often”).
Re: 1-4, I agree that whether or not CDT leads you astray in a given case is an empirical question. I don’t have strong views about what range of actual cases are like this — though I’m sympathetic to your view re: 1, and as I mention in the post, I generally think we should just err on the side of not doing stuff that looks silly by normal lights. I also don’t have strong views about the relevance of non-causal decision-theory research for AGI safety (this project mostly emerged from personal interest).
I’m glad you liked it, Lukas. It does seem like an interesting question how your current confidence in your own values relates to your interest in further “idealization,” of what kind, and how much convergence makes a difference. Prima facie, it does seems plausible that greater confidence speaks in favor”conservatism” about what sorts of idealization you go in for, though I can imagine very uncertain-about-their-values people opting for conservatism, too. Indeed, it seems possible that conservatism is just generally pretty reasonable, here.
Thanks! Re: one in five million and .01% -- thanks, edited. And thanks for pointing to the Augenblick piece—does look relevant (though my specific interest in that footnote was in constraints applicable to a model where you can only consider some subset of your evidence at any given time).
Hi Hadyn,
Thanks for your kind words, and for reading.
Thanks for pointing out these pieces. I like the breakdown of the different dimensions of long-term vs. near-term.
Broadly, I agree with you that the document could benefit from more about premise 5. I’ll consider revising to add some.
I’m definitely concerned about misuse scenarios too (and I think lines here can get blurry—see e.g. Katja Grace’s recent post); but I wanted, in this document, to focus on misalignment in particular. The question of how to weigh misuse vs. misalignment risk, and how the two are similar/different more generally, seems like a big one, so I’ll mostly leave it for another time (one big practical difference is that misalignment makes certain types of technical work more relevant).
Eventually, the disempowerment has to scale to ~all of humanity (a la premise 5), so that would qualify as TAI in the “transition as big of a deal as the industrial revolution” sense. However, it’s true that my timelines condition in premise 1 (e.g., APS systems become possible and financially feasible) is weaker than Ajeya’s.
Thanks for this thoughtful comment, Ben. And also, for putting the “The Gold Lily” and “Mother and Child” on my radar—they hadn’t been before. I agree that “Mother and Child” evokes a sort some kind of sort of intergenerational project in the way you describe—“it is your turn to address it.” It seems related to the thing I was trying to talk about at the end of the post—e.g., Gluck asking for some kind of directness and intensity of engagement with life.
It’s a good question, and one I considered going into in more detail on in the post (I’ll add a link to this comment). I think it’s helpful to have in mind two types of people: “people who see the exact same evidence you do” (e.g., they look down on the same patterns of wrinkles on your hands, the same exact fading on the jeans they’re wearing, etc) and “people who might, for all you know about a given objective world, see the exact same evidence you do” (an example here would be “the person in room 2”). By “people in your epistemic situation,” I mean the former. The latter I think of as actually a disguised set of objective worlds, which posit different locations (and numbers) of the former-type people. But SIA, importantly, likes them both (though on my gloss, liking the former is more fundamental).
Here are some cases to illustrate. Suppose that God creates either one person in room 1 (if heads) or two people (if tails) in rooms 1 and 2. And suppose that there are two types of people: “Alices” and “Bobs.” Let’s say that any given Alice sees the exact same evidence as the other Alices (the same wrinkles, faded jeans, etc), and that the same holds for Bobs, and that if you’re an Alice or a Bob, you know it. Now consider three cases:
For each person God creates, he flips a second coin. If it’s heads, he creates an Alice. If tails, a Bob.
God flips a second coin. If it’s heads, he makes the person in room 1 Alice; if tails, Bob. But if the first coin was tails and he needs to create a second person, he makes that person different from the first. Thus, if tails-heads, it’s an Alice in room 1, and a Bob in room 2. But if it’s tails-tails, then it’s a Bob in room 1, and an Alice in room 2. (I talk about this case in part 4, XV.)
God creates all Alices no matter what.
Let’s write people’s names with “A” or “B,” in order of room number. And let’s say you wake up as an Alice.
In case one, “coin 1 heads” (I’ll write the coin-1 results in parentheses) corresponds to two objective worlds — A, and B — each with 1⁄4 prior probability. Coin 1 tails corresponds to four objective worlds — AA, AB, BA, and BB — each with 1/8th prior probability. So as Alice, you start by crossing off B and BB, because there are no Alices. So you’re left with 1⁄4 on A, and 1/8th on each of AA, AB, and BA, so an overall odds-ratio of 2:1:1:1. But now, as SIA, you scale the prior in proportion to the number of Alices there are, so AA gets double weight. Now you’re 2:2:1:1. Thus, you end up with 1/3rd on A, 1⁄3 on AA (with 1/6th on each of the corresponding centered worlds), and 1/6th on each of AB and BA. And you’re a “thirder” overall.
Now let’s look at case two. Here, the prior is 1⁄4 on A, 1⁄4 on B, 1⁄4 on AB, and 1⁄4 on BA. So SIA doesn’t actually do any scaling of the prior: there’s a maximum of one A in each world. Rather, it crosses off B, and ends up with 1/3rd on anything else, and stays a “thirder” overall.
Case three is just Sleeping Beauty: SIA scales in proportion to the number of Alices, and ends up a thirder overall.
So in each of these cases, SIA gives the same result, even though the distribution of Alices is in some sense pretty different. And notice, we can redescribe case 1 and 2 in terms of SIA liking “people who, for all you know about a given objective world, might be an Alice” instead of in terms of SIA liking Alices. E.g., in both cases, there are twice as many such people on tails. But importantly, their probability of being an Alice isn’t correlated with coin 1 heads vs. coin 1 tails.
Anthropics cases are sometimes ambiguous about whether they’re talking about cases of type 1 or of type 3. God’s coin toss is closer to case 1: e.g., you wake up as a person in a room, but we didn’t specify that God was literally making exact copies of you in the other rooms—your reasoning, though, treats his probability of giving any particular objective-world person your exact evidence is constant across people. Sleeping Beauty is often treated as more like case 3, but it’s compatible with being more of a case 1 type (e.g., if the experimenters also flip another coin on each waking, and leave it for Beauty to see, this doesn’t make a difference; and in general, the Beauties could have different subjective experiences on each waking, as long as —as far as Beauty knows — these variations in experience are independent of the coin toss outcome). I’m not super careful about these distinctions in the post, partly because actually splitting out all of the possible objective worlds in type-1 cases isn’t really do-able (there’s no well-defined distribution that God is “choosing from” when he creates each person in God’s coin toss—but his choice is treated, from your perspective, as independent from the coin toss outcome); and as noted, SIA’s verdicts end up the same.
Cool, this gives me a clearer picture of where you’re coming from. I had meant the central question of the post to be whether it ever makes sense to do the EDT-ish try-to-control-the-past thing, even in pretty unrealistic cases—partly because I think answering “yes” to this is weird and disorienting in itself, even if it doesn’t end up making much of a practical difference day-to-day; and partly because a central objection to EDT is that the past, being already fixed, is never controllable in any practically-relevant sense, even in e.g. Newcomb’s cases. It sounds like your main claim is that in our actual everyday circumstances, with respect to things like the WWI case, EDTish and CDT recommendations don’t come apart—a topic I don’t spend much time on or have especially strong views about.
“you’re going to lean on the difference between ‘cause’ and ‘control’”—indeed, and I had meant the “no causal interaction with” part of opening sentence to indicate this. It does seem like various readers object to/were confused by the use of the term “control” here, and I think there’s room for more emphasis early on as to what specifically I have in mind; but at a high-level, I’m inclined to keep the term “control,” rather than trying to rephrase things solely in terms of e.g. correlations, because I think it makes sense to think of yourself as, for practical purposes, “controlling” what your copy writes on his whiteboard, what Omega puts in the boxes, etc; that more broadly, EDT-ish decision-making is in fact weird in the way that trying to control the past is weird, and that this makes it all the more striking and worth highlighting that EDT-ish decision-making seems, sometimes, like the right way to go.
(Copying over my response from LessWrong)
Thanks for writing this—I’m very excited about people pushing back on/digging deeper re: counting arguments, simplicity arguments, and the other arguments re: scheming I discuss in the report. Indeed, despite the general emphasis I place on empirical work as the most promising source of evidence re: scheming, I also think that there’s a ton more to do to clarify and maybe debunk the more theoretical arguments people offer re: scheming – and I think playing out the dialectic further in this respect might well lead to comparatively fast progress (for all their centrality to the AI risk discourse, I think arguments re: scheming have received way too little direct attention). And if, indeed, the arguments for scheming are all bogus, this is super good news and would be an important update, at least for me, re: p(doom) overall. So overall I’m glad you’re doing this work and think this is a valuable post.
On other note up front: I don’t think this post “surveys the main arguments that have been put forward for thinking that future AIs will scheme.” In particular: both counting arguments and simplicity arguments (the two types of argument discussed in the post) assume we can ignore the path that SGD takes through model space. But the report also discusses two arguments that don’t make this assumption – namely, the “training-game independent proxy goals story” (I think this one is possibly the most common story, see e.g. Ajeya here, and all the talk about the evolution analogy) and the “nearest max-reward goal argument.” I think that the idea that “a wide variety of goals can lead to scheming” plays some role in these arguments as well, but not such that they are just the counting argument restated, and I think they’re worth treating on their own terms.
On counting arguments and simplicity arguments
Focusing just on counting arguments and simplicity arguments, though: Suppose that I’m looking down at a superintelligent model newly trained on diverse, long-horizon tasks. I know that it has extremely ample situational awareness – e.g., it has highly detailed models of the world, the training process it’s undergoing, the future consequences of various types of power-seeking, etc – and that it’s getting high reward because it’s pursuing some goal (the report conditions on this). Ok, what sort of goal?
We can think of arguments about scheming in two categories here.
(I) The first tries to be fairly uncertain/agnostic about what sorts of goals SGD’s inductive biases favor, and it argues that given this uncertainty, we should be pretty worried about scheming.
I tend to think of my favored version of the counting argument (that is, the hazy counting argument) in these terms.
(II) The second type focuses on a particular story about SGD’s inductive biases and then argues that this bias favors schemers.
I tend to think of simplicity arguments in these terms. E.g., the story is that SGD’s inductive biases favor simplicity, schemers can have simpler goals, so schemers are favored.
Let’s focus first on (I), the more-agnostic-about-SGD’s-inductive-biases type. Here’s a way of pumping the sort of intuition at stake in the hazy counting argument:
A very wide variety of goals can prompt scheming.
By contrast, non-scheming goals need to be much more specific to lead to high reward.
I’m not sure exactly what sorts of goals SGD’s inductive biases favor, but I don’t have strong reason to think they actively favor non-schemer goals.
So, absent further information, and given how many goals-that-get-high-reward are schemer-like, I should be pretty worried that this model is a schemer.
Now, as I mention in the report, I’m happy to grant that this isn’t a super rigorous argument. But how, exactly, is your post supposed to comfort me with respect to it? We can consider two objections, both of which are present in/suggested by your post in various ways.
(A) This sort of reasoning would lead to you giving significant weight to SGD overfitting. But SGD doesn’t overfit, so this sort of reasoning must be going wrong, and in fact you should have low probability on SGD having selected a schemer, even given this ignorance about SGD’s inductive biases.
(B): (3) is false: we know enough about SGD’s inductive biases to know that it actively favors non-scheming goals over scheming goals.
Let’s start with (A). I agree that this sort of reasoning would lead you to giving significant weight to SGD overfitting, absent any further evidence. But it’s not clear to me that giving this sort of weight to overfitting was unreasonable ex ante, or that having learned that SGD-doesn’t-overfit, you should now end up with low p(scheming) even given your ongoing ignorance about SGD’s inductive biases.
Thus, consider the sort of analogy I discuss in the counting arguments section. Suppose that all we know is that Bob lives in city X, that he went to a restaurant on Saturday, and that town X has a thousand chinese restaurants, a hundred mexican restaurants, and one indian restaurant. What should our probability be that he went to a chinese restaurant?
In this case, my intuitive answer here is: “hefty.”[1] In particular, absent further knowledge about Bob’s food preferences, and given the large number of chinese restaurants in the city, “he went to a chinese restaurant” seems like a pretty salient hypothesis. And it seems quite strange to be confident that he went to a non-chinese restaurant instead.
Ok but now suppose you learn that last week, Bob also engaged in some non-restaurant leisure activity. For such leisure activities, the city offers: a thousand movie theaters, a hundred golf courses, and one escape room. So it would’ve been possible to make a similar argument for putting hefty credence on Bob having gone to a movie. But lo, it turns out that actually, Bob went golfing instead, because he likes golf more than movies or escape rooms.
How should you update about the restaurant Bob went to? Well… it’s not clear to me you should update much. Applied to both leisure and to restaurants, the hazy counting argument is trying to be fairly agnostic about Bob’s preferences, while giving some weight to some type of “count.” Trying to be uncertain and agnostic does indeed often mean putting hefty probabilities on things that end up false. But: do you have a better proposed alternative, such that you shouldn’t put hefty probability on “Bob went to a chinese restaurant”, here, because e.g. you learned that hazy counting arguments don’t work when applied to Bob? If so, what is it? And doesn’t it seem like it’s giving the wrong answer?
Or put another way: suppose you didn’t yet know whether SGD overfits or not, but you knew e.g. about the various theoretical problems with unrestricted uses of the indifference principle. What should your probability have been, ex ante, on SGD overfitting? I’m pretty happy to say “hefty,” here. E.g., it’s not clear to me that the problem, re: hefty-probability-on-overfitting, was some a priori problem with hazy-counting-argument-style reasoning. For example: given your philosophical knowledge about the indifference principle, but without empirical knowledge about ML, should you have been super surprised if it turned out that SGD did overfit? I don’t think so.
Now, you could be making a different, more B-ish sort of argument here: namely, that the fact that SGD doesn’t overfit actively gives us evidence that SGD’s inductive biases also disfavor schemers. This would be akin to having seen Bob, in a different city, actively seek out mexican restaurants despite there being many more chinese restaurants available, such that you now have active evidence that he prefers mexican and is willing to work for it. This wouldn’t be a case of having learned that bob’s preferences are such that hazy counting arguments “don’t work on bob” in general. But it would be evidence that Bob prefers non-chinese.
I’m pretty interested in arguments of this form. But I think that pretty quickly, they move into the territory of type (II) arguments above: that is, they start to say something like “we learn, from SGD not overfitting, that it prefers models of type X. Non-scheming models are of type X, schemers are not, so we now know that SGD won’t prefer schemers.”
But what is X? I’m not sure your answer (though: maybe it will come in a later post). You could say something like “SGD prefers models that are ‘natural’” – but then, are schemers natural in that sense? Or, you could say “SGD prefers models that behave similarly on the training and test distributions” – but in what sense is a schemer violating this standard? On both distributions, a schemer seeks after their schemer-like goal. I’m not saying you can’t make an argument for a good X, here – but I haven’t yet heard it. And I’d want to hear its predictions about non-scheming forms of goal-misgeneralization as well.
Indeed, my understanding is that a quite salient candidate for “X” here is “simplicity” – e.g., that SGD’s not overfitting is explained by its bias towards simpler functions. And this puts us in the territory of the “simplicity argument” above. I.e., we’re now being less agnostic about SGD’s preferences, and instead positing some more particular bias. But there’s still the question of whether this bias favors schemers or not, and the worry is that it does.
This brings me to your take on simplicity arguments. I agree with you that simplicity arguments are often quite ambiguous about the notion of simplicity at stake (see e.g. my discussion here). And I think they’re weak for other reasons too (in particular, the extra cognitive faff scheming involves seems to me more important than its enabling simpler goals).
But beyond “what is simplicity anyway,” you also offer some other considerations, other than SGD-not-overfitting, meant to suggest that we have active evidence that SGD’s inductive biases disfavor schemers. I’m not going to dig deep on those considerations here, and I’m looking forward to your future post on the topic. For now, my main reaction is: “we have active evidence that SGD’s inductive biases disfavor schemers” seems like a much more interesting claim/avenue of inquiry than trying to nail down the a priori philosophical merits of counting arguments/indifference principles, and if you believe we have that sort of evidence, I think it’s probably most productive to just focus on fleshing it out and examining it directly. That is, whatever their a priori merits, counting arguments are attempting to proceed from a position of lots of uncertainty and agnosticism, which only makes sense if you’ve got no other good evidence to go on. But if we do have such evidence (e.g., if (3) above is false), then I think it can quickly overcome whatever “prior” counting arguments set (e.g., if you learn that Bob has a special passion for mexican food and hates chinese, you can update far towards him heading to a mexican restaurant). In general, I’m very excited for people to take our best current understanding of SGD’s inductive biases (it’s not my area of expertise), and apply it to p(scheming), and am interested to hear your own views in this respect. But if we have active evidence that SGD’s inductive biases point away from schemers, I think that whether counting arguments are good absent such evidence matters way less, and I, for one, am happy to pay them less attention.
(One other comment re: your take on simplicity arguments: it seems intuitively pretty non-simple to me to fit the training data on the training distribution, and then cut to some very different function on the test data, e.g. the identity function or the constant function. So not sure your parody argument that simplicity also predicts overfitting works. And insofar as simplicity is supposed to be the property had by non-overfitting functions, it seems somewhat strange if positing a simplicity bias predicts over-fitting after all.)
A few other comments
Re: goal realism, it seems like the main argument in the post is something like:
Michael Huemer says that it’s sometimes OK to use the principle of indifference if you’re applying it to explanatorily fundamental variables.
But goals won’t be explanatorily fundamental. So the principle of indifference is still bad here.
I haven’t yet heard much reason to buy Huemer’s view, so not sure how much I care about debating whether we should expect goals to satisfy his criteria of fundamentality. But I’ll flag I do feel like there’s a pretty robust way in which explicitly-represented goals appropriately enter into our explanations of human behavior – e.g., I have buying a flight to New York because I want to go to New York, I have a representation of that goal and how my flight-buying achieves it, etc. And it feels to me like your goal reductionism is at risk of not capturing this. (To be clear: I do think that how we understand goal-directedness matters for scheming—more here—and that if models are only goal-directed in a pretty deflationary sense, this makes scheming a way weirder hypothesis. But I think that if models are as goal-directed as strategic and agentic humans reasoning about how to achieve explicitly represented goals, their goal-directedness has met a fairly non-deflationary standard.)
I’ll also flag some broader unclarity about the post’s underlying epistemic stance. You rightly note that the strict principle of indifference has many philosophical problems. But it doesn’t feel to me like you’ve given a compelling alternative account of how to reason “on priors” in the sorts of cases where we’re sufficiently uncertain that there’s a temptation to spread one’s credence over many possibilities in the broad manner that principles-of-indifference-ish reasoning attempts to do.
Thus, for example, how does your epistemology think about a case like “There are 1000 people in this town, one of them is the murderer, what’s the probability that it’s Mortimer P. Snodgrass?” Or: “there are a thousand white rooms, you wake up in one of them, what’s the probability that it’s room number 734?” These aren’t cases like dice, where there’s a random process designed to function in principle-of-indifference-ish ways. But it’s pretty tempting to spread your credence out across the people/rooms (even if in not-fully-uniform ways), in a manner that feels closely akin to the sort of thing that principle-of-indifference-ish reasoning is trying to do. (We can say “just use all the evidence available to you”—but why should this result in such principle-of-indifference-ish results?)
Your critique of counting argument would be more compelling to me if you had a fleshed out account of cases like these—e.g., one which captures the full range of cases where we’re pulled towards something principle-of-indifference-ish, such that you can then take that account and explain why it shouldn’t point us towards hefty probabilities on schemers, a la the hazy counting argument, even given very-little-evidence about SGD’s inductive biases.
More to say on all this, and I haven’t covered various ways in which I’m sympathetic to/moved by points in the vicinity of the one’s you’re making here. But for now: thanks again for writing, looking forward to future installments.