Senior research analyst at Open Philanthropy. Doctorate in philosophy at the University of Oxford. Opinions my own.
Joe_Carlsmith
Two sources of beyond-episode goals (Section 2.2.2 of “Scheming AIs”)
Two concepts of an “episode” (Section 2.2.1 of “Scheming AIs”)
Situational awareness (Section 2.1 of “Scheming AIs”)
On “slack” in training (Section 1.5 of “Scheming AIs”)
Why focus on schemers in particular (Sections 1.3 and 1.4 of “Scheming AIs”)
A taxonomy of non-schemer models (Section 1.2 of “Scheming AIs”)
Varieties of fake alignment (Section 1.1 of “Scheming AIs”)
New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”
Superforecasting the premises in “Is power-seeking AI an existential risk?”
Thanks for this thoughtful comment, Ben. And also, for putting the “The Gold Lily” and “Mother and Child” on my radar—they hadn’t been before. I agree that “Mother and Child” evokes a sort some kind of sort of intergenerational project in the way you describe—“it is your turn to address it.” It seems related to the thing I was trying to talk about at the end of the post—e.g., Gluck asking for some kind of directness and intensity of engagement with life.
In memory of Louise Glück
The “no sandbagging on checkable tasks” hypothesis
Thanks! Re: one in five million and .01% -- thanks, edited. And thanks for pointing to the Augenblick piece—does look relevant (though my specific interest in that footnote was in constraints applicable to a model where you can only consider some subset of your evidence at any given time).
I’m sorry to hear about this, Nathan. As I say in the post, I do think that the question how to do gut-stuff right from a practical perspective is distinct from the epistemic angle that the post focuses on, and I think it’s important to attend to both.
(Also copied from LW. And partly re-hashing my response from twitter.)
I’m seeing your main argument here as a version of what I call, in section 4.4, a “speed argument against schemers”—e.g., basically, that SGD will punish the extra reasoning that schemers need to perform.
(I’m generally happy to talk about this reasoning as a complexity penalty, and/or about the params it requires, and/or about circuit-depth—what matters is the overall “preference” that SGD ends up with. And thinking of this consideration as a different kind of counting argument *against* schemers seems like it might well be a productive frame. I do also think that questions about whether models will be bottlenecked on serial computation, and/or whether “shallower” computations will be selected for, are pretty relevant here, and the report includes a rough calculation in this respect in section 4.4.2 (see also summary here).)
Indeed, I think that maybe the strongest single argument against scheming is a combination of
“Because of the extra reasoning schemers perform, SGD would prefer non-schemers over schemers in a comparison re: final properties of the models” and
“The type of path-dependence/slack at stake in training is such that SGD will get the model that it prefers overall.”
My sense is that I’m less confident than you in both (1) and (2), but I think they’re both plausible (the report, in particular, argues in favor of (1)), and that the combination is a key source of hope. I’m excited to see further work fleshing out the case for both (including e.g. the sorts of arguments for (2) that I took you and Nora to be gesturing at on twitter—the report doesn’t spend a ton of time on assessing how much path-dependence to expect, and of what kind).
Re: your discussion of the “ghost of instrumental reasoning,” “deducing lots of world knowledge ‘in-context,’ and “the perspective that NNs will ‘accidentally’ acquire such capabilities internally as a convergent result of their inductive biases”—especially given that you only skimmed the report’s section headings and a small amount of the content, I have some sense, here, that you’re responding to other arguments you’ve seen about deceptive alignment, rather than to specific claims made in the report (I don’t, for example, make any claims about world knowledge being derived “in-context,” or about models “accidentally” acquiring flexible instrumental reasoning). Is your basic thought something like: sure, the models will develop flexible instrumental reasoning that could in principle be used in service of arbitrary goals, but they will only in fact use it in service of the specified goal, because that’s the thing training pressures them to do? If so, my feeling is something like: ok, but a lot of the question here is whether using the instrumental reasoning in service of some other goal (one that backchains into getting-reward) will be suitably compatible with/incentivized by training pressures as well. And I don’t see e.g. the reversal curse as strong evidence on this front.
Re: “mechanistically ungrounded intuitions about ‘goals’ and ‘tryingness’”—as I discuss in section 0.1, the report is explicitly setting aside disputes about whether the relevant models will be well-understood as goal-directed (my own take on that is in section 2.2.1 of my report on power-seeking AI here). The question in this report is whether, conditional on goal-directedness, we should expect scheming. That said, I do think that what I call the “messyness” of the relevant goal-directedness might be relevant to our overall assessment of the arguments for scheming in various ways, and that scheming might require an unusually high standard of goal-directedness in some sense. I discuss this in section 2.2.3, on “‘Clean’ vs. ‘messy’ goal-directedness,” and in various other places in the report.
Re: “long term goals are sufficiently hard to form deliberately that I don’t think they’ll form accidentally”—the report explicitly discusses cases where we intentionally train models to have long-term goals (both via long episodes, and via short episodes aimed at inducing long-horizon optimization). I think scheming is more likely in those cases. See section 2.2.4, “What if you intentionally train the model to have long-term goals?” That said, I’d be interested to see arguments that credit assignment difficulties actively count against the development of beyond-episode goals (whether in models trained on short episodes or long episodes) for models that are otherwise goal-directed. And I do think that, if we could be confident that models trained on short episodes won’t learn beyond-episode goals accidentally (even irrespective of mundane adversarial training—e.g., that models rewarded for getting gold coins on the episode would not learn a goal that generalizes to caring about gold coins in general, even prior to efforts to punish it for sacrificing gold-coins-on-the-episode for gold-coins-later), that would be a significant source of comfort (I discuss some possible experimental directions in this respect in section 6.2).