(I might write a longer response later, but I thought it would be worth writing a quick response now.)
I have a few points of agreement and a few points of disagreement:
Agreements:
The strict counting argument seems very weak as an argument for scheming, essentially for the reason you identified: it relies on a uniform prior over AI goals, which seems like a really bad model of the situation.
The hazy counting argument—while stronger than the strict counting argument—still seems like weak evidence for scheming. One way of seeing this is, as you pointed out, to show that essentially identical arguments can be applied to deep learning in different contexts that nonetheless contradict empirical evidence.
Some points of disagreement:
I think the title overstates the strength of the conclusion. The hazy counting argument seems weak to me but I don’t think it’s literally “no evidence” for the claim here: that future AIs will scheme.
I disagree with the bottom-line conclusion: “we should assign very low credence to the spontaneous emergence of scheming in future AI systems—perhaps 0.1% or less”
I think it’s too early to be very confident in sweeping claims about the behavior or inner workings of future AI systems, especially in the long-run. I don’t think the evidence we have about these things is very strong right now.
One caveat: I think the claim here is vague. I don’t know what counts as “spontaneous emergence”, for example. And I don’t know how to operationalize AI scheming. I personally think scheming comes in degrees: some forms of scheming might be relatively benign and mild, and others could be more extreme and pervasive.
Ultimately I think you’ve only rebutted one argument for scheming—the counting argument. A more plausible argument for scheming, in my opinion, is simply that the way we train AIs—including the data we train them on—could reward AIs that scheme over AIs that are honest and don’t scheme. Actors such as AI labs have strong incentives to be vigilant against these types of mistakes when training AIs, but I don’t expect people to come up with perfect solutions. So I’m not convinced that AIs won’t scheme at all.
If by “scheming” all you mean is that an agent deceives someone in order to get power, I’d argue that many humans scheme all the time. Politicians routinely scheme, for example, by pretending to have values that are more palatable to the general public, in order to receive votes. Society bears some costs from scheming, and pays costs to mitigate the effects of scheming. Combined, these costs are not crazy-high fractions of GDP; but nonetheless, scheming is a constant fact of life.
If future AIs are “as aligned as humans”, then AIs will probably scheme frequently. I think an important question is how intensely and how pervasively AIs will scheme; and thus, how much society will have to pay as a result of scheming. If AIs scheme way more than humans, then this could be catastrophic, but I haven’t yet seen any decent argument for that theory.
So ultimately I am skeptical that AI scheming will cause human extinction or disempowerment, but probably for different reasons than the ones in your essay: I think the negative effects of scheming can probably be adequately mitigated by paying some costs even if it arises.
I don’t think you need to believe in any strong version of goal realism in order to accept the claim that AIs will intuitively have “goals” that they robustly attempt to pursue. It seems pretty natural to me that people will purposely design AIs that have goals in an ordinary sense, and some of these goals will be “misaligned” in the sense that the designer did not intend for them. My relative optimism about AI scheming doesn’t come from thinking that AIs won’t robustly pursue goals, but instead comes largely from my beliefs that:
AIs, like all real-world agents, will be subject to constraints when pursuing their goals. These constraints include things like the fact that it’s extremely hard and risky to take over the whole world and then optimize the universe exactly according to what you want. As a result, AIs with goals that differ from what humans (and other AIs) want, will probably end up compromising and trading with other agents instead of pursuing world takeover. This is a benign failure and doesn’t seem very bad.
The amount of investment we put into mitigating scheming is not an exogenous variable, but instead will respond to evidence about how pervasive scheming is in AI systems, and how big of a deal AI scheming is. And I think we’ll accumulate lots of evidence about the pervasiveness of AI scheming in deep learning over time (e.g. such as via experiments with model organisms of alignment), allowing us to set the level of investment in AI safety at a reasonable level as AI gets incrementally more advanced.
If we experimentally determine that scheming is very important and very difficult to mitigate in AI systems, we’ll probably respond by spending a lot more money on mitigating scheming, and vice versa. In effect, I don’t think we have good reasons to think that society will spend a suboptimal amount on mitigating scheming.
I think the title overstates the strength of the conclusion
This seems like an isolated demand for rigor to me. I think it’s fine to say something is “no evidence” when, speaking pedantically, it’s only a negligible amount of evidence.
Ultimately I think you’ve only rebutted one argument for scheming—the counting argument
I mean, we do in fact discuss the simplicity argument, although we don’t go in as much depth.
the way we train AIs—including the data we train them on—could reward AIs that scheme over AIs that are honest and don’t scheme
Without a concrete proposal about what that might look like, I don’t feel the need to address this possibility.
If future AIs are “as aligned as humans”, then AIs will probably scheme frequently
I think future AIs will be much more aligned than humans, because we will have dramatically more control over them than over humans.
I don’t think you need to believe in any strong version of goal realism in order to accept the claim that AIs will intuitively have “goals” that they robustly attempt to pursue.
We did not intend to deny that some AIs will be well-described as having goals.
Over on LessWrong, the phrase is more common, but the tophits are multiple posts that specifically argue against the phrase in the abstract. So overall I would not consider it an isolated demand for rigor if someone were to argue against the phrase “no evidence” on either forum.
This seems like an isolated demand for rigor to me. I think it’s fine to say something is “no evidence” when, speaking pedantically, it’s only a negligible amount of evidence.
I think that’s fair, but I’m still admittedly annoyed at this usage of language. I don’t think it’s an isolated demand for rigor because I have personally criticized many other similar uses of “no evidence” in the past.
I think future AIs will be much more aligned than humans, because we will have dramatically more control over them than over humans.
That’s plausible to me, but I’m perhaps not as optimistic as you are. I think AIs might easily end up becoming roughly as misaligned with humans as humans are to each other, at least eventually.
We did not intend to deny that some AIs will be well-described as having goals.
If you agree that AIs will intuitively have goals that they robustly pursue, I guess I’m just not sure why you thought it was important to rebut goal realism? You wrote,
The goal realist perspective relies on a trick of language. By pointing to a thing inside an AI system and calling it an “objective”, it invites the reader to project a generalized notion of “wanting” onto the system’s imagined internal ponderings, thereby making notions such as scheming seem more plausible.
But I think even on a reductionist view, it can make sense to talk about AIs “wanting” things, just like it makes sense to talk about humans wanting things. I’m not sure why you think this distinction makes much of a difference.
The goal realism section was an argument in the alternative. If you just agree with us that the indifference principle is invalid, then the counting argument fails, and it doesn’t matter what you think about goal realism.
If you think that some form of indifference reasoning still works— in a way that saves the counting argument for scheming— the most plausible view on which that’s true is goal realism combined with Huemer’s restricted indifference principle. We attack goal realism to try to close off that line of reasoning.
(I might write a longer response later, but I thought it would be worth writing a quick response now.)
I have a few points of agreement and a few points of disagreement:
Agreements:
The strict counting argument seems very weak as an argument for scheming, essentially for the reason you identified: it relies on a uniform prior over AI goals, which seems like a really bad model of the situation.
The hazy counting argument—while stronger than the strict counting argument—still seems like weak evidence for scheming. One way of seeing this is, as you pointed out, to show that essentially identical arguments can be applied to deep learning in different contexts that nonetheless contradict empirical evidence.
Some points of disagreement:
I think the title overstates the strength of the conclusion. The hazy counting argument seems weak to me but I don’t think it’s literally “no evidence” for the claim here: that future AIs will scheme.
I disagree with the bottom-line conclusion: “we should assign very low credence to the spontaneous emergence of scheming in future AI systems—perhaps 0.1% or less”
I think it’s too early to be very confident in sweeping claims about the behavior or inner workings of future AI systems, especially in the long-run. I don’t think the evidence we have about these things is very strong right now.
One caveat: I think the claim here is vague. I don’t know what counts as “spontaneous emergence”, for example. And I don’t know how to operationalize AI scheming. I personally think scheming comes in degrees: some forms of scheming might be relatively benign and mild, and others could be more extreme and pervasive.
Ultimately I think you’ve only rebutted one argument for scheming—the counting argument. A more plausible argument for scheming, in my opinion, is simply that the way we train AIs—including the data we train them on—could reward AIs that scheme over AIs that are honest and don’t scheme. Actors such as AI labs have strong incentives to be vigilant against these types of mistakes when training AIs, but I don’t expect people to come up with perfect solutions. So I’m not convinced that AIs won’t scheme at all.
If by “scheming” all you mean is that an agent deceives someone in order to get power, I’d argue that many humans scheme all the time. Politicians routinely scheme, for example, by pretending to have values that are more palatable to the general public, in order to receive votes. Society bears some costs from scheming, and pays costs to mitigate the effects of scheming. Combined, these costs are not crazy-high fractions of GDP; but nonetheless, scheming is a constant fact of life.
If future AIs are “as aligned as humans”, then AIs will probably scheme frequently. I think an important question is how intensely and how pervasively AIs will scheme; and thus, how much society will have to pay as a result of scheming. If AIs scheme way more than humans, then this could be catastrophic, but I haven’t yet seen any decent argument for that theory.
So ultimately I am skeptical that AI scheming will cause human extinction or disempowerment, but probably for different reasons than the ones in your essay: I think the negative effects of scheming can probably be adequately mitigated by paying some costs even if it arises.
I don’t think you need to believe in any strong version of goal realism in order to accept the claim that AIs will intuitively have “goals” that they robustly attempt to pursue. It seems pretty natural to me that people will purposely design AIs that have goals in an ordinary sense, and some of these goals will be “misaligned” in the sense that the designer did not intend for them. My relative optimism about AI scheming doesn’t come from thinking that AIs won’t robustly pursue goals, but instead comes largely from my beliefs that:
AIs, like all real-world agents, will be subject to constraints when pursuing their goals. These constraints include things like the fact that it’s extremely hard and risky to take over the whole world and then optimize the universe exactly according to what you want. As a result, AIs with goals that differ from what humans (and other AIs) want, will probably end up compromising and trading with other agents instead of pursuing world takeover. This is a benign failure and doesn’t seem very bad.
The amount of investment we put into mitigating scheming is not an exogenous variable, but instead will respond to evidence about how pervasive scheming is in AI systems, and how big of a deal AI scheming is. And I think we’ll accumulate lots of evidence about the pervasiveness of AI scheming in deep learning over time (e.g. such as via experiments with model organisms of alignment), allowing us to set the level of investment in AI safety at a reasonable level as AI gets incrementally more advanced.
If we experimentally determine that scheming is very important and very difficult to mitigate in AI systems, we’ll probably respond by spending a lot more money on mitigating scheming, and vice versa. In effect, I don’t think we have good reasons to think that society will spend a suboptimal amount on mitigating scheming.
This seems like an isolated demand for rigor to me. I think it’s fine to say something is “no evidence” when, speaking pedantically, it’s only a negligible amount of evidence.
I mean, we do in fact discuss the simplicity argument, although we don’t go in as much depth.
Without a concrete proposal about what that might look like, I don’t feel the need to address this possibility.
I think future AIs will be much more aligned than humans, because we will have dramatically more control over them than over humans.
We did not intend to deny that some AIs will be well-described as having goals.
Minor, but: searching on the EA Forum, your post and Quentin Pope’s post are the only posts with the exact phrase “no evidence” (EDIT: in the title, which weakens my point significantly but it still holds) The closest other match on the first page is There is little (good) evidence that aid systematically harms political institutions, which to my eyes seem substantially more caveated.
Over on LessWrong, the phrase is more common, but the top hits are multiple posts that specifically argue against the phrase in the abstract. So overall I would not consider it an isolated demand for rigor if someone were to argue against the phrase “no evidence” on either forum.
I think that’s fair, but I’m still admittedly annoyed at this usage of language. I don’t think it’s an isolated demand for rigor because I have personally criticized many other similar uses of “no evidence” in the past.
That’s plausible to me, but I’m perhaps not as optimistic as you are. I think AIs might easily end up becoming roughly as misaligned with humans as humans are to each other, at least eventually.
If you agree that AIs will intuitively have goals that they robustly pursue, I guess I’m just not sure why you thought it was important to rebut goal realism? You wrote,
But I think even on a reductionist view, it can make sense to talk about AIs “wanting” things, just like it makes sense to talk about humans wanting things. I’m not sure why you think this distinction makes much of a difference.
The goal realism section was an argument in the alternative. If you just agree with us that the indifference principle is invalid, then the counting argument fails, and it doesn’t matter what you think about goal realism.
If you think that some form of indifference reasoning still works— in a way that saves the counting argument for scheming— the most plausible view on which that’s true is goal realism combined with Huemer’s restricted indifference principle. We attack goal realism to try to close off that line of reasoning.