I don’t think of total doom as inevitable, but I certainly do see it as a default—without concerted effort to make AI safe, it will not be.
This may come down to a semantic dispute about what we mean by “default”. Typically what I mean by “default” is something more like: “without major intervention from the longtermist community”. This default is quite different than the default of “[no] concerted effort to make AI safe”, which I agree would be disastrous.
Under this definition of “default”, I think the default outcome isn’t one without any safety research. I think our understanding of the default outcome can be informed by society’s general level of risk-aversion to new technologies, which is usually pretty high (some counterexamples notwithstanding).
Before anything else, however, I want to note that we have seen nothing about AI motives generalizing, because current systems don’t have motives.
I mostly agree, but I think it makes sense to describe GPT-4 as having some motives, although they are not persistent and open-ended. You can clearly tell that it’s trying to help you when you talk to it, although I’m not making a strong claim about its psychological states. Mostly, our empirical ignorance here is a good reason to fall back on our prior about the likelihood of deceptive alignment. And I do not yet see any good reason to think that prior should be high.
Regarding AI vs. AI and Rogue humans versus AI, we have also seen that animals, overall, have fared very poorly as humanity thrived. In the analogy, I don’t know why you think we’re the dogs kept as pets, not the birds whose habitat is gone, or even the mosquitos humans want to eliminate.
If AI motives are completely different from human motives and we have no ability to meaningfully communicate with them, then yeah, I think it might be better to view our situation with AI as more analogous to humans vs. wild animals. But,
I don’t think that’s a good model of what plausible AI motives will be like, given that humans will be directly responsible for developing and training AIs, unlike our situation regarding wild animals.
Even in this exceptionally pessimistic analogy, the vast majority of wild animal species have not gone extinct from human activities yet, and humans care at least a little bit about preserving wild animal species (in the sense of spending at least 0.01% of our GDP each year on wildlife conservation). In the contemporary era, richer nations plausibly have more success with conservation efforts given that they can afford it more easily. Given this, I think as we grow richer, it’s similarly plausible that we will eventually put a stop to species extinction, even for animals that we care very little about.
One thing you don’t really seem to be taking into account is inner alignment failure / goal misgeneralisation / mesaoptimisation. Why don’t you think this will happen?
I think we have doom by default for a number of independent disjunctive reasons. And by “default” I mean “if we keep developing AGI at the rate we currently are, without an indefinite global pause” (regardless of how many resources are poured into x-safety, there just isn’t enough time to solve it without a pause).
Deceptive alignment is a convergent instrumental subgoal. If an AI is clearly misaligned while its creator still has the ability to pull the plug, the plug will be pulled; ergo, pretending to be aligned is worthwhile ~regardless of terminal goal.
Thus, the prior would seem to be that all sufficiently-smart AI appear aligned, but only X proportion of them are truly aligned where X is the chance of a randomly-selected value system being aligned; the 1-X others are deceptively aligned.
GPT-4 being the smartest AI we have and also appearing aligned is not really evidence against this; it’s plausibly smart enough in the specific domain of “predicting humans” for its apparent alignment to be deceptive.
This may come down to a semantic dispute about what we mean by “default”. Typically what I mean by “default” is something more like: “without major intervention from the longtermist community”. This default is quite different than the default of “[no] concerted effort to make AI safe”, which I agree would be disastrous.
Under this definition of “default”, I think the default outcome isn’t one without any safety research. I think our understanding of the default outcome can be informed by society’s general level of risk-aversion to new technologies, which is usually pretty high (some counterexamples notwithstanding).
I mostly agree, but I think it makes sense to describe GPT-4 as having some motives, although they are not persistent and open-ended. You can clearly tell that it’s trying to help you when you talk to it, although I’m not making a strong claim about its psychological states. Mostly, our empirical ignorance here is a good reason to fall back on our prior about the likelihood of deceptive alignment. And I do not yet see any good reason to think that prior should be high.
If AI motives are completely different from human motives and we have no ability to meaningfully communicate with them, then yeah, I think it might be better to view our situation with AI as more analogous to humans vs. wild animals. But,
I don’t think that’s a good model of what plausible AI motives will be like, given that humans will be directly responsible for developing and training AIs, unlike our situation regarding wild animals.
Even in this exceptionally pessimistic analogy, the vast majority of wild animal species have not gone extinct from human activities yet, and humans care at least a little bit about preserving wild animal species (in the sense of spending at least 0.01% of our GDP each year on wildlife conservation). In the contemporary era, richer nations plausibly have more success with conservation efforts given that they can afford it more easily. Given this, I think as we grow richer, it’s similarly plausible that we will eventually put a stop to species extinction, even for animals that we care very little about.
One thing you don’t really seem to be taking into account is inner alignment failure / goal misgeneralisation / mesaoptimisation. Why don’t you think this will happen?
I think we have doom by default for a number of independent disjunctive reasons. And by “default” I mean “if we keep developing AGI at the rate we currently are, without an indefinite global pause” (regardless of how many resources are poured into x-safety, there just isn’t enough time to solve it without a pause).
Deceptive alignment is a convergent instrumental subgoal. If an AI is clearly misaligned while its creator still has the ability to pull the plug, the plug will be pulled; ergo, pretending to be aligned is worthwhile ~regardless of terminal goal.
Thus, the prior would seem to be that all sufficiently-smart AI appear aligned, but only X proportion of them are truly aligned where X is the chance of a randomly-selected value system being aligned; the 1-X others are deceptively aligned.
GPT-4 being the smartest AI we have and also appearing aligned is not really evidence against this; it’s plausibly smart enough in the specific domain of “predicting humans” for its apparent alignment to be deceptive.