Thanks for the interesting article—very easy to understand which I appreciated.
Thanks!
If you really don’t think unchecked AI will kill everyone, then I probably agree that the argument for a pause becomes weak and possibly untenable.
I agree this is probably the main crux for a lot of people. Nonetheless, it is difficult for me to fully explain the reasons for optimism in a short post within the context of the pause debate. Mostly, I think AIs will probably just be ethical if we train them hard enough to be, since I haven’t found any strong reason yet to think that AI motives will generalize extremely poorly from the training distribution. But even if AI motives do generalize poorly, I am skeptical of total doom happening as a side effect.
[ETA: My main argument is not “AI will be fine even if it’s misaligned.” I’m not saying that at all. The context here is a brief point in my section on optimism arguing that AI might not literally kill everyone if it didn’t “care much for humans”. Please don’t take this out of context and think that I’m arguing something much stronger.]
For people who confidently believe in total doom by default, I have some questions that I want to see answered:
Why should we expect rogue AIs to kill literally everyone rather than first try to peacefully resolve their conflicts with us, as humans often do with each other (including when there are large differences in power)?
Why should we expect this future conflict to be “AI vs. humanity” rather than “AI vs. AI” (with humanity on the sidelines)?
Why are rogue AI motives so much more likely to lead to disaster than rogue human motives? Yes, AIs will be more powerful than humans, but there are already many people who are essentially powerless (not to mention many non-human animals) who survive despite the fact that their interests are in competition with much more powerful entities. (But again, I stress that this logic is not at all my primary reason for hope.)
I don’t think of total doom as inevitable, but I certainly do see it as a default—without concerted effort to make AI safe, it will not be.
Before anything else, however, I want to note that we have seen nothing about AI motives generalizing, because current systems don’t have motives.
That said, we have seen the unavoidable and universal situation of misalignment between stated goals and actual goals, and between principals and agents. These are fundamental problems, and we aren’t gonna fix them in general. Any ways to avoid them will require very specific effort. Given instrumental convergence, I don’t understand how that leaves room to think we can scale AI indefinitely and not have existential risks by default.
Regarding AI vs. AI and Rogue humans versus AI, we have also seen that animals, overall, have fared very poorly as humanity thrived. In the analogy, I don’t know why you think we’re the dogs kept as pets, not the birds whose habitat is gone, or even the mosquitos humans want to eliminate. Sure, it’s possible, but you seem confident that we’d be in the tiny minority of winners if we become irrelevant.
I don’t think of total doom as inevitable, but I certainly do see it as a default—without concerted effort to make AI safe, it will not be.
This may come down to a semantic dispute about what we mean by “default”. Typically what I mean by “default” is something more like: “without major intervention from the longtermist community”. This default is quite different than the default of “[no] concerted effort to make AI safe”, which I agree would be disastrous.
Under this definition of “default”, I think the default outcome isn’t one without any safety research. I think our understanding of the default outcome can be informed by society’s general level of risk-aversion to new technologies, which is usually pretty high (some counterexamples notwithstanding).
Before anything else, however, I want to note that we have seen nothing about AI motives generalizing, because current systems don’t have motives.
I mostly agree, but I think it makes sense to describe GPT-4 as having some motives, although they are not persistent and open-ended. You can clearly tell that it’s trying to help you when you talk to it, although I’m not making a strong claim about its psychological states. Mostly, our empirical ignorance here is a good reason to fall back on our prior about the likelihood of deceptive alignment. And I do not yet see any good reason to think that prior should be high.
Regarding AI vs. AI and Rogue humans versus AI, we have also seen that animals, overall, have fared very poorly as humanity thrived. In the analogy, I don’t know why you think we’re the dogs kept as pets, not the birds whose habitat is gone, or even the mosquitos humans want to eliminate.
If AI motives are completely different from human motives and we have no ability to meaningfully communicate with them, then yeah, I think it might be better to view our situation with AI as more analogous to humans vs. wild animals. But,
I don’t think that’s a good model of what plausible AI motives will be like, given that humans will be directly responsible for developing and training AIs, unlike our situation regarding wild animals.
Even in this exceptionally pessimistic analogy, the vast majority of wild animal species have not gone extinct from human activities yet, and humans care at least a little bit about preserving wild animal species (in the sense of spending at least 0.01% of our GDP each year on wildlife conservation). In the contemporary era, richer nations plausibly have more success with conservation efforts given that they can afford it more easily. Given this, I think as we grow richer, it’s similarly plausible that we will eventually put a stop to species extinction, even for animals that we care very little about.
One thing you don’t really seem to be taking into account is inner alignment failure / goal misgeneralisation / mesaoptimisation. Why don’t you think this will happen?
I think we have doom by default for a number of independent disjunctive reasons. And by “default” I mean “if we keep developing AGI at the rate we currently are, without an indefinite global pause” (regardless of how many resources are poured into x-safety, there just isn’t enough time to solve it without a pause).
Deceptive alignment is a convergent instrumental subgoal. If an AI is clearly misaligned while its creator still has the ability to pull the plug, the plug will be pulled; ergo, pretending to be aligned is worthwhile ~regardless of terminal goal.
Thus, the prior would seem to be that all sufficiently-smart AI appear aligned, but only X proportion of them are truly aligned where X is the chance of a randomly-selected value system being aligned; the 1-X others are deceptively aligned.
GPT-4 being the smartest AI we have and also appearing aligned is not really evidence against this; it’s plausibly smart enough in the specific domain of “predicting humans” for its apparent alignment to be deceptive.
First of all, you are goal-post-moving if you make this about “confident belief in total doom by default” instead of the original “if you really don’t think unchecked AI will kill everyone.” You need to defend the position that the probability of existential catastrophe conditional on misaligned AI is <50%.
Secondly, “AI motives will generalize extremely poorly from the training distribution” is a confused and misleading way of putting it. The problem is that it’ll generalize in a way that wasn’t the way we hoped it would generalize.
Third, to answer your questions: 1. The difference in power will be great & growing rapidly, compared to historical cases. I support implementing things like model amnesty, but I don’t expect them to work, and anyhow we are not anywhere close to having such things implemented. 2. It’ll be AI vs. AI with humanity on the sidelines, yes. Humans will be killed off, enslaved, or otherwise misused as pawns. It’ll be like colonialism all over again but on steroids. Unless takeoff is fast enough that there is only one AI faction. Doesn’t really matter, either way humans are screwed. 3. Powerless humans survive because of a combination of (a) many powerful humans actually caring about their wellbeing and empowerment, and (b) those powerful humans who don’t care, having incentives such that it wouldn’t be worth it to try to kill the powerless humans and take their stuff. E.g. if Putin started killing homeless people in Moscow and pawning their possessions, he’d lose way more in expectation than he’d gain. Neither (a) nor (b) will save us in the AI case (at least, keeping acausal trade and the like out of the picture) because until we make significant technical progress on alignment there won’t be any powerful aligned AGIs to balance against the unaligned ones, and because whatever norms and society a bunch of competing unaligned AGIs set up between themselves, it is unlikely to give humans anything close to equal treatment, and what consideration it gives to humans will erode rapidly as the power differential grows.
First of all, you are goal-post-moving if you make this about “confident belief in total doom by default” instead of the original “if you really don’t think unchecked AI will kill everyone.”
I never said “I don’t think unchecked AI will kill everyone”. That quote was not from me.
What I did say was, “Even if AIs end up not caring much for humans, it is dubious that they would decide to kill all of us.” Google informs me that dubious means “not to be relied upon; suspect”.
It’ll be AI vs. AI with humanity on the sidelines, yes. Humans will be killed off, enslaved, or otherwise misused as pawns. It’ll be like colonialism all over again but on steroids.
I don’t see how the first part of that leads to the second part. Humanity could be on the sidelines in a way that doesn’t lead to total oppression and subjugation. The idea that these things will necessarily happen just seems like speculation. I could speculate that the opposite will occur and AIs will leave us alone. That doesn’t get us anywhere.
Neither (a) nor (b) will save us in the AI case (at least, keeping acausal trade and the like out of the picture) because until we make significant technical progress on alignment there won’t be any powerful aligned AGIs to balance against the unaligned ones, and because whatever norms and society a bunch of competing unaligned AGIs set up between themselves, it is unlikely to give humans anything close to equal treatment, and what consideration it gives to humans will erode rapidly as the power differential grows.
The question I’m asking is: why? You have told me what you expect to happen, but I want to see an argument for why you’d expect that to happen. In the absence of some evidence-based model of the situation, I don’t think speculating about specific scenarios is a reliable guide.
Those words were not yours, but you did say you agreed it was the main crux, and in context it seemed like you were agreeing that it was a crux for you too. I see now on reread that I misread you and you were instead saying it was a secondary crux. Here, let’s cut through the semantics and get quantitative:
What is your credence in doom conditional on AIs not caring for humans?
If it’s >50%, then I’m mildly surprised that you think the risk of accidentally creating a permanent pause is worse than the risks from not-pausing. I guess you did say that you think AIs will probably just be ethical if we train them hard enough to be… What is your response to the standard arguments that ‘just train them hard to be ethical’ won’t work? E.g. Ajeya Cotra’s writings on the training game.
Re: “I don’t see how the first part of that leads to the second part” Come on, of course you do, you just don’t see it NECESSARILY leading to the second part. On that I agree. Few things are certain in this world. What is your credence in doom conditional on AIs not caring for humans & there being multiple competing AIs?
IMO the “Competing factions of superintelligent AIs, none of whom care about humans, may soon arise, but even if so, humans will be fine anyway somehow” hypothesis is pretty silly and the burden of proof is on you to defend it. I could cite formal models as well as historical precedents to undermine the hypothesis, but I’m pretty sure you know about them already.
The question I’m asking is: why? You have told me what you expect to happen, but I want to see an argument for why you’d expect that to happen. In the absence of some evidence-based model of the situation, I don’t think speculating about specific scenarios is a reliable guide.
Why what? I answered your original question:
Why are rogue AI motives so much more likely to lead to disaster than rogue human motives? Yes, AIs will be more powerful than humans, but there are already many people who are essentially powerless (not to mention many non-human animals) who survive despite the fact that their interests are in competition with much more powerful entities.
with:
Powerless humans survive because of a combination of (a) many powerful humans actually caring about their wellbeing and empowerment, and (b) those powerful humans who don’t care, having incentives such that it wouldn’t be worth it to try to kill the powerless humans and take their stuff. E.g. if Putin started killing homeless people in Moscow and pawning their possessions, he’d lose way more in expectation than he’d gain. Neither (a) nor (b) will save us in the AI case (at least, keeping acausal trade and the like out of the picture) because until we make significant technical progress on alignment there won’t be any powerful aligned AGIs to balance against the unaligned ones, and because whatever norms and society a bunch of competing unaligned AGIs set up between themselves, it is unlikely to give humans anything close to equal treatment, and what consideration it gives to humans will erode rapidly as the power differential grows.
My guess is that you disagree with the “whatever norms and society a bunch of competing unaligned AGIs set up between themselves, it is unlikely to give humans anything close to equal treatment...” bit.
Giving humans equal treatment would be worse (for the AIs, which by hypothesis don’t care about humans at all) than other salient available options to them, such as having the humans be second-class in various ways or complete pawns/tools/slaves. Eventually, when the economy is entirely robotic, keeping humans alive at all would be an unnecessary expense.
Historically, if you look at relations between humans and animals, or between colonial powers and native powers, this is the norm. Cases in which the powerless survive and thrive despite none of the powerful caring about them are the exception, and happen for reasons that probably won’t apply in the case of AI. E.g. Putin killing homeless people would be bad for his army’s morale, and that would far outweigh the benefits he’d get from it. (Arguably this is a case of some powerful people in Russia caring about the homeless, so maybe it’s not even an exception after all)
Can you say more about what model you have in mind? Do you have a model? What about a scenario, can you spin a plausible story in which all the ASIs don’t care at all about humans but humans are still fine?
Wanna meet up sometime to talk this over in person? I’ll be in Berkeley this weekend and next week!
Paul Christiano argues here that AI would only need to have “pico-pseudokindness” (caring about humans one part in a trillion) to take over the universe but not trash Earth’s environment to the point of uninhabitability, and that at least this is amount of kindness is likely.
See the reply to the first comment on that post. Paul’s “most humans die from AI takeover” is 11%. There are other bad scenarios he considers, like losing control of the future, or most humans die for other reasons, but my understanding is that the 11% most closely corresponds to doom from AI.
What is your credence in doom conditional on AIs not caring for humans?
How much do they care about humans, and what counts as doom? I think these things matter.
If we’re assuming all AIs don’t care at all about humans and doom = human extinction, then I think the probability is pretty high, like 65%.
If we’re allowed to assume that some small minority of AIs cares about humans, or AIs care about humans to some degree, perhaps in the way humans care about wildlife species preservation, then I think the probability is quite a lot lower, at maybe 25%.
For precision, both of these estimates are over the next 100 years, since I have almost no idea what will happen in the very long run.
What is your response to the standard arguments that ‘just train them hard to be ethical’ won’t work? E.g. Ajeya Cotra’s writings on the training game.
In most of these stories, including in Ajeya’s story IIRC, humanity just doesn’t seem to try very hard to reduce misalignment? I don’t think that’s a very reasonable assumption. (Charitably, it could be interpreted as a warning rather than a prediction.) I think that as systems get more capable, we will see a large increase in our alignment efforts and monitoring of AI systems, even without any further intervention from longtermists.
Can you say more about what model you have in mind? Do you have a model?
I’m happy to meet up some time and explain in person. I’ll try to remember to DM you later about that, but if I forget, then feel free to remind me.
I think that as systems get more capable, we will see a large increase in our alignment efforts and monitoring of AI systems, even without any further intervention from longtermists.
Maybe so. But I can’t really see mechanistic interpretability being solved to a sufficient degree to detect a situationally aware AI playing the training game, in time to avert doom. Not without a long pause first at least!
In my opinion, “X is dubious” lines up pretty well with “X is 75% likely to be false”. That said, enough people have objected to this that I think I’ll change the wording.
OK, so our credences aren’t actually that different after all. I’m actually at less than 65%, funnily enough! (But that’s for doom = extinction. I think human extinction is unlikely for reasons to do with acausal trade; there will be a small minority of AIs that care about humans, just not on Earth. I usually use a broader definition of “doom” as “About as bad as human extinction, or worse.”)
I am pretty confident that what happens in the next 100 years will straightforwardly translate to what happens in the long run. If humans are still well-cared-for in 2100 they probably also will be in 2100,000,000.
I agree that if some AIs care about humans, or if all AIs care a little bit about humans, the situation looks proportionately better. Unfortunately that’s not what I expect to happen by default on Earth.
In most of these stories, including in Ajeya’s story IIRC, humanity just doesn’t seem to try very hard to reduce misalignment? I don’t think that’s a very reasonable assumption. (Charitably, it could be interpreted as a warning rather than a prediction.) I think that as systems get more capable, we will see a large increase in our alignment efforts and monitoring of AI systems, even without any further intervention from longtermists.
That’s not really an answer to my question—Ajeya’s argument is about how today’s alignment techniques (e.g. RLHF + monitoring) won’t work even if turbocharged with huge amounts of investment. It sounds like you are disagreeing, and saying that if we just spend lots of $$$ doing lots and lots of RLHF, it’ll work. Or when you say humanity will try harder, do you mean they’ll use some other technique than the ones Ajeya thinks won’t work? If so, which technique?
(Separately, I tend to think humanity will probably invest less in alignment than it does in her stories, but that’s not the crux between us I think.)
Thanks!
I agree this is probably the main crux for a lot of people. Nonetheless, it is difficult for me to fully explain the reasons for optimism in a short post within the context of the pause debate. Mostly, I think AIs will probably just be ethical if we train them hard enough to be, since I haven’t found any strong reason yet to think that AI motives will generalize extremely poorly from the training distribution. But even if AI motives do generalize poorly, I am skeptical of total doom happening as a side effect.
[ETA: My main argument is not “AI will be fine even if it’s misaligned.” I’m not saying that at all. The context here is a brief point in my section on optimism arguing that AI might not literally kill everyone if it didn’t “care much for humans”. Please don’t take this out of context and think that I’m arguing something much stronger.]
For people who confidently believe in total doom by default, I have some questions that I want to see answered:
Why should we expect rogue AIs to kill literally everyone rather than first try to peacefully resolve their conflicts with us, as humans often do with each other (including when there are large differences in power)?
Why should we expect this future conflict to be “AI vs. humanity” rather than “AI vs. AI” (with humanity on the sidelines)?
Why are rogue AI motives so much more likely to lead to disaster than rogue human motives? Yes, AIs will be more powerful than humans, but there are already many people who are essentially powerless (not to mention many non-human animals) who survive despite the fact that their interests are in competition with much more powerful entities. (But again, I stress that this logic is not at all my primary reason for hope.)
I don’t think of total doom as inevitable, but I certainly do see it as a default—without concerted effort to make AI safe, it will not be.
Before anything else, however, I want to note that we have seen nothing about AI motives generalizing, because current systems don’t have motives.
That said, we have seen the unavoidable and universal situation of misalignment between stated goals and actual goals, and between principals and agents. These are fundamental problems, and we aren’t gonna fix them in general. Any ways to avoid them will require very specific effort. Given instrumental convergence, I don’t understand how that leaves room to think we can scale AI indefinitely and not have existential risks by default.
Regarding AI vs. AI and Rogue humans versus AI, we have also seen that animals, overall, have fared very poorly as humanity thrived. In the analogy, I don’t know why you think we’re the dogs kept as pets, not the birds whose habitat is gone, or even the mosquitos humans want to eliminate. Sure, it’s possible, but you seem confident that we’d be in the tiny minority of winners if we become irrelevant.
This may come down to a semantic dispute about what we mean by “default”. Typically what I mean by “default” is something more like: “without major intervention from the longtermist community”. This default is quite different than the default of “[no] concerted effort to make AI safe”, which I agree would be disastrous.
Under this definition of “default”, I think the default outcome isn’t one without any safety research. I think our understanding of the default outcome can be informed by society’s general level of risk-aversion to new technologies, which is usually pretty high (some counterexamples notwithstanding).
I mostly agree, but I think it makes sense to describe GPT-4 as having some motives, although they are not persistent and open-ended. You can clearly tell that it’s trying to help you when you talk to it, although I’m not making a strong claim about its psychological states. Mostly, our empirical ignorance here is a good reason to fall back on our prior about the likelihood of deceptive alignment. And I do not yet see any good reason to think that prior should be high.
If AI motives are completely different from human motives and we have no ability to meaningfully communicate with them, then yeah, I think it might be better to view our situation with AI as more analogous to humans vs. wild animals. But,
I don’t think that’s a good model of what plausible AI motives will be like, given that humans will be directly responsible for developing and training AIs, unlike our situation regarding wild animals.
Even in this exceptionally pessimistic analogy, the vast majority of wild animal species have not gone extinct from human activities yet, and humans care at least a little bit about preserving wild animal species (in the sense of spending at least 0.01% of our GDP each year on wildlife conservation). In the contemporary era, richer nations plausibly have more success with conservation efforts given that they can afford it more easily. Given this, I think as we grow richer, it’s similarly plausible that we will eventually put a stop to species extinction, even for animals that we care very little about.
One thing you don’t really seem to be taking into account is inner alignment failure / goal misgeneralisation / mesaoptimisation. Why don’t you think this will happen?
I think we have doom by default for a number of independent disjunctive reasons. And by “default” I mean “if we keep developing AGI at the rate we currently are, without an indefinite global pause” (regardless of how many resources are poured into x-safety, there just isn’t enough time to solve it without a pause).
Deceptive alignment is a convergent instrumental subgoal. If an AI is clearly misaligned while its creator still has the ability to pull the plug, the plug will be pulled; ergo, pretending to be aligned is worthwhile ~regardless of terminal goal.
Thus, the prior would seem to be that all sufficiently-smart AI appear aligned, but only X proportion of them are truly aligned where X is the chance of a randomly-selected value system being aligned; the 1-X others are deceptively aligned.
GPT-4 being the smartest AI we have and also appearing aligned is not really evidence against this; it’s plausibly smart enough in the specific domain of “predicting humans” for its apparent alignment to be deceptive.
First of all, you are goal-post-moving if you make this about “confident belief in total doom by default” instead of the original “if you really don’t think unchecked AI will kill everyone.” You need to defend the position that the probability of existential catastrophe conditional on misaligned AI is <50%.
Secondly, “AI motives will generalize extremely poorly from the training distribution” is a confused and misleading way of putting it. The problem is that it’ll generalize in a way that wasn’t the way we hoped it would generalize.
Third, to answer your questions:
1. The difference in power will be great & growing rapidly, compared to historical cases. I support implementing things like model amnesty, but I don’t expect them to work, and anyhow we are not anywhere close to having such things implemented.
2. It’ll be AI vs. AI with humanity on the sidelines, yes. Humans will be killed off, enslaved, or otherwise misused as pawns. It’ll be like colonialism all over again but on steroids. Unless takeoff is fast enough that there is only one AI faction. Doesn’t really matter, either way humans are screwed.
3. Powerless humans survive because of a combination of (a) many powerful humans actually caring about their wellbeing and empowerment, and (b) those powerful humans who don’t care, having incentives such that it wouldn’t be worth it to try to kill the powerless humans and take their stuff. E.g. if Putin started killing homeless people in Moscow and pawning their possessions, he’d lose way more in expectation than he’d gain. Neither (a) nor (b) will save us in the AI case (at least, keeping acausal trade and the like out of the picture) because until we make significant technical progress on alignment there won’t be any powerful aligned AGIs to balance against the unaligned ones, and because whatever norms and society a bunch of competing unaligned AGIs set up between themselves, it is unlikely to give humans anything close to equal treatment, and what consideration it gives to humans will erode rapidly as the power differential grows.
I never said “I don’t think unchecked AI will kill everyone”. That quote was not from me.
What I did say was, “Even if AIs end up not caring much for humans, it is dubious that they would decide to kill all of us.” Google informs me that dubious means “not to be relied upon; suspect”.
I don’t see how the first part of that leads to the second part. Humanity could be on the sidelines in a way that doesn’t lead to total oppression and subjugation. The idea that these things will necessarily happen just seems like speculation. I could speculate that the opposite will occur and AIs will leave us alone. That doesn’t get us anywhere.
The question I’m asking is: why? You have told me what you expect to happen, but I want to see an argument for why you’d expect that to happen. In the absence of some evidence-based model of the situation, I don’t think speculating about specific scenarios is a reliable guide.
Those words were not yours, but you did say you agreed it was the main crux, and in context it seemed like you were agreeing that it was a crux for you too. I see now on reread that I misread you and you were instead saying it was a secondary crux. Here, let’s cut through the semantics and get quantitative:
What is your credence in doom conditional on AIs not caring for humans?
If it’s >50%, then I’m mildly surprised that you think the risk of accidentally creating a permanent pause is worse than the risks from not-pausing. I guess you did say that you think AIs will probably just be ethical if we train them hard enough to be… What is your response to the standard arguments that ‘just train them hard to be ethical’ won’t work? E.g. Ajeya Cotra’s writings on the training game.
Re: “I don’t see how the first part of that leads to the second part” Come on, of course you do, you just don’t see it NECESSARILY leading to the second part. On that I agree. Few things are certain in this world. What is your credence in doom conditional on AIs not caring for humans & there being multiple competing AIs?
IMO the “Competing factions of superintelligent AIs, none of whom care about humans, may soon arise, but even if so, humans will be fine anyway somehow” hypothesis is pretty silly and the burden of proof is on you to defend it. I could cite formal models as well as historical precedents to undermine the hypothesis, but I’m pretty sure you know about them already.
Why what? I answered your original question:
with:
My guess is that you disagree with the “whatever norms and society a bunch of competing unaligned AGIs set up between themselves, it is unlikely to give humans anything close to equal treatment...” bit.
Why? Seems pretty obvious to me, I feel like your skepticism is an isolated demand for rigor.
But I’ll go ahead and say more anyway:
Giving humans equal treatment would be worse (for the AIs, which by hypothesis don’t care about humans at all) than other salient available options to them, such as having the humans be second-class in various ways or complete pawns/tools/slaves. Eventually, when the economy is entirely robotic, keeping humans alive at all would be an unnecessary expense.
Historically, if you look at relations between humans and animals, or between colonial powers and native powers, this is the norm. Cases in which the powerless survive and thrive despite none of the powerful caring about them are the exception, and happen for reasons that probably won’t apply in the case of AI. E.g. Putin killing homeless people would be bad for his army’s morale, and that would far outweigh the benefits he’d get from it. (Arguably this is a case of some powerful people in Russia caring about the homeless, so maybe it’s not even an exception after all)
Can you say more about what model you have in mind? Do you have a model? What about a scenario, can you spin a plausible story in which all the ASIs don’t care at all about humans but humans are still fine?
Wanna meet up sometime to talk this over in person? I’ll be in Berkeley this weekend and next week!
Paul Christiano argues here that AI would only need to have “pico-pseudokindness” (caring about humans one part in a trillion) to take over the universe but not trash Earth’s environment to the point of uninhabitability, and that at least this is amount of kindness is likely.
Doesn’t Paul Christiano also have a p(doom) of around 50%? (To me, this suggests “maybe”, rather than “likely”).
See the reply to the first comment on that post. Paul’s “most humans die from AI takeover” is 11%. There are other bad scenarios he considers, like losing control of the future, or most humans die for other reasons, but my understanding is that the 11% most closely corresponds to doom from AI.
Fair. But the other scenarios making up the ~50% are still terrible enough for us to Pause.
How much do they care about humans, and what counts as doom? I think these things matter.
If we’re assuming all AIs don’t care at all about humans and doom = human extinction, then I think the probability is pretty high, like 65%.
If we’re allowed to assume that some small minority of AIs cares about humans, or AIs care about humans to some degree, perhaps in the way humans care about wildlife species preservation, then I think the probability is quite a lot lower, at maybe 25%.
For precision, both of these estimates are over the next 100 years, since I have almost no idea what will happen in the very long run.
In most of these stories, including in Ajeya’s story IIRC, humanity just doesn’t seem to try very hard to reduce misalignment? I don’t think that’s a very reasonable assumption. (Charitably, it could be interpreted as a warning rather than a prediction.) I think that as systems get more capable, we will see a large increase in our alignment efforts and monitoring of AI systems, even without any further intervention from longtermists.
I’m happy to meet up some time and explain in person. I’ll try to remember to DM you later about that, but if I forget, then feel free to remind me.
Maybe so. But I can’t really see mechanistic interpretability being solved to a sufficient degree to detect a situationally aware AI playing the training game, in time to avert doom. Not without a long pause first at least!
I’m surprised by your 25%. To me, that really doesn’t match up with
from your essay.
In my opinion, “X is dubious” lines up pretty well with “X is 75% likely to be false”. That said, enough people have objected to this that I think I’ll change the wording.
OK, so our credences aren’t actually that different after all. I’m actually at less than 65%, funnily enough! (But that’s for doom = extinction. I think human extinction is unlikely for reasons to do with acausal trade; there will be a small minority of AIs that care about humans, just not on Earth. I usually use a broader definition of “doom” as “About as bad as human extinction, or worse.”)
I am pretty confident that what happens in the next 100 years will straightforwardly translate to what happens in the long run. If humans are still well-cared-for in 2100 they probably also will be in 2100,000,000.
I agree that if some AIs care about humans, or if all AIs care a little bit about humans, the situation looks proportionately better. Unfortunately that’s not what I expect to happen by default on Earth.
That’s not really an answer to my question—Ajeya’s argument is about how today’s alignment techniques (e.g. RLHF + monitoring) won’t work even if turbocharged with huge amounts of investment. It sounds like you are disagreeing, and saying that if we just spend lots of $$$ doing lots and lots of RLHF, it’ll work. Or when you say humanity will try harder, do you mean they’ll use some other technique than the ones Ajeya thinks won’t work? If so, which technique?
(Separately, I tend to think humanity will probably invest less in alignment than it does in her stories, but that’s not the crux between us I think.)