Those words were not yours, but you did say you agreed it was the main crux, and in context it seemed like you were agreeing that it was a crux for you too. I see now on reread that I misread you and you were instead saying it was a secondary crux. Here, let’s cut through the semantics and get quantitative:
What is your credence in doom conditional on AIs not caring for humans?
If it’s >50%, then I’m mildly surprised that you think the risk of accidentally creating a permanent pause is worse than the risks from not-pausing. I guess you did say that you think AIs will probably just be ethical if we train them hard enough to be… What is your response to the standard arguments that ‘just train them hard to be ethical’ won’t work? E.g. Ajeya Cotra’s writings on the training game.
Re: “I don’t see how the first part of that leads to the second part” Come on, of course you do, you just don’t see it NECESSARILY leading to the second part. On that I agree. Few things are certain in this world. What is your credence in doom conditional on AIs not caring for humans & there being multiple competing AIs?
IMO the “Competing factions of superintelligent AIs, none of whom care about humans, may soon arise, but even if so, humans will be fine anyway somehow” hypothesis is pretty silly and the burden of proof is on you to defend it. I could cite formal models as well as historical precedents to undermine the hypothesis, but I’m pretty sure you know about them already.
The question I’m asking is: why? You have told me what you expect to happen, but I want to see an argument for why you’d expect that to happen. In the absence of some evidence-based model of the situation, I don’t think speculating about specific scenarios is a reliable guide.
Why what? I answered your original question:
Why are rogue AI motives so much more likely to lead to disaster than rogue human motives? Yes, AIs will be more powerful than humans, but there are already many people who are essentially powerless (not to mention many non-human animals) who survive despite the fact that their interests are in competition with much more powerful entities.
with:
Powerless humans survive because of a combination of (a) many powerful humans actually caring about their wellbeing and empowerment, and (b) those powerful humans who don’t care, having incentives such that it wouldn’t be worth it to try to kill the powerless humans and take their stuff. E.g. if Putin started killing homeless people in Moscow and pawning their possessions, he’d lose way more in expectation than he’d gain. Neither (a) nor (b) will save us in the AI case (at least, keeping acausal trade and the like out of the picture) because until we make significant technical progress on alignment there won’t be any powerful aligned AGIs to balance against the unaligned ones, and because whatever norms and society a bunch of competing unaligned AGIs set up between themselves, it is unlikely to give humans anything close to equal treatment, and what consideration it gives to humans will erode rapidly as the power differential grows.
My guess is that you disagree with the “whatever norms and society a bunch of competing unaligned AGIs set up between themselves, it is unlikely to give humans anything close to equal treatment...” bit.
Giving humans equal treatment would be worse (for the AIs, which by hypothesis don’t care about humans at all) than other salient available options to them, such as having the humans be second-class in various ways or complete pawns/tools/slaves. Eventually, when the economy is entirely robotic, keeping humans alive at all would be an unnecessary expense.
Historically, if you look at relations between humans and animals, or between colonial powers and native powers, this is the norm. Cases in which the powerless survive and thrive despite none of the powerful caring about them are the exception, and happen for reasons that probably won’t apply in the case of AI. E.g. Putin killing homeless people would be bad for his army’s morale, and that would far outweigh the benefits he’d get from it. (Arguably this is a case of some powerful people in Russia caring about the homeless, so maybe it’s not even an exception after all)
Can you say more about what model you have in mind? Do you have a model? What about a scenario, can you spin a plausible story in which all the ASIs don’t care at all about humans but humans are still fine?
Wanna meet up sometime to talk this over in person? I’ll be in Berkeley this weekend and next week!
Paul Christiano argues here that AI would only need to have “pico-pseudokindness” (caring about humans one part in a trillion) to take over the universe but not trash Earth’s environment to the point of uninhabitability, and that at least this is amount of kindness is likely.
See the reply to the first comment on that post. Paul’s “most humans die from AI takeover” is 11%. There are other bad scenarios he considers, like losing control of the future, or most humans die for other reasons, but my understanding is that the 11% most closely corresponds to doom from AI.
What is your credence in doom conditional on AIs not caring for humans?
How much do they care about humans, and what counts as doom? I think these things matter.
If we’re assuming all AIs don’t care at all about humans and doom = human extinction, then I think the probability is pretty high, like 65%.
If we’re allowed to assume that some small minority of AIs cares about humans, or AIs care about humans to some degree, perhaps in the way humans care about wildlife species preservation, then I think the probability is quite a lot lower, at maybe 25%.
For precision, both of these estimates are over the next 100 years, since I have almost no idea what will happen in the very long run.
What is your response to the standard arguments that ‘just train them hard to be ethical’ won’t work? E.g. Ajeya Cotra’s writings on the training game.
In most of these stories, including in Ajeya’s story IIRC, humanity just doesn’t seem to try very hard to reduce misalignment? I don’t think that’s a very reasonable assumption. (Charitably, it could be interpreted as a warning rather than a prediction.) I think that as systems get more capable, we will see a large increase in our alignment efforts and monitoring of AI systems, even without any further intervention from longtermists.
Can you say more about what model you have in mind? Do you have a model?
I’m happy to meet up some time and explain in person. I’ll try to remember to DM you later about that, but if I forget, then feel free to remind me.
I think that as systems get more capable, we will see a large increase in our alignment efforts and monitoring of AI systems, even without any further intervention from longtermists.
Maybe so. But I can’t really see mechanistic interpretability being solved to a sufficient degree to detect a situationally aware AI playing the training game, in time to avert doom. Not without a long pause first at least!
In my opinion, “X is dubious” lines up pretty well with “X is 75% likely to be false”. That said, enough people have objected to this that I think I’ll change the wording.
OK, so our credences aren’t actually that different after all. I’m actually at less than 65%, funnily enough! (But that’s for doom = extinction. I think human extinction is unlikely for reasons to do with acausal trade; there will be a small minority of AIs that care about humans, just not on Earth. I usually use a broader definition of “doom” as “About as bad as human extinction, or worse.”)
I am pretty confident that what happens in the next 100 years will straightforwardly translate to what happens in the long run. If humans are still well-cared-for in 2100 they probably also will be in 2100,000,000.
I agree that if some AIs care about humans, or if all AIs care a little bit about humans, the situation looks proportionately better. Unfortunately that’s not what I expect to happen by default on Earth.
In most of these stories, including in Ajeya’s story IIRC, humanity just doesn’t seem to try very hard to reduce misalignment? I don’t think that’s a very reasonable assumption. (Charitably, it could be interpreted as a warning rather than a prediction.) I think that as systems get more capable, we will see a large increase in our alignment efforts and monitoring of AI systems, even without any further intervention from longtermists.
That’s not really an answer to my question—Ajeya’s argument is about how today’s alignment techniques (e.g. RLHF + monitoring) won’t work even if turbocharged with huge amounts of investment. It sounds like you are disagreeing, and saying that if we just spend lots of $$$ doing lots and lots of RLHF, it’ll work. Or when you say humanity will try harder, do you mean they’ll use some other technique than the ones Ajeya thinks won’t work? If so, which technique?
(Separately, I tend to think humanity will probably invest less in alignment than it does in her stories, but that’s not the crux between us I think.)
Those words were not yours, but you did say you agreed it was the main crux, and in context it seemed like you were agreeing that it was a crux for you too. I see now on reread that I misread you and you were instead saying it was a secondary crux. Here, let’s cut through the semantics and get quantitative:
What is your credence in doom conditional on AIs not caring for humans?
If it’s >50%, then I’m mildly surprised that you think the risk of accidentally creating a permanent pause is worse than the risks from not-pausing. I guess you did say that you think AIs will probably just be ethical if we train them hard enough to be… What is your response to the standard arguments that ‘just train them hard to be ethical’ won’t work? E.g. Ajeya Cotra’s writings on the training game.
Re: “I don’t see how the first part of that leads to the second part” Come on, of course you do, you just don’t see it NECESSARILY leading to the second part. On that I agree. Few things are certain in this world. What is your credence in doom conditional on AIs not caring for humans & there being multiple competing AIs?
IMO the “Competing factions of superintelligent AIs, none of whom care about humans, may soon arise, but even if so, humans will be fine anyway somehow” hypothesis is pretty silly and the burden of proof is on you to defend it. I could cite formal models as well as historical precedents to undermine the hypothesis, but I’m pretty sure you know about them already.
Why what? I answered your original question:
with:
My guess is that you disagree with the “whatever norms and society a bunch of competing unaligned AGIs set up between themselves, it is unlikely to give humans anything close to equal treatment...” bit.
Why? Seems pretty obvious to me, I feel like your skepticism is an isolated demand for rigor.
But I’ll go ahead and say more anyway:
Giving humans equal treatment would be worse (for the AIs, which by hypothesis don’t care about humans at all) than other salient available options to them, such as having the humans be second-class in various ways or complete pawns/tools/slaves. Eventually, when the economy is entirely robotic, keeping humans alive at all would be an unnecessary expense.
Historically, if you look at relations between humans and animals, or between colonial powers and native powers, this is the norm. Cases in which the powerless survive and thrive despite none of the powerful caring about them are the exception, and happen for reasons that probably won’t apply in the case of AI. E.g. Putin killing homeless people would be bad for his army’s morale, and that would far outweigh the benefits he’d get from it. (Arguably this is a case of some powerful people in Russia caring about the homeless, so maybe it’s not even an exception after all)
Can you say more about what model you have in mind? Do you have a model? What about a scenario, can you spin a plausible story in which all the ASIs don’t care at all about humans but humans are still fine?
Wanna meet up sometime to talk this over in person? I’ll be in Berkeley this weekend and next week!
Paul Christiano argues here that AI would only need to have “pico-pseudokindness” (caring about humans one part in a trillion) to take over the universe but not trash Earth’s environment to the point of uninhabitability, and that at least this is amount of kindness is likely.
Doesn’t Paul Christiano also have a p(doom) of around 50%? (To me, this suggests “maybe”, rather than “likely”).
See the reply to the first comment on that post. Paul’s “most humans die from AI takeover” is 11%. There are other bad scenarios he considers, like losing control of the future, or most humans die for other reasons, but my understanding is that the 11% most closely corresponds to doom from AI.
Fair. But the other scenarios making up the ~50% are still terrible enough for us to Pause.
How much do they care about humans, and what counts as doom? I think these things matter.
If we’re assuming all AIs don’t care at all about humans and doom = human extinction, then I think the probability is pretty high, like 65%.
If we’re allowed to assume that some small minority of AIs cares about humans, or AIs care about humans to some degree, perhaps in the way humans care about wildlife species preservation, then I think the probability is quite a lot lower, at maybe 25%.
For precision, both of these estimates are over the next 100 years, since I have almost no idea what will happen in the very long run.
In most of these stories, including in Ajeya’s story IIRC, humanity just doesn’t seem to try very hard to reduce misalignment? I don’t think that’s a very reasonable assumption. (Charitably, it could be interpreted as a warning rather than a prediction.) I think that as systems get more capable, we will see a large increase in our alignment efforts and monitoring of AI systems, even without any further intervention from longtermists.
I’m happy to meet up some time and explain in person. I’ll try to remember to DM you later about that, but if I forget, then feel free to remind me.
Maybe so. But I can’t really see mechanistic interpretability being solved to a sufficient degree to detect a situationally aware AI playing the training game, in time to avert doom. Not without a long pause first at least!
I’m surprised by your 25%. To me, that really doesn’t match up with
from your essay.
In my opinion, “X is dubious” lines up pretty well with “X is 75% likely to be false”. That said, enough people have objected to this that I think I’ll change the wording.
OK, so our credences aren’t actually that different after all. I’m actually at less than 65%, funnily enough! (But that’s for doom = extinction. I think human extinction is unlikely for reasons to do with acausal trade; there will be a small minority of AIs that care about humans, just not on Earth. I usually use a broader definition of “doom” as “About as bad as human extinction, or worse.”)
I am pretty confident that what happens in the next 100 years will straightforwardly translate to what happens in the long run. If humans are still well-cared-for in 2100 they probably also will be in 2100,000,000.
I agree that if some AIs care about humans, or if all AIs care a little bit about humans, the situation looks proportionately better. Unfortunately that’s not what I expect to happen by default on Earth.
That’s not really an answer to my question—Ajeya’s argument is about how today’s alignment techniques (e.g. RLHF + monitoring) won’t work even if turbocharged with huge amounts of investment. It sounds like you are disagreeing, and saying that if we just spend lots of $$$ doing lots and lots of RLHF, it’ll work. Or when you say humanity will try harder, do you mean they’ll use some other technique than the ones Ajeya thinks won’t work? If so, which technique?
(Separately, I tend to think humanity will probably invest less in alignment than it does in her stories, but that’s not the crux between us I think.)