Naively, it seems as if killing everyone would earn AI a massive penalty in training: why would it develop aims that are consistent with doing that?
An AI killing everyone wouldn’t earn a massive penalty in training, because there won’t be humans alive in that scenario to assign the penalty. Cf. point 10 in AGI Ruin:
10. You can’t train alignment by running lethally dangerous cognitions, observing whether the outputs kill or deceive or corrupt the operators, assigning a loss, and doing supervised learning. On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions. (Some generalization of this seems like it would have to be true even outside that paradigm; you wouldn’t be working on a live unaligned superintelligence to align it.) This alone is a point that is sufficient to kill a lot of naive proposals from people who never did or could concretely sketch out any specific scenario of what training they’d do, in order to align what output—which is why, of course, they never concretely sketch anything like that. Powerful AGIs doing dangerous things that will kill you if misaligned, must have an alignment property that generalized far out-of-distribution from safer building/training operations that didn’t kill you. This is where a huge amount of lethality comes from on anything remotely resembling the present paradigm. Unaligned operation at a dangerous level of intelligence*capability will kill you; so, if you’re starting with an unaligned system and labeling outputs in order to get it to learn alignment, the training regime or building regime must be operating at some lower level of intelligence*capability that is passively safe, where its currently-unaligned operation does not pose any threat. (Note that anything substantially smarter than you poses a threat given any realistic level of capability. Eg, “being able to produce outputs that humans look at” is probably sufficient for a generally much-smarter-than-human AGI to navigate its way out of the causal systems that are humans, especially in the real world where somebody trained the system on terabytes of Internet text, rather than somehow keeping it ignorant of the latent causes of its source code and training environments.)
But if future AIs are trained like current ones are—by being given vast amounts of human-derived data—I’d naively expect AI goals to have the human-like property of being fuzzy, complex and constrained
I think you need to say more about what the system is being trained for (and how we train it for that). Just saying “facts about humans are in the data” doesn’t provide a causal mechanism by which the AI acts in human-like ways, any more than “facts about clouds are in the data” provides a mechanism by which the AI role-plays being a cloud.
If there’s a lot more human data than meteorological data on clouds in the training data, then I guess I’d consider it more likely that the AI has goals that are somehow human-related and/or imitative-of-humans, than that the AI has goals that are cloud-related or imitative-of-clouds? But I don’t expect the resultant minds to be all that human-like or all that cloud-like, and if there are resemblances, they could be bad ones and not just good ones.
“Constrained” is a particularly hard target to hit because it actively pushes against producing impressive SotA-advancing results. I’d expect the first AGI systems to be built by labs that are pushing full steam ahead on making crazy impressive things happen ASAP, which means you’re actively optimizing against minds that are trying to limit their impact, intelligence, or power.
(Also, I think training AGI systems by giving them vast amounts of human-derived data is a terrible idea, and cuts out many of the most promising tactics for aligning AGI systems. But that’s maybe a topic to save for after we agree about whether you just get human-ish values for free by exposing alien minds or mind-building-processes to lots of facts about humans.)
So there still seems an inferential leap from ‘existing systems are sometimes misaligned’ to ‘superintelligent AI will most likely be catastrophically misaligned’.
Note that there’s a difference between ‘how much goal overlap is there’ and ‘how catastrophic is the non-overlap’. You gave an argument that human goals overlap some with the goals of evolution, but you didn’t give an argument that humans are non-catastrophic from the (pseudo-)perspective of evolution. That would depend on whether humans will produce lots of copies of human DNA in the future.
Power-seeking, wealth-seeking, and self-protection are all instrumentally useful unless your goals include not having power, not having wealth, and not resisting human interference.
Yep! The orthogonality doesn’t just show that unfriendly goals are possible; it shows that friendly goals are possible too.
power and survival are instrumentally convergent for humans too, but not all humans maximally seek these things (even if they can). What will be different about AI?
An AI killing everyone wouldn’t earn a massive penalty in training, because there won’t be humans alive in that scenario to assign the penalty.
Humans need not be around to give a penalty at inference time, just like how GPT4 is not penalized by individual humans, but that the reward is learned / programmed. Even if all humans are sleeping / dead today, GPT can run inference according to the reward we preprogrammed. They are not doing pure online learning.
This is where a huge amount of lethality comes from on anything remotely resembling the present paradigm. Unaligned operation at a dangerous level of intelligence*capability will kill you; so, if you’re starting with an unaligned system and labeling outputs in order to get it to learn alignment, the training regime or building regime must be operating at some lower level of intelligence*capability that is passively safe, where its currently-unaligned operation does not pose any threat.
It is a logical fallacy to account for future increase in capabilities but not future advances in safety research. You’re claiming AGI will be an x-risk based on scaling current capabilities only, but you’re failing to scale safety. Generalization to unsafe scenarios is a situation we want to write tests for before deploying in situations where they may occur. Phase deployment should help test whether we can generalize to increasingly harder situations.
I’d expect the first AGI systems to be built by labs that are pushing full steam ahead on making crazy impressive things happen ASAP, which means you’re actively optimizing against minds that are trying to limit their impact, intelligence, or power
The recent push for productization is making everyone realize that alignment is a capability. A gaslighting chatbot is a bad chatbot compared to a harmless helpful one. As you can see currently, the world is phasing out AI deployment, fixing the bugs, then iterating.
You gave an argument that human goals overlap some with the goals of evolution, but you didn’t give an argument that humans are non-catastrophic from the (pseudo-)perspective of evolution. That would depend on whether humans will produce lots of copies of human DNA in the future.
Humans are unaligned in various ways, it looks like a lot of AIs will be deployed in the future, many aligned to different objectives. I’m skeptical of MIRI’s modeling of risk because y’all only talk about one super-powerful AGI that is godlike, but y’all haven’t modeled multiple companies, multiple AGIs, multiple deployments. Unlike the former, this is going to be the most likely scenario that is frequently unmentioned in forecasting. Future compute is going to be distributed among these AGIs too, so in many ways we end up at something akin to a modern society of humans.
Yep! The orthogonality doesn’t just show that unfriendly goals are possible; it shows that friendly goals are possible too.
Then why the overemphasis/obsession on doom scenario? It makes for a great robot-uprising scifi story but is unscientific. If you approximate the likelihood of future scenarios as a gaussian distribution, wiping out all humans is so extreme and long tailed that it is less likely than almost any other scenario in the set, and the least likely scenario in that set has a probability whose limit approaches to zero given the infinite set of possibilities summing up to 1.0. Given that the number of possibilities are infinite, the likelihood of any one possibility is far too small, close to zero. The likelihood of unaligned AGIs jerking each other off in a massive orgy for eternity is as likely as wiping out humans (more likely accounting for resistance to latter scenario).
The recent push for productization is making everyone realize that alignment is a capability. A gaslighting chatbot is a bad chatbot compared to a harmless helpful one. As you can see currently, the world is phasing out AI deployment, fixing the bugs, then iterating.
While that’s one way to look at it, another way is to notice the arms race dynamics and how every major tech company is now throwing LLMs into the public head over heels even when they stil have some severe flaws. Another observation is that e.g. OpenAI’s safety efforts are not very popular among end users, given that in their eyes these safety measures make the systems less capable/interesting/useful. People tend to get irritated when their prompt is answered with “As a language model trained by OpenAI, I am not able to <X>”, rather than feeling relief over being saved from a dangerous output.
As for your final paragraph, it is easy to say “<outcome X> is just one ouf of infinite possibilities”, but you’re equating trajectories with outcomes. The existence of infinite possibilities doesn’t really help when there’s a systematic reason that causes many or most of them to have human extinction as an outcome. Whether this is actually the case or not is of course an open and hotly debated question, but just claiming “it’s just a single point on the x axis so the probability mass must be 0″ is surely not how you get closer to an actual answer.
why the overemphasis/obsession on doom scenario?
Because it is extremely important that we do what we can to avoid such a scenario. I’m glad that e.g. airlines still invest a lot in improving flight safety and preventing accidents even though flying is already the safest way of traveling. Humanity is basically at this very moment boarding a giant AI-rplane that is about to take off for the very first time, and I’m rather happy there’s a number of people out there looking at the possible worst case and doing their best to figure out how we can get this plane safely off the ground rather than saying “why are people so obsessed with the doom scenario? A plane crash is just one out of infinite possibilities, we’re gonna be fine!”.
Humans need not be around to give a penalty at inference time, just like how GPT4 is not penalized by individual humans, but that the reward is learned / programmed. Even if all humans are sleeping / dead today, GPT can run inference according to the reward we preprogrammed. They are not doing pure online learning.
I was also confused by this at first. But I don’t think Rob is saying “an AI that learned ‘don’t kill everyone’ during training would immediately start killing everyone as soon as it can get away with it”, I think he’s saying “even if an AI picks up what seems like a ‘don’t kill everyone’ heuristic during training, that doesn’t mean this heuristic will always hold out-of-distribution”. In particular, undergoing training is a different environment than being deployed, so picking up a “don’t kill everyone in training (but do whatever when deployed)” heuristic is just as good during training as “don’t kill everyone ever”, but the former allows the AI more freedom to pursue its other objectives when deployed.
(I’m hoping Rob can correct me if I’m wrong and/or you can reply if I’m mistaken, per Cunningham’s Law.)
(I’m in a similar position to Amber: Limited background (technical or otherwise) in AI safety and just trying to make sense of things by discussing them.)
Re: “I think you need to say more about what the system is being trained for (and how we train it for that). Just saying “facts about humans are in the data” doesn’t provide a causal mechanism by which the AI acts in human-like ways, any more than “facts about clouds are in the data” provides a mechanism by which the AI role-plays being a cloud.”
The (main) training process for LLMs is exactly to predict human text, which seems like it could reasonably be described as being trained to impersonate humans. If so, it seems natural to me to think that LLMs will by default acquire goals that are similar to human goals. (So it’s not just that “facts about humans are in the data”, but rather that state-of-the-art models are (in some sense) being trained to act like humans.)
I can see some ways this could go wrong – e.g., maybe “predicting what a human would do” is importantly different from “acting like a human would” in terms of the goals internalised; maybe fine-tuning changes the picture; or maybe we’ll soon move to a different training paradigm where this doesn’t apply. And of course, even if there’s some chance this doesn’t happen (even if it isn’t the default), it warrants concern. But, naively, this argument still feels pretty compelling to me.
the (main) training process for LLMs is exactly to predict human text, which seems like it could reasonably be described as being trained to impersonate humans
“Could reasonably be described” is the problem here. You likely need very high precision to get this right. Relatively small divergences from human goals in terms of bits altered suffice to make a thing that is functionally utterly inhuman in its desires. This is a kind of precision that current AI builders absolutely do not have.
Worse than that, if you train an AI to do a thing, in the sense of setting a loss function where doing that thing gets a good score on the function, and not doing that thing gets a bad score, you do not, in general, get out an AI that wants to do that thing. One of the strongest loss signals that trains your human brain is probably “successfully predict the next sensory stimulus”. Yet humans don’t generally go around thinking “Oh boy, I sure love successfully predicting visual and auditory data, it’s so great.” Our goals have some connection to that loss signal, e.g. I suspect it might be a big part of what makes us like art. But the connection is weird and indirect and strange.
If you were an alien engineer sitting down to write that loss function for humans, you probably wouldn’t predict that they’d end up wanting to make and listen to audio data that sounds like Beethoven’s music, or image data that looks like van Gogh’s paintings. Unless you knew some math that tells you what kind of AI with what kind of goals g you get if you train on a loss function L over a dataset D.
The problem is that we do not have that math. Our understanding of what sort of thinky-thing with what goals comes out at the end of training is close to zero. We know it can score high on the loss function in training, and that’s basically it. We don’t know how it scores high. We don’t know why it “wants” to score high, if it’s the kind of AI that can be usefully said to “want” anything. Which we can’t tell if it is either.
With the bluntness of the tools we currently possess, the goals that any AGI we make right now would have would effectively be a random draw from the space of all possible goals. There are some restrictions on where in this gigantic abstract goal space we would sample from, for example the AI can’t want trivial things that lead to it just sitting there forever doing nothing. Because then it would be functionally equivalent to a brick and have no reason to try and score high on the loss function in training, so it would be selected against. But it’s still an incredibly vast possibility space.
Unfortunately, humans and human values are very specific things, and most goals in goal space make no mention of them. If a reference to human goals does get into the AGIs goals, there’s no reason to expect that it will get in there in the very specific configuration of the AGI wanting the humans to get what they want.
So the AGI gets some random goal that involves more than sitting around doing nothing, but probably isn’t very directly related to humans, any more than humans’ goals are related to correctly predicting the smells that enters their noses. The AGI will then probably gather resources to achieve this goal, and not care what happens to humans as a consequence. Concretely, that may look like earth and the solar system getting converted into AGI infrastructure, with no particular attention paid to keeping things like an oxygen rich atmosphere around. The AGI knows that we would object to this, so it will make sure that we can’t stop it. For example, by killing us all.
If you offered it passage off earth in exchange for leaving humanity alone, it would have little reason to take that deal. That’s leaving valuable time and a planet worth of resources and on the table. Humanity might also make another AGI some day, and that could be a serious rival. On the other hand, just killing all the humans is really easy, because they are not smart enough to defend themselves. Victory is nigh guaranteed. So it probably just does that.
The fact that humans can’t assign negative penalties if they’re dead is a good point.
I think you need to say more about what the system is being trained for (and how we train it for that).
I’m definitely just drawing analogies from my (imperfect) understanding of how LLMs/art AIs work here. How do you assume that AI labs will (try to) train more agenty and/more superintelligent AIs?
I think training AGI systems by giving them vast amounts of human-derived data is a terrible idea, and cuts out many of the most promising tactics for aligning AGI systems
How would you do it instead?
Re ‘constraint’, that’s maybe the wrong word: I meant less that AIs would limit their impact, more...like, if I was close-to-omnipotent, I wouldn’t maximize/optimize for just one thing, but probably lots of things that I value. You could frame this as me maximizing my utility, but my point is, it wouldn’t look like paperclip maximizing. AIs might not be like humans, but humans are the most intelligent thing we know of, so it doesn’t seem ridiculous to suppose that complex/intelligent entities tend to have complex/multiple goals.
An AI killing everyone wouldn’t earn a massive penalty in training, because there won’t be humans alive in that scenario to assign the penalty. Cf. point 10 in AGI Ruin:
A related topic is Distinguishing Test From Training.
I think you need to say more about what the system is being trained for (and how we train it for that). Just saying “facts about humans are in the data” doesn’t provide a causal mechanism by which the AI acts in human-like ways, any more than “facts about clouds are in the data” provides a mechanism by which the AI role-plays being a cloud.
If there’s a lot more human data than meteorological data on clouds in the training data, then I guess I’d consider it more likely that the AI has goals that are somehow human-related and/or imitative-of-humans, than that the AI has goals that are cloud-related or imitative-of-clouds? But I don’t expect the resultant minds to be all that human-like or all that cloud-like, and if there are resemblances, they could be bad ones and not just good ones.
“Constrained” is a particularly hard target to hit because it actively pushes against producing impressive SotA-advancing results. I’d expect the first AGI systems to be built by labs that are pushing full steam ahead on making crazy impressive things happen ASAP, which means you’re actively optimizing against minds that are trying to limit their impact, intelligence, or power.
(Also, I think training AGI systems by giving them vast amounts of human-derived data is a terrible idea, and cuts out many of the most promising tactics for aligning AGI systems. But that’s maybe a topic to save for after we agree about whether you just get human-ish values for free by exposing alien minds or mind-building-processes to lots of facts about humans.)
Note that there’s a difference between ‘how much goal overlap is there’ and ‘how catastrophic is the non-overlap’. You gave an argument that human goals overlap some with the goals of evolution, but you didn’t give an argument that humans are non-catastrophic from the (pseudo-)perspective of evolution. That would depend on whether humans will produce lots of copies of human DNA in the future.
Yep! The orthogonality doesn’t just show that unfriendly goals are possible; it shows that friendly goals are possible too.
Some good discussion of this in Superintelligent AI is necessary for an amazing future, but far from sufficient, and in Niceness is unnatural.
Humans need not be around to give a penalty at inference time, just like how GPT4 is not penalized by individual humans, but that the reward is learned / programmed. Even if all humans are sleeping / dead today, GPT can run inference according to the reward we preprogrammed. They are not doing pure online learning.
It is a logical fallacy to account for future increase in capabilities but not future advances in safety research. You’re claiming AGI will be an x-risk based on scaling current capabilities only, but you’re failing to scale safety. Generalization to unsafe scenarios is a situation we want to write tests for before deploying in situations where they may occur. Phase deployment should help test whether we can generalize to increasingly harder situations.
The recent push for productization is making everyone realize that alignment is a capability. A gaslighting chatbot is a bad chatbot compared to a harmless helpful one. As you can see currently, the world is phasing out AI deployment, fixing the bugs, then iterating.
Humans are unaligned in various ways, it looks like a lot of AIs will be deployed in the future, many aligned to different objectives. I’m skeptical of MIRI’s modeling of risk because y’all only talk about one super-powerful AGI that is godlike, but y’all haven’t modeled multiple companies, multiple AGIs, multiple deployments. Unlike the former, this is going to be the most likely scenario that is frequently unmentioned in forecasting. Future compute is going to be distributed among these AGIs too, so in many ways we end up at something akin to a modern society of humans.
Then why the overemphasis/obsession on doom scenario? It makes for a great robot-uprising scifi story but is unscientific. If you approximate the likelihood of future scenarios as a gaussian distribution, wiping out all humans is so extreme and long tailed that it is less likely than almost any other scenario in the set, and the least likely scenario in that set has a probability whose limit approaches to zero given the infinite set of possibilities summing up to 1.0. Given that the number of possibilities are infinite, the likelihood of any one possibility is far too small, close to zero. The likelihood of unaligned AGIs jerking each other off in a massive orgy for eternity is as likely as wiping out humans (more likely accounting for resistance to latter scenario).
While that’s one way to look at it, another way is to notice the arms race dynamics and how every major tech company is now throwing LLMs into the public head over heels even when they stil have some severe flaws. Another observation is that e.g. OpenAI’s safety efforts are not very popular among end users, given that in their eyes these safety measures make the systems less capable/interesting/useful. People tend to get irritated when their prompt is answered with “As a language model trained by OpenAI, I am not able to <X>”, rather than feeling relief over being saved from a dangerous output.
As for your final paragraph, it is easy to say “<outcome X> is just one ouf of infinite possibilities”, but you’re equating trajectories with outcomes. The existence of infinite possibilities doesn’t really help when there’s a systematic reason that causes many or most of them to have human extinction as an outcome. Whether this is actually the case or not is of course an open and hotly debated question, but just claiming “it’s just a single point on the x axis so the probability mass must be 0″ is surely not how you get closer to an actual answer.
Because it is extremely important that we do what we can to avoid such a scenario. I’m glad that e.g. airlines still invest a lot in improving flight safety and preventing accidents even though flying is already the safest way of traveling. Humanity is basically at this very moment boarding a giant AI-rplane that is about to take off for the very first time, and I’m rather happy there’s a number of people out there looking at the possible worst case and doing their best to figure out how we can get this plane safely off the ground rather than saying “why are people so obsessed with the doom scenario? A plane crash is just one out of infinite possibilities, we’re gonna be fine!”.
I was also confused by this at first. But I don’t think Rob is saying “an AI that learned ‘don’t kill everyone’ during training would immediately start killing everyone as soon as it can get away with it”, I think he’s saying “even if an AI picks up what seems like a ‘don’t kill everyone’ heuristic during training, that doesn’t mean this heuristic will always hold out-of-distribution”. In particular, undergoing training is a different environment than being deployed, so picking up a “don’t kill everyone in training (but do whatever when deployed)” heuristic is just as good during training as “don’t kill everyone ever”, but the former allows the AI more freedom to pursue its other objectives when deployed.
(I’m hoping Rob can correct me if I’m wrong and/or you can reply if I’m mistaken, per Cunningham’s Law.)
(I’m in a similar position to Amber: Limited background (technical or otherwise) in AI safety and just trying to make sense of things by discussing them.)
Re: “I think you need to say more about what the system is being trained for (and how we train it for that). Just saying “facts about humans are in the data” doesn’t provide a causal mechanism by which the AI acts in human-like ways, any more than “facts about clouds are in the data” provides a mechanism by which the AI role-plays being a cloud.”
The (main) training process for LLMs is exactly to predict human text, which seems like it could reasonably be described as being trained to impersonate humans. If so, it seems natural to me to think that LLMs will by default acquire goals that are similar to human goals. (So it’s not just that “facts about humans are in the data”, but rather that state-of-the-art models are (in some sense) being trained to act like humans.)
I can see some ways this could go wrong – e.g., maybe “predicting what a human would do” is importantly different from “acting like a human would” in terms of the goals internalised; maybe fine-tuning changes the picture; or maybe we’ll soon move to a different training paradigm where this doesn’t apply. And of course, even if there’s some chance this doesn’t happen (even if it isn’t the default), it warrants concern. But, naively, this argument still feels pretty compelling to me.
“Could reasonably be described” is the problem here. You likely need very high precision to get this right. Relatively small divergences from human goals in terms of bits altered suffice to make a thing that is functionally utterly inhuman in its desires. This is a kind of precision that current AI builders absolutely do not have.
Worse than that, if you train an AI to do a thing, in the sense of setting a loss function where doing that thing gets a good score on the function, and not doing that thing gets a bad score, you do not, in general, get out an AI that wants to do that thing. One of the strongest loss signals that trains your human brain is probably “successfully predict the next sensory stimulus”. Yet humans don’t generally go around thinking “Oh boy, I sure love successfully predicting visual and auditory data, it’s so great.” Our goals have some connection to that loss signal, e.g. I suspect it might be a big part of what makes us like art. But the connection is weird and indirect and strange.
If you were an alien engineer sitting down to write that loss function for humans, you probably wouldn’t predict that they’d end up wanting to make and listen to audio data that sounds like Beethoven’s music, or image data that looks like van Gogh’s paintings. Unless you knew some math that tells you what kind of AI with what kind of goals g you get if you train on a loss function L over a dataset D.
The problem is that we do not have that math. Our understanding of what sort of thinky-thing with what goals comes out at the end of training is close to zero. We know it can score high on the loss function in training, and that’s basically it. We don’t know how it scores high. We don’t know why it “wants” to score high, if it’s the kind of AI that can be usefully said to “want” anything. Which we can’t tell if it is either.
With the bluntness of the tools we currently possess, the goals that any AGI we make right now would have would effectively be a random draw from the space of all possible goals. There are some restrictions on where in this gigantic abstract goal space we would sample from, for example the AI can’t want trivial things that lead to it just sitting there forever doing nothing. Because then it would be functionally equivalent to a brick and have no reason to try and score high on the loss function in training, so it would be selected against. But it’s still an incredibly vast possibility space.
Unfortunately, humans and human values are very specific things, and most goals in goal space make no mention of them. If a reference to human goals does get into the AGIs goals, there’s no reason to expect that it will get in there in the very specific configuration of the AGI wanting the humans to get what they want.
So the AGI gets some random goal that involves more than sitting around doing nothing, but probably isn’t very directly related to humans, any more than humans’ goals are related to correctly predicting the smells that enters their noses. The AGI will then probably gather resources to achieve this goal, and not care what happens to humans as a consequence. Concretely, that may look like earth and the solar system getting converted into AGI infrastructure, with no particular attention paid to keeping things like an oxygen rich atmosphere around. The AGI knows that we would object to this, so it will make sure that we can’t stop it. For example, by killing us all.
If you offered it passage off earth in exchange for leaving humanity alone, it would have little reason to take that deal. That’s leaving valuable time and a planet worth of resources and on the table. Humanity might also make another AGI some day, and that could be a serious rival. On the other hand, just killing all the humans is really easy, because they are not smart enough to defend themselves. Victory is nigh guaranteed. So it probably just does that.
Thanks, this is helpful.
The fact that humans can’t assign negative penalties if they’re dead is a good point.
I’m definitely just drawing analogies from my (imperfect) understanding of how LLMs/art AIs work here. How do you assume that AI labs will (try to) train more agenty and/more superintelligent AIs?
How would you do it instead?
Re ‘constraint’, that’s maybe the wrong word: I meant less that AIs would limit their impact, more...like, if I was close-to-omnipotent, I wouldn’t maximize/optimize for just one thing, but probably lots of things that I value. You could frame this as me maximizing my utility, but my point is, it wouldn’t look like paperclip maximizing. AIs might not be like humans, but humans are the most intelligent thing we know of, so it doesn’t seem ridiculous to suppose that complex/intelligent entities tend to have complex/multiple goals.
I’ll check out the links you suggest!