(I’m in a similar position to Amber: Limited background (technical or otherwise) in AI safety and just trying to make sense of things by discussing them.)
Re: “I think you need to say more about what the system is being trained for (and how we train it for that). Just saying “facts about humans are in the data” doesn’t provide a causal mechanism by which the AI acts in human-like ways, any more than “facts about clouds are in the data” provides a mechanism by which the AI role-plays being a cloud.”
The (main) training process for LLMs is exactly to predict human text, which seems like it could reasonably be described as being trained to impersonate humans. If so, it seems natural to me to think that LLMs will by default acquire goals that are similar to human goals. (So it’s not just that “facts about humans are in the data”, but rather that state-of-the-art models are (in some sense) being trained to act like humans.)
I can see some ways this could go wrong – e.g., maybe “predicting what a human would do” is importantly different from “acting like a human would” in terms of the goals internalised; maybe fine-tuning changes the picture; or maybe we’ll soon move to a different training paradigm where this doesn’t apply. And of course, even if there’s some chance this doesn’t happen (even if it isn’t the default), it warrants concern. But, naively, this argument still feels pretty compelling to me.
the (main) training process for LLMs is exactly to predict human text, which seems like it could reasonably be described as being trained to impersonate humans
“Could reasonably be described” is the problem here. You likely need very high precision to get this right. Relatively small divergences from human goals in terms of bits altered suffice to make a thing that is functionally utterly inhuman in its desires. This is a kind of precision that current AI builders absolutely do not have.
Worse than that, if you train an AI to do a thing, in the sense of setting a loss function where doing that thing gets a good score on the function, and not doing that thing gets a bad score, you do not, in general, get out an AI that wants to do that thing. One of the strongest loss signals that trains your human brain is probably “successfully predict the next sensory stimulus”. Yet humans don’t generally go around thinking “Oh boy, I sure love successfully predicting visual and auditory data, it’s so great.” Our goals have some connection to that loss signal, e.g. I suspect it might be a big part of what makes us like art. But the connection is weird and indirect and strange.
If you were an alien engineer sitting down to write that loss function for humans, you probably wouldn’t predict that they’d end up wanting to make and listen to audio data that sounds like Beethoven’s music, or image data that looks like van Gogh’s paintings. Unless you knew some math that tells you what kind of AI with what kind of goals g you get if you train on a loss function L over a dataset D.
The problem is that we do not have that math. Our understanding of what sort of thinky-thing with what goals comes out at the end of training is close to zero. We know it can score high on the loss function in training, and that’s basically it. We don’t know how it scores high. We don’t know why it “wants” to score high, if it’s the kind of AI that can be usefully said to “want” anything. Which we can’t tell if it is either.
With the bluntness of the tools we currently possess, the goals that any AGI we make right now would have would effectively be a random draw from the space of all possible goals. There are some restrictions on where in this gigantic abstract goal space we would sample from, for example the AI can’t want trivial things that lead to it just sitting there forever doing nothing. Because then it would be functionally equivalent to a brick and have no reason to try and score high on the loss function in training, so it would be selected against. But it’s still an incredibly vast possibility space.
Unfortunately, humans and human values are very specific things, and most goals in goal space make no mention of them. If a reference to human goals does get into the AGIs goals, there’s no reason to expect that it will get in there in the very specific configuration of the AGI wanting the humans to get what they want.
So the AGI gets some random goal that involves more than sitting around doing nothing, but probably isn’t very directly related to humans, any more than humans’ goals are related to correctly predicting the smells that enters their noses. The AGI will then probably gather resources to achieve this goal, and not care what happens to humans as a consequence. Concretely, that may look like earth and the solar system getting converted into AGI infrastructure, with no particular attention paid to keeping things like an oxygen rich atmosphere around. The AGI knows that we would object to this, so it will make sure that we can’t stop it. For example, by killing us all.
If you offered it passage off earth in exchange for leaving humanity alone, it would have little reason to take that deal. That’s leaving valuable time and a planet worth of resources and on the table. Humanity might also make another AGI some day, and that could be a serious rival. On the other hand, just killing all the humans is really easy, because they are not smart enough to defend themselves. Victory is nigh guaranteed. So it probably just does that.
(I’m in a similar position to Amber: Limited background (technical or otherwise) in AI safety and just trying to make sense of things by discussing them.)
Re: “I think you need to say more about what the system is being trained for (and how we train it for that). Just saying “facts about humans are in the data” doesn’t provide a causal mechanism by which the AI acts in human-like ways, any more than “facts about clouds are in the data” provides a mechanism by which the AI role-plays being a cloud.”
The (main) training process for LLMs is exactly to predict human text, which seems like it could reasonably be described as being trained to impersonate humans. If so, it seems natural to me to think that LLMs will by default acquire goals that are similar to human goals. (So it’s not just that “facts about humans are in the data”, but rather that state-of-the-art models are (in some sense) being trained to act like humans.)
I can see some ways this could go wrong – e.g., maybe “predicting what a human would do” is importantly different from “acting like a human would” in terms of the goals internalised; maybe fine-tuning changes the picture; or maybe we’ll soon move to a different training paradigm where this doesn’t apply. And of course, even if there’s some chance this doesn’t happen (even if it isn’t the default), it warrants concern. But, naively, this argument still feels pretty compelling to me.
“Could reasonably be described” is the problem here. You likely need very high precision to get this right. Relatively small divergences from human goals in terms of bits altered suffice to make a thing that is functionally utterly inhuman in its desires. This is a kind of precision that current AI builders absolutely do not have.
Worse than that, if you train an AI to do a thing, in the sense of setting a loss function where doing that thing gets a good score on the function, and not doing that thing gets a bad score, you do not, in general, get out an AI that wants to do that thing. One of the strongest loss signals that trains your human brain is probably “successfully predict the next sensory stimulus”. Yet humans don’t generally go around thinking “Oh boy, I sure love successfully predicting visual and auditory data, it’s so great.” Our goals have some connection to that loss signal, e.g. I suspect it might be a big part of what makes us like art. But the connection is weird and indirect and strange.
If you were an alien engineer sitting down to write that loss function for humans, you probably wouldn’t predict that they’d end up wanting to make and listen to audio data that sounds like Beethoven’s music, or image data that looks like van Gogh’s paintings. Unless you knew some math that tells you what kind of AI with what kind of goals g you get if you train on a loss function L over a dataset D.
The problem is that we do not have that math. Our understanding of what sort of thinky-thing with what goals comes out at the end of training is close to zero. We know it can score high on the loss function in training, and that’s basically it. We don’t know how it scores high. We don’t know why it “wants” to score high, if it’s the kind of AI that can be usefully said to “want” anything. Which we can’t tell if it is either.
With the bluntness of the tools we currently possess, the goals that any AGI we make right now would have would effectively be a random draw from the space of all possible goals. There are some restrictions on where in this gigantic abstract goal space we would sample from, for example the AI can’t want trivial things that lead to it just sitting there forever doing nothing. Because then it would be functionally equivalent to a brick and have no reason to try and score high on the loss function in training, so it would be selected against. But it’s still an incredibly vast possibility space.
Unfortunately, humans and human values are very specific things, and most goals in goal space make no mention of them. If a reference to human goals does get into the AGIs goals, there’s no reason to expect that it will get in there in the very specific configuration of the AGI wanting the humans to get what they want.
So the AGI gets some random goal that involves more than sitting around doing nothing, but probably isn’t very directly related to humans, any more than humans’ goals are related to correctly predicting the smells that enters their noses. The AGI will then probably gather resources to achieve this goal, and not care what happens to humans as a consequence. Concretely, that may look like earth and the solar system getting converted into AGI infrastructure, with no particular attention paid to keeping things like an oxygen rich atmosphere around. The AGI knows that we would object to this, so it will make sure that we can’t stop it. For example, by killing us all.
If you offered it passage off earth in exchange for leaving humanity alone, it would have little reason to take that deal. That’s leaving valuable time and a planet worth of resources and on the table. Humanity might also make another AGI some day, and that could be a serious rival. On the other hand, just killing all the humans is really easy, because they are not smart enough to defend themselves. Victory is nigh guaranteed. So it probably just does that.