I was talking to someone (whom Iâm leaving anonymous) about AI safety, and they said that the AI alignment problem is a joke (to put it mildly). They said that it wonât actually be that hard to teach AI systems the subtleties of human norms because language models contain normative knowledge. I donât know if I endorse this claim but I found it quite convincing, so Iâd like to share it here.
In the classic naive paperclip maximizer scenario, we assume thereâs a goal-directed AI system, and its human boss tells it to âmaximize paperclips.â At this point, it creates a plan to turn all of the iron atoms on Earthâs surface into paperclips. The AI knows everything about the world, including the fact that blood hemoglobin and cargo ships contain iron. However, it doesnât know that itâs wrong to kill people and destroy cargo ships for the purpose of obtaining iron. So it starts going around killing people and destroying cargo ships to obtain as much iron as possible for paperclip manufacturing.
I think most of us assume that the AI system, when directed to âmaximize paperclips,â would align itself with an objective function that says to create as many paper clips as superhumanly possible, even at the cost of destroying human lives and economic assets. However, I see two issues:
Itâs assuming that the system would interpret the term âmaximizeâ extremely literally, in a way that no reasonable human would interpret it. (This is the core of the paperclip argument, but Iâm trying to show that itâs a weakness.) Most modern natural language processing (NLP) systems are based on statistical word embeddings, which capture what words mean in the source texts, rather than their strict mathematical definitions (if they even have one). If the AI system interprets commands using a word embedding, itâs going to interpret âmaximizeâ the way humans would.
Ben Garfinkel has proposed the âprocess orthogonality thesisââthe idea that, for the classic AI alignment argument to work, âthe process of imbuing a system with capabilities and the process of imbuing a system with goalsâ would have to be orthogonal. But this point shows that the process of giving the system capabilities (in this case, knowing that iron can be obtained from various everyday objects) and the process of giving it a goal (in this case, making paperclips) may not be orthogonal. An AI system based on contemporary language models seems much more likely to learn that âmaximize Xâ means something more like âmaximize X subject to common-sense constraints Y1, Y2, âŠâ than to learn that human blood can be turned into iron for paperclips. (Itâs also possible that itâll learn neither, which means it might take âmaximizeâ too literally but wonât figure out that it can make paperclips from humans.)
Itâs assuming that the system would make a special case for verbal commands that can be interpreted as objective functions and set out to optimize the objective function if possible. At a minimum, the AI system needs to convert each verbal command into a plan to execute it, somewhat like a query plan in relational databases. But not every plan to execute a verbal command would involve maximizing an objective function, and using objective functions in execution plans is probably dangerous for the reason that the classic paperclip argument tries to highlight, as well as overkill for most commands.
Problem: We donât know how to build a simulated robot that cleans houses well
Available techniques arenât suitable:
Simple hand-coded reward functions (e.g. dust minimization) wonât produce the desired behavior
We donât have enough data (or sufficiently relevant data) for imitation learning
Existing reward modeling approaches are probably insufficient
This is sort of an âAI alignment problem,â insofar as techniques currently classified as âalignment techniquesâ will probably be needed to solve it. But it also seems very different from the AI alignment problem as classically conceived.
...
One possible interpretation: If we canât develop âalignmentâ techniques soon enough, we will instead build powerful and destructive dust-minimizers
A more natural interpretation: We wonât have highly capable house-cleaning robots until we make progress on âalignmentâ techniques
Iâve concluded that the process orthogonality thesis is less likely to apply to real AI systems than I would have assumed (i.e. Iâve updated downward), and therefore, the âalignment problemâ as originally conceived is less likely to affect AI systems deployed in the real world. However, I donât feel ready to reject all potential global catastrophic risks from imperfectly designed AI (e.g. multi-multi failures), because Iâd rather be safe than sorry.
I think itâs worth saying that the context of âmaximize paperclipsâ is not one where the person literally says the words âmaximize paperclipsâ or something similar; this is instead an intuitive stand-in for building an AI capable of superhuman levels of optimization, such that if you set it the task, say via specifying a reward function, of creating an unbounded number of paperclips youâll get it doing things you wouldnât as a human do to maximize paperclips because humans have competing concerns and will stop when, say, theyâd have to kill themselves or their loved ones to make more paperclips.
The objection seems predicated on interpretation of human language, which is aside the primary point. That is, you could address all the human language interpretation issues and weâd still have an alignment problem, it just might not look literally like building a paperclip maximizer if someone asks the AI to make a lot of paperclips.
In the classic naive paperclip maximizer scenario, we assume thereâs a goal-directed AI system, and its human boss tells it to âmaximize paperclips.â At this point, it creates a plan to turn all of the iron atoms on Earthâs surface into paperclips. The AI knows everything about the world, including the fact that blood hemoglobin and cargo ships contain iron. However, it doesnât know that itâs wrong to kill people and destroy cargo ships for the purpose of obtaining iron. So it starts going around killing people and destroying cargo ships to obtain as much iron as possible for paperclip manufacturing.
I donât think this is a good representation of the classic scenario. Itâs not that the AI âdoesnât know itâs wrongâ. It clearly has a good enough model of the world to predict eg âif a human saw me trying to do this, they would try to stop meâ. The problem is coding an AI that cares about right and wrong. Which is a pretty difficult technical problem. One key part of why itâs hard is that the interface for giving an AI goals is not the same interface youâd use to give a human goals.
Note that this is not the same as saying that itâs impossible to solve, or that itâs obviously much harder than making powerful AI in the first place, just that itâs a difficult technical problem and solving it is one significant step towards safe AI. I think this is what Paul Christiano calls intent alignment
I think itâs possible that this issue goes away with powerful language models, if that can give us an interface to input a goal via a similar interface to instructing a human. And Iâm excited about efforts like this one. But I donât think itâs at all obvious that this will just happen to work out. For example, GPT-3âČs true goal is âgenerate text that is as plausible as possible, based on the text in your training dataâ. And it has a natural language interface, and this goal correlates a bit with âdo what humans wantâ, but it is not the same thing.
Itâs assuming that the system would make a special case for verbal commands that can be interpreted as objective functions and set out to optimize the objective function if possible. At a minimum, the AI system needs to convert each verbal command into a plan to execute it, somewhat like a query plan in relational databases. But not every plan to execute a verbal command would involve maximizing an objective function, and using objective functions in execution plans is probably dangerous for the reason that the classic paperclip argument tries to highlight, as well as overkill for most commands.
This point feels somewhat backwards. Everything Ai systems ever do is maximising an objective function, and Iâm not aware of any AI Safety suggestions that get around this (just ones which have creative objective functions). Itâs not that they convert verbal commands to an objective function, they already have an objective function, which might capture âobey verbal commands in a sensible wayâ or it might not. And my read on the paperclip maximising scenario is that âtell the AI to maximise paperclipsâ really means âencode an objective function that tells it to maximise paperclipsâ
Personally I think the paperclip maximiser scenario is somewhat flawed, and not a good representation of AI x-risk. I like it because it illustrates the key point of specification gamingâthat itâs really, really hard to make an objective function that captures âdo the things we want you to doâ. But this is also going to be pretty obvious to the people making AGI, and they probably wonât have an objective function as clearly dumb as maximise paperclips. But it might not be good enough.
By the way, there will be a workshop on Interactive Learning for Natural Language Processing at ACL 2021. I think it will be useful to incorporate the ideas from this area of research into our models of how AI systems that interpret natural-language feedback would work. One example of this kind of research is Blukis et al. (2019).
A rebuttal of the paperclip maximizer argument
I was talking to someone (whom Iâm leaving anonymous) about AI safety, and they said that the AI alignment problem is a joke (to put it mildly). They said that it wonât actually be that hard to teach AI systems the subtleties of human norms because language models contain normative knowledge. I donât know if I endorse this claim but I found it quite convincing, so Iâd like to share it here.
In the classic naive paperclip maximizer scenario, we assume thereâs a goal-directed AI system, and its human boss tells it to âmaximize paperclips.â At this point, it creates a plan to turn all of the iron atoms on Earthâs surface into paperclips. The AI knows everything about the world, including the fact that blood hemoglobin and cargo ships contain iron. However, it doesnât know that itâs wrong to kill people and destroy cargo ships for the purpose of obtaining iron. So it starts going around killing people and destroying cargo ships to obtain as much iron as possible for paperclip manufacturing.
I think most of us assume that the AI system, when directed to âmaximize paperclips,â would align itself with an objective function that says to create as many paper clips as superhumanly possible, even at the cost of destroying human lives and economic assets. However, I see two issues:
Itâs assuming that the system would interpret the term âmaximizeâ extremely literally, in a way that no reasonable human would interpret it. (This is the core of the paperclip argument, but Iâm trying to show that itâs a weakness.) Most modern natural language processing (NLP) systems are based on statistical word embeddings, which capture what words mean in the source texts, rather than their strict mathematical definitions (if they even have one). If the AI system interprets commands using a word embedding, itâs going to interpret âmaximizeâ the way humans would.
Ben Garfinkel has proposed the âprocess orthogonality thesisââthe idea that, for the classic AI alignment argument to work, âthe process of imbuing a system with capabilities and the process of imbuing a system with goalsâ would have to be orthogonal. But this point shows that the process of giving the system capabilities (in this case, knowing that iron can be obtained from various everyday objects) and the process of giving it a goal (in this case, making paperclips) may not be orthogonal. An AI system based on contemporary language models seems much more likely to learn that âmaximize Xâ means something more like âmaximize X subject to common-sense constraints Y1, Y2, âŠâ than to learn that human blood can be turned into iron for paperclips. (Itâs also possible that itâll learn neither, which means it might take âmaximizeâ too literally but wonât figure out that it can make paperclips from humans.)
Itâs assuming that the system would make a special case for verbal commands that can be interpreted as objective functions and set out to optimize the objective function if possible. At a minimum, the AI system needs to convert each verbal command into a plan to execute it, somewhat like a query plan in relational databases. But not every plan to execute a verbal command would involve maximizing an objective function, and using objective functions in execution plans is probably dangerous for the reason that the classic paperclip argument tries to highlight, as well as overkill for most commands.
Ben gives a great example of how the âalignment problemâ might look different than we expect:
The case of the house-cleaning robot
Problem: We donât know how to build a simulated robot that cleans houses well
Available techniques arenât suitable:
Simple hand-coded reward functions (e.g. dust minimization) wonât produce the desired behavior
We donât have enough data (or sufficiently relevant data) for imitation learning
Existing reward modeling approaches are probably insufficient
This is sort of an âAI alignment problem,â insofar as techniques currently classified as âalignment techniquesâ will probably be needed to solve it. But it also seems very different from the AI alignment problem as classically conceived.
...
One possible interpretation: If we canât develop âalignmentâ techniques soon enough, we will instead build powerful and destructive dust-minimizers
A more natural interpretation: We wonât have highly capable house-cleaning robots until we make progress on âalignmentâ techniques
Iâve concluded that the process orthogonality thesis is less likely to apply to real AI systems than I would have assumed (i.e. Iâve updated downward), and therefore, the âalignment problemâ as originally conceived is less likely to affect AI systems deployed in the real world. However, I donât feel ready to reject all potential global catastrophic risks from imperfectly designed AI (e.g. multi-multi failures), because Iâd rather be safe than sorry.
I think itâs worth saying that the context of âmaximize paperclipsâ is not one where the person literally says the words âmaximize paperclipsâ or something similar; this is instead an intuitive stand-in for building an AI capable of superhuman levels of optimization, such that if you set it the task, say via specifying a reward function, of creating an unbounded number of paperclips youâll get it doing things you wouldnât as a human do to maximize paperclips because humans have competing concerns and will stop when, say, theyâd have to kill themselves or their loved ones to make more paperclips.
The objection seems predicated on interpretation of human language, which is aside the primary point. That is, you could address all the human language interpretation issues and weâd still have an alignment problem, it just might not look literally like building a paperclip maximizer if someone asks the AI to make a lot of paperclips.
I donât think this is a good representation of the classic scenario. Itâs not that the AI âdoesnât know itâs wrongâ. It clearly has a good enough model of the world to predict eg âif a human saw me trying to do this, they would try to stop meâ. The problem is coding an AI that cares about right and wrong. Which is a pretty difficult technical problem. One key part of why itâs hard is that the interface for giving an AI goals is not the same interface youâd use to give a human goals.
Note that this is not the same as saying that itâs impossible to solve, or that itâs obviously much harder than making powerful AI in the first place, just that itâs a difficult technical problem and solving it is one significant step towards safe AI. I think this is what Paul Christiano calls intent alignment
I think itâs possible that this issue goes away with powerful language models, if that can give us an interface to input a goal via a similar interface to instructing a human. And Iâm excited about efforts like this one. But I donât think itâs at all obvious that this will just happen to work out. For example, GPT-3âČs true goal is âgenerate text that is as plausible as possible, based on the text in your training dataâ. And it has a natural language interface, and this goal correlates a bit with âdo what humans wantâ, but it is not the same thing.
This point feels somewhat backwards. Everything Ai systems ever do is maximising an objective function, and Iâm not aware of any AI Safety suggestions that get around this (just ones which have creative objective functions). Itâs not that they convert verbal commands to an objective function, they already have an objective function, which might capture âobey verbal commands in a sensible wayâ or it might not. And my read on the paperclip maximising scenario is that âtell the AI to maximise paperclipsâ really means âencode an objective function that tells it to maximise paperclipsâ
Personally I think the paperclip maximiser scenario is somewhat flawed, and not a good representation of AI x-risk. I like it because it illustrates the key point of specification gamingâthat itâs really, really hard to make an objective function that captures âdo the things we want you to doâ. But this is also going to be pretty obvious to the people making AGI, and they probably wonât have an objective function as clearly dumb as maximise paperclips. But it might not be good enough.
By the way, there will be a workshop on Interactive Learning for Natural Language Processing at ACL 2021. I think it will be useful to incorporate the ideas from this area of research into our models of how AI systems that interpret natural-language feedback would work. One example of this kind of research is Blukis et al. (2019).