I was talking to someone (whom I’m leaving anonymous) about AI safety, and they said that the AI alignment problem is a joke (to put it mildly). They said that it won’t actually be that hard to teach AI systems the subtleties of human norms because language models contain normative knowledge. I don’t know if I endorse this claim but I found it quite convincing, so I’d like to share it here.
In the classic naive paperclip maximizer scenario, we assume there’s a goal-directed AI system, and its human boss tells it to “maximize paperclips.” At this point, it creates a plan to turn all of the iron atoms on Earth’s surface into paperclips. The AI knows everything about the world, including the fact that blood hemoglobin and cargo ships contain iron. However, it doesn’t know that it’s wrong to kill people and destroy cargo ships for the purpose of obtaining iron. So it starts going around killing people and destroying cargo ships to obtain as much iron as possible for paperclip manufacturing.
I think most of us assume that the AI system, when directed to “maximize paperclips,” would align itself with an objective function that says to create as many paper clips as superhumanly possible, even at the cost of destroying human lives and economic assets. However, I see two issues:
It’s assuming that the system would interpret the term “maximize” extremely literally, in a way that no reasonable human would interpret it. (This is the core of the paperclip argument, but I’m trying to show that it’s a weakness.) Most modern natural language processing (NLP) systems are based on statistical word embeddings, which capture what words mean in the source texts, rather than their strict mathematical definitions (if they even have one). If the AI system interprets commands using a word embedding, it’s going to interpret “maximize” the way humans would.
Ben Garfinkel has proposed the “process orthogonality thesis”—the idea that, for the classic AI alignment argument to work, “the process of imbuing a system with capabilities and the process of imbuing a system with goals” would have to be orthogonal. But this point shows that the process of giving the system capabilities (in this case, knowing that iron can be obtained from various everyday objects) and the process of giving it a goal (in this case, making paperclips) may not be orthogonal. An AI system based on contemporary language models seems much more likely to learn that “maximize X” means something more like “maximize X subject to common-sense constraints Y1, Y2, …” than to learn that human blood can be turned into iron for paperclips. (It’s also possible that it’ll learn neither, which means it might take “maximize” too literally but won’t figure out that it can make paperclips from humans.)
It’s assuming that the system would make a special case for verbal commands that can be interpreted as objective functions and set out to optimize the objective function if possible. At a minimum, the AI system needs to convert each verbal command into a plan to execute it, somewhat like a query plan in relational databases. But not every plan to execute a verbal command would involve maximizing an objective function, and using objective functions in execution plans is probably dangerous for the reason that the classic paperclip argument tries to highlight, as well as overkill for most commands.
Problem: We don’t know how to build a simulated robot that cleans houses well
Available techniques aren’t suitable:
Simple hand-coded reward functions (e.g. dust minimization) won’t produce the desired behavior
We don’t have enough data (or sufficiently relevant data) for imitation learning
Existing reward modeling approaches are probably insufficient
This is sort of an “AI alignment problem,” insofar as techniques currently classified as “alignment techniques” will probably be needed to solve it. But it also seems very different from the AI alignment problem as classically conceived.
...
One possible interpretation: If we can’t develop “alignment” techniques soon enough, we will instead build powerful and destructive dust-minimizers
A more natural interpretation: We won’t have highly capable house-cleaning robots until we make progress on “alignment” techniques
I’ve concluded that the process orthogonality thesis is less likely to apply to real AI systems than I would have assumed (i.e. I’ve updated downward), and therefore, the “alignment problem” as originally conceived is less likely to affect AI systems deployed in the real world. However, I don’t feel ready to reject all potential global catastrophic risks from imperfectly designed AI (e.g. multi-multi failures), because I’d rather be safe than sorry.
I think it’s worth saying that the context of “maximize paperclips” is not one where the person literally says the words “maximize paperclips” or something similar; this is instead an intuitive stand-in for building an AI capable of superhuman levels of optimization, such that if you set it the task, say via specifying a reward function, of creating an unbounded number of paperclips you’ll get it doing things you wouldn’t as a human do to maximize paperclips because humans have competing concerns and will stop when, say, they’d have to kill themselves or their loved ones to make more paperclips.
The objection seems predicated on interpretation of human language, which is aside the primary point. That is, you could address all the human language interpretation issues and we’d still have an alignment problem, it just might not look literally like building a paperclip maximizer if someone asks the AI to make a lot of paperclips.
In the classic naive paperclip maximizer scenario, we assume there’s a goal-directed AI system, and its human boss tells it to “maximize paperclips.” At this point, it creates a plan to turn all of the iron atoms on Earth’s surface into paperclips. The AI knows everything about the world, including the fact that blood hemoglobin and cargo ships contain iron. However, it doesn’t know that it’s wrong to kill people and destroy cargo ships for the purpose of obtaining iron. So it starts going around killing people and destroying cargo ships to obtain as much iron as possible for paperclip manufacturing.
I don’t think this is a good representation of the classic scenario. It’s not that the AI “doesn’t know it’s wrong”. It clearly has a good enough model of the world to predict eg “if a human saw me trying to do this, they would try to stop me”. The problem is coding an AI that cares about right and wrong. Which is a pretty difficult technical problem. One key part of why it’s hard is that the interface for giving an AI goals is not the same interface you’d use to give a human goals.
Note that this is not the same as saying that it’s impossible to solve, or that it’s obviously much harder than making powerful AI in the first place, just that it’s a difficult technical problem and solving it is one significant step towards safe AI. I think this is what Paul Christiano calls intent alignment
I think it’s possible that this issue goes away with powerful language models, if that can give us an interface to input a goal via a similar interface to instructing a human. And I’m excited about efforts like this one. But I don’t think it’s at all obvious that this will just happen to work out. For example, GPT-3′s true goal is “generate text that is as plausible as possible, based on the text in your training data”. And it has a natural language interface, and this goal correlates a bit with “do what humans want”, but it is not the same thing.
It’s assuming that the system would make a special case for verbal commands that can be interpreted as objective functions and set out to optimize the objective function if possible. At a minimum, the AI system needs to convert each verbal command into a plan to execute it, somewhat like a query plan in relational databases. But not every plan to execute a verbal command would involve maximizing an objective function, and using objective functions in execution plans is probably dangerous for the reason that the classic paperclip argument tries to highlight, as well as overkill for most commands.
This point feels somewhat backwards. Everything Ai systems ever do is maximising an objective function, and I’m not aware of any AI Safety suggestions that get around this (just ones which have creative objective functions). It’s not that they convert verbal commands to an objective function, they already have an objective function, which might capture ‘obey verbal commands in a sensible way’ or it might not. And my read on the paperclip maximising scenario is that “tell the AI to maximise paperclips” really means “encode an objective function that tells it to maximise paperclips”
Personally I think the paperclip maximiser scenario is somewhat flawed, and not a good representation of AI x-risk. I like it because it illustrates the key point of specification gaming—that it’s really, really hard to make an objective function that captures “do the things we want you to do”. But this is also going to be pretty obvious to the people making AGI, and they probably won’t have an objective function as clearly dumb as maximise paperclips. But it might not be good enough.
By the way, there will be a workshop on Interactive Learning for Natural Language Processing at ACL 2021. I think it will be useful to incorporate the ideas from this area of research into our models of how AI systems that interpret natural-language feedback would work. One example of this kind of research is Blukis et al. (2019).
A rebuttal of the paperclip maximizer argument
I was talking to someone (whom I’m leaving anonymous) about AI safety, and they said that the AI alignment problem is a joke (to put it mildly). They said that it won’t actually be that hard to teach AI systems the subtleties of human norms because language models contain normative knowledge. I don’t know if I endorse this claim but I found it quite convincing, so I’d like to share it here.
In the classic naive paperclip maximizer scenario, we assume there’s a goal-directed AI system, and its human boss tells it to “maximize paperclips.” At this point, it creates a plan to turn all of the iron atoms on Earth’s surface into paperclips. The AI knows everything about the world, including the fact that blood hemoglobin and cargo ships contain iron. However, it doesn’t know that it’s wrong to kill people and destroy cargo ships for the purpose of obtaining iron. So it starts going around killing people and destroying cargo ships to obtain as much iron as possible for paperclip manufacturing.
I think most of us assume that the AI system, when directed to “maximize paperclips,” would align itself with an objective function that says to create as many paper clips as superhumanly possible, even at the cost of destroying human lives and economic assets. However, I see two issues:
It’s assuming that the system would interpret the term “maximize” extremely literally, in a way that no reasonable human would interpret it. (This is the core of the paperclip argument, but I’m trying to show that it’s a weakness.) Most modern natural language processing (NLP) systems are based on statistical word embeddings, which capture what words mean in the source texts, rather than their strict mathematical definitions (if they even have one). If the AI system interprets commands using a word embedding, it’s going to interpret “maximize” the way humans would.
Ben Garfinkel has proposed the “process orthogonality thesis”—the idea that, for the classic AI alignment argument to work, “the process of imbuing a system with capabilities and the process of imbuing a system with goals” would have to be orthogonal. But this point shows that the process of giving the system capabilities (in this case, knowing that iron can be obtained from various everyday objects) and the process of giving it a goal (in this case, making paperclips) may not be orthogonal. An AI system based on contemporary language models seems much more likely to learn that “maximize X” means something more like “maximize X subject to common-sense constraints Y1, Y2, …” than to learn that human blood can be turned into iron for paperclips. (It’s also possible that it’ll learn neither, which means it might take “maximize” too literally but won’t figure out that it can make paperclips from humans.)
It’s assuming that the system would make a special case for verbal commands that can be interpreted as objective functions and set out to optimize the objective function if possible. At a minimum, the AI system needs to convert each verbal command into a plan to execute it, somewhat like a query plan in relational databases. But not every plan to execute a verbal command would involve maximizing an objective function, and using objective functions in execution plans is probably dangerous for the reason that the classic paperclip argument tries to highlight, as well as overkill for most commands.
Ben gives a great example of how the “alignment problem” might look different than we expect:
The case of the house-cleaning robot
Problem: We don’t know how to build a simulated robot that cleans houses well
Available techniques aren’t suitable:
Simple hand-coded reward functions (e.g. dust minimization) won’t produce the desired behavior
We don’t have enough data (or sufficiently relevant data) for imitation learning
Existing reward modeling approaches are probably insufficient
This is sort of an “AI alignment problem,” insofar as techniques currently classified as “alignment techniques” will probably be needed to solve it. But it also seems very different from the AI alignment problem as classically conceived.
...
One possible interpretation: If we can’t develop “alignment” techniques soon enough, we will instead build powerful and destructive dust-minimizers
A more natural interpretation: We won’t have highly capable house-cleaning robots until we make progress on “alignment” techniques
I’ve concluded that the process orthogonality thesis is less likely to apply to real AI systems than I would have assumed (i.e. I’ve updated downward), and therefore, the “alignment problem” as originally conceived is less likely to affect AI systems deployed in the real world. However, I don’t feel ready to reject all potential global catastrophic risks from imperfectly designed AI (e.g. multi-multi failures), because I’d rather be safe than sorry.
I think it’s worth saying that the context of “maximize paperclips” is not one where the person literally says the words “maximize paperclips” or something similar; this is instead an intuitive stand-in for building an AI capable of superhuman levels of optimization, such that if you set it the task, say via specifying a reward function, of creating an unbounded number of paperclips you’ll get it doing things you wouldn’t as a human do to maximize paperclips because humans have competing concerns and will stop when, say, they’d have to kill themselves or their loved ones to make more paperclips.
The objection seems predicated on interpretation of human language, which is aside the primary point. That is, you could address all the human language interpretation issues and we’d still have an alignment problem, it just might not look literally like building a paperclip maximizer if someone asks the AI to make a lot of paperclips.
I don’t think this is a good representation of the classic scenario. It’s not that the AI “doesn’t know it’s wrong”. It clearly has a good enough model of the world to predict eg “if a human saw me trying to do this, they would try to stop me”. The problem is coding an AI that cares about right and wrong. Which is a pretty difficult technical problem. One key part of why it’s hard is that the interface for giving an AI goals is not the same interface you’d use to give a human goals.
Note that this is not the same as saying that it’s impossible to solve, or that it’s obviously much harder than making powerful AI in the first place, just that it’s a difficult technical problem and solving it is one significant step towards safe AI. I think this is what Paul Christiano calls intent alignment
I think it’s possible that this issue goes away with powerful language models, if that can give us an interface to input a goal via a similar interface to instructing a human. And I’m excited about efforts like this one. But I don’t think it’s at all obvious that this will just happen to work out. For example, GPT-3′s true goal is “generate text that is as plausible as possible, based on the text in your training data”. And it has a natural language interface, and this goal correlates a bit with “do what humans want”, but it is not the same thing.
This point feels somewhat backwards. Everything Ai systems ever do is maximising an objective function, and I’m not aware of any AI Safety suggestions that get around this (just ones which have creative objective functions). It’s not that they convert verbal commands to an objective function, they already have an objective function, which might capture ‘obey verbal commands in a sensible way’ or it might not. And my read on the paperclip maximising scenario is that “tell the AI to maximise paperclips” really means “encode an objective function that tells it to maximise paperclips”
Personally I think the paperclip maximiser scenario is somewhat flawed, and not a good representation of AI x-risk. I like it because it illustrates the key point of specification gaming—that it’s really, really hard to make an objective function that captures “do the things we want you to do”. But this is also going to be pretty obvious to the people making AGI, and they probably won’t have an objective function as clearly dumb as maximise paperclips. But it might not be good enough.
By the way, there will be a workshop on Interactive Learning for Natural Language Processing at ACL 2021. I think it will be useful to incorporate the ideas from this area of research into our models of how AI systems that interpret natural-language feedback would work. One example of this kind of research is Blukis et al. (2019).