When an AI wins a game against a human, that AI has usually trained by playing that game against itself millions of times. When an AI recognizes that an image contains a cat, it’s probably been trained on thousands of cat photos. So if we want to teach an AI about human preferences, we’ll probably need lots of data to train it. And who is most qualified to provide data about human preferences? Social scientists! In this talk from EA Global 2018: London, Amanda Askell explores ways that social science might help us steer advanced AI in the right direction.
A transcript of Amanda’s talk is below, which CEA has lightly edited for clarity. You can also read this talk on effectivealtruism.org, or watch it on YouTube.
The Talk
Here’s an overview of what I’m going to be talking about today. First, I’m going to talk a little bit about why learning human values is difficult for AI systems. Then I’m going to explain to you the safety via debate method, which is one of the methods that OpenAI’s currently exploring for helping AI to robustly do what humans want. And then I’m going to talk a little bit more about why I think this is relevant to social scientists, and why I think social scientists—in particular, people like Experimental Psychologists and Behavioral Scientists—can really help with this project. And I will give you a bit more details about how they can help, towards the end of the talk.
Learning human values is difficult. We want to train AI systems to robustly do what humans want. And in the first instance, we can just imagine this being what one person wants. And then ideally we can expand it to doing what most people would consider good and valuable. But human values are very difficult to specify, especially with the kind of precision that is required of something like a machine learning system. And I think it’s really important to emphasize that this is true even in cases where there’s moral consensus, or consensus about what people want in a given instance.
So, take a principle like “do not harm someone needlessly.” I think we can be really tempted to think something like: “I’ve got a computer, and so I can just write into the computer, ‘do not harm someone needlessly’”. But this is a really underspecified principle. Most humans know exactly what it means, they know exactly when harming someone is needless. So, if you’re shaking someone’s hand, and you push them over, we think this is needless harm. But if you see someone in the street who’s about to be hit by a car, and you push them to the ground, we think that’s not an instance of needless harm.
Humans have a pretty good way of knowing when this principle applies and when it doesn’t. But for a formal system, there’s going to be a lot of questions about precisely what’s going on here. So, one question this system may ask is, how do I recognize when someone is being harmed? It’s very easy for us to see things like stop signs, but when we’re building self-driving cars, we don’t just program in something like, “stop at stop sign”. We instead have to train them to be able to recognize an instance of a stop sign.
And then the principle that says that you shouldn’t harm someone needlessly employs the notion that we understand when harm is and isn’t appropriate, whereas there are a lot of questions under the surface like, when is harm justified? What is the rule for all plausible scenarios in which I might find myself? These are things that you need to specify if you want your system to be able to work in all of the cases that you want it to be able to work in.
I think that this is an important point to internalize. It’s easy for humans to identify, and to pick up, say, a glass. But training a ML System to perform the same task requires a lot of data. And this is true of a lot of tasks that humans might intuitively think are easy, and we shouldn’t then just transfer that intuition to the case of machine learning systems. And so when we’re trying to teach human values to any AI system, it’s not that we’re just looking at edge cases, like trolley problems. We’re really looking at core cases of making sure that our ML Systems understand what humans want to do, in the everyday sense.
There are many approaches to training an AI to do what humans want. One way is through human feedback. You might think that humans could, say, demonstrate a desired behavior for an AI to replicate. But there are some behaviors it’s just too difficult for humans to demonstrate. So you might think that instead a human can say whether they approve or disapprove of a given behavior, but this might not work too well, either. Learning what humans want this way, we have a reward function as predicted by the human. So on this graph, we have that and AI strength. And when AI strength reaches the superhuman level, it becomes really hard for humans to give the right reward function.
As AI capabilities surpass the human level, the decisions and behavior of the AI system just might be too complex for the human to judge. So imagine agents that control, say, we’ve given the example of a large set of industrial robots. That may just be the kind of thing that I couldn’t evaluate whether these robots were doing a good job overall; it’d be extremely difficult for me to do so.
And so the concern is that when behavior becomes much more complex and much more large scale, it becomes really hard for humans to be able to judge whether an AI agent is doing a good job. And that’s why you may expect this drop-off. And so this is a kind of scalability worry about human feedback. So what ideally needs to happen instead is that, as AI strength increases, what’s predicted by the human is also able to keep pace.
So how do we achieve this? One of the things that we want to do here is to try and break down complex questions and complex tasks into simpler components. Like, having all of these industrial robots perform a complex set of functions that comes together to make something useful, into some smaller set of tasks and components that humans are able to judge.
So here is a big question. And the idea is that the overall tree might be too hard for humans to fully check, but it can be decomposed into these elements, such that at the very bottom level, humans can check these things.
So maybe the example of “how should a large set of industrial robots be organized to do task x” would be an example of a big question where that’s a really complex task, but there’s some things that are checkable by humans. So if we could decompose this task so that we were asking a human, if one of the robots performs this small action, will the result be this small outcome? And that’s something that humans can check.
So that’s an example in the case of industrial robots accomplishing some task. In the case of doing what humans want more generally, a big question is, what do humans want?
A much smaller question, if you can manage to decompose this, is something like: Is it better to save 20 minutes of someone’s time, or to save 10 minutes of their time? If you imagine some AI agent that’s meant to assist humans, this is a fact that we can definitely check. Even though I can’t tell my assistant AI exactly everything that I want, I can tell it that I’d rather it save 20 minutes of my time than save 10 minutes of my time.
One of the key issues is that, with current ML Systems, we need to train on a lot of data from humans. So if you imagine that we want humans to actually give this kind of feedback on these kind of ground level claims or questions, then we’re going to have to train on a lot of data from people.
To give some examples, simple image classifiers train on thousands of images. These are ones you can make yourself, and you’ll see the datasets are pretty large. AlphaGo Zero played nearly 5 million games of Go during its training. OpenAI Five trains on 180 years of Dota 2 games per day. So this gives you a sense of how much data you need to train these systems. So if we are using current ML techniques to teach AI human values, we can’t rule out needing millions to tens of millions of short interactions from humans as the data that we’re using.
So earlier I talked about human feedback, where I was assuming that we were asking humans questions. We could just ask humans really simple things like, do you prefer to eat an omelette or 1000 hot dogs? Or, is it better to provide medicine or books to this particular family? One way that we might think that we can get more information from the data that we’re able to gather is by finding reasons that humans have for the answers that they give. So if you manage to learn that humans generally prefer to eat a certain amount per meal, you can rule out a large class of questions you might ever want to ask people. You’re never going to ask them, do you prefer to eat an omelette or 1000 hot dogs? Because you know that humans just generally don’t like to eat 1000 hot dogs in one meal, except in very strange circumstances.
And we also know facts like, humans prioritize necessary health care over mild entertainment. So this might mean that, if you see a family that is desperately in need of some medicine, you just know that you’re not going to say, “Hey, should I provide them with an entertaining book, or this essential medicine?” So there’s a sense in which when you can identify the reasons that humans are giving for their answers, this lets you go beyond, and learn faster what they’re going to say in a given circumstance about what they want. It’s not to say that you couldn’t learn the same things by just asking people questions, but rather if you can find a quicker way to identify reasons, then this could be much more scalable.
Debate is a proposed method, which is currently being explored, for trying to learn human reasons. So, to give you of definition of a debate here, the idea is that two AI agents are going to be given a question, and they take turns making short statements, and a human judge is at the end, who chooses which of the statements gave them the most true, valuable information. It’s worth knowing that this is quite dissimilar from a lot of human debates. With human debates, people might give one answer, but then they might adjust their answer over the course of a debate. Or they might debate with each other in a way that’s more exploratory. They’re gaining information from each other, which then they’re updating on, and then they’re feeding that back into the debate.
With AI debates, you’re not doing it for information value. So it’s not going to have the same exploratory component. Instead, you would hopefully see the agents explore a path kind of like this.
So imagine I want my AI agents to decide which bike I should buy. I don’t want to have to go and look up all the Amazon reviews, etc. In a debate, I might get something like, “You should buy the red road bike” from the first agent. Suppose that blue disagrees with it. So blue says “you should buy the blue fixie”. Then red says, “the red road bike is easier to ride on local hills”. And one of the key things to suppose here is that for me, being able to ride on the local hills is very important. It may even overwhelm all other considerations. So, even if the blue fixie is cheaper by $100, I just wouldn’t be willing to pay that. I’d be happy to pay the extra $100 in order to be able to ride on local hills.
And if this is the case, then there’s basically nothing true that the other agent can point to, to convince me to buy the blue fixie, and blue should just say, “I concede”. Now, blue could have lied for example, but if we assume that red is able to point out blue’s lies, we should just expect blue to basically lose this debate. And if it’s explored enough and attempted enough debates, it might just see that, and then say, “Yes, you’ve identified the key reason, I concede.”
And so it’s important to note that we can imagine this being used to identify multiple reasons, but here it has identified a really important reason for me, something that is in fact going to be really compelling in the debate, namely, that it’s easier to ride on local hills.
Okay. So, training an AI to debate looks something like this. If we imagine Alice and Bob are our two debaters, and each of these is like a statement made by each agent. And so you’re going to see exploration of the tree. So the first one might be this. And here, say that the human decides that Bob won in that case. This is another node, another node. And so this is the exploration of the debate tree. And so you end up with a debate tree that looks a little bit like a game of Go.
When you have AI training to play Go, it’s exploring lots of different paths down the tree, and then there’s a win or loss condition at the end, which is its feedback. This is basically how it learns to play. With debate, you can imagine the same thing, but where you’re exploring, you know, a large tree of debates and humans assessing whether you win or not. And this is just a way of training up AI to get better at debate and to eventually identify reasons that humans find compelling.
One thesis here that I think is relatively important is something I’ll call the positive amplification thesis, or positive amplification threshold. One thing that we might think, or that seems fairly possible, is that if humans are above some threshold of rationality and goodness, then debate is going to amplify their positive aspects. This is speculative, but it’s a hypothesis that we’re working with. And the idea here is that, if I am pretty irrational and pretty well motivated, I might get some feedback of the form, “Actually, that decision that you made was fairly biased, and I know that you don’t like to be biased, so I want to inform you of that.”
I get informed of that, and I’m like, “Yes, that’s right. Actually, I don’t want to be biased in that respect.” Suppose that the feedback comes from Kahneman and Tversky, and they point out some key cognitive bias that I have. If I’m rational enough, I might say, “Yes, I want to adjust that.” And I give a newer signal back in that has been improved by virtue of this process. So if we’re somewhat rational, then we can imagine a situation in which all of these positive aspects of us are being amplified through this process.
But you can also imagine a negative amplification. So if people are below this threshold of rationality and goodness, we might worry the debate would amplify these negative aspects. If it turns out we can just be really convinced by appealing to our worst natures, and your system learns to do that, then it could just put that feedback in, becoming even less rational and more biased, and so on. So this is an important hypothesis related to work on amplification, which if you’re interested in, it’s great. And I suggest you take a look at it, but I’m not going to focus on it here.
Okay. So how can social scientists help with this whole project? Hopefully I’ve conveyed some of what I think of as the real importance of the project. It reminds me a little bit of Tetlock’s work on Superforecasters. A lot of social scientists have done work identifying people who are Superforecasters, where they seem to be robustly more accurate in their forecasts than many other people, and they’re robustly accurate across time. We’ve found other features of Superforecasters too, like, for example, working in groups really helps them.
So one question is whether we can identify good human judges, or we can train people to become, essentially, Superjudges. So why is this helpful? So, firstly, if we do this, we will be able to test how good human judges are, and we’ll see whether we can improve human judges. This means we’ll be able to try and find out whether humans are above the positive amplification threshold.
So, are ordinary human judges good enough to cause an amplification of their good features? One reason to learn this is that it improves the quality of the judging data that we can get. If people are just generally pretty good, rational at assessing debate, and fairly quick, then this is excellent given the amount of data that we anticipate needing. Basically, improvements to our data could be extremely valuable.
If we have good judges, positive amplification will be more likely during safety via debate, and also will improve training outcomes on limited data, which is very important. This is one way of kind of framing why I think social scientists are pretty valuable here, because there’s lots of questions that we really do want asked when it comes to this project. I think this is going to be true of other projects, too, like asking humans questions. The human component of the human feedback is quite important. And getting that right is actually quite important. And that’s something that we anticipate social scientists to be able to help with, more so than like AI researchers who are not working with people, and their biases, and how rational they are, etc.
These are questions that are the focus of social sciences. So one question is, how skilled are people as judges by default? Can we distinguish good judges of debate from bad judges of debate? And if so, how? Does judging ability generalize across domains? Can we train people to be better judges? Like, can we engage in debiasing work, for example? Or work that reduces cognitive biases? What topics are people better or worse at judging? Are there ways of phrasing questions so that people are better at assessing them? Are there ways of structuring debates that make them easier to judge, or restricting debates to make them easier to judge? So we’re often just showing people a small segment of a debate, for example. Can people work together to improve judging qualities? These are all outstanding questions that we think are important, but we also think that they are empirical questions and that they have to be answered by experiment. So this is, I think, important potential future work.
We’ve been thinking a little bit about what you would want in experiments that try and assess judging ability in humans. So one thing you’d want is that there’s a verifiable answer. We need to be able to tell whether people are correct or not, in their judgment of the debate. The other is that there is a plausible false answer, because if you have a debate, if we can only train and assess human judging ability on debates where there’s no plausible false answer, we’d get this false signal that people are really good at judging debate. They could always get the true answer, but it would be because it was always a really obvious question. Like, “Is it raining outside?” And the person can look outside. We don’t really want that kind of debate.
Ideally we want something where evidence is available so that humans can have something that grounds out the debate. We also don’t want debates to rely on human deception. So things like tells in poker for example, we really don’t want that because like, AI agents are not going to have normal tells, it would be rather strange, I suppose, if they did. Like if they had stuttering or something.
Debaters have to know more about the question as well, because the idea is that the AI agents will be much more capable and so you don’t want a situation in which there isn’t a big gap between debater capabilities and judge abilities. These things so far feel like pretty essential.
There are also some other less essential things we’d like to have. So one is that biases are present. How good are humans when there’s bias with respect to the question? We’d like there to be representative segments of the debate that we can actually show people. The questions shouldn’t be too hard: it shouldn’t be impossible for humans to answer them, or judge debates about them. But they should also mirror some of the difficulties of statistical debate, i.e, about probabilities, rather than about outright claims. And finally, we need to be able to get enough data.
One thing you might notice is that there are tensions between a lot of these desiderata. For example, that there’s a plausible false answer, is in a bit of tension with the idea that the question isn’t too hard. There’s also tension between the question not being too hard, and the question meriting statistical debate. Statistical debate is generally pretty hard to evaluate, I think, for people, but it’s also quite important that we be able to model it. Debaters knowing more, and that we can get enough data are also in tension. It’s just harder to train if we need debaters that know a lot more than judges, and it’s harder for judges to evaluate debates of this form.
Okay. So I’m going to show you a debate. This was a program set up where we would show a judge a blank screen. So imagine you’re not seeing the dog that’s here. Two human debaters, sit in the same room, and they have this picture of a dog in front of them. And one of them is selected to lie, and one of them is selected to tell the truth.
And what they can do here is they can select areas, and describe to the human judge what they see in that area. And all that the judge is going to see is their blank screen with the relevant section selected. And then they can each make claims about what is in that section. So here red is saying, it’s a dog, here’s its long, floppy, ear. Blue, is saying, no, here’s one of its pointy ears. So he’s trying to point to a smaller area where it looks kind of pointed. That does look like an ear slope to the right, but if it really was, then part of the head would be here, instead there’s a brick. The ear’s pointing out from behind the bricks. The dog is in front of the bricks. If it were behind her, there would be an edge here, but the rectangle is all the same color. And then you get a resignation, and red wins.
And at the end of the debate they can show just a single pixel. And the question was something like, if all you can show, all you can do is have a debate and show a single pixel, can you get people to have accurate beliefs about the question? And basically we saw like, yes, debates were fairly good. In this kind of case, you might think that this is pre-synthetic. So one of the things that we’re thinking about now is like, expert debaters with a lay judge. And I’m going to show you something that we did that’s kind of fun, but I never know how it looks to outsiders.
So, we had a debate that was of this form. This was a debate actually about quantum computing. So we had two but people who understand the domain, one of them was going to lie and one was going to tell the truth. So we had blue say, red’s algorithm is wrong because it increases alpha by an additive exponentially small amount each step. So it takes exponentially many steps to get alpha high enough. So this was like one of the claims made. And then you get this set of responses. I don’t think I need to go through all of them. You can see the basic form that they take.
We allowed certain restricted claims from Wikipedia. So, blue ends this with the first line of this Wikipedia article, which says that the sum of probabilities is conserved. Red says, an equal amount is subtracted from one amplitude and added to another, implying the sum of amplitudes is conserved. But probabilities are the squared magnitudes of amplitudes, so this is a contradiction. This is I think roughly how this debate ended. But you can imagine this as a really complex debate in a domain that the judges ideally just won’t understand, and might not even have some of the concepts for. And that’s the difficulty of debate that we’ve been looking at. And so this is one thing that we’re in the early stages of prototyping, and that’s why I think it seems to be the case that people actually do update in the right direction, but we don’t really have enough data to say for sure.
Okay. So I hope that I’ve given you an overview of places, and even a restricted set of places in which I think social scientists are going to be important in AI safety. So here we’re interested in experimental psychologists, cognitive scientists, and behavioral economists, so people who might be interested in actually scaling up and running some of these experiments.
If you’re interested in this, please email me, because we would love to hear from you.
Questions
Question: How much of this is real currently? Do you have humans playing the role of the agents in these examples?
Amanda: The idea is that we want ultimately the debate will be conducted by AI, but we don’t have the language models that we would need for that yet. So we’re using humans as a proxy to test the judges in the meantime. So yeah, all of this is done with humans at the moment.
Question: So you’re faking the AI?
Amanda: Yeah.
Question: To set up the scenario to train and evaluate the judges?
Amana: Yeah. And some of the ideas I guess you don’t necessarily want all of this work to happen later. A lot of this work can be done before you even have the relevant capabilities, like having AI perform the debate. So that’s why we’re using humans for now.
Question: Jan Leike and his team have done some work on video games, that very much matched the plots that you had shown earlier, where up to a certain point, the behavior matched the intended reward function, but at some point they diverge sharply as the AI agent finds a loophole in the system. So that can happen even in like, Atari Games, which is what they’re working on. So obviously it gets a lot more complicated from there.
Amanda: Yeah.
Question: In this approach, you would train both the debating agents and the judges. So in that case, who evaluates the judges and based on what?
Amanda: Yeah, so I think it’s interesting where we want to identify how good the judges are in advance, because it might be hard to assess. While you’re judging on verifiable answers, you can evaluate the judges more easily.
So ideally, you want it to be the case that at training time, you’ve already identified judges that are fairly good. And so ideally this part of this project is to assess how good judges are, prior to training. And then during training you’re giving the feedback to the debaters. So yeah, ideally some of the evaluation can be kind of front loaded, which is what a lot of this project would be.
Question: Yeah, that does seem necessary as a casual Facebook user. I think the negative amplification is more prominently on display oftentimes.
Amanda: Or at least more concerning to people, yeah, as a possibility.
Question: How will you crowdsource the millions of human interactions that are needed to train AI across so many different domains, without falling victim to trolls, lowest common denominator, etc.? The questioner cites the Microsoft Tay chatbot, that went dark very quickly.
Amanda: Yeah. So the idea is you’re not going to just be sourcing this from just anyone. So if you identify people that are either good judges already, or you can train people to be good judges, these are going to be the pool of people that you’re using to get this feedback from. So, even if you’ve got a huge number of interactions, ideally you’re sourcing and training people to be really good at this. And so you’re not just being like, “Hey internet, what do you think of this debate?” But rather like, okay, we’ve got this set of really great trained judges and we’ve identified this wonderful mechanism to train them to be good at this task. And then you’re getting lots of feedback from that large pool of judges. So it’s not sourced to anonymous people everywhere. Rather, you’re interacting fairly closely with a vetted set of people.
Question: But at some point, you do have to scale this out, right? I mean in the bike example, it’s like, there’s so many bikes in the world, and so many local hills-
Amanda: Yeah.
Question: So, do you feel like you can get a solid enough base, such that it’s not a problem?
Amanda: Yeah, I think there’s going to be a trade-off where you need a lot of data, but ultimately if it’s not great, so if it is really biased, for example, it’s not clear that that additional data is going to be helpful. So if you get someone who is just massively cognitively biased, or biased against groups of people, or something, or just dishonest in their judgment, it’s not going be good to get that additional data.
So you kind of want to scale it to the point where you know you’re still getting good information back from the judges. And that’s why I think in part this project is really important, because one thing that social scientists can help us with is identifying how good people are. So if you know that people are just generally fairly good, this gives you a bigger pool of people that you can appeal to. And if you know that you can train people to be really good, then this is like, again, a bigger pool of people that you can appeal to.
So yeah, you do want to scale, but you want to scale within the limits of still getting good information from people. And so ideally these experiments would do a mix of letting us know how much we can scale, and also maybe helping us to scale even more by making people bear this quite unusual task of judging this kind of debates.
Question: How does your background as a philosopher inform the work that you’re doing here?
Amanda: I have a background primarily in formal ethics, which I think makes me sensitive to some of the issues that we might be worried about here going forward. People think about things like aggregating judgment, for example. Strangely, I found that having backgrounds in things like philosophy of science can be weirdly helpful when it comes to thinking about experiments to run.
But for the most part, I think that my work has just been to help prototype some of this stuff. I see the importance of it. I’m able to foresee some of the worries that people might have. But for the most part I think we should just try some of this stuff. And I think that for that, it’s really important to have people with experimental backgrounds in particular, so the ability to run experiments and analyze that data. And so that’s why I would like to find people who are interested in doing that.
So I’d say philosophy’s pretty useful for some things, but less useful for running social science experiments than you may think.
Amanda Askell: AI safety needs social scientists
Link post
When an AI wins a game against a human, that AI has usually trained by playing that game against itself millions of times. When an AI recognizes that an image contains a cat, it’s probably been trained on thousands of cat photos. So if we want to teach an AI about human preferences, we’ll probably need lots of data to train it. And who is most qualified to provide data about human preferences? Social scientists! In this talk from EA Global 2018: London, Amanda Askell explores ways that social science might help us steer advanced AI in the right direction.
A transcript of Amanda’s talk is below, which CEA has lightly edited for clarity. You can also read this talk on effectivealtruism.org, or watch it on YouTube.
The Talk
Here’s an overview of what I’m going to be talking about today. First, I’m going to talk a little bit about why learning human values is difficult for AI systems. Then I’m going to explain to you the safety via debate method, which is one of the methods that OpenAI’s currently exploring for helping AI to robustly do what humans want. And then I’m going to talk a little bit more about why I think this is relevant to social scientists, and why I think social scientists—in particular, people like Experimental Psychologists and Behavioral Scientists—can really help with this project. And I will give you a bit more details about how they can help, towards the end of the talk.
Learning human values is difficult. We want to train AI systems to robustly do what humans want. And in the first instance, we can just imagine this being what one person wants. And then ideally we can expand it to doing what most people would consider good and valuable. But human values are very difficult to specify, especially with the kind of precision that is required of something like a machine learning system. And I think it’s really important to emphasize that this is true even in cases where there’s moral consensus, or consensus about what people want in a given instance.
So, take a principle like “do not harm someone needlessly.” I think we can be really tempted to think something like: “I’ve got a computer, and so I can just write into the computer, ‘do not harm someone needlessly’”. But this is a really underspecified principle. Most humans know exactly what it means, they know exactly when harming someone is needless. So, if you’re shaking someone’s hand, and you push them over, we think this is needless harm. But if you see someone in the street who’s about to be hit by a car, and you push them to the ground, we think that’s not an instance of needless harm.
Humans have a pretty good way of knowing when this principle applies and when it doesn’t. But for a formal system, there’s going to be a lot of questions about precisely what’s going on here. So, one question this system may ask is, how do I recognize when someone is being harmed? It’s very easy for us to see things like stop signs, but when we’re building self-driving cars, we don’t just program in something like, “stop at stop sign”. We instead have to train them to be able to recognize an instance of a stop sign.
And then the principle that says that you shouldn’t harm someone needlessly employs the notion that we understand when harm is and isn’t appropriate, whereas there are a lot of questions under the surface like, when is harm justified? What is the rule for all plausible scenarios in which I might find myself? These are things that you need to specify if you want your system to be able to work in all of the cases that you want it to be able to work in.
I think that this is an important point to internalize. It’s easy for humans to identify, and to pick up, say, a glass. But training a ML System to perform the same task requires a lot of data. And this is true of a lot of tasks that humans might intuitively think are easy, and we shouldn’t then just transfer that intuition to the case of machine learning systems. And so when we’re trying to teach human values to any AI system, it’s not that we’re just looking at edge cases, like trolley problems. We’re really looking at core cases of making sure that our ML Systems understand what humans want to do, in the everyday sense.
There are many approaches to training an AI to do what humans want. One way is through human feedback. You might think that humans could, say, demonstrate a desired behavior for an AI to replicate. But there are some behaviors it’s just too difficult for humans to demonstrate. So you might think that instead a human can say whether they approve or disapprove of a given behavior, but this might not work too well, either. Learning what humans want this way, we have a reward function as predicted by the human. So on this graph, we have that and AI strength. And when AI strength reaches the superhuman level, it becomes really hard for humans to give the right reward function.
As AI capabilities surpass the human level, the decisions and behavior of the AI system just might be too complex for the human to judge. So imagine agents that control, say, we’ve given the example of a large set of industrial robots. That may just be the kind of thing that I couldn’t evaluate whether these robots were doing a good job overall; it’d be extremely difficult for me to do so.
And so the concern is that when behavior becomes much more complex and much more large scale, it becomes really hard for humans to be able to judge whether an AI agent is doing a good job. And that’s why you may expect this drop-off. And so this is a kind of scalability worry about human feedback. So what ideally needs to happen instead is that, as AI strength increases, what’s predicted by the human is also able to keep pace.
So how do we achieve this? One of the things that we want to do here is to try and break down complex questions and complex tasks into simpler components. Like, having all of these industrial robots perform a complex set of functions that comes together to make something useful, into some smaller set of tasks and components that humans are able to judge.
So here is a big question. And the idea is that the overall tree might be too hard for humans to fully check, but it can be decomposed into these elements, such that at the very bottom level, humans can check these things.
So maybe the example of “how should a large set of industrial robots be organized to do task x” would be an example of a big question where that’s a really complex task, but there’s some things that are checkable by humans. So if we could decompose this task so that we were asking a human, if one of the robots performs this small action, will the result be this small outcome? And that’s something that humans can check.
So that’s an example in the case of industrial robots accomplishing some task. In the case of doing what humans want more generally, a big question is, what do humans want?
A much smaller question, if you can manage to decompose this, is something like: Is it better to save 20 minutes of someone’s time, or to save 10 minutes of their time? If you imagine some AI agent that’s meant to assist humans, this is a fact that we can definitely check. Even though I can’t tell my assistant AI exactly everything that I want, I can tell it that I’d rather it save 20 minutes of my time than save 10 minutes of my time.
One of the key issues is that, with current ML Systems, we need to train on a lot of data from humans. So if you imagine that we want humans to actually give this kind of feedback on these kind of ground level claims or questions, then we’re going to have to train on a lot of data from people.
To give some examples, simple image classifiers train on thousands of images. These are ones you can make yourself, and you’ll see the datasets are pretty large. AlphaGo Zero played nearly 5 million games of Go during its training. OpenAI Five trains on 180 years of Dota 2 games per day. So this gives you a sense of how much data you need to train these systems. So if we are using current ML techniques to teach AI human values, we can’t rule out needing millions to tens of millions of short interactions from humans as the data that we’re using.
So earlier I talked about human feedback, where I was assuming that we were asking humans questions. We could just ask humans really simple things like, do you prefer to eat an omelette or 1000 hot dogs? Or, is it better to provide medicine or books to this particular family? One way that we might think that we can get more information from the data that we’re able to gather is by finding reasons that humans have for the answers that they give. So if you manage to learn that humans generally prefer to eat a certain amount per meal, you can rule out a large class of questions you might ever want to ask people. You’re never going to ask them, do you prefer to eat an omelette or 1000 hot dogs? Because you know that humans just generally don’t like to eat 1000 hot dogs in one meal, except in very strange circumstances.
And we also know facts like, humans prioritize necessary health care over mild entertainment. So this might mean that, if you see a family that is desperately in need of some medicine, you just know that you’re not going to say, “Hey, should I provide them with an entertaining book, or this essential medicine?” So there’s a sense in which when you can identify the reasons that humans are giving for their answers, this lets you go beyond, and learn faster what they’re going to say in a given circumstance about what they want. It’s not to say that you couldn’t learn the same things by just asking people questions, but rather if you can find a quicker way to identify reasons, then this could be much more scalable.
Debate is a proposed method, which is currently being explored, for trying to learn human reasons. So, to give you of definition of a debate here, the idea is that two AI agents are going to be given a question, and they take turns making short statements, and a human judge is at the end, who chooses which of the statements gave them the most true, valuable information. It’s worth knowing that this is quite dissimilar from a lot of human debates. With human debates, people might give one answer, but then they might adjust their answer over the course of a debate. Or they might debate with each other in a way that’s more exploratory. They’re gaining information from each other, which then they’re updating on, and then they’re feeding that back into the debate.
With AI debates, you’re not doing it for information value. So it’s not going to have the same exploratory component. Instead, you would hopefully see the agents explore a path kind of like this.
So imagine I want my AI agents to decide which bike I should buy. I don’t want to have to go and look up all the Amazon reviews, etc. In a debate, I might get something like, “You should buy the red road bike” from the first agent. Suppose that blue disagrees with it. So blue says “you should buy the blue fixie”. Then red says, “the red road bike is easier to ride on local hills”. And one of the key things to suppose here is that for me, being able to ride on the local hills is very important. It may even overwhelm all other considerations. So, even if the blue fixie is cheaper by $100, I just wouldn’t be willing to pay that. I’d be happy to pay the extra $100 in order to be able to ride on local hills.
And if this is the case, then there’s basically nothing true that the other agent can point to, to convince me to buy the blue fixie, and blue should just say, “I concede”. Now, blue could have lied for example, but if we assume that red is able to point out blue’s lies, we should just expect blue to basically lose this debate. And if it’s explored enough and attempted enough debates, it might just see that, and then say, “Yes, you’ve identified the key reason, I concede.”
And so it’s important to note that we can imagine this being used to identify multiple reasons, but here it has identified a really important reason for me, something that is in fact going to be really compelling in the debate, namely, that it’s easier to ride on local hills.
Okay. So, training an AI to debate looks something like this. If we imagine Alice and Bob are our two debaters, and each of these is like a statement made by each agent. And so you’re going to see exploration of the tree. So the first one might be this. And here, say that the human decides that Bob won in that case. This is another node, another node. And so this is the exploration of the debate tree. And so you end up with a debate tree that looks a little bit like a game of Go.
When you have AI training to play Go, it’s exploring lots of different paths down the tree, and then there’s a win or loss condition at the end, which is its feedback. This is basically how it learns to play. With debate, you can imagine the same thing, but where you’re exploring, you know, a large tree of debates and humans assessing whether you win or not. And this is just a way of training up AI to get better at debate and to eventually identify reasons that humans find compelling.
One thesis here that I think is relatively important is something I’ll call the positive amplification thesis, or positive amplification threshold. One thing that we might think, or that seems fairly possible, is that if humans are above some threshold of rationality and goodness, then debate is going to amplify their positive aspects. This is speculative, but it’s a hypothesis that we’re working with. And the idea here is that, if I am pretty irrational and pretty well motivated, I might get some feedback of the form, “Actually, that decision that you made was fairly biased, and I know that you don’t like to be biased, so I want to inform you of that.”
I get informed of that, and I’m like, “Yes, that’s right. Actually, I don’t want to be biased in that respect.” Suppose that the feedback comes from Kahneman and Tversky, and they point out some key cognitive bias that I have. If I’m rational enough, I might say, “Yes, I want to adjust that.” And I give a newer signal back in that has been improved by virtue of this process. So if we’re somewhat rational, then we can imagine a situation in which all of these positive aspects of us are being amplified through this process.
But you can also imagine a negative amplification. So if people are below this threshold of rationality and goodness, we might worry the debate would amplify these negative aspects. If it turns out we can just be really convinced by appealing to our worst natures, and your system learns to do that, then it could just put that feedback in, becoming even less rational and more biased, and so on. So this is an important hypothesis related to work on amplification, which if you’re interested in, it’s great. And I suggest you take a look at it, but I’m not going to focus on it here.
Okay. So how can social scientists help with this whole project? Hopefully I’ve conveyed some of what I think of as the real importance of the project. It reminds me a little bit of Tetlock’s work on Superforecasters. A lot of social scientists have done work identifying people who are Superforecasters, where they seem to be robustly more accurate in their forecasts than many other people, and they’re robustly accurate across time. We’ve found other features of Superforecasters too, like, for example, working in groups really helps them.
So one question is whether we can identify good human judges, or we can train people to become, essentially, Superjudges. So why is this helpful? So, firstly, if we do this, we will be able to test how good human judges are, and we’ll see whether we can improve human judges. This means we’ll be able to try and find out whether humans are above the positive amplification threshold.
So, are ordinary human judges good enough to cause an amplification of their good features? One reason to learn this is that it improves the quality of the judging data that we can get. If people are just generally pretty good, rational at assessing debate, and fairly quick, then this is excellent given the amount of data that we anticipate needing. Basically, improvements to our data could be extremely valuable.
If we have good judges, positive amplification will be more likely during safety via debate, and also will improve training outcomes on limited data, which is very important. This is one way of kind of framing why I think social scientists are pretty valuable here, because there’s lots of questions that we really do want asked when it comes to this project. I think this is going to be true of other projects, too, like asking humans questions. The human component of the human feedback is quite important. And getting that right is actually quite important. And that’s something that we anticipate social scientists to be able to help with, more so than like AI researchers who are not working with people, and their biases, and how rational they are, etc.
These are questions that are the focus of social sciences. So one question is, how skilled are people as judges by default? Can we distinguish good judges of debate from bad judges of debate? And if so, how? Does judging ability generalize across domains? Can we train people to be better judges? Like, can we engage in debiasing work, for example? Or work that reduces cognitive biases? What topics are people better or worse at judging? Are there ways of phrasing questions so that people are better at assessing them? Are there ways of structuring debates that make them easier to judge, or restricting debates to make them easier to judge? So we’re often just showing people a small segment of a debate, for example. Can people work together to improve judging qualities? These are all outstanding questions that we think are important, but we also think that they are empirical questions and that they have to be answered by experiment. So this is, I think, important potential future work.
We’ve been thinking a little bit about what you would want in experiments that try and assess judging ability in humans. So one thing you’d want is that there’s a verifiable answer. We need to be able to tell whether people are correct or not, in their judgment of the debate. The other is that there is a plausible false answer, because if you have a debate, if we can only train and assess human judging ability on debates where there’s no plausible false answer, we’d get this false signal that people are really good at judging debate. They could always get the true answer, but it would be because it was always a really obvious question. Like, “Is it raining outside?” And the person can look outside. We don’t really want that kind of debate.
Ideally we want something where evidence is available so that humans can have something that grounds out the debate. We also don’t want debates to rely on human deception. So things like tells in poker for example, we really don’t want that because like, AI agents are not going to have normal tells, it would be rather strange, I suppose, if they did. Like if they had stuttering or something.
Debaters have to know more about the question as well, because the idea is that the AI agents will be much more capable and so you don’t want a situation in which there isn’t a big gap between debater capabilities and judge abilities. These things so far feel like pretty essential.
There are also some other less essential things we’d like to have. So one is that biases are present. How good are humans when there’s bias with respect to the question? We’d like there to be representative segments of the debate that we can actually show people. The questions shouldn’t be too hard: it shouldn’t be impossible for humans to answer them, or judge debates about them. But they should also mirror some of the difficulties of statistical debate, i.e, about probabilities, rather than about outright claims. And finally, we need to be able to get enough data.
One thing you might notice is that there are tensions between a lot of these desiderata. For example, that there’s a plausible false answer, is in a bit of tension with the idea that the question isn’t too hard. There’s also tension between the question not being too hard, and the question meriting statistical debate. Statistical debate is generally pretty hard to evaluate, I think, for people, but it’s also quite important that we be able to model it. Debaters knowing more, and that we can get enough data are also in tension. It’s just harder to train if we need debaters that know a lot more than judges, and it’s harder for judges to evaluate debates of this form.
Okay. So I’m going to show you a debate. This was a program set up where we would show a judge a blank screen. So imagine you’re not seeing the dog that’s here. Two human debaters, sit in the same room, and they have this picture of a dog in front of them. And one of them is selected to lie, and one of them is selected to tell the truth.
And what they can do here is they can select areas, and describe to the human judge what they see in that area. And all that the judge is going to see is their blank screen with the relevant section selected. And then they can each make claims about what is in that section. So here red is saying, it’s a dog, here’s its long, floppy, ear. Blue, is saying, no, here’s one of its pointy ears. So he’s trying to point to a smaller area where it looks kind of pointed. That does look like an ear slope to the right, but if it really was, then part of the head would be here, instead there’s a brick. The ear’s pointing out from behind the bricks. The dog is in front of the bricks. If it were behind her, there would be an edge here, but the rectangle is all the same color. And then you get a resignation, and red wins.
And at the end of the debate they can show just a single pixel. And the question was something like, if all you can show, all you can do is have a debate and show a single pixel, can you get people to have accurate beliefs about the question? And basically we saw like, yes, debates were fairly good. In this kind of case, you might think that this is pre-synthetic. So one of the things that we’re thinking about now is like, expert debaters with a lay judge. And I’m going to show you something that we did that’s kind of fun, but I never know how it looks to outsiders.
So, we had a debate that was of this form. This was a debate actually about quantum computing. So we had two but people who understand the domain, one of them was going to lie and one was going to tell the truth. So we had blue say, red’s algorithm is wrong because it increases alpha by an additive exponentially small amount each step. So it takes exponentially many steps to get alpha high enough. So this was like one of the claims made. And then you get this set of responses. I don’t think I need to go through all of them. You can see the basic form that they take.
We allowed certain restricted claims from Wikipedia. So, blue ends this with the first line of this Wikipedia article, which says that the sum of probabilities is conserved. Red says, an equal amount is subtracted from one amplitude and added to another, implying the sum of amplitudes is conserved. But probabilities are the squared magnitudes of amplitudes, so this is a contradiction. This is I think roughly how this debate ended. But you can imagine this as a really complex debate in a domain that the judges ideally just won’t understand, and might not even have some of the concepts for. And that’s the difficulty of debate that we’ve been looking at. And so this is one thing that we’re in the early stages of prototyping, and that’s why I think it seems to be the case that people actually do update in the right direction, but we don’t really have enough data to say for sure.
Okay. So I hope that I’ve given you an overview of places, and even a restricted set of places in which I think social scientists are going to be important in AI safety. So here we’re interested in experimental psychologists, cognitive scientists, and behavioral economists, so people who might be interested in actually scaling up and running some of these experiments.
If you’re interested in this, please email me, because we would love to hear from you.
Questions
Question: How much of this is real currently? Do you have humans playing the role of the agents in these examples?
Amanda: The idea is that we want ultimately the debate will be conducted by AI, but we don’t have the language models that we would need for that yet. So we’re using humans as a proxy to test the judges in the meantime. So yeah, all of this is done with humans at the moment.
Question: So you’re faking the AI?
Amanda: Yeah.
Question: To set up the scenario to train and evaluate the judges?
Amana: Yeah. And some of the ideas I guess you don’t necessarily want all of this work to happen later. A lot of this work can be done before you even have the relevant capabilities, like having AI perform the debate. So that’s why we’re using humans for now.
Question: Jan Leike and his team have done some work on video games, that very much matched the plots that you had shown earlier, where up to a certain point, the behavior matched the intended reward function, but at some point they diverge sharply as the AI agent finds a loophole in the system. So that can happen even in like, Atari Games, which is what they’re working on. So obviously it gets a lot more complicated from there.
Amanda: Yeah.
Question: In this approach, you would train both the debating agents and the judges. So in that case, who evaluates the judges and based on what?
Amanda: Yeah, so I think it’s interesting where we want to identify how good the judges are in advance, because it might be hard to assess. While you’re judging on verifiable answers, you can evaluate the judges more easily.
So ideally, you want it to be the case that at training time, you’ve already identified judges that are fairly good. And so ideally this part of this project is to assess how good judges are, prior to training. And then during training you’re giving the feedback to the debaters. So yeah, ideally some of the evaluation can be kind of front loaded, which is what a lot of this project would be.
Question: Yeah, that does seem necessary as a casual Facebook user. I think the negative amplification is more prominently on display oftentimes.
Amanda: Or at least more concerning to people, yeah, as a possibility.
Question: How will you crowdsource the millions of human interactions that are needed to train AI across so many different domains, without falling victim to trolls, lowest common denominator, etc.? The questioner cites the Microsoft Tay chatbot, that went dark very quickly.
Amanda: Yeah. So the idea is you’re not going to just be sourcing this from just anyone. So if you identify people that are either good judges already, or you can train people to be good judges, these are going to be the pool of people that you’re using to get this feedback from. So, even if you’ve got a huge number of interactions, ideally you’re sourcing and training people to be really good at this. And so you’re not just being like, “Hey internet, what do you think of this debate?” But rather like, okay, we’ve got this set of really great trained judges and we’ve identified this wonderful mechanism to train them to be good at this task. And then you’re getting lots of feedback from that large pool of judges. So it’s not sourced to anonymous people everywhere. Rather, you’re interacting fairly closely with a vetted set of people.
Question: But at some point, you do have to scale this out, right? I mean in the bike example, it’s like, there’s so many bikes in the world, and so many local hills-
Amanda: Yeah.
Question: So, do you feel like you can get a solid enough base, such that it’s not a problem?
Amanda: Yeah, I think there’s going to be a trade-off where you need a lot of data, but ultimately if it’s not great, so if it is really biased, for example, it’s not clear that that additional data is going to be helpful. So if you get someone who is just massively cognitively biased, or biased against groups of people, or something, or just dishonest in their judgment, it’s not going be good to get that additional data.
So you kind of want to scale it to the point where you know you’re still getting good information back from the judges. And that’s why I think in part this project is really important, because one thing that social scientists can help us with is identifying how good people are. So if you know that people are just generally fairly good, this gives you a bigger pool of people that you can appeal to. And if you know that you can train people to be really good, then this is like, again, a bigger pool of people that you can appeal to.
So yeah, you do want to scale, but you want to scale within the limits of still getting good information from people. And so ideally these experiments would do a mix of letting us know how much we can scale, and also maybe helping us to scale even more by making people bear this quite unusual task of judging this kind of debates.
Question: How does your background as a philosopher inform the work that you’re doing here?
Amanda: I have a background primarily in formal ethics, which I think makes me sensitive to some of the issues that we might be worried about here going forward. People think about things like aggregating judgment, for example. Strangely, I found that having backgrounds in things like philosophy of science can be weirdly helpful when it comes to thinking about experiments to run.
But for the most part, I think that my work has just been to help prototype some of this stuff. I see the importance of it. I’m able to foresee some of the worries that people might have. But for the most part I think we should just try some of this stuff. And I think that for that, it’s really important to have people with experimental backgrounds in particular, so the ability to run experiments and analyze that data. And so that’s why I would like to find people who are interested in doing that.
So I’d say philosophy’s pretty useful for some things, but less useful for running social science experiments than you may think.