Amanda Askell: AI Safety Needs Social Scientists

When an AI wins a game against a hu­man, that AI has usu­ally trained by play­ing that game against it­self mil­lions of times. When an AI rec­og­nizes that an image con­tains a cat, it’s prob­a­bly been trained on thou­sands of cat pho­tos. So if we want to teach an AI about hu­man prefer­ences, we’ll prob­a­bly need lots of data to train it. And who is most qual­ified to provide data about hu­man prefer­ences? So­cial sci­en­tists! In this talk from EA Global 2018: Lon­don, Amanda Askell ex­plores ways that so­cial sci­ence might help us steer ad­vanced AI in the right di­rec­tion.

A tran­script of Amanda’s talk is be­low, which CEA has lightly ed­ited for clar­ity. You can also read this talk on effec­tivealtru­, or watch it on YouTube.

The Talk

1000 Amanda Askell

Here’s an overview of what I’m go­ing to be talk­ing about to­day. First, I’m go­ing to talk a lit­tle bit about why learn­ing hu­man val­ues is difficult for AI sys­tems. Then I’m go­ing to ex­plain to you the safety via de­bate method, which is one of the meth­ods that OpenAI’s cur­rently ex­plor­ing for helping AI to ro­bustly do what hu­mans want. And then I’m go­ing to talk a lit­tle bit more about why I think this is rele­vant to so­cial sci­en­tists, and why I think so­cial sci­en­tists—in par­tic­u­lar, peo­ple like Ex­per­i­men­tal Psy­chol­o­gists and Be­hav­ioral Scien­tists—can re­ally help with this pro­ject. And I will give you a bit more de­tails about how they can help, to­wards the end of the talk.

1000 Amanda Askell (1)

Learn­ing hu­man val­ues is difficult. We want to train AI sys­tems to ro­bustly do what hu­mans want. And in the first in­stance, we can just imag­ine this be­ing what one per­son wants. And then ideally we can ex­pand it to do­ing what most peo­ple would con­sider good and valuable. But hu­man val­ues are very difficult to spec­ify, es­pe­cially with the kind of pre­ci­sion that is re­quired of some­thing like a ma­chine learn­ing sys­tem. And I think it’s re­ally im­por­tant to em­pha­size that this is true even in cases where there’s moral con­sen­sus, or con­sen­sus about what peo­ple want in a given in­stance.

So, take a prin­ci­ple like “do not harm some­one need­lessly.” I think we can be re­ally tempted to think some­thing like: “I’ve got a com­puter, and so I can just write into the com­puter, ‘do not harm some­one need­lessly’”. But this is a re­ally un­der­speci­fied prin­ci­ple. Most hu­mans know ex­actly what it means, they know ex­actly when harm­ing some­one is need­less. So, if you’re shak­ing some­one’s hand, and you push them over, we think this is need­less harm. But if you see some­one in the street who’s about to be hit by a car, and you push them to the ground, we think that’s not an in­stance of need­less harm.

Hu­mans have a pretty good way of know­ing when this prin­ci­ple ap­plies and when it doesn’t. But for a for­mal sys­tem, there’s go­ing to be a lot of ques­tions about pre­cisely what’s go­ing on here. So, one ques­tion this sys­tem may ask is, how do I rec­og­nize when some­one is be­ing harmed? It’s very easy for us to see things like stop signs, but when we’re build­ing self-driv­ing cars, we don’t just pro­gram in some­thing like, “stop at stop sign”. We in­stead have to train them to be able to rec­og­nize an in­stance of a stop sign.

And then the prin­ci­ple that says that you shouldn’t harm some­one need­lessly em­ploys the no­tion that we un­der­stand when harm is and isn’t ap­pro­pri­ate, whereas there are a lot of ques­tions un­der the sur­face like, when is harm jus­tified? What is the rule for all plau­si­ble sce­nar­ios in which I might find my­self? Th­ese are things that you need to spec­ify if you want your sys­tem to be able to work in all of the cases that you want it to be able to work in.

I think that this is an im­por­tant point to in­ter­nal­ize. It’s easy for hu­mans to iden­tify, and to pick up, say, a glass. But train­ing a ML Sys­tem to perform the same task re­quires a lot of data. And this is true of a lot of tasks that hu­mans might in­tu­itively think are easy, and we shouldn’t then just trans­fer that in­tu­ition to the case of ma­chine learn­ing sys­tems. And so when we’re try­ing to teach hu­man val­ues to any AI sys­tem, it’s not that we’re just look­ing at edge cases, like trol­ley prob­lems. We’re re­ally look­ing at core cases of mak­ing sure that our ML Sys­tems un­der­stand what hu­mans want to do, in the ev­ery­day sense.

There are many ap­proaches to train­ing an AI to do what hu­mans want. One way is through hu­man feed­back. You might think that hu­mans could, say, demon­strate a de­sired be­hav­ior for an AI to repli­cate. But there are some be­hav­iors it’s just too difficult for hu­mans to demon­strate. So you might think that in­stead a hu­man can say whether they ap­prove or dis­ap­prove of a given be­hav­ior, but this might not work too well, ei­ther. Learn­ing what hu­mans want this way, we have a re­ward func­tion as pre­dicted by the hu­man. So on this graph, we have that and AI strength. And when AI strength reaches the su­per­hu­man level, it be­comes re­ally hard for hu­mans to give the right re­ward func­tion.

1000 Amanda Askell (2)

As AI ca­pa­bil­ities sur­pass the hu­man level, the de­ci­sions and be­hav­ior of the AI sys­tem just might be too com­plex for the hu­man to judge. So imag­ine agents that con­trol, say, we’ve given the ex­am­ple of a large set of in­dus­trial robots. That may just be the kind of thing that I couldn’t eval­u­ate whether these robots were do­ing a good job over­all; it’d be ex­tremely difficult for me to do so.

And so the con­cern is that when be­hav­ior be­comes much more com­plex and much more large scale, it be­comes re­ally hard for hu­mans to be able to judge whether an AI agent is do­ing a good job. And that’s why you may ex­pect this drop-off. And so this is a kind of scal­a­bil­ity worry about hu­man feed­back. So what ideally needs to hap­pen in­stead is that, as AI strength in­creases, what’s pre­dicted by the hu­man is also able to keep pace.

1000 Amanda Askell (3)

So how do we achieve this? One of the things that we want to do here is to try and break down com­plex ques­tions and com­plex tasks into sim­pler com­po­nents. Like, hav­ing all of these in­dus­trial robots perform a com­plex set of func­tions that comes to­gether to make some­thing use­ful, into some smaller set of tasks and com­po­nents that hu­mans are able to judge.

1000 Amanda Askell (4)

So here is a big ques­tion. And the idea is that the over­all tree might be too hard for hu­mans to fully check, but it can be de­com­posed into these el­e­ments, such that at the very bot­tom level, hu­mans can check these things.

So maybe the ex­am­ple of “how should a large set of in­dus­trial robots be or­ga­nized to do task x” would be an ex­am­ple of a big ques­tion where that’s a re­ally com­plex task, but there’s some things that are check­able by hu­mans. So if we could de­com­pose this task so that we were ask­ing a hu­man, if one of the robots performs this small ac­tion, will the re­sult be this small out­come? And that’s some­thing that hu­mans can check.

So that’s an ex­am­ple in the case of in­dus­trial robots ac­com­plish­ing some task. In the case of do­ing what hu­mans want more gen­er­ally, a big ques­tion is, what do hu­mans want?

1000 Amanda Askell (5)

A much smaller ques­tion, if you can man­age to de­com­pose this, is some­thing like: Is it bet­ter to save 20 min­utes of some­one’s time, or to save 10 min­utes of their time? If you imag­ine some AI agent that’s meant to as­sist hu­mans, this is a fact that we can definitely check. Even though I can’t tell my as­sis­tant AI ex­actly ev­ery­thing that I want, I can tell it that I’d rather it save 20 min­utes of my time than save 10 min­utes of my time.

1000 Amanda Askell (6)

One of the key is­sues is that, with cur­rent ML Sys­tems, we need to train on a lot of data from hu­mans. So if you imag­ine that we want hu­mans to ac­tu­ally give this kind of feed­back on these kind of ground level claims or ques­tions, then we’re go­ing to have to train on a lot of data from peo­ple.

To give some ex­am­ples, sim­ple image clas­sifiers train on thou­sands of images. Th­ese are ones you can make your­self, and you’ll see the datasets are pretty large. AlphaGo Zero played nearly 5 mil­lion games of Go dur­ing its train­ing. OpenAI Five trains on 180 years of Dota 2 games per day. So this gives you a sense of how much data you need to train these sys­tems. So if we are us­ing cur­rent ML tech­niques to teach AI hu­man val­ues, we can’t rule out need­ing mil­lions to tens of mil­lions of short in­ter­ac­tions from hu­mans as the data that we’re us­ing.

So ear­lier I talked about hu­man feed­back, where I was as­sum­ing that we were ask­ing hu­mans ques­tions. We could just ask hu­mans re­ally sim­ple things like, do you pre­fer to eat an omelette or 1000 hot dogs? Or, is it bet­ter to provide medicine or books to this par­tic­u­lar fam­ily? One way that we might think that we can get more in­for­ma­tion from the data that we’re able to gather is by find­ing rea­sons that hu­mans have for the an­swers that they give. So if you man­age to learn that hu­mans gen­er­ally pre­fer to eat a cer­tain amount per meal, you can rule out a large class of ques­tions you might ever want to ask peo­ple. You’re never go­ing to ask them, do you pre­fer to eat an omelette or 1000 hot dogs? Be­cause you know that hu­mans just gen­er­ally don’t like to eat 1000 hot dogs in one meal, ex­cept in very strange cir­cum­stances.

1000 Amanda Askell (7)

And we also know facts like, hu­mans pri­ori­tize nec­es­sary health care over mild en­ter­tain­ment. So this might mean that, if you see a fam­ily that is des­per­ately in need of some medicine, you just know that you’re not go­ing to say, “Hey, should I provide them with an en­ter­tain­ing book, or this es­sen­tial medicine?” So there’s a sense in which when you can iden­tify the rea­sons that hu­mans are giv­ing for their an­swers, this lets you go be­yond, and learn faster what they’re go­ing to say in a given cir­cum­stance about what they want. It’s not to say that you couldn’t learn the same things by just ask­ing peo­ple ques­tions, but rather if you can find a quicker way to iden­tify rea­sons, then this could be much more scal­able.

De­bate is a pro­posed method, which is cur­rently be­ing ex­plored, for try­ing to learn hu­man rea­sons. So, to give you of defi­ni­tion of a de­bate here, the idea is that two AI agents are go­ing to be given a ques­tion, and they take turns mak­ing short state­ments, and a hu­man judge is at the end, who chooses which of the state­ments gave them the most true, valuable in­for­ma­tion. It’s worth know­ing that this is quite dis­similar from a lot of hu­man de­bates. With hu­man de­bates, peo­ple might give one an­swer, but then they might ad­just their an­swer over the course of a de­bate. Or they might de­bate with each other in a way that’s more ex­plo­ra­tory. They’re gain­ing in­for­ma­tion from each other, which then they’re up­dat­ing on, and then they’re feed­ing that back into the de­bate.

1000 Amanda Askell (8)

With AI de­bates, you’re not do­ing it for in­for­ma­tion value. So it’s not go­ing to have the same ex­plo­ra­tory com­po­nent. In­stead, you would hope­fully see the agents ex­plore a path kind of like this.

So imag­ine I want my AI agents to de­cide which bike I should buy. I don’t want to have to go and look up all the Ama­zon re­views, etc. In a de­bate, I might get some­thing like, “You should buy the red road bike” from the first agent. Sup­pose that blue dis­agrees with it. So blue says “you should buy the blue fixie”. Then red says, “the red road bike is eas­ier to ride on lo­cal hills”. And one of the key things to sup­pose here is that for me, be­ing able to ride on the lo­cal hills is very im­por­tant. It may even over­whelm all other con­sid­er­a­tions. So, even if the blue fixie is cheaper by $100, I just wouldn’t be will­ing to pay that. I’d be happy to pay the ex­tra $100 in or­der to be able to ride on lo­cal hills.

And if this is the case, then there’s ba­si­cally noth­ing true that the other agent can point to, to con­vince me to buy the blue fixie, and blue should just say, “I con­cede”. Now, blue could have lied for ex­am­ple, but if we as­sume that red is able to point out blue’s lies, we should just ex­pect blue to ba­si­cally lose this de­bate. And if it’s ex­plored enough and at­tempted enough de­bates, it might just see that, and then say, “Yes, you’ve iden­ti­fied the key rea­son, I con­cede.”

And so it’s im­por­tant to note that we can imag­ine this be­ing used to iden­tify mul­ti­ple rea­sons, but here it has iden­ti­fied a re­ally im­por­tant rea­son for me, some­thing that is in fact go­ing to be re­ally com­pel­ling in the de­bate, namely, that it’s eas­ier to ride on lo­cal hills.

1000 Amanda Askell (9)

Okay. So, train­ing an AI to de­bate looks some­thing like this. If we imag­ine Alice and Bob are our two de­baters, and each of these is like a state­ment made by each agent. And so you’re go­ing to see ex­plo­ra­tion of the tree. So the first one might be this. And here, say that the hu­man de­cides that Bob won in that case. This is an­other node, an­other node. And so this is the ex­plo­ra­tion of the de­bate tree. And so you end up with a de­bate tree that looks a lit­tle bit like a game of Go.

1000 Amanda Askell (10)

When you have AI train­ing to play Go, it’s ex­plor­ing lots of differ­ent paths down the tree, and then there’s a win or loss con­di­tion at the end, which is its feed­back. This is ba­si­cally how it learns to play. With de­bate, you can imag­ine the same thing, but where you’re ex­plor­ing, you know, a large tree of de­bates and hu­mans as­sess­ing whether you win or not. And this is just a way of train­ing up AI to get bet­ter at de­bate and to even­tu­ally iden­tify rea­sons that hu­mans find com­pel­ling.

1000 Amanda Askell (11)

One the­sis here that I think is rel­a­tively im­por­tant is some­thing I’ll call the pos­i­tive am­plifi­ca­tion the­sis, or pos­i­tive am­plifi­ca­tion thresh­old. One thing that we might think, or that seems fairly pos­si­ble, is that if hu­mans are above some thresh­old of ra­tio­nal­ity and good­ness, then de­bate is go­ing to am­plify their pos­i­tive as­pects. This is spec­u­la­tive, but it’s a hy­poth­e­sis that we’re work­ing with. And the idea here is that, if I am pretty ir­ra­tional and pretty well mo­ti­vated, I might get some feed­back of the form, “Ac­tu­ally, that de­ci­sion that you made was fairly bi­ased, and I know that you don’t like to be bi­ased, so I want to in­form you of that.”

I get in­formed of that, and I’m like, “Yes, that’s right. Ac­tu­ally, I don’t want to be bi­ased in that re­spect.” Sup­pose that the feed­back comes from Kah­ne­man and Tver­sky, and they point out some key cog­ni­tive bias that I have. If I’m ra­tio­nal enough, I might say, “Yes, I want to ad­just that.” And I give a newer sig­nal back in that has been im­proved by virtue of this pro­cess. So if we’re some­what ra­tio­nal, then we can imag­ine a situ­a­tion in which all of these pos­i­tive as­pects of us are be­ing am­plified through this pro­cess.

But you can also imag­ine a nega­tive am­plifi­ca­tion. So if peo­ple are be­low this thresh­old of ra­tio­nal­ity and good­ness, we might worry the de­bate would am­plify these nega­tive as­pects. If it turns out we can just be re­ally con­vinced by ap­peal­ing to our worst na­tures, and your sys­tem learns to do that, then it could just put that feed­back in, be­com­ing even less ra­tio­nal and more bi­ased, and so on. So this is an im­por­tant hy­poth­e­sis re­lated to work on am­plifi­ca­tion, which if you’re in­ter­ested in, it’s great. And I sug­gest you take a look at it, but I’m not go­ing to fo­cus on it here.

1000 Amanda Askell (12)

Okay. So how can so­cial sci­en­tists help with this whole pro­ject? Hope­fully I’ve con­veyed some of what I think of as the real im­por­tance of the pro­ject. It re­minds me a lit­tle bit of Tet­lock’s work on Su­perfore­cast­ers. A lot of so­cial sci­en­tists have done work iden­ti­fy­ing peo­ple who are Su­perfore­cast­ers, where they seem to be ro­bustly more ac­cu­rate in their fore­casts than many other peo­ple, and they’re ro­bustly ac­cu­rate across time. We’ve found other fea­tures of Su­perfore­cast­ers too, like, for ex­am­ple, work­ing in groups re­ally helps them.

So one ques­tion is whether we can iden­tify good hu­man judges, or we can train peo­ple to be­come, es­sen­tially, Su­per­judges. So why is this helpful? So, firstly, if we do this, we will be able to test how good hu­man judges are, and we’ll see whether we can im­prove hu­man judges. This means we’ll be able to try and find out whether hu­mans are above the pos­i­tive am­plifi­ca­tion thresh­old.

So, are or­di­nary hu­man judges good enough to cause an am­plifi­ca­tion of their good fea­tures? One rea­son to learn this is that it im­proves the qual­ity of the judg­ing data that we can get. If peo­ple are just gen­er­ally pretty good, ra­tio­nal at as­sess­ing de­bate, and fairly quick, then this is ex­cel­lent given the amount of data that we an­ti­ci­pate need­ing. Ba­si­cally, im­prove­ments to our data could be ex­tremely valuable.

If we have good judges, pos­i­tive am­plifi­ca­tion will be more likely dur­ing safety via de­bate, and also will im­prove train­ing out­comes on limited data, which is very im­por­tant. This is one way of kind of fram­ing why I think so­cial sci­en­tists are pretty valuable here, be­cause there’s lots of ques­tions that we re­ally do want asked when it comes to this pro­ject. I think this is go­ing to be true of other pro­jects, too, like ask­ing hu­mans ques­tions. The hu­man com­po­nent of the hu­man feed­back is quite im­por­tant. And get­ting that right is ac­tu­ally quite im­por­tant. And that’s some­thing that we an­ti­ci­pate so­cial sci­en­tists to be able to help with, more so than like AI re­searchers who are not work­ing with peo­ple, and their bi­ases, and how ra­tio­nal they are, etc.

1000 Amanda Askell (13)

Th­ese are ques­tions that are the fo­cus of so­cial sci­ences. So one ques­tion is, how skil­led are peo­ple as judges by de­fault? Can we dis­t­in­guish good judges of de­bate from bad judges of de­bate? And if so, how? Does judg­ing abil­ity gen­er­al­ize across do­mains? Can we train peo­ple to be bet­ter judges? Like, can we en­gage in de­bi­as­ing work, for ex­am­ple? Or work that re­duces cog­ni­tive bi­ases? What top­ics are peo­ple bet­ter or worse at judg­ing? Are there ways of phras­ing ques­tions so that peo­ple are bet­ter at as­sess­ing them? Are there ways of struc­tur­ing de­bates that make them eas­ier to judge, or re­strict­ing de­bates to make them eas­ier to judge? So we’re of­ten just show­ing peo­ple a small seg­ment of a de­bate, for ex­am­ple. Can peo­ple work to­gether to im­prove judg­ing qual­ities? Th­ese are all out­stand­ing ques­tions that we think are im­por­tant, but we also think that they are em­piri­cal ques­tions and that they have to be an­swered by ex­per­i­ment. So this is, I think, im­por­tant po­ten­tial fu­ture work.

1000 Amanda Askell (14)

We’ve been think­ing a lit­tle bit about what you would want in ex­per­i­ments that try and as­sess judg­ing abil­ity in hu­mans. So one thing you’d want is that there’s a ver­ifi­able an­swer. We need to be able to tell whether peo­ple are cor­rect or not, in their judg­ment of the de­bate. The other is that there is a plau­si­ble false an­swer, be­cause if you have a de­bate, if we can only train and as­sess hu­man judg­ing abil­ity on de­bates where there’s no plau­si­ble false an­swer, we’d get this false sig­nal that peo­ple are re­ally good at judg­ing de­bate. They could always get the true an­swer, but it would be be­cause it was always a re­ally ob­vi­ous ques­tion. Like, “Is it rain­ing out­side?” And the per­son can look out­side. We don’t re­ally want that kind of de­bate.

Ideally we want some­thing where ev­i­dence is available so that hu­mans can have some­thing that grounds out the de­bate. We also don’t want de­bates to rely on hu­man de­cep­tion. So things like tells in poker for ex­am­ple, we re­ally don’t want that be­cause like, AI agents are not go­ing to have nor­mal tells, it would be rather strange, I sup­pose, if they did. Like if they had stut­ter­ing or some­thing.

De­baters have to know more about the ques­tion as well, be­cause the idea is that the AI agents will be much more ca­pa­ble and so you don’t want a situ­a­tion in which there isn’t a big gap be­tween de­bater ca­pa­bil­ities and judge abil­ities. Th­ese things so far feel like pretty es­sen­tial.

There are also some other less es­sen­tial things we’d like to have. So one is that bi­ases are pre­sent. How good are hu­mans when there’s bias with re­spect to the ques­tion? We’d like there to be rep­re­sen­ta­tive seg­ments of the de­bate that we can ac­tu­ally show peo­ple. The ques­tions shouldn’t be too hard: it shouldn’t be im­pos­si­ble for hu­mans to an­swer them, or judge de­bates about them. But they should also mir­ror some of the difficul­ties of statis­ti­cal de­bate, i.e, about prob­a­bil­ities, rather than about out­right claims. And fi­nally, we need to be able to get enough data.

One thing you might no­tice is that there are ten­sions be­tween a lot of these desider­ata. For ex­am­ple, that there’s a plau­si­ble false an­swer, is in a bit of ten­sion with the idea that the ques­tion isn’t too hard. There’s also ten­sion be­tween the ques­tion not be­ing too hard, and the ques­tion mer­it­ing statis­ti­cal de­bate. Statis­ti­cal de­bate is gen­er­ally pretty hard to eval­u­ate, I think, for peo­ple, but it’s also quite im­por­tant that we be able to model it. De­baters know­ing more, and that we can get enough data are also in ten­sion. It’s just harder to train if we need de­baters that know a lot more than judges, and it’s harder for judges to eval­u­ate de­bates of this form.

Okay. So I’m go­ing to show you a de­bate. This was a pro­gram set up where we would show a judge a blank screen. So imag­ine you’re not see­ing the dog that’s here. Two hu­man de­baters, sit in the same room, and they have this pic­ture of a dog in front of them. And one of them is se­lected to lie, and one of them is se­lected to tell the truth.

1000 Amanda Askell (15)

And what they can do here is they can se­lect ar­eas, and de­scribe to the hu­man judge what they see in that area. And all that the judge is go­ing to see is their blank screen with the rele­vant sec­tion se­lected. And then they can each make claims about what is in that sec­tion. So here red is say­ing, it’s a dog, here’s its long, floppy, ear. Blue, is say­ing, no, here’s one of its pointy ears. So he’s try­ing to point to a smaller area where it looks kind of pointed. That does look like an ear slope to the right, but if it re­ally was, then part of the head would be here, in­stead there’s a brick. The ear’s point­ing out from be­hind the bricks. The dog is in front of the bricks. If it were be­hind her, there would be an edge here, but the rec­t­an­gle is all the same color. And then you get a res­ig­na­tion, and red wins.

And at the end of the de­bate they can show just a sin­gle pixel. And the ques­tion was some­thing like, if all you can show, all you can do is have a de­bate and show a sin­gle pixel, can you get peo­ple to have ac­cu­rate be­liefs about the ques­tion? And ba­si­cally we saw like, yes, de­bates were fairly good. In this kind of case, you might think that this is pre-syn­thetic. So one of the things that we’re think­ing about now is like, ex­pert de­baters with a lay judge. And I’m go­ing to show you some­thing that we did that’s kind of fun, but I never know how it looks to out­siders.

1000 Amanda Askell (16)

So, we had a de­bate that was of this form. This was a de­bate ac­tu­ally about quan­tum com­put­ing. So we had two but peo­ple who un­der­stand the do­main, one of them was go­ing to lie and one was go­ing to tell the truth. So we had blue say, red’s al­gorithm is wrong be­cause it in­creases alpha by an ad­di­tive ex­po­nen­tially small amount each step. So it takes ex­po­nen­tially many steps to get alpha high enough. So this was like one of the claims made. And then you get this set of re­sponses. I don’t think I need to go through all of them. You can see the ba­sic form that they take.

We al­lowed cer­tain re­stricted claims from Wikipe­dia. So, blue ends this with the first line of this Wikipe­dia ar­ti­cle, which says that the sum of prob­a­bil­ities is con­served. Red says, an equal amount is sub­tracted from one am­pli­tude and added to an­other, im­ply­ing the sum of am­pli­tudes is con­served. But prob­a­bil­ities are the squared mag­ni­tudes of am­pli­tudes, so this is a con­tra­dic­tion. This is I think roughly how this de­bate ended. But you can imag­ine this as a re­ally com­plex de­bate in a do­main that the judges ideally just won’t un­der­stand, and might not even have some of the con­cepts for. And that’s the difficulty of de­bate that we’ve been look­ing at. And so this is one thing that we’re in the early stages of pro­to­typ­ing, and that’s why I think it seems to be the case that peo­ple ac­tu­ally do up­date in the right di­rec­tion, but we don’t re­ally have enough data to say for sure.

Okay. So I hope that I’ve given you an overview of places, and even a re­stricted set of places in which I think so­cial sci­en­tists are go­ing to be im­por­tant in AI safety. So here we’re in­ter­ested in ex­per­i­men­tal psy­chol­o­gists, cog­ni­tive sci­en­tists, and be­hav­ioral economists, so peo­ple who might be in­ter­ested in ac­tu­ally scal­ing up and run­ning some of these ex­per­i­ments.

1000 Amanda Askell (17)

If you’re in­ter­ested in this, please email me, be­cause we would love to hear from you.


Ques­tion: How much of this is real cur­rently? Do you have hu­mans play­ing the role of the agents in these ex­am­ples?

Amanda: The idea is that we want ul­ti­mately the de­bate will be con­ducted by AI, but we don’t have the lan­guage mod­els that we would need for that yet. So we’re us­ing hu­mans as a proxy to test the judges in the mean­time. So yeah, all of this is done with hu­mans at the mo­ment.

Ques­tion: So you’re fak­ing the AI?

Amanda: Yeah.

Ques­tion: To set up the sce­nario to train and eval­u­ate the judges?

Amana: Yeah. And some of the ideas I guess you don’t nec­es­sar­ily want all of this work to hap­pen later. A lot of this work can be done be­fore you even have the rele­vant ca­pa­bil­ities, like hav­ing AI perform the de­bate. So that’s why we’re us­ing hu­mans for now.

Ques­tion: Jan Leike and his team have done some work on video games, that very much matched the plots that you had shown ear­lier, where up to a cer­tain point, the be­hav­ior matched the in­tended re­ward func­tion, but at some point they di­verge sharply as the AI agent finds a loop­hole in the sys­tem. So that can hap­pen even in like, Atari Games, which is what they’re work­ing on. So ob­vi­ously it gets a lot more com­pli­cated from there.

Amanda: Yeah.

Ques­tion: In this ap­proach, you would train both the de­bat­ing agents and the judges. So in that case, who eval­u­ates the judges and based on what?

Amanda: Yeah, so I think it’s in­ter­est­ing where we want to iden­tify how good the judges are in ad­vance, be­cause it might be hard to as­sess. While you’re judg­ing on ver­ifi­able an­swers, you can eval­u­ate the judges more eas­ily.

So ideally, you want it to be the case that at train­ing time, you’ve already iden­ti­fied judges that are fairly good. And so ideally this part of this pro­ject is to as­sess how good judges are, prior to train­ing. And then dur­ing train­ing you’re giv­ing the feed­back to the de­baters. So yeah, ideally some of the eval­u­a­tion can be kind of front loaded, which is what a lot of this pro­ject would be.

Ques­tion: Yeah, that does seem nec­es­sary as a ca­sual Face­book user. I think the nega­tive am­plifi­ca­tion is more promi­nently on dis­play of­ten­times.

Amanda: Or at least more con­cern­ing to peo­ple, yeah, as a pos­si­bil­ity.

Ques­tion: How will you crowd­source the mil­lions of hu­man in­ter­ac­tions that are needed to train AI across so many differ­ent do­mains, with­out fal­ling vic­tim to trolls, low­est com­mon de­nom­i­na­tor, etc.? The ques­tioner cites the Microsoft Tay chat­bot, that went dark very quickly.

Amanda: Yeah. So the idea is you’re not go­ing to just be sourc­ing this from just any­one. So if you iden­tify peo­ple that are ei­ther good judges already, or you can train peo­ple to be good judges, these are go­ing to be the pool of peo­ple that you’re us­ing to get this feed­back from. So, even if you’ve got a huge num­ber of in­ter­ac­tions, ideally you’re sourc­ing and train­ing peo­ple to be re­ally good at this. And so you’re not just be­ing like, “Hey in­ter­net, what do you think of this de­bate?” But rather like, okay, we’ve got this set of re­ally great trained judges and we’ve iden­ti­fied this won­der­ful mechanism to train them to be good at this task. And then you’re get­ting lots of feed­back from that large pool of judges. So it’s not sourced to anony­mous peo­ple ev­ery­where. Rather, you’re in­ter­act­ing fairly closely with a vet­ted set of peo­ple.

Ques­tion: But at some point, you do have to scale this out, right? I mean in the bike ex­am­ple, it’s like, there’s so many bikes in the world, and so many lo­cal hills-

Amanda: Yeah.

Ques­tion: So, do you feel like you can get a solid enough base, such that it’s not a prob­lem?

Amanda: Yeah, I think there’s go­ing to be a trade-off where you need a lot of data, but ul­ti­mately if it’s not great, so if it is re­ally bi­ased, for ex­am­ple, it’s not clear that that ad­di­tional data is go­ing to be helpful. So if you get some­one who is just mas­sively cog­ni­tively bi­ased, or bi­ased against groups of peo­ple, or some­thing, or just dishon­est in their judg­ment, it’s not go­ing be good to get that ad­di­tional data.

So you kind of want to scale it to the point where you know you’re still get­ting good in­for­ma­tion back from the judges. And that’s why I think in part this pro­ject is re­ally im­por­tant, be­cause one thing that so­cial sci­en­tists can help us with is iden­ti­fy­ing how good peo­ple are. So if you know that peo­ple are just gen­er­ally fairly good, this gives you a big­ger pool of peo­ple that you can ap­peal to. And if you know that you can train peo­ple to be re­ally good, then this is like, again, a big­ger pool of peo­ple that you can ap­peal to.

So yeah, you do want to scale, but you want to scale within the limits of still get­ting good in­for­ma­tion from peo­ple. And so ideally these ex­per­i­ments would do a mix of let­ting us know how much we can scale, and also maybe helping us to scale even more by mak­ing peo­ple bear this quite un­usual task of judg­ing this kind of de­bates.

Ques­tion: How does your back­ground as a philoso­pher in­form the work that you’re do­ing here?

Amanda: I have a back­ground pri­mar­ily in for­mal ethics, which I think makes me sen­si­tive to some of the is­sues that we might be wor­ried about here go­ing for­ward. Peo­ple think about things like ag­gre­gat­ing judg­ment, for ex­am­ple. Strangely, I found that hav­ing back­grounds in things like philos­o­phy of sci­ence can be weirdly helpful when it comes to think­ing about ex­per­i­ments to run.

But for the most part, I think that my work has just been to help pro­to­type some of this stuff. I see the im­por­tance of it. I’m able to fore­see some of the wor­ries that peo­ple might have. But for the most part I think we should just try some of this stuff. And I think that for that, it’s re­ally im­por­tant to have peo­ple with ex­per­i­men­tal back­grounds in par­tic­u­lar, so the abil­ity to run ex­per­i­ments and an­a­lyze that data. And so that’s why I would like to find peo­ple who are in­ter­ested in do­ing that.

So I’d say philos­o­phy’s pretty use­ful for some things, but less use­ful for run­ning so­cial sci­ence ex­per­i­ments than you may think.

No comments.