A conversation with Rohin Shah

Link post


  • Ro­hin Shah — PhD stu­dent at the Cen­ter for Hu­man-Com­pat­i­ble AI, UC Berkeley

  • Asya Ber­gal – AI Impacts

  • Robert Long – AI Impacts

  • Sara Hax­hia — In­de­pen­dent researcher


We spoke with Ro­hin Shah on Au­gust 6, 2019. Here is a brief sum­mary of that con­ver­sa­tion:

  • Be­fore tak­ing into ac­count other re­searchers’ opinions, Shah guesses an ex­tremely rough~90% chance that even with­out any ad­di­tional in­ter­ven­tion from cur­rent longter­mists, ad­vanced AI sys­tems will not cause hu­man ex­tinc­tion by ad­ver­sar­i­ally op­ti­miz­ing against hu­mans. He gives the fol­low­ing rea­sons, or­dered by how heav­ily they weigh in his con­sid­er­a­tion:

    • Grad­ual de­vel­op­ment and take-off of AI sys­tems is likely to al­low for cor­rect­ing the AI sys­tem on­line, and AI re­searchers will in fact cor­rect safety is­sues rather than hack­ing around them and re­de­ploy­ing.

      • Shah thinks that in­sti­tu­tions de­vel­op­ing AI are likely to be care­ful be­cause hu­man ex­tinc­tion would be just as bad for them as for ev­ery­one else.

    • As AI sys­tems get more pow­er­ful, they will likely be­come more in­ter­pretable and eas­ier to un­der­stand be­cause they will use fea­tures that hu­mans also tend to use.

    • Many ar­gu­ments for AI risk go through an in­tu­ition that AI sys­tems can be de­com­posed into an ob­jec­tive func­tion and a world model, and Shah thinks this isn’t likely to be a good way to model fu­ture AI sys­tems.

  • Shah be­lieves that con­di­tional on mis­al­igned AI lead­ing to ex­tinc­tion, it al­most cer­tainly goes through de­cep­tion.

  • Shah very un­cer­tainly guesses that there’s a ~50% that we will get AGI within two decades:

    • He gives a ~30% – 40% chance that it will be via es­sen­tially cur­rent tech­niques.

    • He gives a ~70% that con­di­tional on the two pre­vi­ous claims, it will be a mesa op­ti­mizer.

    • Shah’s model for how we get to AGI soon has the fol­low­ing fea­tures:

      • AI will be trained on a huge va­ri­ety of tasks, ad­dress­ing the usual difficulty of gen­er­al­iza­tion in ML systems

      • AI will learn the same kinds of use­ful fea­tures that hu­mans have learned.

      • This pro­cess of re­search and train­ing the AI will mimic the ways that evolu­tion pro­duced hu­mans who learn.

      • Gra­di­ent de­scent is sim­ple and in­effi­cient, so in or­der to do so­phis­ti­cated learn­ing, the outer op­ti­miza­tion al­gorithm used in train­ing will have to pro­duce a mesa op­ti­mizer.

  • Shah is skep­ti­cal of more ‘na­tivist’ the­o­ries where hu­man ba­bies are born with a lot of in­duc­tive bi­ases, rather than learn­ing al­most ev­ery­thing from their ex­pe­riences in the world.

  • Shah thinks there are sev­eral things that could change his be­liefs, in­clud­ing:

    • If he learned that evolu­tion ac­tu­ally baked a lot into hu­mans (‘na­tivism’), he would lengthen the amount of time he thinks there will be be­fore AGI.

    • In­for­ma­tion from his­tor­i­cal case stud­ies or analy­ses of AI re­searchers could change his mind around how the AI com­mu­nity would by de­fault han­dle prob­lems that arise.

    • Hav­ing a bet­ter un­der­stand­ing of the dis­agree­ments he has with MIRI:

      • Shah be­lieves that slow take­off is much more likely than fast take­off.

      • Shah doesn’t be­lieve that any suffi­ciently pow­er­ful AI sys­tem will look like an ex­pected util­ity max­i­mizer.

      • Shah be­lieves less in crisp for­mal­iza­tions of in­tel­li­gence than MIRi does.

      • Shah has more faith in AI re­searchers fix­ing prob­lems as they come up.

      • Shah has less faith than MIRI in our abil­ity to write proofs of the safety of our AI sys­tems.

This tran­script has been lightly ed­ited for con­ci­sion and clar­ity.


Asya Ber­gal: We haven’t re­ally planned out how we’re go­ing to talk to peo­ple in gen­eral, so if any of these ques­tions seem bad or not use­ful, just give us feed­back. I think we’re par­tic­u­larly in­ter­ested in skep­ti­cism ar­gu­ments, or safe by de­fault style ar­gu­ments– I wasn’t sure from our con­ver­sa­tion whether you par­tially en­dorse that, or you just are fa­mil­iar with the ar­gu­men­ta­tion style and think you could give it well or some­thing like that.

Ro­hin Shah: I think I par­tially en­dorse it.

Asya Ber­gal: Okay, great. If you can, it would be use­ful if you gave us the short ver­sion of your take on the AI risk ar­gu­ment and the place where you feel you and peo­ple who are more con­vinced of things dis­agree. Does that make sense?

Robert Long: Just to clar­ify, maybe for my own… What’s ‘con­vinced of things’? I’m think­ing of the tar­get propo­si­tion as some­thing like “it’s ex­tremely high value for peo­ple to be do­ing work that aims to make AGI more safe or benefi­cial”.

Asya Ber­gal: Even that state­ment seems a lit­tle im­pre­cise be­cause I think peo­ple have differ­ing opinions about what the high value work is. But that seems like ap­prox­i­mately the right propo­si­tion.

Ro­hin Shah: Okay. So there are some very ob­vi­ous ones which are not the ones that I en­dorse, but things like, do you be­lieve in longter­mism? Do you buy into the to­tal view of pop­u­la­tion ethics? And if your an­swer is no, and you take a more stan­dard ver­sion, you’re go­ing to dras­ti­cally re­duce how much you care about AI safety. But let’s see, the ones that I would en­dorse-

Robert Long: Maybe we should work on this set of ques­tions. I think this will only come up with peo­ple who are into ra­tio­nal­ism. I think we’re pri­mar­ily fo­cused just on em­piri­cal sources of dis­agree­ment, whereas these would be eth­i­cal.

Ro­hin Shah: Yup.

Robert Long: Which again, you’re com­pletely right to men­tion these things.

Ro­hin Shah: So, there’s… okay. The first one I had listed is that con­tinual or grad­ual or slow take­off, what­ever you want to call it, al­lows you to cor­rect the AI sys­tem on­line. And also it means that AI sys­tems are likely to fail in not ex­tinc­tion-level ways be­fore they fail in ex­tinc­tion-level ways, and pre­sum­ably we will learn from that and not just hack around it and fix it and re­de­ploy it. I think I feel fairly con­fi­dent that there are sev­eral peo­ple who will dis­agree with ex­actly the last thing I said, which is that peo­ple won’t just hack around it and de­ploy it– like fix the sur­face-level prob­lem and then just re­de­ploy it and hope that ev­ery­thing’s fine.

I am not sure what drives the differ­ence be­tween those in­tu­itions. I think they would point to neu­ral ar­chi­tec­ture search and things like that as ex­am­ples of, “Let’s just throw com­pute at the prob­lem and let the com­pute figure out a bunch of heuris­tics that seem to work.” And I would point at, “Look, we no­ticed that… or, some­one no­ticed that AI sys­tems are not par­tic­u­larly fair and now there’s just a ton of re­search into fair­ness.”

And it’s true that we didn’t stop de­ploy­ing AI sys­tems be­cause of fair­ness con­cerns, but I think that is ac­tu­ally just the cor­rect de­ci­sion from a so­cietal per­spec­tive. The benefits from AI sys­tems are in fact– they do in fact out­weigh the cons of them not be­ing fair, and so it doesn’t re­quire you to not de­ploy the AI sys­tem while it’s be­ing fixed.

Asya Ber­gal: That makes sense. I feel like an­other com­mon thing, which is not just “hack around and fix it”, is that peo­ple think that it will fail in ways that we don’t rec­og­nize and then we’ll re­de­ploy some big­ger cooler ver­sion of it that will be de­cep­tively al­igned (or what­ever the prob­lem is). How do you feel about ar­gu­ments of that form: that we just won’t re­al­ize all the ways in which the thing is bad?

Ro­hin Shah: So I’m think­ing: the AI sys­tem tries to de­ceive us, so I guess the ar­gu­ment would be, we don’t re­al­ize that the AI sys­tem was try­ing to de­ceive us and in­stead we’re like, “Oh, the AI sys­tem just failed be­cause it was off dis­tri­bu­tion or some­thing.”

It seems strange that we wouldn’t see an AI sys­tem de­liber­ately hide in­for­ma­tion from us. And then we look at this and we’re like, “Why the hell didn’t this in­for­ma­tion come up? This seems like a clear prob­lem.” And then do some sort of in­ves­ti­ga­tion into this.

I sup­pose it’s pos­si­ble we wouldn’t be able to tell it’s in­ten­tion­ally do­ing this be­cause it thinks it could get bet­ter re­ward by do­ing so. But that doesn’t… I mean, I don’t have a par­tic­u­lar ar­gu­ment why that couldn’t hap­pen but it doesn’t feel like…

Asya Ber­gal: Yeah, to be fair I’m not sure that one is what you should ex­pect… that’s just a thing that I com­monly hear.

Ro­hin Shah: Yes. I also hear that.

Robert Long: I was sur­prised at your de­cep­tion com­ment… You were talk­ing about, “What about sce­nar­ios where noth­ing seems wrong un­til you reach a cer­tain level?”

Asya Ber­gal: Right. Sorry, that doesn’t have to be de­cep­tion. I think maybe I men­tioned de­cep­tion be­cause I feel like I of­ten com­monly also see it.

Ro­hin Shah: I guess if I imag­ine “How did AI lead to ex­tinc­tion?”, I don’t re­ally imag­ine a sce­nario that doesn’t in­volve de­cep­tion. And then I claim that con­di­tional on that sce­nario hav­ing hap­pened, I am very sur­prised by the fact that we did not know this de­cep­tion in any ear­lier sce­nario that didn’t lead to ex­tinc­tion. And I don’t re­ally get peo­ple’s in­tu­itions for why that would be the case. I haven’t tried to figure that one out though.

Sara Hax­hia: So do you have no model of how peo­ple’s in­tu­itions differ? You can’t see it go­ing wrong aside from if it was de­cep­tively al­igned? Why?

Ro­hin Shah: Oh, I feel like most peo­ple have the in­tu­ition that con­di­tional on ex­tinc­tion, it hap­pened by the AI de­ceiv­ing us. [Note: In this in­ter­view, Ro­hin was only con­sid­er­ing risks aris­ing be­cause of AI sys­tems that try to op­ti­mize for goals that are not our own, not other forms of ex­is­ten­tial risks from AI.]

Asya Ber­gal: I think there’s an­other class of things which is some­thing not nec­es­sar­ily de­ceiv­ing us, as in it has a model of our goals and in­ten­tion­ally pre­sents us with de­cep­tive out­put, and just like… it has some no­tion of util­ity func­tion and op­ti­mizes for that poorly. It doesn’t nec­es­sar­ily have a model of us, it just op­ti­mizes the pa­per­clips or some­thing like that, and we didn’t re­al­ize be­fore that it is op­ti­miz­ing. I think when I hear de­cep­tive, I think “it has a model of hu­man be­hav­ior that is in­ten­tion­ally try­ing to do things that sub­vert our ex­pec­ta­tions”. And I think there’s also a ver­sion where it just has goals un­al­igned with ours and doesn’t spend any re­sources in mod­el­ing our be­hav­ior.

Ro­hin Shah: I think in that sce­nario, usu­ally as an in­stru­men­tal goal, you need to de­ceive hu­mans, be­cause if you don’t have a model of hu­man be­hav­ior– if you don’t model the fact that hu­mans are go­ing to in­terfere with your plans– hu­mans just turn you off and noth­ing, there’s no ex­tinc­tion.

Robert Long: Be­cause we’d no­tice. You’re think­ing in the non-de­cep­tion cases, as with the de­cep­tion cases, in this sce­nario we’d prob­a­bly no­tice.

Sara Hax­hia: That clar­ifies my ques­tion. Great.

Ro­hin Shah: As far as I know, this is an ac­cepted thing among peo­ple who think about AI x-risk.

Asya Ber­gal: The ac­cepted thing is like, “If things go badly, it’s be­cause it’s ac­tu­ally de­ceiv­ing us on some level”?

Ro­hin Shah: Yup. There are some other sce­nar­ios which could lead to us not be­ing de­ceived and bad things still hap­pen. Th­ese tend to be things like, we build an econ­omy of AI sys­tems and then slowly hu­mans get pushed out of the econ­omy of AI sys­tems and…

They’re still mod­el­ing us. I just can’t re­ally imag­ine the sce­nario in which they’re not mod­el­ing us. I guess you could imag­ine one where we slowly cede power to AI sys­tems that are do­ing things bet­ter than we could. And at no point are they ac­tively try­ing to de­ceive us, but at some point they’re just like… they’re run­ning the en­tire econ­omy and we don’t re­ally have much say in it.

And per­haps this could get to a point where we’re like, “Okay, we have lost con­trol of the fu­ture and this is effec­tively an x-risk, but at no point was there re­ally any de­cep­tion.”

Asya Ber­gal: Right. I’m happy to move on to other stuff.

Ro­hin Shah: Cool. Let’s see. What’s the next one I have? All right. This one’s a lot sketchier-

Asya Ber­gal: So sorry, what is the thing that we’re list­ing just so-

Ro­hin Shah: Oh, rea­sons why AI safety will be fine by de­fault.

Asya Ber­gal: Right. Gotcha, great.

Ro­hin Shah: Okay. Th­ese two points were both re­ally one point. So then the next one was… I claimed that as AI sys­tems get more pow­er­ful, they will be­come more in­ter­pretable and eas­ier to un­der­stand, just be­cause they’re us­ing– they will prob­a­bly be able to get and learn fea­tures that hu­mans also tend to use.

I don’t think this has re­ally been de­bated in the com­mu­nity very much and– sorry, I don’t mean that there’s agree­ment on it. I think it is just not a hy­poth­e­sis that has been pro­moted to at­ten­tion in the com­mu­nity. And it’s not to­tally clear what the safety im­pli­ca­tions are. It sug­gests that we could un­der­stand AI sys­tems more eas­ily and sort of in com­bi­na­tion with the pre­vi­ous point it says, “Oh, we’ll no­tice things– we’ll be more able to no­tice things than to­day where we’re like, ‘Here’s this image clas­sifier. Does it do good things? Who the hell knows? We tried it on a bunch of in­puts and it seemed like it was do­ing the right stuff, but who knows what it’s do­ing in­side.’”

Asya Ber­gal: I’m cu­ri­ous why you think it’s likely to use fea­tures that hu­mans tend to use. It’s pos­si­ble the an­swer is some in­tu­ition that’s hard to de­scribe.

Ro­hin Shah: In­tu­ition that I hope to de­scribe in a year. Partly it’s that in the very toy straw model, there are just a bunch of fea­tures in the world that an AI sys­tem can pay at­ten­tion to in or­der to make good pre­dic­tions. When you limit the AI sys­tem to make pre­dic­tions on a very small nar­row dis­tri­bu­tion, which is like all AI sys­tems to­day, there are lots of fea­tures that the AI sys­tem can use for that task that we hu­mans don’t use be­cause they’re just not very good for the rest of the dis­tri­bu­tion.

Asya Ber­gal: I see. It seems like im­plic­itly in this ar­gu­ment is that when hu­mans are run­ning their own clas­sifiers, they have some like nat­u­ral op­ti­mal set of fea­tures that they use for that dis­tri­bu­tion?

Ro­hin Shah: I don’t know if I’d say op­ti­mal, but yeah. Bet­ter than the fea­tures that the AI sys­tem is us­ing.

Robert Long: In the space of bet­ter fea­tures, why aren’t they go­ing past us or into some other op­ti­mal space of fea­ture world?

Ro­hin Shah: I think they would even­tu­ally.

Robert Long: I see, but they might have to go through ours first?

Ro­hin Shah: So A) I think they would go through ours, B) I think my in­tu­ition is some­thing like the fea­tures– and this one seems like more just raw in­tu­ition and I don’t re­ally have an ar­gu­ment for it– but the fea­tures… things like agency, op­ti­miza­tion, want, de­cep­tion, ma­nipu­la­tion seem like things that are use­ful for mod­el­ing the world.

I would be sur­prised if an AI sys­tem went so far be­yond that those fea­tures didn’t even en­ter into its calcu­la­tions. Or, I’d be sur­prised if that hap­pened very quickly, maybe. I don’t want to make claims about how far past those AI sys­tems could go, but I do think that… I guess I’m also say­ing that we should be aiming for AI sys­tems that are like… This is a ter­rible way to op­er­a­tional­ize it, but AI sys­tems that are 10X as in­tel­li­gent as hu­mans, what do we have to do for them? And then once we’ve got AI sys­tems that are 10 x smarter than us, then we’re like, “All right, what more prob­lems could arise in the fu­ture?” And ask the AI sys­tems to help us with that as well.

Asya Ber­gal: To clar­ify, the thing you’re say­ing is… By the time AI sys­tems are good and more pow­er­ful, they will have some con­cep­tion of the kind of fea­tures that hu­mans use, and be able to de­scribe their de­ci­sions in terms of those fea­tures? Or do you think in­her­ently, there’ll be a point where AI sys­tems use the ex­act same fea­tures that hu­mans use?

Ro­hin Shah: Not the ex­act same fea­tures, but broadly similar fea­tures to the ones that hu­mans use.

Robert Long: Where ex­am­ples of those fea­tures would be like ob­jects, cause, agent, the things that we want in­ter­preted in deep nets but usu­ally can’t.

Ro­hin Shah: Yes, ex­actly.

Asya Ber­gal: Again, so you think in some sense that that’s a nat­u­ral way to de­scribe things? Or there’s only one path through get­ting bet­ter at de­scribing things, and that has to go through the way that hu­mans de­scribe things? Does that sound right?

Ro­hin Shah: Yes.

Asya Ber­gal: Okay. Does that also feel like an in­tu­ition?

Ro­hin Shah: Yes.

Robert Long: Sorry, I think I did a bad in­ter­viewer thing where I started list­ing things, I should have just asked you to list some of the fea­tures which I think-

Ro­hin Shah: Well I listed them, like, op­ti­miza­tion, want, mo­ti­va­tion be­fore, but I agree causal­ity would be an­other one. But yeah, I was think­ing more the things that safety re­searchers of­ten talk about. I don’t know, what other fea­tures do we tend to use a lot? Ob­ject’s a good one… the con­cep­tion of 3D space is one that I don’t think these clas­sifiers have and that we definitely have.

And the con­cept of 3D space seems like it’s prob­a­bly go­ing to be use­ful for an AI sys­tem no mat­ter how smart it gets. Cur­rently, they might have a con­cept of 3D space, but it’s not ob­vi­ous that they do. And I wouldn’t be sur­prised if they don’t.

At some point, I want to take this in­tu­ition and run with it and see where it goes. And try to ar­gue for it more.

Robert Long: But I think for the pur­poses of this in­ter­view, I think we do un­der­stand how this is some­thing that would make things safe by de­fault. At least, in as much as in­ter­pretabil­ity con­duces to safety. Be­cause we could be able to in­ter­pret them in and still fuck shit up.

Ro­hin Shah: Yep. Agreed. Cool.

Sara Hax­hia: I guess I’m a lit­tle bit con­fused about how it makes the code more in­ter­pretable. I can see how if it uses hu­man brains, we can model it bet­ter be­cause we can just say, “Th­ese are hu­man things and this means we can make pre­dic­tions bet­ter.” But if you’re look­ing at a neu­ral net or some­thing, it doesn’t make it more in­ter­pretable.

Ro­hin Shah: If you mean the code, I agree with that.

Sara Hax­hia: Okay. So, is this kind of like ex­ter­nal, like you be­ing able to model that thing?

Ro­hin Shah: I think you could look at the… you take a par­tic­u­lar in­put to neu­ral net, you pass it through lay­ers, you see what the ac­ti­va­tions are. I don’t think if you just look di­rectly at the ac­ti­va­tions, you’re go­ing to get any­thing sen­si­ble, in the same way that if you look at elec­tri­cal sig­nals in my brain you’re not go­ing to be able to un­der­stand them.

Sara Hax­hia: So, is your point that the rea­son it be­comes more in­ter­pretable is some­thing more like, you un­der­stand its mo­ti­va­tions?

Ro­hin Shah: What I mean is… Are you fa­mil­iar with Chris Olah’s work?

Sara Hax­hia: I’m not.

Ro­hin Shah: Okay. So Chris Olah does in­ter­pretabil­ity work with image clas­sifiers. One tech­nique that he uses is: Take a par­tic­u­lar neu­ron in the neu­ral net, say, “I want to max­i­mize the ac­ti­va­tion of this neu­ron,” and then do gra­di­ent de­scent on your in­put image to see what image max­i­mally ac­ti­vates that neu­ron. And this gives you some in­sight into what that neu­ron is de­tect­ing. I think things like that will be eas­ier as time goes on.

Robert Long: Even if it’s not just that par­tic­u­lar tech­nique, right? Just the gen­eral task?

Ro­hin Shah: Yes.

Sara Hax­hia: How does that re­late to the hu­man val­ues thing? It felt like you were say­ing some­thing like it’s go­ing to model the world in a similar way to the way we do, and that’s go­ing to make it more in­ter­pretable. And I just don’t re­ally see the link.

Ro­hin Shah: A straw ver­sion of this, which isn’t ex­actly what I mean but sort of is the right in­tu­ition, would be like maybe if you run the same… What’s the in­put that max­i­mizes the out­put of this neu­ron? You’ll see that this par­tic­u­lar neu­ron is a de­cep­tion clas­sifier. It looks at the in­put and then based on some­thing, does some com­pu­ta­tion with the in­put, maybe the in­put’s like a di­alogue be­tween two peo­ple and then this neu­ron is tel­ling you, “Hey, is per­son A try­ing to de­ceive per­son B right now?” That’s an ex­am­ple of the sort of thing I am imag­in­ing.

Asya Ber­gal: I’m go­ing to do the bad in­ter­viewer thing where I put words in your mouth. I think one prob­lem right now is you can go a few lay­ers into a neu­ral net­work and the first few lay­ers cor­re­spond to things you can eas­ily tell… Like, the first layer is clearly look­ing at all the differ­ent pixel val­ues, and maybe the sec­ond layer is find­ing lines or some­thing like that. But then there’s this worry that later on, the neu­rons will cor­re­spond to con­cepts that we have no hu­man in­ter­pre­ta­tion for, so it won’t even make sense to in­ter­pret them. Whereas Ro­hin is say­ing, “No, ac­tu­ally the neu­rons will cor­re­spond to, or the ar­chi­tec­ture will cor­re­spond to some hu­man un­der­stand­able con­cept that it makes sense to in­ter­pret.” Does that seem right?

Ro­hin Shah: Yeah, that seems right. I am maybe not sure that I tie it nec­es­sar­ily to the ar­chi­tec­ture, but ac­tu­ally prob­a­bly I’d have to one day.

Asya Ber­gal: Definitely, you don’t need to. Yeah.

Ro­hin Shah: Any­way, I haven’t thought about that enough, but that’s ba­si­cally that. If you look at cur­rent late lay­ers in image clas­sifiers they are of­ten like, “Oh look, this is a de­tec­tor for lemon ten­nis balls,” and you’re just like, “That’s a strange con­cept you’ve got there, neu­ral net, but sure.”

Robert Long: Alright, cool. Next way of be­ing safe?

Ro­hin Shah: They’re get­ting more and more sketchy. I have an in­tu­ition that… I should rephrase this. I have an in­tu­ition that AI sys­tems are not well-mod­eled as, “Here’s the ob­jec­tive func­tion and here is the world model.” Most of the clas­sic ar­gu­ments are: Sup­pose you’ve got an in­cor­rect ob­jec­tive func­tion, and you’ve got this AI sys­tem with this re­ally, re­ally good in­tel­li­gence, which maybe we’ll call it a world model or just gen­eral in­tel­li­gence. And this in­tel­li­gence can take in any util­ity func­tion, and op­ti­mize it, and you plug in the in­cor­rect util­ity func­tion, and catas­tro­phe hap­pens.

This does not seem to be the way that cur­rent AI sys­tems work. It is the case that you have a re­ward func­tion, and then you sort of train a policy that op­ti­mizes that re­ward func­tion, but… I ex­plained this the wrong way around. But the policy that’s learned isn’t re­ally… It’s not re­ally perform­ing an op­ti­miza­tion that says, “What is go­ing to get me the most re­ward? Let me do that thing.”

It has been given a bunch of heuris­tics by gra­di­ent de­scent that tend to cor­re­late well with get­ting high re­ward and then it just ex­e­cutes those heuris­tics. It’s kind of similar to… If any of you are fans of the se­quences… Eliezer wrote a se­quence on evolu­tion and said… What was it? Hu­mans are not fit­ness max­i­miz­ers, they are adap­ta­tion ex­ecu­tors, some­thing like this. And that is how I view neu­ral nets to­day that are trained by RL. They don’t re­ally seem like ex­pected util­ity max­i­miz­ers the way that it’s usu­ally talked about by MIRI or on LessWrong.

I mostly ex­pect this to con­tinue, I think con­di­tional on AGI be­ing de­vel­oped soon-ish, like in the next decade or two, with some­thing kind of like cur­rent tech­niques. I think it would be… AGI would be a mesa op­ti­mizer or in­ner op­ti­mizer, whichever term you pre­fer. And that that in­ner op­ti­mizer will just sort of have a mish­mash of all of these heuris­tics that point in a par­tic­u­lar di­rec­tion but can’t re­ally be de­com­posed into ‘here are the ob­jec­tives, and here is the in­tel­li­gence’, in the same way that you can’t re­ally de­com­pose hu­mans very well into ‘here are the ob­jec­tives and here is the in­tel­li­gence’.

Robert Long: And why does that lead to bet­ter safety?

Ro­hin Shah: I don’t know that it does, but it leads to not be­ing as con­fi­dent in the origi­nal ar­gu­ments. It feels like this should be push­ing in the di­rec­tion of ‘it will be eas­ier to cor­rect or mod­ify or change the AI sys­tem’. Many of the ar­gu­ments for risk are ‘if you have a util­ity max­i­mizer, it has all of these con­ver­gent in­stru­men­tal sub-goals’ and, I don’t know, if I look at hu­mans they kind of sort of pur­sued con­ver­gent in­stru­men­tal sub-goals, but not re­ally.

You can definitely con­vince them that they should have differ­ent goals. They change the thing they are pur­su­ing rea­son­ably of­ten. Mostly this just re­duces my con­fi­dence in ex­ist­ing ar­gu­ments rather than gives me an ar­gu­ment for safety.

Robert Long: It’s like a defeater for AI safety ar­gu­ments that rely on a clean sep­a­ra­tion be­tween util­ity…

Ro­hin Shah: Yeah, which seems like all of them. All of the most crisp ones. Not all of them. I keep for­get­ting about the… I keep not tak­ing into ac­count the one where your god-like AI slowly re­place hu­mans and hu­mans lose con­trol of the fu­ture. That one still seems to­tally pos­si­ble in this world.

Robert Long: If AGI is through cur­rent tech­niques, it’s likely to have sys­tems that don’t have this clean sep­a­ra­tion.

Ro­hin Shah: Yep. A sep­a­rate claim that I would ar­gue for sep­a­rately– I don’t think they in­ter­act very much– is that I would also claim that we will get AGI via es­sen­tially cur­rent tech­niques. I don’t know if I should put a timeline on it, but two decades seems plau­si­ble. Not say­ing it’s likely, maybe 50% or some­thing. And that the re­sult­ing AGI will look like mesa op­ti­mizer.

Asya Ber­gal: Yeah. I’d be very cu­ri­ous to delve into why you think that.

Robert Long: Yeah, me too. Let’s just do that be­cause that’s fast. Also your… What do you mean by cur­rent tech­niques, and what’s your cre­dence in that be­ing what hap­pens?

Sara Hax­hia: And like what’s your model for how… where is this com­ing from?

Ro­hin Shah: So on the meta ques­tions, first, the cur­rent tech­niques would be like deep learn­ing, gra­di­ent de­scent broadly, maybe RL, maybe meta-learn­ing, maybe things sort of like it, but back prop­a­ga­tion or some­thing like that is still in­volved.

I don’t think there’s a clean line here. Some­thing like, we don’t look back and say: That. That was where the ML field just to­tally did a U-turn and did some­thing else en­tirely.

Robert Long: Right. Every­thing that’s in­volved in the build­ing of the AGI is some­thing you can roughly find in cur­rent text­books or like con­fer­ence pro­ceed­ings or some­thing. Maybe com­bined in new cool ways.

Ro­hin Shah: Yeah. Maybe, yeah. Yup. And also you throw a bunch of com­pute at it. That is part of my model. So that was the first one. What is cur­rent tech­niques? Then you asked cre­dence.

Cre­dence in AGI de­vel­oped in two decades by cur­rent-ish tech­niques… Depends on the defi­ni­tion of cur­rent-ish tech­niques, but some­thing like 30, 40%. Cre­dence that it will be a mesa op­ti­mizer, maybe con­di­tional on this be­ing… The pre­vi­ous thing be­ing true, the cre­dence on it be­ing a mesa op­ti­mizer, 60, 70%. Yeah, maybe 70%.

And then the ac­tual model for why this is… it’s sort of re­lated to the pre­vi­ous points about fea­tures wherein there are lots and lots of fea­tures and hu­mans have set­tled on the ones that are broadly use­ful across a wide va­ri­ety of con­texts. I think that in that world, what you want to do to get AGI is train an AI sys­tem on a very broad… train an AI sys­tem maybe by RL or some­thing else, I don’t know. Prob­a­bly RL.

On a very large dis­tri­bu­tion of tasks or a large dis­tri­bu­tion of some­thing, maybe they’re tasks, maybe they’re not like, I don’t know… Hu­man ba­bies aren’t re­ally train­ing on some par­tic­u­lar task. Maybe it’s just a bunch of un­su­per­vised learn­ing. And in do­ing so over a lot of time and a lot of com­pute, it will con­verge on the same sorts of fea­tures that hu­mans use.

I think the nice part of this story is that it doesn’t re­quire that you ex­plain how the AI sys­tem gen­er­al­izes– gen­er­al­iza­tion in gen­eral is just a very difficult prop­erty to get out of ML sys­tems if you want to gen­er­al­ize out­side of the train­ing dis­tri­bu­tion. You mostly don’t re­quire that here be­cause, A) it’s be­ing trained on a very wide va­ri­ety of tasks and B) it’s sort of mimick­ing the same sort of pro­ce­dure that was used to cre­ate hu­mans. Where, with hu­mans you’ve also got the sort of… evolu­tion did a lot of op­ti­miza­tion in or­der to cre­ate crea­tures that were able to work effec­tively in the en­vi­ron­ment, the en­vi­ron­ment’s su­per com­pli­cated, es­pe­cially be­cause there are other crea­tures that are try­ing to use the same re­sources.

And so that’s where you get the wide va­ri­ety or, the very like broad dis­tri­bu­tion of things. Okay. What have I not said yet?

Robert Long: That was your model. Are you done with the model of how that sort of thing hap­pens or-

Ro­hin Shah: I feel like I’ve for­got­ten as­pects, for­got­ten to say as­pects of the model, but maybe I did say all of it.

Robert Long: Well, just to re­cap: One thing you re­ally want is a gen­er­al­iza­tion, but this is in some sense taken care of be­cause you’re just train­ing on a huge bunch of tasks. Se­condly, you’re likely to get them learn­ing use­ful fea­tures. And one-

Ro­hin Shah: And thirdly, it’s mimick­ing what evolu­tion did, which is the one ex­am­ple we have of a pro­cess that cre­ated gen­eral in­tel­li­gence.

Asya Ber­gal: It feels like im­plicit in this sort of claim for why it’s soon is that com­pute will grow suffi­ciently to ac­com­mo­date this pro­cess, which is similar to evolu­tion. It feels like there’s im­plicit there, a claim that com­pute will grow and a claim that how­ever com­pute will grow, that’s go­ing to be enough to do this thing.

Ro­hin Shah: Yeah, that’s fair. I think ac­tu­ally I don’t have good rea­sons for be­liev­ing that, maybe I should re­duce my cre­dences on these a bit, but… That’s ba­si­cally right. So, it feels like for the first time I’m like, “Wow, I can ac­tu­ally use es­ti­mates of hu­man brain com­pu­ta­tion and it ac­tu­ally makes sense with my model.”

I’m like, “Yeah, ex­ist­ing AI sys­tems seem more ex­pen­sive to run than the hu­man brain… Sorry, if you com­pare dol­lars per hour of hu­man brain equiv­a­lent. Hiring a hu­man is what? Maybe we call it $20 an hour or some­thing if we’re talk­ing about rel­a­tively sim­ple tasks. And then, I don’t think you could get an equiv­a­lent amount of com­pute for $20 for a while, but maybe I for­get what num­ber it came out to, I got to re­cently. Yeah, ac­tu­ally the com­pute ques­tion feels like a thing I don’t ac­tu­ally know the an­swer to.

Asya Ber­gal: A re­lated ques­tion– this is just to clar­ify for me– it feels like maybe the rele­vant thing to com­pare to is not the amount of com­pute it takes to run a hu­man brain, but like-

Ro­hin Shah: Evolu­tion also mat­ters.

Asya Ber­gal: Yeah, the amount of com­pute to get to the hu­man brain or some­thing like that.

Ro­hin Shah: Yes, I agree with that, that that is a rele­vant thing. I do think we can be way more effi­cient than evolu­tion.

Asya Ber­gal: That sounds right. But it does feel like that’s… that does seem like that’s the right sort of quan­tity to be look­ing at? Or does it feel like-

Ro­hin Shah: For train­ing, yes.

Asya Ber­gal: I’m cu­ri­ous if it feels like the train­ing is go­ing to be more ex­pen­sive than the run­ning in your model.

Ro­hin Shah: I think the… It’s a good ques­tion. It feels like we will need a bunch of ex­per­i­men­ta­tion, figur­ing out how to build es­sen­tially the equiv­a­lent of the hu­man brain. And I don’t know how ex­pen­sive that pro­cess will be, but I don’t think it has to be a sin­gle pro­gram that you run. I think it can be like… The re­search pro­cess it­self is part of that.

At some point I think we build a sys­tem that is ini­tially trained by gra­di­ent de­scent, and then the train­ing by gra­di­ent de­scent is com­pa­rable to hu­mans go­ing out in the world and act­ing and learn­ing based on that. A pretty big un­cer­tainty here is: How much has evolu­tion put in a bunch of im­por­tant pri­ors into hu­man brains? Ver­sus how much are hu­man brains ac­tu­ally just learn­ing most things from scratch? Well, scratch or learn­ing from their par­ents.

Peo­ple would claim that ba­bies have lots of in­duc­tive bi­ases, I don’t know that I buy it. It seems like you can learn a lot with a month of just look­ing at the world and ex­plor­ing it, es­pe­cially when you get way more data than cur­rent AI sys­tems get. For one thing, you can just move around in the world and no­tice that it’s three di­men­sional.

Another thing is you can ac­tu­ally in­ter­act with stuff and see what the re­sponse is. So you can get causal in­ter­ven­tion data, and that’s prob­a­bly where causal­ity be­comes such an in­grained part of us. So I could imag­ine that these things that we see as core to hu­man rea­son­ing, things like hav­ing a no­tion of causal­ity or hav­ing a no­tion, I think ap­par­ently we’re also sup­posed to have as ba­bies an in­tu­ition about statis­tics and like coun­ter­fac­tu­als and prag­mat­ics.

But all of these are done with brains that have been in the world for a long time, rel­a­tively speak­ing, rel­a­tive to AI sys­tems. I’m not ac­tu­ally sure if I buy that this is be­cause we have re­ally good pri­ors.

Asya Ber­gal: I re­cently heard… Some­one was talk­ing to me about an ar­gu­ment that went like: Hu­mans, in ad­di­tion to hav­ing pri­ors, built-ins from evolu­tion and learn­ing things in the same way that neu­ral nets do, learn things through… you go to school and you’re taught cer­tain con­cepts and al­gorithms and stuff like that. And that seems dis­tinct from learn­ing things in a gra­di­ent de­scenty way. Does that seem right?

Ro­hin Shah: I definitely agree with that.

Asya Ber­gal: I see. And does that seem like a plau­si­ble thing that might not be en­com­passed by some gra­di­ent de­scenty thing?

Ro­hin Shah: I think the idea there would be, you do the gra­di­ent de­scenty thing for some time. That gets you in the AI sys­tem that now has in­side of it a way to learn. That’s sort of what it means to be a mesa op­ti­mizer. And then that mesa op­ti­mizer can go and do its own thing to do bet­ter learn­ing. And maybe at some point you just say, “To hell with this gra­di­ent de­scent, I’ll turn it off.” Prob­a­bly hu­mans don’t do that. Maybe hu­mans do that, I don’t know.

Asya Ber­gal: Right. So you do gra­di­ent de­scent to get to some place. And then from there you can learn in the same way– where you just read ar­ti­cles on the in­ter­net or some­thing?

Ro­hin Shah: Yeah. Oh, an­other rea­son that I think this… Another part of my model for why this is more likely– I knew there was more– is that, ex­actly that point, which is that learn­ing prob­a­bly re­quires some more de­liber­ate ac­tive pro­cess than gra­di­ent de­scent. Gra­di­ent de­sign feels re­ally rel­a­tively dumb, not as dumb as evolu­tion, but close. And the only plau­si­ble way I’ve seen so far for how that could hap­pen is by mesa op­ti­miza­tion. And it also seems to be how it hap­pened with hu­mans. I guess you could imag­ine the meta-learn­ing sys­tem that’s ex­plic­itly try­ing to de­velop this learn­ing al­gorithm.

And then… okay, by the defi­ni­tion of mesa op­ti­miz­ers, that would not be a mesa op­ti­mizer, it would be an in­ner op­ti­mizer. So maybe it’s an in­ner op­ti­mizer in­stead if we use-

Asya Ber­gal: I think I don’t quite un­der­stand what it means that learn­ing re­quires, or that the only way to do learn­ing is through mesa op­ti­miza­tion

Ro­hin Shah: I can give you a brief ex­pla­na­tion of what it means to me in a minute or two. I’m go­ing to go and open my sum­mary be­cause that says it bet­ter than I can.

Learned op­ti­miza­tion, that’s what it was called. All right. Sup­pose you’re search­ing over a space of pro­grams to find one that plays tic-tac-toe well. And ini­tially you find a pro­gram that says, “If the board is empty, put some­thing in the cen­ter square,” or rather, “If the cen­ter square is empty, put some­thing there. If there’s two in a row some­where of yours, put some­thing to com­plete it. If your op­po­nent has two in a row some­where, make sure to block it,” and you learn a bunch of these heuris­tics. Those are some nice, in­ter­pretable heuris­tics but maybe you’ve got some un­in­ter­pretable ones too.

But as you search more and more, even­tu­ally some­day you stum­ble upon the min­i­max al­gorithm, which just says, “Play out the game all the way un­til the end. See whether in all pos­si­ble moves that you could make, and all pos­si­ble moves your op­po­nent could make, and search for the path where you are guaran­teed to win.”

And then you’re like, “Wow, this al­gorithm, it just always wins. No one can ever beat it. It’s amaz­ing.” And so ba­si­cally you have this outer op­ti­miza­tion loop that was search­ing over a space of pro­grams, and then it found a pro­gram, so one el­e­ment of the space, that was it­self perform­ing op­ti­miza­tion, be­cause it was search­ing through pos­si­ble moves or pos­si­ble paths in the game tree to find the ac­tual policy it should play.

And so your outer op­ti­miza­tion al­gorithm found an in­ner op­ti­miza­tion al­gorithm that is good, or it solves the task well. And the main claim I will make, and I’m not sure if… I don’t think the pa­per makes it, but the claim I will make is that for many tasks if you’re us­ing gra­di­ent de­scent as your op­ti­mizer, be­cause gra­di­ent de­scent is so an­noy­ingly slow and sim­ple and in­effi­cient, the best way to ac­tu­ally achieve the task will be to find a mesa op­ti­mizer. So gra­di­ent de­scent finds pa­ram­e­ters that them­selves take an in­put, do some sort of op­ti­miza­tion, and then figure out an out­put.

Asya Ber­gal: Got you. So I guess part of it is di­vid­ing into sub-prob­lems that need to be op­ti­mized and then run­ning… Does that seem right?

Ro­hin Shah: I don’t know that there’s nec­es­sar­ily a di­vi­sion into sub prob­lems, but it’s a spe­cific kind of op­ti­miza­tion that’s tai­lored for the task at hand. Maybe an­other ex­am­ple would be… I don’t know, that’s a bad ex­am­ple. I think the anal­ogy to hu­mans is one I lean on a lot, where evolu­tion is the outer op­ti­mizer and it needs to build things that repli­cate a bunch.

It turns out hav­ing things repli­cate a bunch is not some­thing you can re­ally get by heuris­tics. What you need to do is to cre­ate hu­mans who can them­selves op­ti­mize and figure out how to… Well, not repli­cate a bunch, but do things that are very cor­re­lated with repli­cat­ing a bunch. And that’s how you get very good repli­ca­tors.

Asya Ber­gal: So I guess you’re say­ing… of­ten the gra­di­ent de­scent pro­cess will– it turns out that hav­ing an op­ti­mizer as part of the pro­cess is of­ten a good thing. Yeah, that makes sense. I re­mem­ber them in the mesa op­ti­miza­tion stuff.

Ro­hin Shah: Yeah. So that in­tu­ition is one of the rea­sons I think that… It’s part of my model for why AGI will be a mesa op­ti­mizer. Though I do– in the world where we’re not us­ing cur­rent ML tech­niques I’m like, “Oh, any­thing can hap­pen.”

Asya Ber­gal: That makes sense. Yeah, I was go­ing to ask about that. Okay. So con­di­tioned on cur­rent ML tech­niques lead­ing to it, it’ll prob­a­bly go through mesa op­ti­miz­ers?

Ro­hin Shah: Yeah. I might en­dorse the claim with much weaker con­fi­dence even with­out cur­rent ML tech­niques, but I’d have to think a lot more about that. There are ar­gu­ments for why mesa op­ti­miza­tion is the thing you want– is the thing that hap­pens– that are sep­a­rate from deep learn­ing. In fact, the whole pa­per doesn’t re­ally talk about deep learn­ing very much.

Robert Long: Cool. So that was dig­ging into the model of why and how con­fi­dent we should be on cur­rent tech­nique AGI, pro­saic AI I guess peo­ple call it? And seems like the ma­jor sources of un­cer­tainty there are: does com­pute ac­tu­ally go up, con­sid­er­a­tions about evolu­tion and its re­la­tion to hu­man in­tel­li­gence and learn­ing and stuff?

Ro­hin Shah: Yup. So the Me­dian Group, for ex­am­ple, will agree with most of this anal­y­sis… Ac­tu­ally no. The Me­dian Group will agree with some of this anal­y­sis but then say, and there­fore, AGI is ex­tremely far away, be­cause evolu­tion threw in some hor­rify­ing amount of com­pu­ta­tion and there’s no way we can ever match that.

Asya Ber­gal: I’m cu­ri­ous if you still have things on your list of like safety by de­fault ar­gu­ments, I’m cu­ri­ous to go back to that. Maybe you cov­ered them.

Ro­hin Shah: I think I have cov­ered them. The way I’ve listed this last one is ‘AI sys­tems will be op­ti­miz­ers in the same way that hu­mans are op­ti­miz­ers, not like Eliezer-style EU max­i­miz­ers’… which is ba­si­cally what I’ve just been say­ing.

Sara Hax­hia: But it seems like it still feels dan­ger­ous.. if a hu­man had loads of power, it could do things that… even if they aren’t max­i­miz­ing some util­ity.

Ro­hin Shah: Yeah, I agree, this is not an ar­gu­ment for com­plete safety. I for­get where I was ini­tially go­ing with this point. I think my main point here is that mesa op­ti­miz­ers don’t nice… Oh, right, they don’t nicely fac­tor into util­ity func­tion and in­tel­li­gence. And that re­duces my cre­dence in ex­ist­ing ar­gu­ments, and there are still is­sues which are like, with a mesa op­ti­mizer, your ca­pa­bil­ities gen­er­al­ize with dis­tri­bu­tional shift, but your ob­jec­tive doesn’t.

Hu­mans are not re­ally op­ti­miz­ing for re­pro­duc­tive suc­cess. And ar­guably, if some­one had wanted to cre­ate things that were re­ally good at re­pro­duc­ing, they might have used evolu­tion as a way to do it. And then hu­mans showed up and were like, “Oh, whoops, I guess we’re not do­ing that any­more.”

I mean, the mesa op­ti­miz­ers pa­per is a very pes­simistic pa­per. In their view, mesa op­ti­miza­tion is a bad thing that leads to dan­ger and that’s… I agree that all of the rea­sons they point out for mesa op­ti­miza­tion be­ing dan­ger­ous are in fact rea­sons that we should be wor­ried about mesa op­ti­miza­tion.

I think mostly I see this as… con­ver­gent in­stru­men­tal sub-goals are less likely to be ob­vi­ously a thing that this pur­sues. And that just feels more im­por­tant to me. I don’t re­ally have a strong ar­gu­ment for why that con­sid­er­a­tion dom­i­nates-

Robert Long: The con­ver­gent in­stru­men­tal sub-goals con­sid­er­a­tion?

Ro­hin Shah: Yeah.

Asya Ber­gal: I have a meta cre­dence ques­tion, maybe two lay­ers of them. The first be­ing, do you con­sider your­self op­ti­mistic about AI for some ran­dom qual­i­ta­tive defi­ni­tion of op­ti­mistic? And the fol­low-up is, what do you think is the cre­dence that by de­fault things go well, with­out ad­di­tional in­ter­ven­tion by us do­ing safety re­search or some­thing like that?

Ro­hin Shah: I would say rel­a­tive to AI al­ign­ment re­searchers, I’m op­ti­mistic. Rel­a­tive to the gen­eral pub­lic or some­thing like that, I might be pes­simistic. It’s hard to tell. I don’t know, cre­dence that things go well? That’s a hard one. In­tu­itively, it feels like 80 to 90%, 90%, maybe. 90 feels like I’m be­ing way too con­fi­dent and like, “What? You only as­sign 10%, even though you have liter­ally no… you can’t pre­dict the fu­ture and no one can pre­dict the fu­ture, why are you try­ing to do it?” It still does feel more like 90%.

Asya Ber­gal: I think that’s fine. I guess the fol­low-up is sort of like, be­tween the sort of things that you gave, which were like: Slow take­off al­lows for cor­rect­ing things, things that are more pow­er­ful will be more in­ter­pretable, and I think the third one be­ing, AI sys­tems not ac­tu­ally be­ing… I’m cu­ri­ous how much do you feel like your ac­tual be­lief in this leans on these ar­gu­ments? Does that make sense?

Ro­hin Shah: Yeah. I think the slow take­off one is the biggest one. If I be­lieve that at some point we would build an AI sys­tem that within the span of a week was just way smarter than any hu­man, and be­fore that the most pow­er­ful AI sys­tem was be­low hu­man level, I’m just like, “Shit, we’re doomed.”

Robert Long: Be­cause there it doesn’t mat­ter if it goes through in­ter­pretable fea­tures par­tic­u­larly.

Ro­hin Shah: There I’m like, “Okay, once we get to some­thing that’s su­per in­tel­li­gent, it feels like the hu­man ant anal­ogy is ba­si­cally right.” And un­less we… Maybe we could still be fine be­cause peo­ple thought about it and put in… Maybe I’m still like, “Oh, AI re­searchers would have been able to pre­dict that this would’ve hap­pened and so were care­ful.”

I don’t know, in a world where fast take­off is true, lots of things are weird about the world, and I don’t re­ally un­der­stand the world. So I’m like, “Shit, it’s quite likely some­thing goes wrong.” I think the slow take­off is definitely a crux. Also, we keep call­ing it slow take­off and I want to em­pha­size that it’s not nec­es­sar­ily slow in cal­en­dar time. It’s more like grad­ual.

Asya Ber­gal: Right, like ‘enough time for us to cor­rect things’ take­off.

Ro­hin Shah: Yeah. And there’s no dis­con­ti­nu­ity be­tween… you’re not like, “Here’s a 2X hu­man AI,” and a cou­ple of sec­onds later it’s now… Not a cou­ple of sec­onds later, but like, “Yeah, we’ve got 2X AI,” for a few months and then sud­denly some­one de­ploys a 10,000X hu­man AI. If that hap­pened, I would also be pretty wor­ried.

It’s more like there’s a 2X hu­man AI, then there’s like a 3X hu­man AI and then a 4X hu­man AI. Maybe this hap­pens from the same AI get­ting bet­ter and learn­ing more over time. Maybe it hap­pens from it de­sign­ing a new AI sys­tem that learns faster, but starts out lower and so then over­takes it sort of con­tin­u­ously, stuff like that.

So that I think, yeah, with­out… I don’t re­ally know what the al­ter­na­tive to it is, but in the one where it’s not hu­man level, and then 10,000X hu­man in a week and it just sort of hap­pened, that I’m like, I don’t know, 70% of doom or some­thing, maybe more. That feels like I’m… I en­dorse that cre­dence even less than most just be­cause I feel like I don’t know what that world looks like. Whereas on the other ones I at least have a plau­si­ble world in my head.

Asya Ber­gal: Yeah, that makes sense. I think you’ve men­tioned, in a slow take­off sce­nario that… Some peo­ple would dis­agree that in a world where you no­tice some­thing was wrong, you wouldn’t just hack around it, and keep go­ing.

Asya Ber­gal: I have a sug­ges­tion which it feels like maybe is a differ­ence and I’m very cu­ri­ous for your take on whether that seems right or seems wrong. It seems like peo­ple be­lieve there’s go­ing to be some kind of pres­sure for perfor­mance or com­pet­i­tive­ness that pushes peo­ple to try to make more pow­er­ful AI in spite of safety failures. Does that seem un­true to you or like you’re un­sure about it?

Ro­hin Shah: It seems some­what un­true to me. I re­cently made a com­ment about this on the Align­ment Fo­rum. Peo­ple make this anal­ogy be­tween AI x-risk and risk of nu­clear war, on mu­tu­ally as­sured de­struc­tion. That par­tic­u­lar anal­ogy seems off to me be­cause with nu­clear war, you need the threat of be­ing able to hurt the other side whereas with AI x-risk, if the de­struc­tion hap­pens, that af­fects you too. So there’s no mu­tu­ally as­sured de­struc­tion type dy­namic.

You could imag­ine a situ­a­tion where for some rea­son the US and China are like, “Who­ever gets to AGI first just wins the uni­verse.” And I think in that sce­nario maybe I’m a bit wor­ried, but even then, it seems like ex­tinc­tion is just worse, and as a re­sult, you get sig­nifi­cantly less risky be­hav­ior? But I don’t think you get to the point where peo­ple are just liter­ally rac­ing ahead with no thought to safety for the sake of win­ning.

I also don’t think that you would… I don’t think that differ­ences in who gets to AGI first are go­ing to lead to you win the uni­verse or not. I think it leads to pretty con­tin­u­ous changes in power bal­ance be­tween the two.

I also don’t think there’s a dis­crete point at which you can say, “I’ve won the race.” I think it’s just like ca­pa­bil­ities keep im­prov­ing and you can have more ca­pa­bil­ities than the other guy, but at no point can you say, “Now I have won the race.” I sup­pose if you could get a de­ci­sive strate­gic ad­van­tage, then you could do it. And that has noth­ing to do with what your AI ca­pa­bil­ity… If you’ve got a de­ci­sive strate­gic ad­van­tage that could hap­pen.

I would be sur­prised if the first hu­man-level AI al­lowed you to get any­thing close to a de­ci­sive strate­gic ad­van­tage. Maybe when you’re at 1000X hu­man level AI, per­haps. Maybe not a thou­sand. I don’t know. Given slow take­off, I’d be sur­prised if you could know­ably be like, “Oh yes, if I de­velop this piece of tech­nol­ogy faster than my op­po­nent, I will get a de­ci­sive strate­gic ad­van­tage.”

Asya Ber­gal: That makes sense. We dis­cussed a lot of cruxes you have. Do you feel like there’s ev­i­dence that you already have pre-com­puted that you think could move you in one di­rec­tion or an­other on this? Ob­vi­ously, if you’ve got ev­i­dence that X was true, that would move you, but are there con­crete things where you’re like, “I’m in­ter­ested to see how this will turn out, and that will af­fect my views on the thing?”

Ro­hin Shah: So I think I men­tioned the… On the ques­tion of timelines, they are like the… How much did evolu­tion ac­tu­ally bake in to hu­mans? It seems like a ques­tion that could put… I don’t know if it could be an­swered, but maybe you could an­swer that one. That would af­fect it… I lean on the side of not re­ally, but it’s pos­si­ble that the an­swer is yes, ac­tu­ally quite a lot. If that was true, I just lengthen my timelines ba­si­cally.

Sara Hax­hia: Can you also ex­plain how this would change your be­hav­ior with re­spect to what re­search you’re do­ing, or would it not change that at all?

Ro­hin Shah: That’s a good ques­tion. I think I would have to think about that one for longer than two min­utes.

As back­ground on that, a lot of my cur­rent re­search is more try­ing to get AI re­searchers to be think­ing about what hap­pens when you de­ploy, when you have AI sys­tems work­ing with hu­mans, as op­posed to solv­ing al­ign­ment. Mostly be­cause I for a while couldn’t see re­search that felt use­ful to me for solv­ing al­ign­ment. I think I’m now see­ing more things that I can do that seem more rele­vant and I will prob­a­bly switch to do­ing them pos­si­bly af­ter grad­u­at­ing be­cause the­sis, and need­ing to grad­u­ate, and stuff like that.

Ro­hin Shah: Yes, but you were ask­ing ev­i­dence that would change my mind-

Asya Ber­gal: I think it’s also rea­son­able to be not sure ex­actly about con­crete things. I don’t have a good an­swer to this ques­tion off the top of my head.

Ro­hin Shah: It’s worth at least think­ing about for a cou­ple of min­utes. I think I could imag­ine get­ting more in­for­ma­tion from ei­ther his­tor­i­cal case stud­ies of how peo­ple have dealt with new tech­nolo­gies, or analy­ses of how AI re­searchers cur­rently think about things or deal with stuff, could change my mind about whether I think the AI com­mu­nity would by de­fault han­dle prob­lems that arise, which feels like an im­por­tant crux be­tween me and oth­ers.

I think cur­rently my sense is if the like… You asked me this, I never an­swered it. If the AI safety field just sort of van­ished, but the work we’ve done so far re­mained and con­scien­tious AI re­searchers re­mained, or peo­ple who are already AI re­searchers and already do­ing this sort of stuff with­out be­ing in­fluenced by EA or ra­tio­nal­ity, then I think we’re still fine be­cause peo­ple will no­tice failures and cor­rect them.

I did an­swer that ques­tion. I said some­thing like 90%. This was a sce­nario I was say­ing 90% for. And yeah, that one feels like a thing that I could get ev­i­dence on that would change my mind.

I can’t re­ally imag­ine what would cause me to be­lieve that AI sys­tems will ac­tu­ally do a treach­er­ous turn with­out ever try­ing to de­ceive us be­fore that. But there might be some­thing there. I don’t re­ally know what ev­i­dence would move me, any sort of plau­si­ble ev­i­dence I could see that would move me in that di­rec­tion.

Slow take­off ver­sus fast take­off…. I feel like MIRI still ap­par­ently be­lieves in fast take­off. I don’t have a clear pic­ture of these rea­sons, I ex­pect those rea­sons would move me to­wards fast take­off.

Oh, on the ex­pected util­ity max or the… my per­cep­tion of MIRI, or of Eliezer and also maybe MIRI, is that they have this po­si­tion that any AI sys­tem, any suffi­ciently pow­er­ful AI sys­tem, will look to us like an ex­pected util­ity max­i­mizer, there­fore con­ver­gent in­stru­men­tal sub-goals and so on. I don’t buy this. I wrote a post ex­plain­ing why I don’t buy this.

Yeah, there’s a lot of just like.. MIRI could say their rea­sons for be­liev­ing things and that would prob­a­bly cause me to up­date. Ac­tu­ally, I have enough dis­agree­ments with MIRI that they may not up­date me, but it could in the­ory up­date me.

Asya Ber­gal: Yeah, that’s right. What are some dis­agree­ments you have with MIRI?

Ro­hin Shah: Well, the ones I just men­tioned. There is this great post from maybe not a year ago, but in 2018, called ‘Real­ism about Ra­tion­al­ity’, which is ba­si­cally this per­spec­tive that there is the one true learn­ing al­gorithm or the one cor­rect way of do­ing ex­plo­ra­tion, or just, there is a pla­tonic ideal of in­tel­li­gence. We could in prin­ci­ple find it, code it up, and then we would have this ex­tremely good AI al­gorithm.

Then there is like, to the ex­tent that this was a dis­agree­ment back in 2008, Robin Han­son would have been on the other side say­ing, “No, in­tel­li­gence is just like a broad… just like con­glomer­ate of a bunch of differ­ent heuris­tics that are all task spe­cific, and you can’t just take one and ap­ply it on the other space. It is just messy and com­pli­cated and doesn’t have a nice crisp for­mal­iza­tion.”

And, I fall not ex­actly on Robin Han­son’s side, but much more on Robin Han­son’s side than the ‘ra­tio­nal­ity is a real for­mal­iz­able nat­u­ral thing in the world’.

Sara Hax­hia: Do you have any idea where the cruxes of dis­agree­ment are at all?

Ro­hin Shah: No, that one has proved very difficult to…

Robert Long: I think that’s an AI Im­pacts pro­ject, or like a dis­ser­ta­tion or some­thing. I feel like there’s just this gen­eral do­main speci­fic­ity de­bate, how gen­eral is ra­tio­nal­ity de­bate…

I think there are these very cru­cial con­sid­er­a­tions about the na­ture of in­tel­li­gence and how do­main spe­cific it is and they were an is­sue be­tween Robin and Eliezer and no one… It’s hard to know what ev­i­dence, what the ev­i­dence is in this case.

Ro­hin Shah: Yeah. But I ba­si­cally agree with this and that it feels like a very deep dis­agree­ment that I have never had any suc­cess in com­ing to a re­s­olu­tion to, and I read ar­gu­ments by peo­ple who be­lieve this and I’m like, “No.”

Sara Hax­hia: Have you spo­ken to peo­ple?

Ro­hin Shah: I have spo­ken to peo­ple at CHAI, I don’t know that they would re­ally be on board this train. Hold on, Daniel prob­a­bly would be. And that hasn’t helped that much. Yeah. This dis­agree­ment feels like one where I would pre­dict that con­ver­sa­tions are not go­ing to help very much.

Robert Long: So, the gen­eral ques­tion here was dis­agree­ments with MIRI, and then there’s… And you’ve men­tioned fast take­off and maybe re­lat­edly, the Yud­kowsky-Han­son–

Ro­hin Shah: Real­ism about Ra­tion­al­ity is how I’d phrase it. There’s also the– are AI re­searchers con­scien­tious? Well, ac­tu­ally I don’t know that they would say they are not con­scien­tious. Maybe they’d say they’re not pay­ing at­ten­tion or they have mo­ti­vated rea­son­ing for ig­nor­ing the is­sues… lots of things like that.

Robert Long: And this is­sue of do ad­vanced in­tel­li­gences look enough like EU max­i­miz­ers…

Ro­hin Shah: Oh, yes. That one too. Yeah, sorry. That’s one of the ma­jor ones. Not sure how I for­got that.

Robert Long: I re­mem­ber it be­cause I’m writ­ing it all down, so… again, you’ve been talk­ing about very com­pli­cated things.

Ro­hin Shah: Yeah. Re­lated to the Real­ism about Ra­tion­al­ity point is the use of for­mal­ism and proof. Nor for­mal­ism, but proof at least. I don’t know that MIRI ac­tu­ally be­lieves that what we need to do is write a bunch of proofs about our AI sys­tem, but it sure sounds like it, and that seems like a too difficult, and ba­si­cally im­pos­si­ble task to me, if the proofs that we’re try­ing to write are about al­ign­ment or benefi­cial­ness or some­thing like that.

They also seem to… No, maybe all the other dis­agree­ments can be traced back to these dis­agree­ments. I’m not sure.