Rohin Shah: What’s been happening in AI alignment?

While we haven’t yet built al­igned AI, the field of al­ign­ment has steadily gained ground in the past few years, pro­duc­ing many use­ful out­puts. In this talk, Ro­hin Shah, a sixth-year PhD stu­dent at UC Berkeley’s Cen­ter for Hu­man-Com­pat­i­ble AI (CHAI), sur­veys con­cep­tual progress in AI al­ign­ment over the last two years.

While Ro­hin started his PhD work­ing on pro­gram syn­the­sis, he be­came con­vinced that it was im­por­tant to build safe, al­igned AI, and so moved to CHAI at the start of his fourth year. He now thinks about how to provide speci­fi­ca­tions of good be­hav­ior in ways other than re­ward func­tions. He is best known for the Align­ment Newslet­ter, a pop­u­lar weekly pub­li­ca­tion with con­tent rele­vant to AI al­ign­ment.

Below is a tran­script of Ro­hin’s talk, which we’ve lightly ed­ited for clar­ity. You can also watch it on YouTube and read it on effec­tivealtru­

The Talk

Hi, ev­ery­one. My name is Ro­hin Shah. I’m a sixth-year PhD stu­dent at the Cen­ter for Hu­man-Com­pat­i­ble AI at UC Berkeley. My re­search is gen­er­ally on what hap­pens when you try to do deep re­in­force­ment learn­ing in en­vi­ron­ments that in­volve hu­mans. More broadly, I work on tech­ni­cal AI safety. I also write the Align­ment Newslet­ter.

To­day, I’ll cover what’s been hap­pen­ing in AI al­ign­ment. I should warn you: While this talk doesn’t as­sume any tech­ni­cal knowl­edge of AI, it does as­sume ba­sic fa­mil­iar­ity with the ar­gu­ments for AI risk.

I’ll be sur­vey­ing a broad swath of work rather than fo­cus­ing on my per­sonal in­ter­ests. I’m hop­ing that this will help you figure out which parts of AI al­ign­ment you find ex­cit­ing and would like to delve into more deeply.

A lot of the talk is based on a liter­a­ture re­view I wrote a few months ago. You can find refer­ences and de­tails in that re­view.


With that, let’s get started. Tak­ing a high-level, out­side view, the rea­son that most peo­ple work on AI safety is that pow­er­ful AI sys­tems are go­ing to be a big deal. They’re go­ing to rad­i­cally trans­form the world that we live in. There­fore, we should prob­a­bly put some effort into mak­ing sure that this trans­for­ma­tion goes well.

In par­tic­u­lar, if AI sys­tems are smarter than we are, then they could be­come the dom­i­nant force on the planet, which could be bad for us — in the same way that go­rillas prob­a­bly aren’t [thrilled] about how we have taken over all of their habitats. This doesn’t nec­es­sar­ily mean that [AI will cre­ate] be an x-risk [ex­is­ten­tial risk]. It just means that we should have a sound tech­ni­cal rea­son to ex­pect that the pow­er­ful AI sys­tems we build are ac­tu­ally benefi­cial for us. And I would ar­gue that we cur­rently do not have such a rea­son. There­fore, the case for work­ing on AI al­ign­ment is that we re­ally should be cre­at­ing this rea­son.

I want to note that there’s a lot of dis­agree­ment over spe­cific sub-ques­tions in AI safety. That will be­come more ev­i­dent over the rest of this talk. But my im­pres­sion is that vir­tu­ally ev­ery­one in the field agrees with the ba­sic, high-level ar­gu­ment [that we should have a good rea­son for ex­pect­ing AI sys­tems to be benefi­cial].

What are the spe­cific risks we’re wor­ried about with AI? One is­sue is that hu­mans aren’t ready to deal with the im­pacts of AI. Peo­ple tend to be in con­flict a lot, and the US-China re­la­tion­ship is a big con­cern [in the AI com­mu­nity]. AI will en­able bet­ter and bet­ter ways of fight­ing. That seems pretty bad. Maybe our fights will lead to big­ger and big­ger im­pacts; at some point, that could re­sult in ex­tinc­tion-level events. Or per­haps AI leads to tech­nolog­i­cal progress at such a fast pace that we’re un­able to [ad­just]. As a re­sult, we could lock in some sub­op­ti­mal val­ues [that AI would act on for the rest of hu­man­ity’s fu­ture]. In both of these sce­nar­ios, the AI sys­tem wouldn’t in­ten­tion­ally cause x-risk, but it nonethe­less would hap­pen.

I’m not go­ing to fo­cus too much on this, but will note that some peo­ple are talk­ing about prefer­ence ag­gre­ga­tion. This is the idea that the AI sys­tem ag­gre­gates prefer­ences across all stake­hold­ers and does its thing — and then ev­ery­one agrees not to [op­pose] the re­sults. Similarly, we could try to [ar­rive at a] bet­ter metaphilos­o­phy to avoid prob­lems like value lock-in.

Another out­side view that peo­ple take, aside from “AI is pow­er­ful and a big deal,” is that op­ti­miza­tion leads to ex­treme out­comes. To take a very sim­ple ex­am­ple, men in the US are, on av­er­age, about five feet, 10 inches tall. But very few bas­ket­ball play­ers, who are se­lected for height, are five feet, 10 inches. Most are well over six feet. When you se­lect for some­thing and have op­ti­miza­tion pres­sure, you tend to get ex­treme out­comes. And pow­er­ful AI sys­tems are go­ing to be pow­er­ful op­ti­miz­ers. As a re­sult, we prob­a­bly shouldn’t ex­pect our ev­ery­day rea­son­ing to prop­erly ac­count for what these op­ti­miz­ers will do.

There­fore, we need to [cul­ti­vate] more of a se­cu­rity mind­set and look for ar­gu­ments that quan­tify ev­ery pos­si­bil­ity, as op­posed to the av­er­age pos­si­bil­ity. This mind­set in­spires re­searchers, es­pe­cially at MIRI [the Ma­chine In­tel­li­gence Re­search In­sti­tute], to try to un­der­stand how in­tel­li­gence re­ally works, so that we can then make well-de­signed AI sys­tems that we un­der­stand. This has led to re­search on em­bed­ded agency, par­tial agency, and ab­strac­tion.

A bit about em­bed­ded agency: This is one of MIRI’s main re­search pro­grams. The ba­sic idea is that, ac­cord­ing to the stan­dard model of re­in­force­ment learn­ing and [our un­der­stand­ing of] AI more gen­er­ally, an en­vi­ron­ment takes in ac­tions and pro­duces [ob­serv­able phe­nom­ena] and re­wards. Then, com­pletely sep­a­rate from the en­vi­ron­ment, an agent [ob­serves these phe­nom­ena] and takes ac­tions as a re­sult. But that’s not how agents work. I’m an agent, yet I am not sep­a­rate from the en­vi­ron­ment; I am a part of it. This leads to many philo­soph­i­cal prob­lems. I would love to go into more de­tail, but don’t have too much time. There’s a great se­quence on the AI Align­ment Fo­rum that I strongly recom­mend.


The next prob­lem I want to talk about is one that I call “the speci­fi­ca­tion prob­lem.” It’s also called “outer al­ign­ment.” Ba­si­cally, the way we build AI sys­tems right now is by as­sum­ing that we have some in­fal­lible speci­fi­ca­tion of the op­ti­mal be­hav­ior in all pos­si­ble situ­a­tions, as though it were handed down to us from God. Then, we must figure out how to meet that speci­fi­ca­tion. But of course, we can never ac­tu­ally get such a speci­fi­ca­tion. The clas­sic pa­per­clip max­i­mizer thought ex­per­i­ment shows that it’s quite hard to spec­ify the be­hav­ior of an AI mak­ing pa­per­clips in a rea­son­able and sane way. This is also the main prob­lem that Stu­art Rus­sell dis­cusses in his book Hu­man Com­pat­i­ble. Or­ga­ni­za­tions [whose work in­cludes ad­dress­ing] this speci­fi­ca­tion prob­lem in­clude CHAI, OpenAI, Deep­Mind, and Ought.

The main pro­posed way of solv­ing the speci­fi­ca­tion prob­lem is to do some form of value learn­ing. One thing I want to note: Value doesn’t nec­es­sar­ily mean “nor­ma­tive value.” You don’t nec­es­sar­ily need to be think­ing about pop­u­la­tion ethics. For ex­am­ple, a robot that learned how to clean your room, and then re­li­ably did so, would count as [an ex­am­ple of] value learn­ing. Maybe we should be call­ing it “speci­fi­ca­tion learn­ing,” but value learn­ing seems to be the name that has stuck.

The types of value learn­ing in­clude CIRL (or “as­sis­tance games”). CIRL stands for “co­op­er­a­tive in­verse re­in­force­ment learn­ing.” This is a par­tic­u­lar for­mal­iza­tion of how you could ap­proach value learn­ing, in which the world con­tains a sin­gle hu­man who knows the re­ward func­tion — the true speci­fi­ca­tion — but, for some rea­son, can’t com­mu­ni­cate that ex­plic­itly to the agent. There is also an agent whose goal is to in­fer what the hu­man’s speci­fi­ca­tion is, and then op­ti­mize for it. And be­cause the agent no longer has a definite speci­fi­ca­tion that it’s try­ing to op­ti­mize, and it’s in­stead un­cer­tain over what it’s try­ing to op­ti­mize, this re­sults in many nice prop­er­ties.

For ex­am­ple, the agent might ask you about what you want; it may try to clar­ify what your prefer­ences are. If you try to shut it down, it will rea­son that it must have been do­ing a poor job of helping you. There­fore, it’s go­ing to al­low you to shut it down, un­like a clas­sic un­ex­pected util­ity max­i­mizer, which will say, “No, I’m not go­ing to shut down, be­cause if I am shut down, then I can’t achieve my goal.”

The un­for­tu­nate thing about as­sis­tance games is that they are [ex­cep­tion­ally] com­pu­ta­tion­ally in­tractable. It’s very ex­pen­sive to solve a CIRL game. In ad­di­tion, it re­quires a good model of how hu­man prefer­ences re­late to hu­man be­hav­ior, which — as many of the so­cial sci­ences show — is a very difficult prob­lem. And there is a the­o­rem that says it is im­pos­si­ble to prove in the su­per-gen­eral case. Although, of course, we don’t ac­tu­ally need the su­per-gen­eral case; we only need the case that ap­plies in the real world. In­stead of be­ing im­pos­si­ble to prove, [the real-world case] is merely very, very difficult.

Next, we have [strate­gies based on agents] learn­ing hu­man in­tent. This is a broad cat­e­gory of pos­si­ble com­mu­ni­ca­tion pro­to­cols that a hu­man could use to com­mu­ni­cate the speci­fi­ca­tion to the agent. So per­haps a hu­man could demon­strate the op­ti­mal be­hav­ior to the agent, and from that, the agent could learn what it’s sup­posed to do. (This is the idea be­hind in­verse re­in­force­ment learn­ing and imi­ta­tion learn­ing.) Alter­na­tively, per­haps the hu­man could eval­u­ate pro­posed hy­po­thet­i­cal be­hav­iors that the agent might ex­e­cute, and then the agent could rea­son out what it should be do­ing.

Now we come to in­tent al­ign­ment, or “cor­rigi­bil­ity.” This is some­what differ­ent. While the pre­vi­ous ap­proaches try to spec­ify an al­gorithm that learns val­ues, with in­tent al­ign­ment we in­stead build an agent that tries to do what we want it to do. Put an­other way, we’re try­ing to bake into the agent the mo­ti­va­tion to be helpful to us. Then, if we have an agent [whose sole mo­ti­va­tion] is to be helpful to [a hu­man], that will nat­u­rally mo­ti­vate it to do many other things that we want. For ex­am­ple, it’s go­ing to try to clar­ify what my [travel] prefer­ences are in the same way that a good per­sonal as­sis­tant would, so that it doesn’t have to bother me when I ask it to book me a flight.

That cov­ers a broad spec­trum of ap­proaches to value learn­ing. How­ever, there are still a few prob­lems that arise. In­tu­itively, one big one is that, since the agent is learn­ing from our feed­back, it’s not go­ing to be able to do bet­ter than we can; it won’t be able to scale to su­per­hu­man perfor­mance. If we demon­strate the task to the agent, it won’t be able to perform the task any bet­ter than we could, be­cause it’s re­ceiv­ing no in­for­ma­tion on how to [go about that]. Similarly, if we’re eval­u­at­ing the agent’s be­hav­ior, it won’t be able to find good be­hav­iors that we wouldn’t rec­og­nize as good.

An ex­am­ple is AlphaGo’s move 37 [in its match against Go cham­pion Lee Sedol]. That was a fa­mous move that AlphaGo made, which no hu­man ever would have made. It seemed crazy. I think it was as­signed a less than one-in-10,000 chance of suc­ceed­ing, and yet that move ended up be­ing cru­cial to AlphaGo’s suc­cess. And why could AlphaGo do this? Be­cause AlphaGo wasn’t rely­ing on our abil­ity to de­ter­mine whether a par­tic­u­lar move was good. AlphaGo was just rely­ing on a re­ward func­tion to tell it when it had won and when it had lost, and that was a perfect speci­fi­ca­tion of what counts as win­ning or los­ing in Go. So ideally, we would like to build su­per­in­tel­li­gent AI sys­tems that can ac­tu­ally ex­ceed hu­man perfor­mance at tasks, but it’s not clear how we do this with value learn­ing.

The key idea that al­lows cur­rent ap­proaches around this is: Our AI sys­tems are never go­ing to ex­ceed the su­per­vi­sion that we give them, but maybe we can train our AI sys­tems to ap­prox­i­mate what we would do if we had an ex­tremely long time to think. Imag­ine I had 1,000 years to think about what the best thing to do was in a cer­tain sce­nario, and then I shared that with an AI sys­tem — and then the AI sys­tem prop­erly ap­prox­i­mated my sug­ges­tion, but could do so in a few min­utes as op­posed to 1,000 years. That would pre­sum­ably be a su­per­in­tel­li­gent AI.

The de­tails for how we take this in­sight and ar­rive at an al­gorithm so that we can try it soon — not in 1,000 years — are a bit in­volved. I’m not go­ing to go into them. But the tech­niques to look for are iter­ated am­plifi­ca­tion, de­bate, and re­cur­sive re­ward mod­el­ing.

Another prob­lem with value learn­ing is the in­formed over­sight prob­lem: Even if we’re smarter than the agent that we’re train­ing, we won’t be able to effec­tively su­per­vise it in the event that we don’t un­der­stand why it chose a cer­tain ac­tion. The clas­sic ex­am­ple is an agent tasked to write a new novel. Per­haps it has ac­cess to a library where it’s sup­posed to learn about how to write books, and it can use this in or­der to write the novel, but the novel is sup­posed to be new; [the task re­quires more than] just mem­o­riz­ing a novel from the library and spit­ting it back out again. It’s pos­si­ble that the agent will look at five books in the library, pla­gia­rize chunks from all of them, and put those to­gether into a book that reads very nicely to us, but doesn’t re­ally solve the task be­cause [the novel is un­o­rigi­nal]. How are we sup­posed to tell the agent that this was bad? In or­der to catch the agent look­ing at the five books and steal­ing sen­tences from them, we’d have to read the en­tire library — thou­sands of books — and search for ev­i­dence of pla­gia­rism. This seems too ex­pen­sive for over­sight.

So, it may be sig­nifi­cantly more costly for us to provide over­sight than it is for the agent to take ac­tions if we can­not see how the agent is tak­ing those ac­tions. The key to solv­ing this is al­most ob­vi­ous. It’s sim­ply to make sure you know how the agent is tak­ing their ac­tions. Again, there are many de­tails on ex­actly how we think about this, but the term to look for is “as­crip­tion uni­ver­sal­ity.” Essen­tially, this means that the su­per­vi­sor knows ev­ery­thing that the agent knows, in­clud­ing any facts about how the agent chose its out­put.

[In the novel-writ­ing ex­am­ple], if we were as­crip­tion-uni­ver­sal with re­spect to the agent, then we would know that it had taken sen­tences from five books, be­cause the agent knows that. And if we knew that, then we could ap­pro­pri­ately an­a­lyze it and tell it not to pla­gia­rize in the fu­ture.

How do we cre­ate this prop­erty? Sadly, I’m not go­ing to tell you, be­cause again, I have limited time. But there’s a great set of blog posts and a sum­mary in the Align­ment Newslet­ter, and all of those items are in my liter­a­ture re­view. Really, I just want you to read that link; I put a lot of work into it, and I think it’s good.


Let’s move on to an­other top-level prob­lem: the prob­lem of mesa op­ti­miza­tion. I’m go­ing to illus­trate mesa op­ti­miza­tion with a non-AI ex­am­ple. Sup­pose you’re search­ing for a Python pro­gram that plays tic-tac-toe well. Ini­tially you find some pro­grams that have good heuris­tics. Maybe you find a pro­gram that always starts at the cen­ter square, and that one tends to win a lit­tle more of­ten than the oth­ers. Later, you find a pro­gram that makes sure that any­time it has two spots in a row and the third spot is empty, it plays in that third spot and wins. One that does that in a sin­gle step starts to win a bit more.

Even­tu­ally, you come across the min­i­max al­gorithm, which plays op­ti­mally by search­ing for the best ac­tion to take in ev­ery situ­a­tion. What hap­pened here was that in your search for op­ti­mal Python pro­grams, you ended up find­ing a pro­gram that was it­self an op­ti­mizer that searched pos­si­ble moves in tic tac toe.

This is mesa op­ti­miza­tion. You have a base [or “outer”] op­ti­mizer — in this case, the search over Python pro­grams — and in the course of run­ning that base op­ti­mizer, you find a new op­ti­mizer, which in this case is the min­i­max al­gorithm.

Why is this weird ex­am­ple about pro­grams rele­vant to AI? Well, of­ten we think about AI sys­tems that are trained us­ing gra­di­ent de­scent. And gra­di­ent de­scent is an op­ti­miza­tion al­gorithm that searches over the space of neu­ral net pa­ram­e­ters to find some set of pa­ram­e­ters that performs well on a loss func­tion.

Let’s say that gra­di­ent de­scent is the outer op­ti­mizer. It seems plau­si­ble that mesa op­ti­miza­tion could hap­pen even with gra­di­ent de­scent, where gra­di­ent de­scent finds an in­stan­ti­a­tion of the neu­ral net pa­ram­e­ters, such that then the neu­ral net it­self, when it runs, performs some sort of op­ti­miza­tion. Then the neu­ral net would be a mesa op­ti­mizer that is op­ti­miz­ing some ob­jec­tive, which we would call the mesa ob­jec­tive. And while we know that the mesa ob­jec­tive should lead to similar be­hav­ior as the origi­nal ob­jec­tive on the train­ing dis­tri­bu­tion, be­cause that’s what it was se­lected to do, it may be ar­bi­trar­ily differ­ent [out­side] the train­ing dis­tri­bu­tion. For ex­am­ple, if you trained it on tic tac toe, then you know it’s go­ing to win at tic tac toe — but if you switch to Con­nect Four, it might do some­thing crazy. Maybe in Con­nect Four, it will con­tinue to look for three in a row in­stead of four in a row, and there­fore it will lose badly at Con­nect Four, even though it was work­ing well with tic tac toe.

Let’s say that this hap­pened with gra­di­ent de­scent, and that we had a very pow­er­ful, in­tel­li­gent neu­ral net. Even if we had solved the speci­fi­ca­tion prob­lem, and had the ideal re­ward func­tion to train this agent, it might be that the neu­ral net model that we come up with op­ti­mizes for a differ­ent ob­jec­tive, which may once again be mis­al­igned with what we want. The outer-in­ner dis­tinc­tion is why the speci­fi­ca­tion prob­lem is called “outer al­ign­ment,” and why mesa op­ti­miza­tion is called “in­ner al­ign­ment.”

How do peo­ple solve mesa op­ti­miza­tion? There’s one main pro­posal: ad­ver­sar­ial train­ing. The ba­sic idea is that in ad­di­tion to train­ing an AI sys­tem that’s try­ing to perform well on your speci­fi­ca­tions, you also have an ad­ver­sary — an AI sys­tem or AI hu­man team that’s try­ing to find situ­a­tions in which the agent you’re train­ing would perform badly, or would op­ti­mize for some­thing other than the speci­fi­ca­tion prob­lem.

In the case where you’re try­ing to get a cor­rigible AI sys­tem, maybe your ad­ver­sary is look­ing for situ­a­tions in which the AI sys­tem ma­nipu­lates you or de­ceives you into think­ing some­thing is true, when it is ac­tu­ally false. Then, if you can find all of those situ­a­tions and pe­nal­ize the agent for them, the agent will stop be­hav­ing badly. You’ll have an agent that ro­bustly does the right thing across all set­tings. Ver­ifi­ca­tion would [in­volve us­ing] that agent to ver­ify an­other prop­erty that you care about.

Ideally, we would like to say, “I have for­mally ver­ified that the agent is go­ing to re­li­ably pur­sue the speci­fi­ca­tion that I out­lined.” Whether this is pos­si­ble or not — whether peo­ple are ac­tu­ally op­ti­mistic or not — I’m not to­tally clear on. But it is a plau­si­ble ap­proach that one could take.

There are also other ar­eas of re­search re­lated to less ob­vi­ous solu­tions. Ro­bust­ness to dis­tri­bu­tional shift is par­tic­u­larly im­por­tant, be­cause mesa op­ti­miza­tion be­comes risky with dis­tri­bu­tional shift. On your train­ing dis­tri­bu­tion, your agent is go­ing to perform well; it’s only when the world changes that things could plau­si­bly go badly.


A no­table thing that I haven’t talked about yet is in­ter­pretabil­ity. In­ter­pretabil­ity is a field of re­search which en­tails try­ing to make sure that we un­der­stand the AI sys­tems we train. The rea­son I haven’t in­cluded it yet is be­cause it’s use­ful for ev­ery­thing. For ex­am­ple, you could use in­ter­pretabil­ity to help your ad­ver­sary [iden­tify] the situ­a­tions in which your agent will do bad things. This helps ad­ver­sar­ial train­ing work bet­ter. But in­ter­pretabil­ity is also use­ful for value learn­ing. It al­lows you to provide bet­ter feed­back to the agent; if you bet­ter un­der­stand what the agent is do­ing, you can bet­ter cor­rect it. And it’s es­pe­cially rele­vant to in­formed over­sight or de­scrip­tion uni­ver­sal­ity. So while in­ter­pretabil­ity is ob­vi­ously not a solu­tion in and of it­self, it makes other solu­tions way bet­ter.

There’s also the op­tion of try­ing to pre­vent catas­tro­phes. Some­one else can deal with whether the AI sys­tem will be use­ful; we’re just go­ing to stop it from kil­ling ev­ery­body. Ap­proaches in this area in­clude im­pact reg­u­lariza­tion, where the AI sys­tem is pe­nal­ized for hav­ing large im­pacts on the world. Some tech­niques are rel­a­tive reach­a­bil­ity and at­tain­able util­ity preser­va­tion. The hope here would be that you could cre­ate pow­er­ful AI sys­tems that can do some­what im­pact­ful things like pro­vid­ing ad­vice on writ­ing new laws, but wouldn’t be able to do ex­tremely im­pact­ful things like en­g­ineer a pan­demic that kills ev­ery­body. There­fore, even if an AI sys­tem were mo­ti­vated to harm us, the im­pact penalty would pre­vent it from do­ing some­thing truly catas­trophic.

Another [area of im­pact reg­u­lariza­tion] is or­a­cles. The idea here is to re­strict the AI sys­tem’s ac­tion space so that all it does is an­swer ques­tions. This doesn’t im­me­di­ately provide safety, but hope­fully it makes it a lot harder for an AI sys­tem to cause a catas­tro­phe. Alter­na­tively, you could try to box the AI sys­tem, so that it can’t have much of an im­pact on the world. One ex­am­ple of re­cent work on this is BoMAI, or boxed my­opic ar­tifi­cial in­tel­li­gence. In that case, you put both the hu­man and the AI sys­tem in a box so that they have no com­mu­ni­ca­tion with the out­side world while the AI sys­tem is op­er­at­ing. And then the AI sys­tem shuts down, and the hu­man leaves the box and is able to use any in­for­ma­tion that the AI sys­tem gave them.

So that’s most of [the ma­te­rial] I’ll cover in this prob­lem-solu­tion for­mat. There’s also a lot of other work on AI safety and al­ign­ment that’s more difficult to cat­e­go­rize. For ex­am­ple, there’s work on safe ex­plo­ra­tion, ad­ver­sar­ial ex­am­ples, and un­cer­tainty. Th­ese all seem pretty rele­vant to AI al­ign­ment, but it’s not ob­vi­ous to me where, ex­actly, they fit in the graph [above]. So I haven’t put them in.

There’s also a lot of work on fore­cast­ing, which is ex­tremely rele­vant to [iden­ti­fy­ing] which re­search agen­das you want to pur­sue. For ex­am­ple, there has been a lot of dis­agree­ment over whether or not there will be dis­con­ti­nu­ities in AI progress — in other words, whether at some point in the fu­ture, AI ca­pa­bil­ities shoot up in a way that we couldn’t have pre­dicted by ex­trap­o­lat­ing from past progress.

Another com­mon dis­agree­ment is over whether ad­vanced AI sys­tems will provide com­pre­hen­sive ser­vices. Here’s a very short and ba­sic de­scrip­tion of what that means: Each task that you might want an AI sys­tem to do is performed by one ser­vice; you don’t have a sin­gle agent that’s do­ing all of the tasks. On the other hand, you could imag­ine a sin­gle mono­lithic AI agent that is able to do all tasks. Which of these two wor­lds are we likely to live in?

A third dis­agree­ment is over whether it is pos­si­ble to get to pow­er­ful AI sys­tems by just in­creas­ing the amounts of com­pute that we use with cur­rent meth­ods. Or do we ac­tu­ally need some deep in­sights in or­der to get to pow­er­ful AI sys­tems?

This is all very rele­vant to de­cid­ing what type of re­search you want to do. Many re­search agen­das only make sense un­der some pos­si­ble wor­lds. And if you find out that one world [doesn’t seem very likely], then per­haps you switch to a differ­ent re­search agenda.

That con­cludes my talk. Again, here’s the link to the liter­a­ture re­view that I wrote. There is both a short ver­sion and a long ver­sion. I re­ally en­courage you to read it. It goes into more de­tail than I could in this pre­sen­ta­tion. Thank you so much.

No comments.