The academic contribution to AI safety seems large

Sum­mary: I model the con­tri­bu­tion to AI safety by aca­demics work­ing in ad­ja­cent ar­eas. I ar­gue that this con­tri­bu­tion is at least on the same or­der as the EA bet, and seek a lower bound. Guessti­mates here and here. I fo­cus on pre­sent lev­els of aca­demic work, but the trend is even more im­por­tant.

Con­fi­dence: High in a no­table con­tri­bu­tion, low in the par­tic­u­lar es­ti­mates. Lots of Fermi es­ti­mates.

A big rea­son for the EA fo­cus on AI safety is its ne­glect­ed­ness:

...less than $50 mil­lion per year is de­voted to the field of AI safety or work speci­fi­cally tar­get­ing global catas­trophic biorisks.

80,000 Hours (2019)

...we es­ti­mate fewer than 100 peo­ple in the world are work­ing on how to make AI safe.

80,000 Hours (2017)

Grand to­tal: $9.09m… [Foot­note: this] doesn’t in­clude any­one gen­er­ally work­ing on ver­ifi­ca­tion/​con­trol, au­dit­ing, trans­parency, etc. for other rea­sons.

Seb Far­quhar (2018)

...what we are do­ing is less than a pit­tance. You go to some ran­dom city… Along the high­way you see all these huge build­ings for com­pa­nies… Maybe they are de­sign­ing a new pub­lic­ity cam­paign for a ra­zor blade. You drive past hun­dreds of these… Any one of those has more re­sources than the to­tal that hu­man­ity is spend­ing on [AI safety].

Nick Bostrom (2016)

Num­bers like these helped con­vince me that AI safety is the best thing to work on. I now think that these are un­der­es­ti­mates, be­cause of non-EA lines of re­search which weren’t counted.

Use “EA safety” for the whole um­brella of work done at or­gani­sa­tions like FHI, MIRI, Deep­Mind and OpenAI’s safety teams, and by in­de­pen­dent re­searchers. A lot of this—maybe a third—is con­ducted at uni­ver­si­ties; to avoid dou­ble count­ing I count it as EA and not academia.

The ar­gu­ment:

  1. EA safety is small, even rel­a­tive to a sin­gle aca­demic sub­field.

  2. There is over­lap be­tween ca­pa­bil­ities and short-term safety work.

  3. There is over­lap be­tween short-term safety work and long-term safety work.

  4. So AI safety is less ne­glected than the open­ing quotes im­ply.

  5. Also, on pre­sent trends, there’s a good chance that academia will do more safety over time, even­tu­ally dwarfing the con­tri­bu­tion of EA.

What’s ‘safety’?

EA safety is best read as about “AGI al­ign­ment”: work on as­sur­ing that the ac­tions of an ex­tremely ad­vanced sys­tem are suffi­ciently close to hu­man-friendly goals.

EA fo­cusses on AGI be­cause weaker AI sys­tems aren’t thought to be di­rectly tied to ex­is­ten­tial risk. How­ever, Critch and Krueger note that “pre­po­tent”—un­stop­pably ad­vanced, but not nec­es­sar­ily hu­man-level—AI could still pose x-risks. The po­ten­tial for this lat­ter type is key to the ar­gu­ment that short-term work is rele­vant to us, since the scal­ing curves for some sys­tems seem to be hold­ing up, and so might reach pre­po­tence.

“ML safety” could mean just mak­ing ex­ist­ing sys­tems safe, or us­ing ex­ist­ing sys­tems as a proxy for al­ign­ing an AGI. The lat­ter is some­times called “mid-term safety”, and this is the key class of work for my pur­poses.

In the fol­low­ing “AI safety” means any­thing which helps us solve the AGI con­trol prob­lem.

De facto AI safety work

The line be­tween safety work and ca­pa­bil­ities work is some­times blurred. A clas­sic ex­am­ple is ‘ro­bust­ness’: it is both a safety prob­lem and a ca­pa­bil­ities prob­lem if your sys­tem can be re­li­ably bro­ken by noise. Trans­parency (in­creas­ing di­rect hu­man ac­cess to the goals and prop­er­ties of learned sys­tems) is the most ob­vi­ous case of work rele­vant to ca­pa­bil­ities, short-term safety, and AGI al­ign­ment. As well as be­ing a huge aca­demic fad, it’s a core mechanism in 6 out of the 11 live AGI al­ign­ment pro­pos­als re­cently sum­marised by Hub­inger.

More con­tro­ver­sial is whether there’s sig­nifi­cant over­lap be­tween short-term safety and AGI al­ign­ment. All we need for now is:

The mid-term safety hy­poth­e­sis (weak form): at least some work on cur­rent sys­tems will trans­fer to AGI al­ign­ment.

Some re­searchers who seem to put a lot of stock in this view: Shah, Chris­ti­ano, Krakovna, Ols­son, Olah, Stein­hardt, Amodei, Krueger. (Note that I haven’t pol­led them; this is guessed from pub­lic state­ments and re­vealed prefer­ences.) Since 2016, ac­tu­ally “about half” of MIRI’s re­search has been on their ML agenda, ap­par­ently to cover the chance of pro­saic AGI. [EDIT: This was their plan in 2016, but per­son­nel changes and new agen­das took over. There’s still some re­lated work but I don’t know what frac­tion of their effort it is, nor the ra­tio­nale.]

Here are some al­ign­ment-rele­vant re­search ar­eas dom­i­nated by non-EAs. I won’t ex­plain these: I use the in­cred­ibly de­tailed tax­on­omy (and 30 liter­a­ture re­views) of Critch and Krueger (2020). Look there, and at re­lated agen­das for ex­pla­na­tions and biblio­gra­phies.

Th­ese are nar­rowly drawn from ML, robotics, and game the­ory: this is just a sam­ple of rele­vant work! Work in so­cial sci­ence, moral un­cer­tainty, or de­ci­sion the­ory could be just as rele­vant as the above di­rect tech­ni­cal work; Richard Ngo lists many ques­tions for non-AI peo­ple here.

Work in these fields could help di­rectly, if the even­tual AGI paradigm is not too dis­similar from the cur­rent one (that is, if the weak mid-term hy­poth­e­sis holds). But there are also in­di­rect benefits: if they help us to use AIs to al­ign AGI; if they help to build the field; if they help con­vince peo­ple that there re­ally is an AGI con­trol prob­lem (for in­stance, Vic­to­ria Krakovna’s speci­fi­ca­tion gam­ing list has been helpful to me in in­ter­act­ing with scep­ti­cal spe­cial­ists). Th­ese im­ply an­other view un­der which much aca­demic work has al­ign­ment value:

The mid-term safety hy­poth­e­sis (very weak form): at least some work on cur­rent sys­tems will prob­a­bly help with AGI al­ign­ment in some way, not limited to di­rect tech­ni­cal trans­fer.

A nat­u­ral ob­jec­tion is that most of the above ar­eas don’t ad­dress the AGI case: they’re not even try­ing to solve our prob­lem. I dis­cuss this and other dis­counts be­low.

How large is EA Safety?

Some over­lap­ping lists:

  • # peo­ple with posts on the Align­ment Fo­rum since late 2018: 94. To my knowl­edge, 37 of these are full-time.

  • 80k AI Safety Google Group: 400, al­most en­tirely ju­nior peo­ple.

  • Larks’ great 2019 roundup con­tained ~110 AI re­searchers (who pub­lished that year), most of whom could be de­scribed as EA or ad­ja­cent.

  • Issa Rice’s AI Watch: “778” (raw count, but there’s lots of false pos­i­tives for gen­eral x-risk peo­ple and in­ac­tive peo­ple. Last big up­date 2018).

In the top-down model I start with all EAs and then filter them by in­ter­est in AI risk, di­rect work, and % of time work­ing on safety. (EA safety has a lot of hob­by­ists.) The bot­tom-up model at­tempts a head­count.

How large is non-EA Safety?

A rough top-down point es­ti­mate of aca­demic com­puter sci­en­tists alone work­ing on AI gives 84k to 103k, with caveats sum­marised in the Guessti­mate. Then define a (very) rough rele­vance filter:

C = % of AI work on ca­pa­bil­ities
S = % of AI work on short-term safe­ty
CS = % of ca­pa­bil­ities work that over­laps with short-term safe­ty
SL = % of short-term safety that over­laps with long-term safety

Then, we could de­com­pose the safety-rele­vant part of aca­demic AI as:

SR = (C * CS * SL) + (S * SL)

Then the non-EA safety size is sim­ply the field size * SR.

None of those pa­ram­e­ters is ob­vi­ous, but I make an at­tempt in the model (bot­tom-left cor­ner).

This just counts academia, and just tech­ni­cal AI within that. It’s harder to es­ti­mate the amount of in­dus­trial effort, but the AI In­dex re­port sug­gests that com­mer­cial AI re­search is about 10% as large as aca­demic re­search (by num­ber of pa­pers, not im­pact). But we don’t need this if we’re just ar­gu­ing that the non-EA lower bound is large.

What’s a good dis­count fac­tor for de facto safety work?

In EA safety, it’s com­mon to be cyn­i­cal about academia and em­piri­cal AI safety. There’s some­thing to it: the amount of pa­per­work and com­mu­ni­ca­tion over­head is no­to­ri­ous; there are per­verse in­cen­tives around pub­lish­ing tempo, short-ter­mism, and con­for­mity; it is very com­mon to em­pha­sise only the pos­i­tive effects of your work; and, as the GPT-2 story shows, there is a strong dogma about au­to­matic dis­clo­sure of all work. Also, in­so­far as AI safety is ‘pre-paradig­matic’, you might not ex­pect nor­mal sci­ence to make much head­way. (But note that sev­eral agent-foun­da­tion-style mod­els are from academia—see ‘A cur­sory check’ be­low.)

But this is only half of the ledger. One of the big ad­van­tages of aca­demic work is the much bet­ter dis­tri­bu­tion of se­nior re­searchers: EA Safety seems bot­tle­necked on peo­ple able to guide and train ju­niors. Another fac­tor is in­creased in­fluence: the av­er­age aca­demic has se­ri­ous op­por­tu­ni­ties to af­fect policy, hun­dreds of stu­dents, and the gen­eral at­ti­tude of their field to­ward al­ign­ment, in­clud­ing non-aca­demic work on al­ign­ment. Lastly, you get ac­cess to gov­ern­ment-scale fund­ing. I ig­nore these pos­i­tives in the fol­low­ing.


Here’s a top-down model ar­gu­ing that tech­ni­cal AI aca­demics could have the same or­der of effect as EA, even un­der a heavy im­pact dis­count, even when ig­nor­ing other fields and the use­ful fea­tures of academia. Here’s an (in­com­plete) bot­tom-up model to check if it’s roughly sen­si­ble. As you can see from the var­i­ance, the out­put means are not to be trusted.

A “con­fi­dence” in­ter­val

Again, the model is con­ser­va­tive: I don’t count the most promi­nent safety-rele­vant aca­demic in­sti­tu­tions (FHI, CHAI, etc); I don’t count con­tri­bu­tions from in­dus­try, just the sin­gle most rele­vant aca­demic field; I don’t count non-tech­ni­cal aca­demic con­tri­bu­tions; and a high dis­count is ap­plied to aca­demic work. For the sake of ar­gu­ment I’ve set the dis­count very high: a unit of ad­ja­cent aca­demic work is said to be 80% less effec­tive than a unit of ex­plicit AGI work. The mod­els rely on my pri­ors; cus­tomise them be­fore draw­ing con­clu­sions (see ‘Pa­ram­e­ters’ be­low).

A cur­sory check of the model

The above im­plies that there should be a lot of main­stream work with al­ign­ment im­pli­ca­tions—maybe as much as EA pro­duces. A sys­tem­atic study would be a big un­der­tak­ing, but can we at least find ex­am­ples? Yes:

How much does EA safety pro­duce? In Larks’ ex­haus­tive an­nual round-up of EA safety work in 2019, he iden­ti­fied about 50 pa­per-sized chunks (not count­ing MIRI’s pri­vate efforts). Of them, both CAIS and mesa-op­ti­misers seem more sig­nifi­cant than the above. Re­cent years have seen similarly im­por­tant EA work (e.g. De­bate, quan­tiliz­ers, or the Arm­strong/​Shah dis­cus­sion of value learn­ing).

What does this change?

I ar­gue that AIS is less ne­glected than it seems, be­cause some aca­demic work is re­lated, and academia is enor­mous. (My con­fi­dence in­ter­val for the aca­demic con­tri­bu­tion is vast—but I didn’t quite man­age to zero out the lower bound even by be­ing con­ser­va­tive.) Does this change the cause’s pri­or­ity?

Prob­a­bly not. Even if the field is big­ger than we thought, it’s still ex­tremely small rel­a­tive to the in­vest­ment in AI ca­pa­bil­ities, and highly ne­glected rel­a­tive to its im­por­tance. The point of the above is to cor­rect your model, to draw at­ten­tion to other sources of use­ful work, and to help sharpen a per­sis­tent dis­agree­ment within EA safety about the role of mid-term safety and academia.

This might change your view of effec­tive in­ter­ven­tions within AIS (for in­stance, ways to bring AGI al­ign­ment fur­ther within the Over­ton win­dow), but my model doesn’t get you there on its own. A key quan­tity I don’t re­ally dis­cuss is the ra­tio of ca­pa­bil­ities to al­ign­ment work. It seems pro­hibitively hard to re­duce ca­pa­bil­ities in­vest­ment. But a large, cred­ible aca­demic field of al­ign­ment is one way to re­place some work on ca­pa­bil­ities.

A naive ex­trap­o­la­tion im­plies that AIS ne­glect­ed­ness will de­crease fur­ther: in the last 10 years, Safety has moved from the fringe of the in­ter­net into the heart of great uni­ver­si­ties and NGOs. We have mo­men­tum: the pro­gramme is sup­ported by some of the most in­fluen­tial AI re­searchers—e.g. Rus­sell, Ben­gio, Sutskever, Shana­han, Rossi, Sel­man, McAllester, Pearl, Sch­mid­hu­ber, Horvitz. (Often only ver­bal ap­proval.)

In ad­di­tion, from per­sonal ex­pe­rience, ju­nior aca­demics are much more favourable to­wards al­ign­ment and are much more likely to want to work on it di­rectly.

Lastly: In­tu­itively, the eco­nomic in­cen­tive to solve AGI-safety-like prob­lems scales as ca­pa­bil­ities in­crease, and as mid-term prob­lems draw at­ten­tion. Or­di­nary le­gal li­a­bil­ity dis­in­cen­tivises all the sub-ex­is­ten­tial risks. (The in­cen­tive may not scale prop­erly, from a longter­mist per­spec­tive, but the di­rec­tion still seems helpful.)

If this con­tinues, then even the EA bet on di­rect AGI al­ign­ment could be to­tally out­stripped by nor­mal aca­demic in­cen­tives (pres­tige, so­cial proof, herd­ing around the agen­das of top re­searchers).

A cool fore­cast­ing com­pe­ti­tion is cur­rently run­ning on a re­lated ques­tion.

This ar­gu­ment de­pends on our luck hold­ing, and more­over, on peo­ple (e.g. me) not naively an­nounc­ing vic­tory and so dis­cour­ag­ing in­vest­ment. But to the ex­tent that you trust the trend, this should af­fect your pri­ori­ti­sa­tion of AI safety, since its ex­pected ne­glect­ed­ness is a great deal smaller.


  • Your prob­a­bil­ity of pro­saic AGI (i.e. where we get there by just scal­ing up black-box al­gorithms). Whether it’s pos­si­ble to al­ign pro­saic AGI. Your prob­a­bil­ity that agent foun­da­tions is the only way to pro­mote real al­ign­ment.

  • The per­centage of main­stream work which is rele­vant to AGI al­ign­ment. Sub­sumes the ca­pa­bil­ities/​safety over­lap and the short/​long term safety over­lap. The idea of a con­tin­u­ous dis­count on work ad­ja­cent to al­ign­ment would be mis­guided if there were re­ally two classes of safety prob­lem, short- and long-term, and if short-term work had neg­ligible im­pact on the long-term prob­lems. The rele­vance would then be near 0.

  • The above is ex­tremely sen­si­tive to your fore­cast for AGI. Given very short timelines, you should fo­cus on other things than climb­ing up through academia, even if you think it’s gen­er­ally well-suited to this task; con­versely, if you think we have 100 years, then you can have pretty strong views on aca­demic in­ad­e­quacy and still agree that their im­pact will be sub­stan­tial.

  • If you have an ex­tremely nega­tive view of academia’s effi­ciency, then the above may not move you much. (See for in­stance, the dra­mat­i­cally diminish­ing re­turn on in­puts in ma­ture fields like physics.)

Caveats, fu­ture work

  • To es­ti­mate academia fairly, you’d need a more com­pli­cated model, in­volv­ing sec­ond-or­der effects like availa­bil­ity of se­nior re­searchers, policy in­fluence, op­por­tu­nity to spread ideas to stu­dents and col­leagues, fund­ing. That is, academia has ex­tremely clear paths to global im­pact. But since academia is stronger on the sec­ond or­der, omit­ting it doesn’t hurt my lower-bound ar­gu­ment.

  • A ques­tion which de­serves a post of its own is: “How of­ten do sci­en­tists in­ad­ver­tently solve a prob­lem?” (The gen­eral form—“how of­ten does seem­ingly un­re­lated work help? Provide cru­cial help?”—seems triv­ial: many solu­tions are helped by seem­ingly un­re­lated prior work.) I’m rely­ing on the dis­counts to cover the effect of “ac­tu­ally try­ing to solve the prob­lem”, but this might not be apt. Maybe av­er­age academia is to re­search as the av­er­age char­ity is to im­pact: maybe di­rectly tar­get­ing im­pact is that im­por­tant.

  • I haven’t thought much about po­ten­tial harms from aca­demic al­ign­ment work. Short-ter­mists crowd­ing out long-ter­mists and a lack of at­ten­tion to info haz­ards might be two.

  • In­tel­lec­tual im­pact is not lin­ear in peo­ple. Also, the above treats all (non-EA) aca­demic in­sti­tu­tions as equally con­ducive to safety work, which is not true.

  • Even more caveats.

  • Con­flict of in­ter­est: I’m a PhD stu­dent.

Thanks to Jan Brauner for the idea. Thanks to Vo­jta Ko­vařík, Aaron Gertler, Ago La­jko, Tomáš Gavenčiak, Misha Yagudin, Rob Kirk, Matthijs Maas, Nandi Schoots, and Richard Ngo for helpful com­ments.