My current thoughts on MIRI’s “highly reliable agent design” work

In­ter­pret­ing this writeup:

I lead the Open Philan­thropy Pro­ject’s work on tech­ni­cal AI safety re­search. In our MIRI grant writeup last year, we said that we had strong reser­va­tions about MIRI’s re­search, and that we hoped to write more about MIRI’s re­search in the fu­ture. This writeup ex­plains my cur­rent think­ing about the sub­set of MIRI’s re­search referred to as “highly re­li­able agent de­sign” in the Agent Foun­da­tions Agenda. My hope is that this writeup will help move the dis­cus­sion for­ward, but I definitely do not con­sider it to be any kind of fi­nal word on highly re­li­able agent de­sign. I’m post­ing the writeup here be­cause I think this is the most ap­pro­pri­ate au­di­ence, and I’m look­ing for­ward to read­ing the com­ments (though I prob­a­bly won’t be able to re­spond to all of them).

After writ­ing the first ver­sion of this writeup, I re­ceived com­ments from other Open Phil staff, tech­ni­cal ad­vi­sors, and MIRI staff. Many com­ments were dis­agree­ments with ar­gu­ments or cre­dences stated here; some of these dis­agree­ments seem plau­si­ble to me, some com­ments dis­agree with one an­other, and I place sig­nifi­cant weight on all of them be­cause of my con­fi­dence in the com­men­ta­tors. Based on these com­ments, I think it’s very likely that some as­pects of this writeup will turn out to have been mis­cal­ibrated or mis­taken – i.e. in­cor­rect given the available ev­i­dence, and not just cases where I as­sign a rea­son­able cre­dence or make a rea­son­able ar­gu­ment that may turn out to be wrong – but I’m not sure which as­pects these will turn out to be.

I con­sid­ered spend­ing a lot of time heav­ily re­vis­ing this writeup to take these com­ments into ac­count. How­ever, it seems pretty likely to me that I could con­tinue this com­ment/​re­vi­sion pro­cess for a long time, and this pro­cess offers very limited op­por­tu­ni­ties for oth­ers out­side of a small set of col­leagues to en­gage with my views and cor­rect me where I’m wrong. I think there’s sig­nifi­cant value in in­stead putting an im­perfect writeup into the pub­lic record, and giv­ing oth­ers a chance to re­spond in their own words to an un­am­bigu­ous snap­shot of my be­liefs at a par­tic­u­lar point in time.


  1. What is “highly re­li­able agent de­sign”?

  2. What’s the ba­sic case for HRAD?

  3. What do I think about HRAD?

    1. Low cre­dence that HRAD will be ap­pli­ca­ble (25%?)

    2. HRAD has few ad­vo­cates among AI researchers

    3. Other re­search, es­pe­cially “learn­ing to rea­son from hu­mans,” looks more promis­ing than HRAD (75%?)

    4. MIRI staff are thought­ful, al­igned with our val­ues, and have a good track record

  4. How much should Open Phil sup­port HRAD work?

1. What is “highly re­li­able agent de­sign”?

I un­der­stand MIRI’s “highly re­li­able agent de­sign” work (coined in this re­search agenda, “HRAD” for short) as work that aims to de­scribe ba­sic as­pects of rea­son­ing and de­ci­sion-mak­ing in a com­plete, prin­ci­pled, and the­o­ret­i­cally satis­fy­ing way. Here’s a non-ex­haus­tive list of re­search top­ics in this area:

  • Episte­mol­ogy: de­vel­op­ing a for­mal the­ory of in­duc­tion that ac­counts for the facts that an AI sys­tem will be im­ple­mented in the phys­i­cal world it is rea­son­ing about (“nat­u­ral­is­tic world mod­els”) and that other in­tel­li­gent agents may be simu­lat­ing the AI sys­tem (“be­nign uni­ver­sal prior”).

  • De­ci­sion the­ory: de­vel­op­ing a de­ci­sion the­ory that be­haves ap­pro­pri­ately when an agent’s de­ci­sions are log­i­cally en­tan­gled with other parts of the en­vi­ron­ment (e.g. in the pres­ence of other copies of the agent, other very similar sys­tems, or other agents that can pre­dict the agent), and that can’t be prof­itably threat­ened by other agents.

  • Log­i­cal un­cer­tainty: de­vel­op­ing a rigor­ous, satis­fy­ing the­ory of prob­a­bil­is­tic rea­son­ing over facts that are log­i­cal con­se­quences of an agent’s cur­rent be­liefs, but that are too ex­pen­sive to rea­son out de­duc­tively.

  • Vingean re­flec­tion: de­vel­op­ing a the­ory of for­mal rea­son­ing that al­lows an agent to rea­son with high re­li­a­bil­ity about similar agents, in­clud­ing agents with con­sid­er­ably more com­pu­ta­tional re­sources, with­out simu­lat­ing those agents.

To be re­ally satis­fy­ing, it should be pos­si­ble to put these de­scrip­tions to­gether into a full and prin­ci­pled de­scrip­tion of an AI sys­tem that rea­sons and makes de­ci­sions in pur­suit of some goal in the world, not tak­ing into ac­count is­sues of effi­ciency; this de­scrip­tion might be un­der­stand­able as a mod­ified/​ex­panded ver­sion of AIXI. Ideally this re­search would also yield rigor­ous ex­pla­na­tions of why no other de­scrip­tion is satis­fy­ing.

2. What’s the ba­sic case for HRAD?

My un­der­stand­ing is that MIRI (or at least Nate and Eliezer) be­lieve that if there is not sig­nifi­cant progress on many prob­lems in HRAD, the prob­a­bil­ity that an ad­vanced AI sys­tem will cause catas­trophic harm is very high. (They re­serve some prob­a­bil­ity for other ap­proaches be­ing found that could ren­der HRAD un­nec­es­sary, but they aren’t aware of any such ap­proaches.)

I’ve en­gaged in many con­ver­sa­tions about why MIRI be­lieves this, and have of­ten had trou­ble com­ing away with crisply ar­tic­u­lated rea­sons. So far, the ba­sic case that I think is most com­pel­ling and most con­sis­tent with the ma­jor­ity of the con­ver­sa­tions I’ve had is some­thing like this (phras­ing is mine /​ Holden’s):

  1. Ad­vanced AI sys­tems are go­ing to have a huge im­pact on the world, and for many plau­si­ble sys­tems, we won’t be able to in­ter­vene af­ter they be­come suffi­ciently ca­pa­ble.

  2. If we fun­da­men­tally “don’t know what we’re do­ing” be­cause we don’t have a satis­fy­ing de­scrip­tion of how an AI sys­tem should rea­son and make de­ci­sions, then we will prob­a­bly make lots of mis­takes in the de­sign of an ad­vanced AI sys­tem.

  3. Even minor mis­takes in an ad­vanced AI sys­tem’s de­sign are likely to cause catas­trophic mis­al­ign­ment.

  4. Be­cause of 1, 2, and 3, if we don’t have a satis­fy­ing de­scrip­tion of how an AI sys­tem should rea­son and make de­ci­sions, we’re likely to make enough mis­takes to cause a catas­tro­phe. The right way to get to ad­vanced AI that does the right thing in­stead of caus­ing catas­tro­phes is to deeply un­der­stand what we’re do­ing, start­ing with a satis­fy­ing de­scrip­tion of how an AI sys­tem should rea­son and make de­ci­sions.

  5. This case does not re­volve around any spe­cific claims about spe­cific po­ten­tial failure modes, or their re­la­tion­ship to spe­cific HRAD sub­prob­lems. This case re­volves around the value of fun­da­men­tal un­der­stand­ing for avoid­ing “un­known un­known” prob­lems.

I also find it helpful to see this case as as­sert­ing that HRAD is one kind of “ba­sic sci­ence” ap­proach to un­der­stand­ing AI. Ba­sic sci­ence in other ar­eas – i.e. work based on some sense of be­ing in­tu­itively, fun­da­men­tally con­fused and un­satis­fied by the lack of ex­pla­na­tion for some­thing – seems to have an out­stand­ing track record of un­cov­er­ing im­por­tant truths that would have been hard to pre­dict in ad­vance, in­clud­ing the work of Fara­day/​Maxwell, Ein­stein, Nash, and Tur­ing. Ba­sic sci­ence can also provide a foun­da­tion for high-re­li­a­bil­ity en­g­ineer­ing, e.g. by giv­ing us a lan­guage to ex­press guaran­tees about how an en­g­ineered sys­tem will perform in differ­ent cir­cum­stances or by im­prov­ing an en­g­ineer’s abil­ity to de­sign good em­piri­cal tests. Our lack of satis­fy­ing ex­pla­na­tions for how an AI sys­tem should rea­son and make de­ci­sions and the im­por­tance of “know­ing what we’re do­ing” in AI make a ba­sic sci­ence ap­proach ap­peal­ing, and HRAD is one such ap­proach. (I don’t think MIRI would say that there couldn’t be other kinds of ba­sic sci­ence that could be done in AI, but they don’t know of similarly valuable-look­ing ap­proaches.)

We’ve spent a lot of effort (100+ hours) try­ing to write down more de­tailed cases for HRAD work. This time in­cluded con­ver­sa­tions with MIRI, con­ver­sa­tion among Open Phil staff and tech­ni­cal ad­vi­sors, and writ­ing drafts of these ar­gu­ments. Th­ese other cases didn’t feel like they cap­tured MIRI’s views very well and were not very un­der­stand­able or per­sua­sive to me and other Open Phil staff mem­bers, so I’ve fallen back on this sim­pler case for now when think­ing about HRAD work.

3. What do I think about HRAD?

I have sev­eral points of agree­ment with MIRI’s ba­sic case:

  • I agree that ex­ist­ing for­mal­isms like AIXI, Solomonoff in­duc­tion, and causal de­ci­sion the­ory are un­satis­fy­ing as de­scrip­tions of how an AI sys­tem should rea­son and make de­ci­sions, and I agree with most (maybe all) of the ways that MIRI thinks they are un­satis­fy­ing.

  • I agree that ad­vanced AI is likely to have a huge im­pact on the world, and that for cer­tain ad­vanced AI sys­tems there will be a point af­ter which we won’t be able to in­ter­vene.

  • I agree that some plau­si­ble kinds of mis­takes in an AI sys­tem’s de­sign would cause catas­trophic mis­al­ign­ment.

  • I agree that with­out some kind of de­scrip­tion of “what an ad­vanced AI sys­tem is do­ing” that makes us con­fi­dent that it will be al­igned, we should be very wor­ried that it will cause a catas­tro­phe.

The fact that MIRI re­searchers (who are thought­ful, very ded­i­cated to this prob­lem, al­igned with our val­ues, and have a good track record in think­ing about ex­is­ten­tial risks from AI) and some oth­ers in the effec­tive al­tru­ism com­mu­nity are sig­nifi­cantly more pos­i­tive than I am about HRAD is an ex­tremely im­por­tant fac­tor to me in fa­vor of HRAD. Th­ese pos­i­tive views sig­nifi­cantly raise the min­i­mum cre­dence I’m will­ing to put on HRAD re­search be­ing very helpful.

In ad­di­tion to these pos­i­tive fac­tors, I have sev­eral reser­va­tions about HRAD work. In re­la­tion to the ba­sic case, these reser­va­tions make me think that HRAD isn’t likely to be sig­nifi­cantly helpful for get­ting a con­fi­dence-gen­er­at­ing de­scrip­tion of how an ad­vanced AI sys­tem rea­sons and makes de­ci­sions.

1. It seems pretty likely that early ad­vanced AI sys­tems won’t be un­der­stand­able in terms of HRAD’s for­mal­isms, in which case HRAD won’t be use­ful as a de­scrip­tion of how these sys­tems should rea­son and make de­ci­sions.

Note: I’m not sure to what ex­tent MIRI and I dis­agree about how likely HRAD is to be ap­pli­ca­ble to early ad­vanced AI sys­tems. It may be that our over­all dis­agree­ment about HRAD is more about the fea­si­bil­ity of other AI al­ign­ment re­search op­tions (see 3 be­low), or pos­si­bly about strate­gic ques­tions out­side the scope of this doc­u­ment (e.g. to what ex­tent we should try to ad­dress po­ten­tial risks from ad­vanced AI through strat­egy, policy, and out­reach rather than through tech­ni­cal re­search).

2. HRAD has gained fewer strong ad­vo­cates among AI re­searchers than I’d ex­pect it to if it were very promis­ing—in­clud­ing among AI re­searchers whom I con­sider highly thought­ful about the rele­vant is­sues, and whom I’d ex­pect to be more ex­cited if HRAD were likely to be very helpful.

To­gether, these two con­cerns give me some­thing like a 20% cre­dence that if HRAD work reached a high level of ma­tu­rity (and rel­a­tively lit­tle other AI al­ign­ment re­search were done) HRAD would sig­nifi­cantly help AI re­searchers build al­igned AI sys­tems around the time it be­comes pos­si­ble to build any ad­vanced AI sys­tem.

3. The above con­sid­ers HRAD in a vac­uum, in­stead of com­par­ing it to other AI al­ign­ment re­search op­tions. My un­der­stand­ing is that MIRI thinks it is very un­likely that other AI al­ign­ment re­search can make up for a lack of progress in HRAD. I dis­agree; HRAD looks sig­nifi­cantly less promis­ing to me (in terms of solv­ing ob­ject-level al­ign­ment prob­lems, ig­nor­ing fac­tors like field-build­ing value) than learn­ing to rea­son and make de­ci­sions from hu­man-gen­er­ated data (de­scribed more be­low), and HRAD seems un­likely to be helpful on the mar­gin if rea­son­able amounts of other AI al­ign­ment re­search is done.

This re­duces my cre­dence in HRAD be­ing very helpful to around 10%. I think this is the de­ci­sion-rele­vant cre­dence.

In the next few sec­tions, I’ll go into more de­tail about the fac­tors I just de­scribed. After­ward, I’ll say what I think this im­plies about how much we should sup­port HRAD re­search, briefly sum­ma­riz­ing the other fac­tors that I think are most rele­vant.

3a. Low cre­dence that HRAD will be ap­pli­ca­ble (25%?)

The ba­sic case for HRAD be­ing helpful de­pends on HRAD pro­duc­ing a de­scrip­tion of how an AI sys­tem should rea­son and make de­ci­sions that can be pro­duc­tively ap­plied to ad­vanced AI sys­tems. In this sec­tion, I’ll de­scribe my rea­sons for think­ing this is not likely. (As noted above, I’m not sure to what ex­tent MIRI and I dis­agree about how likely HRAD is to be ap­pli­ca­ble to early ad­vanced AI sys­tems; nev­er­the­less, it’s an im­por­tant fac­tor in my cur­rent be­liefs about the value of HRAD work.)

I un­der­stand HRAD work as aiming to de­scribe ba­sic as­pects of rea­son­ing and de­ci­sion-mak­ing in a com­plete, prin­ci­pled, and the­o­ret­i­cally satis­fy­ing way, and ideally to have ar­gu­ments that no other de­scrip­tion is more satis­fy­ing. I’ll re­fer to this as a “com­plete ax­io­matic ap­proach,” mean­ing that an end re­sult of HRAD-style re­search on some as­pect of rea­son­ing would be a set of ax­ioms that com­pletely de­scribe that as­pect and that are cho­sen for their in­trin­sic de­sir­a­bil­ity or for the de­sir­a­bil­ity of the prop­er­ties they en­tail. This prop­erty of HRAD work is the source of sev­eral of my reser­va­tions:

  • I haven’t found any in­stances of com­plete ax­io­matic de­scrip­tions of AI sys­tems be­ing used to miti­gate prob­lems in those sys­tems (e.g. to pre­dict, post­dict, ex­plain, or fix them) or to de­sign those sys­tems in a way that avoids prob­lems they’d oth­er­wise face. AIXI and Solomonoff in­duc­tion are par­tic­u­larly strong ex­am­ples of work that is very close to HRAD, but don’t seem to have been ap­pli­ca­ble to real AI sys­tems. While I think the most likely ex­pla­na­tion for this lack of prece­dent is that com­plete ax­io­matic de­scrip­tion is not a very promis­ing ap­proach, it could be that not enough effort has been spent in this di­rec­tion for con­tin­gent rea­sons; I think that at­tempts at this would be very in­for­ma­tive about HRAD’s ex­pected use­ful­ness, and seem like the most likely way that I’ll in­crease my cre­dence in HRAD’s fu­ture ap­pli­ca­bil­ity. (Two very ac­com­plished ma­chine learn­ing re­searchers have told me that AIXI is a use­ful source of in­spira­tion for their work; I think it’s plau­si­ble that e.g. log­i­cal un­cer­tainty could serve a similar role, but this is a much weaker case for HRAD than the one I un­der­stand MIRI as mak­ing.) If HRAD work were likely to be ap­pli­ca­ble to ad­vanced AI sys­tems, it seems likely to me that some com­plete ax­io­matic de­scrip­tions (or early HRAD re­sults) should be ap­pli­ca­ble to cur­rent AI sys­tems, es­pe­cially if ad­vanced AI sys­tems are similar to to­day’s.

  • From con­ver­sa­tions with re­searchers and from my own fa­mil­iar­ity with the liter­a­ture, my un­der­stand­ing is that it would be ex­tremely difficult to re­late to­day’s cut­ting-edge AI sys­tems to com­plete ax­io­matic de­scrip­tions. It seems to me that very few re­searchers think this ap­proach is promis­ing rel­a­tive to other kinds of the­ory work, and that when re­searchers have tried to de­scribe mod­ern ma­chine learn­ing meth­ods in this way, their work has gen­er­ally not been very suc­cess­ful (com­pared to other the­o­ret­i­cal and ex­per­i­men­tal work) in in­creas­ing re­searchers’ un­der­stand­ing of the AI sys­tems they are de­vel­op­ing.

  • It seems plau­si­ble that the kinds of ax­io­matic de­scrip­tions that HRAD work could pro­duce would be too tax­ing to be use­fully ap­plied to any prac­ti­cal AI sys­tem. HRAD re­sults would have to be ap­plied to ac­tual AI sys­tems via the­o­ret­i­cally satis­fy­ing ap­prox­i­ma­tion meth­ods, and it seems plau­si­ble that this will not be pos­si­ble (or that the ap­prox­i­ma­tion meth­ods will not pre­serve most of the de­sir­able prop­er­ties en­tailed by the ax­io­matic de­scrip­tions). I haven’t gath­ered ev­i­dence about this ques­tion.

  • It seems plau­si­ble that the con­cep­tual frame­work and ax­ioms cho­sen dur­ing HRAD work will be very differ­ent from the con­cep­tual frame­work that would best de­scribe how early ad­vanced AI sys­tems work. In the­ory, it may be pos­si­ble to de­scribe a re­cur­rent neu­ral net­work learn­ing to pre­dict fu­ture in­puts as a par­tic­u­lar ap­prox­i­ma­tion of Solomonoff in­duc­tion, but in prac­tice the differ­ences in con­cep­tual frame­work may be sig­nifi­cant enough that this de­scrip­tion would not ac­tu­ally be use­ful for un­der­stand­ing how neu­ral net­works work or how they might fail.

Over­all, this makes me think it’s un­likely that HRAD work will ap­ply well to ad­vanced AI sys­tems, es­pe­cially if ad­vanced AI is reached soon (which would make it more likely to re­sem­ble to­day’s ma­chine learn­ing meth­ods). A large por­tion of my cre­dence in HRAD be­ing ap­pli­ca­ble to ad­vanced AI sys­tems comes from the pos­si­bil­ity that ad­vanced AI sys­tems won’t look much like to­day’s. I don’t know how to gain much ev­i­dence about HRAD’s ap­pli­ca­bil­ity in this case.

3b. HRAD has few ad­vo­cates among AI researchers

HRAD has gained fewer strong ad­vo­cates among AI re­searchers than I’d ex­pect it to if it were very promis­ing, de­spite other as­pects of MIRI’s re­search (the al­ign­ment prob­lem, value speci­fi­ca­tion, cor­rigi­bil­ity) be­ing strongly sup­ported by a few promi­nent re­searchers. Our re­view of five of MIRI’s HRAD pa­pers last year pro­vided more de­tailed ex­am­ples of how a small num­ber of AI re­searchers (seven com­puter sci­ence pro­fes­sors, one grad­u­ate stu­dent, and our tech­ni­cal ad­vi­sors) re­spond to HRAD re­search; these re­views made it seem to us that HRAD re­search has lit­tle po­ten­tial to de­crease po­ten­tial risks from ad­vanced AI rel­a­tive to other tech­ni­cal work with the same goal, though we noted that this con­clu­sion was “par­tic­u­larly ten­ta­tive, and some of our ad­vi­sors thought that ver­sions of MIRI’s re­search di­rec­tion could have sig­nifi­cant value if effec­tively pur­sued”.

I in­ter­pret these un­fa­vor­able re­views and lack of strong ad­vo­cates as ev­i­dence that:

  1. HRAD is less likely to be good ba­sic sci­ence of AI; I’d ex­pect a rea­son­able num­ber of ex­ter­nal AI re­searchers rec­og­nize good ba­sic sci­ence of AI, even if its aes­thetic is fairly differ­ent from the most com­mon aes­thet­ics in AI re­search.

  2. HRAD is less likely to be ap­pli­ca­ble to AI sys­tems that are similar to to­day’s; I would ex­pect ap­pli­ca­bil­ity to AI sys­tems similar to to­day’s to make HRAD re­search sig­nifi­cantly more in­ter­est­ing to AI re­searchers, and our tech­ni­cal ad­vi­sors agreed strongly that HRAD is es­pe­cially un­likely to ap­ply to AI sys­tems that are similar to to­day’s.

I’m frankly not sure how many strong ad­vo­cates among AI re­searchers it would take to change my mind on these points – I think a lot would de­pend on de­tails of who they were and what story they told about their in­ter­est in HRAD.

I do be­lieve that some of this lack of in­ter­est should be ex­plained by so­cial dy­nam­ics and com­mu­ni­ca­tion difficul­ties – MIRI is not part of the aca­demic sys­tem, and the way MIRI re­searchers write about their work and mo­ti­va­tion is very differ­ent from many aca­demic pa­pers, and both of these could cause main­stream AI re­searchers to be less in­ter­ested in HRAD re­search than they would be if these fac­tors weren’t in play. How­ever, I think our re­view pro­cess and con­ver­sa­tions with our tech­ni­cal ad­vi­sors each provide some ev­i­dence that this isn’t likely to be suffi­cient to ex­plain AI re­searchers’ low in­ter­est in HRAD.

Re­view­ers’ de­scrip­tions of the pa­pers’ main ques­tions, con­clu­sions, and in­tended re­la­tion­ship to po­ten­tial risks from ad­vanced AI gen­er­ally seemed thought­ful and (as far as I can tell) ac­cu­rate, and in sev­eral cases (most no­tably Fallen­stein and Ku­mar 2015) some re­view­ers thought the work was novel and im­pres­sive; if re­view­ers’ opinions were more de­ter­mined by so­cial and com­mu­ni­ca­tion is­sues, I would ex­pect re­views to be less ac­cu­rate, less nu­anced, and more broadly dis­mis­sive.

I only had enough in­ter­ac­tion with ex­ter­nal re­view­ers to be mod­er­ately con­fi­dent that their opinions weren’t sig­nifi­cantly at­tributable to so­cial or com­mu­ni­ca­tion is­sues. I’ve had much more ex­ten­sive, in-depth in­ter­ac­tion with our tech­ni­cal ad­vi­sors, and I’m sig­nifi­cantly more con­fi­dent that their views are mostly de­ter­mined by their tech­ni­cal knowl­edge and re­search taste. I think our tech­ni­cal ad­vi­sors are among the very best-qual­ified out­siders to as­sess MIRI’s work, and that they have gen­uine un­der­stand­ing of the im­por­tance of al­ign­ment as well as be­ing strong re­searchers by tra­di­tional stan­dards. Their as­sess­ment is prob­a­bly the sin­gle biggest data point for me in this sec­tion.

Out­side of HRAD, some other re­search top­ics that MIRI has pro­posed have been the sub­ject of much more in­ter­est from AI re­searchers. For ex­am­ple, re­searchers and stu­dents at CHAI have pub­lished pa­pers on and are con­tin­u­ing to work on value speci­fi­ca­tion and er­ror-tol­er­ance (par­tic­u­larly cor­rigi­bil­ity), these top­ics have con­sis­tently seemed more promis­ing to our tech­ni­cal ad­vi­sors, and Stu­art Rus­sell has adopted the value al­ign­ment prob­lem as a cen­tral theme of his work. In light of this, I am more in­clined to take AI re­searchers’ lack of in­ter­est in HRAD as ev­i­dence about its promis­ing­ness than as ev­i­dence of se­vere so­cial or com­mu­ni­ca­tion is­sues.

The most con­vinc­ing ar­gu­ment I know of for not treat­ing other re­searchers’ lack of in­ter­est as sig­nifi­cant ev­i­dence about the promis­ing­ness of HRAD re­search is:

  1. I’m pretty sure that MIRI’s work on de­ci­sion the­ory is a very sig­nifi­cant step for­ward for philo­soph­i­cal de­ci­sion the­ory. This is based mostly on con­ver­sa­tions with a very small num­ber of philoso­phers who I know to have se­ri­ously eval­u­ated MIRI’s work, par­tially on an ab­sence of good ob­jec­tions to their de­ci­sion the­ory work, and a lit­tle on my own as­sess­ment of the work (which I’d dis­card if the first two con­sid­er­a­tions had gone the other way).

  2. MIRI’s de­ci­sion the­ory work has gained sig­nifi­cantly fewer ad­vo­cates among pro­fes­sional philoso­phers than I’d ex­pect it to if it were very promis­ing.

I’m strongly in­clined to re­solve this con­flict by con­tin­u­ing to be­lieve that MIRI’s de­ci­sion the­ory work is good philos­o­phy, and to ex­plain 2 by ap­peal­ing to so­cial dy­nam­ics and com­mu­ni­ca­tion difficul­ties. I think it’s rea­son­able to con­sider an analo­gous situ­a­tion with HRAD and AI re­searchers to be plau­si­ble a pri­ori, but the analogue of point 1 above doesn’t ap­ply to HRAD work, and the other rea­sons I’ve given in this sec­tion lead me to think that this is not likely.

3c. Other re­search, es­pe­cially “learn­ing to rea­son from hu­mans,” looks more promis­ing than HRAD (75%?)

How promis­ing does HRAD look com­pared to other AI al­ign­ment re­search op­tions? The most sig­nifi­cant fac­tor to me is the ap­par­ent promis­ing­ness of de­sign­ing ad­vanced AI sys­tems to rea­son and make de­ci­sions from hu­man-gen­er­ated data (“learn­ing to rea­son from hu­mans”); if an ap­proach along these lines is suc­cess­ful, it doesn’t seem to me that much room would be left for HRAD to help on the mar­gin. My views here are heav­ily based on Paul Chris­ti­ano’s writ­ing on this topic, but I’m not claiming to rep­re­sent his over­all ap­proach, and in par­tic­u­lar I’m try­ing to sketch out a broader set of ap­proaches that in­cludes Paul’s. It’s plau­si­ble to me that other kinds of al­ign­ment re­search could play a similar role, but I have a much less clear pic­ture of how that would work, and find­ing out about sig­nifi­cant prob­lems with learn­ing to rea­son from hu­mans would make me both more pes­simistic about tech­ni­cal work on AI al­ign­ment in gen­eral and more op­ti­mistic that HRAD would be helpful. The ar­gu­ments in this sec­tion are pretty loose, but the ba­sic idea seems promis­ing enough to me to jus­tify high cre­dence that some­thing in this gen­eral area will work.

“Learn­ing to rea­son from hu­mans” is differ­ent from the most com­mon ap­proaches in AI to­day, where de­ci­sion-mak­ing meth­ods are im­plic­itly learned in the pro­cess of ap­prox­i­mat­ing some func­tion – e.g. a re­ward-max­i­miz­ing policy, an imi­ta­tive policy, a Q-func­tion or model of the world, etc. In­stead, learn­ing to rea­son from hu­mans would in­volve di­rectly train­ing a sys­tem to rea­son in ways that match hu­man demon­stra­tions or are ap­proved of by hu­man feed­back, as in Paul’s ar­ti­cle here.

If we are able to be­come con­fi­dent that an AI sys­tem is learn­ing to rea­son in ways that meet hu­man ap­proval or match hu­man demon­stra­tions, it seems to me that we could also be­come con­fi­dent that the AI sys­tem would be al­igned over­all; a very harm­ful de­ci­sion would need to be gen­er­ated by a se­ries of hu­man-en­dorsed rea­son­ing steps (and un­less hu­man rea­son­ing en­dorses a search for edge cases, edge cases won’t be sought). Hu­man en­dorse­ment of rea­son­ing and de­ci­sion-mak­ing could not only in­cor­po­rate valid in­stru­men­tal rea­son­ing (in parts of episte­mol­ogy and de­ci­sion the­ory that we know how to for­mal­ize), but also rules of thumb and san­ity checks that al­low hu­mans to nav­i­gate un­cer­tainty about which episte­mol­ogy and de­ci­sion the­ory are cor­rect, as well as hu­man value judge­ments about which de­ci­sions, ac­tions, short-term con­se­quences, and long-term con­se­quences are de­sir­able, un­de­sir­able, or of un­cer­tain value.

Another fac­tor that is im­por­tant to me here is the po­ten­tial to de­sign sys­tems to rea­son and make de­ci­sions in ways that are cal­ibrated or con­ser­va­tive. The idea here is that we can be­come more con­fi­dent that AI sys­tems will not make catas­trophic de­ci­sions if they can re­li­ably de­tect when they are op­er­at­ing in un­fa­mil­iar do­mains or situ­a­tions, have low con­fi­dence that hu­mans would ap­prove of their rea­son­ing and de­ci­sions, have low con­fi­dence in pre­dicted con­se­quences, or are con­sid­er­ing ac­tions that could cause sig­nifi­cant harm; in those cases, we’d like AI sys­tems to “check in” with hu­mans more in­ten­sively and to act more con­ser­va­tively. It seems likely to me that these kinds of prop­er­ties would con­tribute sig­nifi­cantly to al­ign­ment and safety, and that we could pur­sue these prop­er­ties by de­sign­ing sys­tems to learn to rea­son and make de­ci­sions in hu­man-ap­proved ways, or by di­rectly study­ing statis­ti­cal prop­er­ties like cal­ibra­tion or “con­ser­va­tive­ness”.

“Learn­ing to rea­son and make de­ci­sions from hu­man ex­am­ples and feed­back” and “learn­ing to act ‘con­ser­va­tively’ where ‘ap­pro­pri­ate’” don’t seem to me to be many or­ders of mag­ni­tude more difficult than the kinds of learn­ing tasks AI sys­tems are good at to­day. If it was nec­es­sary for an AI sys­tem to imi­tate hu­man judge­ment perfectly, I would be much more skep­ti­cal of this ap­proach, but that doesn’t seem to be nec­es­sary, as Paul ar­gues:

“You need only the vaguest un­der­stand­ing of hu­mans to guess that kil­ling the user is: (1) not some­thing they would ap­prove of, (2) not some­thing they would do, (3) not in line with their in­stru­men­tal prefer­ences.

So in or­der to get bad out­comes here you have to re­ally mess up your model of what hu­mans want (or more likely mess up the un­der­ly­ing frame­work in an im­por­tant way).

If we imag­ine a land­scape of pos­si­ble in­ter­pre­ta­tions of hu­man prefer­ences, there is a ‘right’ in­ter­pre­ta­tion that we are shoot­ing for. But if you start with a wrong an­swer that is any­where in the neigh­bor­hood, you will do things like ‘ask the user what to do, and don’t ma­nipu­late them.’ And these be­hav­iors will even­tu­ally get you where you want to go.

That is to say, the ‘right’ be­hav­ior is sur­rounded by a mas­sive crater of ‘good enough’ be­hav­iors, and in the long-term they all con­verge to the same place. We just need to land in the crater.”

Learn­ing to rea­son from hu­mans is a good fit with to­day’s AI re­search, and is broad enough that it would be very sur­pris­ing to me if it were not pro­duc­tively ap­pli­ca­ble to early ad­vanced AI sys­tems.

It seems to me that this kind of ap­proach is also much more likely to be ro­bust to unan­ti­ci­pated prob­lems than a for­mal, HRAD-style ap­proach would be, since it ex­plic­itly aims to learn how to rea­son in hu­man-en­dorsed ways in­stead of rely­ing on re­searchers to no­tice and for­mally solve all crit­i­cal prob­lems of rea­son­ing be­fore the sys­tem is built. There are sig­nifi­cant open ques­tions about whether and how we could make ma­chine learn­ing ro­bust and the­o­ret­i­cally well-un­der­stood enough for high con­fi­dence, but it seems to me that this will be the case for any tech­ni­cal path­way that re­lies on learn­ing about hu­man prefer­ences in or­der to act de­sir­ably.

Fi­nally, it seems to me that if a lack of HRAD-style un­der­stand­ing does leave us ex­posed to many im­por­tant “un­known un­known” prob­lems, there is a good chance that some of those prob­lems will be re­vealed by failures or difficul­ties in achiev­ing al­ign­ment in ear­lier AI sys­tems, and that re­searchers who are ac­tively think­ing about the goal of al­ign­ing ad­vanced AI sys­tems will be able to no­tice these failings and re­late them to a need for bet­ter HRAD-style un­der­stand­ing. This kind of pro­cess seems very likely to be ap­pli­ca­ble to learn­ing to rea­son from hu­mans, but could also ap­ply to other ap­proaches to AI al­ign­ment. I do not think that this pro­cess is guaran­teed to re­veal a need for HRAD-style un­der­stand­ing in the case that it is needed, and I am fairly sure that some failure modes will not ap­pear in ear­lier ad­vanced AI sys­tems (the failure modes Bostrom calls “treach­er­ous turns”, which only ap­pear when an AI sys­tem has a large range of gen­eral-pur­pose ca­pa­bil­ities, can rea­son very pow­er­fully, etc.). It’s pos­si­ble that ear­lier failure modes will be too rare, too late, or not clearly enough re­lated to a need for HRAD-style re­search. How­ever, if a lack of fun­da­men­tal un­der­stand­ing does ex­pose us to many im­por­tant “un­known un­known” failure modes, it seems more likely to me that some in­for­ma­tive failures will hap­pen early than that all such failures will ap­pear only af­ter sys­tems are ad­vanced enough to be ex­tremely high-im­pact, and that re­searchers mo­ti­vated by al­ign­ment of ad­vanced AI will no­tice if those failures could be ad­dressed through HRAD-style un­der­stand­ing. (I’m un­cer­tain about how re­searchers who aren’t think­ing ac­tively about al­ign­ment of ad­vanced AI would re­spond, and I think one of the most valuable things we can do to­day is to in­crease the num­ber of re­searchers who are think­ing ac­tively about al­ign­ment of ad­vanced AI and are there­fore more likely to re­spond ap­pro­pri­ately to ev­i­dence.)

My cre­dence for this sec­tion isn’t higher for three ba­sic rea­sons:

  • It may be sig­nifi­cantly harder to build an al­igned AI sys­tem that’s much more pow­er­ful than a hu­man if we use learned rea­son­ing rules in­stead of for­mally speci­fied ones. Very lit­tle work has been done on this topic.

  • It may be that some parts of HRAD – e.g. log­i­cal un­cer­tainty or be­nign uni­ver­sal pri­ors – will turn out to be nec­es­sary for re­li­a­bil­ity. This cur­rently looks un­likely to me, but seems like the main way that parts of HRAD could turn out to be pre­req­ui­sites for learn­ing to rea­son from hu­mans.

  • Un­known un­knowns; my ar­gu­ments in this sec­tion are pretty loose, and lit­tle work has been done on this topic.

3d. MIRI staff are thought­ful, al­igned with our val­ues, and have a good track record

As I noted above, I be­lieve that MIRI staff are thought­ful, very ded­i­cated to this prob­lem, al­igned with our val­ues, and have a good track record in think­ing about ex­is­ten­tial risk from AI. The fact that some of them are much more op­ti­mistic than I am about HRAD re­search is a very sig­nifi­cant fac­tor in fa­vor of HRAD. I think it would be in­cor­rect to place a very low cre­dence (e.g. 1%) on their views be­ing closer to the truth than mine are.

I don’t think it is helpful to try to list a large amount of de­tail here; I’m in­clud­ing this as its own sec­tion in or­der to em­pha­size its im­por­tance to my rea­son­ing. My views come from many in-per­son and on­line con­ver­sa­tions with MIRI re­searchers over the past 5 years, re­ports of many similar con­ver­sa­tions by other thought­ful peo­ple I trust, and a large amount of on­line writ­ing about ex­is­ten­tial risk from AI spread over sev­eral sites, most no­tably, agent­foun­da­, ar­, and in­tel­li­

The most straight­for­ward thing to list is that MIRI was among the first groups to strongly ar­tic­u­late the case for ex­is­ten­tial risk from ar­tifi­cial in­tel­li­gence and the need for tech­ni­cal and strate­gic re­search on this topic, as noted in our last writeup:

“We be­lieve that MIRI played an im­por­tant role in pub­li­ciz­ing and sharp­en­ing the value al­ign­ment prob­lem. This prob­lem is de­scribed in the in­tro­duc­tion to MIRI’s Agent Foun­da­tions tech­ni­cal agenda. We are aware of MIRI writ­ing about this prob­lem pub­li­cly and in-depth as early as 2001, at a time when we be­lieve it re­ceived sub­stan­tial at­ten­tion from very few oth­ers. While MIRI was not the first to dis­cuss po­ten­tial risks from ad­vanced ar­tifi­cial in­tel­li­gence, we be­lieve it was a rel­a­tively early and promi­nent pro­moter, and gen­er­ally spoke at more length about spe­cific is­sues such as the value al­ign­ment prob­lem than more long-stand­ing pro­po­nents.”

4. How much should Open Phil sup­port HRAD work?

My 10% cre­dence that “if HRAD reached a high level of ma­tu­rity it would sig­nifi­cantly help AI re­searchers build al­igned AI sys­tems” doesn’t fully an­swer the ques­tion of how much we should sup­port HRAD work (with our fund­ing and with our out­reach to re­searchers) rel­a­tive to other tech­ni­cal work on AI safety. It seems to me that the main ad­di­tional fac­tors are:

Field-build­ing value: I ex­pect that the ma­jor­ity of the value of our cur­rent fund­ing in tech­ni­cal AI safety re­search will come from its effect of in­creas­ing the to­tal num­ber of peo­ple who are deeply knowl­edge­able about tech­ni­cal re­search on ar­tifi­cial in­tel­li­gence and ma­chine learn­ing, while also be­ing deeply versed in is­sues rele­vant to po­ten­tial risks. HRAD work ap­pears to be sig­nifi­cantly less use­ful for this goal than other kinds of AI al­ign­ment work, since HRAD has not gained much sup­port among AI re­searchers. (I do think that in or­der to be effec­tive for field-build­ing, AI safety re­search di­rec­tions should be among the most promis­ing we can think of to­day; this is not an ar­gu­ment for work on non-promis­ing, but at­trac­tive “AI safety” re­search.)

Re­place­abil­ity: HRAD work seems much more likely than other AI al­ign­ment work to be ne­glected by AI re­searchers and fun­ders. If HRAD work turns out to be sig­nifi­cantly helpful, we could make a sig­nifi­cant coun­ter­fac­tual differ­ence by sup­port­ing it.

Shovel-readi­ness: My un­der­stand­ing is that HRAD work is cur­rently fund­ing-con­strained (i.e. MIRI could scale up its pro­gram given more funds). This is not gen­er­ally true of tech­ni­cal AI safety work, which in my ex­pe­rience has also re­quired sig­nifi­cant staff time.

The differ­ence in field-build­ing value be­tween HRAD and the other tech­ni­cal AI safety work we sup­port makes me sig­nifi­cantly more en­thu­si­as­tic about sup­port­ing other tech­ni­cal AI safety work than about sup­port­ing HRAD. How­ever, HRAD’s low re­place­abil­ity and my 10% cre­dence in HRAD be­ing use­ful make me ex­cited to sup­port at least some HRAD work.

In my view, enough HRAD work should be sup­ported to con­tinue build­ing ev­i­dence about its chance of ap­pli­ca­bil­ity to ad­vanced AI, to have op­por­tu­ni­ties for other AI re­searchers to en­counter it and be­come ad­vo­cates, and to gen­er­ally make it rea­son­ably likely that if it is more im­por­tant than it cur­rently ap­pears then we can learn this fact. MIRI’s cur­rent size seems to me to be ap­prox­i­mately right for this pur­pose, and as far as I know MIRI staff don’t think MIRI is too small to con­tinue mak­ing steady progress. Given this, I am am­biva­lent (along the lines of our pre­vi­ous grant writeup) about recom­mend­ing that Good Ven­tures funds be used to in­crease MIRI’s ca­pac­ity for HRAD re­search.