Cognitive Science/​Psychology As a Neglected Approach to AI Safety

All of the ad­vice on get­ting into AI safety re­search that I’ve seen recom­mends study­ing com­puter sci­ence and math­e­mat­ics: for ex­am­ple, the 80,000 hours AI safety syl­labus pro­vides a com­puter sci­ence-fo­cused read­ing list, and men­tions that “Ideally your un­der­grad­u­ate de­gree would be math­e­mat­ics and com­puter sci­ence”.

There are ob­vi­ous good rea­sons for recom­mend­ing these two fields, and I agree that any­one wish­ing to make an im­pact in AI safety should have at least a ba­sic profi­ciency in them. How­ever, I find it a lit­tle con­cern­ing that cog­ni­tive sci­ence/​psy­chol­ogy are rarely even men­tioned in these guides. I be­lieve that it would be valuable to have more peo­ple work­ing in AI safety whose pri­mary back­ground is from one of cogsci/​psych, or who have at least done a minor in them.

Here are ex­am­ples of four lines of re­search into AI safety which I think could benefit from such a back­ground:

  • The psy­chol­ogy of de­vel­op­ing an AI safety cul­ture. Be­sides the tech­ni­cal prob­lem of “how can we cre­ate safe AI”, there is the so­cial prob­lem of “how can we en­sure that the AI re­search com­mu­nity de­vel­ops a cul­ture where safety con­cerns are taken se­ri­ously”. At least two ex­ist­ing pa­pers draw on psy­chol­ogy to con­sider this prob­lem: Eliezer Yud­kowsky’s “Cog­ni­tive Bi­ases Po­ten­tially Affect­ing Judg­ment of Global Risks” uses cog­ni­tive psy­chol­ogy to dis­cuss why peo­ple might mis­judge the prob­a­bil­ity of risks in gen­eral, and Seth Baum’s “On the pro­mo­tion of safe and so­cially benefi­cial ar­tifi­cial in­tel­li­gence” uses so­cial psy­chol­ogy to dis­cuss the spe­cific challenge of mo­ti­vat­ing AI re­searchers to choose benefi­cial AI de­signs.

  • Devel­op­ing bet­ter analy­ses of “AI take­off” sce­nar­ios. Cur­rently hu­mans are the only gen­eral in­tel­li­gence we know of, so any an­a­lyzes of what “ex­per­tise” con­sists of and how it can be ac­quired would benefit from the study of hu­mans. Eliezer Yud­kowsky’s “In­tel­li­gence Ex­plo­sion Microe­co­nomics” draws on a num­ber of fields to an­a­lyze the pos­si­bil­ity of a hard take­off, in­clud­ing some knowl­edge of hu­man in­tel­li­gence differ­ences as well as the his­tory of hu­man evolu­tion, whereas my “How Fea­si­ble is the Rapid Devel­op­ment of Ar­tifi­cial Su­per­in­tel­li­gence?” draws ex­ten­sively on the work of a num­ber of psy­chol­o­gists to make the case that based on what we know of hu­man ex­per­tise, sce­nar­ios with AI sys­tems be­com­ing ma­jor ac­tors within timescales on the or­der of mere days or weeks seem to re­main within the range of plau­si­bil­ity.

  • Defin­ing just what it is that hu­man val­ues are. The pro­ject of AI safety can roughly be defined as “the challenge of en­sur­ing that AIs re­main al­igned with hu­man val­ues”, but it’s also widely ac­knowl­edged that no­body re­ally knows what ex­actly hu­man val­ues are—or at least, not to a suffi­cient ex­tent that they could be given a for­mal defi­ni­tion and pro­grammed into an AI. This seems like one of the core prob­lems of AI safety, and one which can only be un­der­stood with a psy­chol­ogy-fo­cused re­search pro­gram. Luke Muehlhauser’s ar­ti­cle “A Crash Course in the Neu­ro­science of Hu­man Mo­ti­va­tion” took one look at hu­man val­ues from the per­spec­tive of neu­ro­science, and my “Defin­ing Hu­man Values for Value Learn­ers” sought to provide a pre­limi­nary defi­ni­tion of hu­man val­ues in a com­pu­ta­tional lan­guage, draw­ing from the in­ter­sec­tion of ar­tifi­cial in­tel­li­gence, moral psy­chol­ogy, and emo­tion re­search. Both of these are very pre­limi­nary pa­pers, and it would take a full re­search pro­gram to pur­sue this ques­tion in more de­tail.

  • Bet­ter un­der­stand­ing multi-level world-mod­els. MIRI defines the tech­ni­cal prob­lem of “multi-level world-mod­els” as “How can multi-level world-mod­els be con­structed from sense data in a man­ner amenable to on­tol­ogy iden­ti­fi­ca­tion?”. In other words, sup­pose that we had built an AI to make di­a­monds (or any­thing else we care about) for us. How should that AI be pro­grammed so that it could still ac­cu­rately es­ti­mate the num­ber of di­a­monds in the world af­ter it had learned more about physics, and af­ter it had learned that the things it calls “di­a­monds” are ac­tu­ally com­posed of pro­tons, neu­trons, and elec­trons? While I haven’t seen any pa­pers that would ex­plic­itly tackle this ques­tion yet, a rea­son­able start­ing point would seem to be the ques­tion of “well, how do hu­mans do it?”. There, psych/​cogsci may offer some clues. For in­stance, in the book Cog­ni­tive Plu­ral­ism, the philoso­pher Steven Horst offers an ar­gu­ment for be­liev­ing that hu­mans have mul­ti­ple differ­ent, mu­tu­ally in­com­pat­i­ble men­tal mod­els /​ rea­son­ing sys­tems—rang­ing from core knowl­edge sys­tems to sci­en­tific the­o­ries—that they flex­ibly switch be­tween de­pend­ing on the situ­a­tion. (Un­for­tu­nately, Horst ap­proaches this as a philoso­pher, so he’s mostly con­tent at mak­ing the ar­gu­ment for this be­ing the case in gen­eral, leav­ing it up to ac­tual cog­ni­tive sci­en­tists to work out how ex­actly this works.) I pre­vi­ously also offered a gen­eral ar­gu­ment along these lines in my ar­ti­cle World-mod­els as tools, sug­gest­ing that at least part of the choice of a men­tal model may be driven by re­in­force­ment learn­ing in the basal gan­glia. But this isn’t say­ing much, given that all hu­man thought and be­hav­ior seems to be in at least part driven by re­in­force­ment learn­ing in the basal gan­glia. Again, this would take a ded­i­cated re­search pro­gram.

From these four spe­cial cases, you could de­rive more gen­eral use cases for psy­chol­ogy and cog­ni­tive sci­ence within AI safety:

  • Psy­chol­ogy as the study and un­der­stand­ing of hu­man thought and be­hav­ior, helps guide ac­tions that are aimed at un­der­stand­ing and in­fluenc­ing peo­ple’s be­hav­ior in a more safety-al­igned di­rec­tion (re­lated ex­am­ple: the psy­chol­ogy of de­vel­op­ing an AI safety cul­ture)

  • The study of the only gen­eral in­tel­li­gence we know about, may provide in­for­ma­tion about the prop­er­ties of other gen­eral in­tel­li­gences (re­lated ex­am­ple: de­vel­op­ing bet­ter an­a­lyzes of “AI take­off” sce­nar­ios)

  • A bet­ter un­der­stand­ing of how hu­man minds work, may help figure out how we want the cog­ni­tive pro­cesses of AIs to work so that they end up al­igned with our val­ues (re­lated ex­am­ples: defin­ing hu­man val­ues, bet­ter un­der­stand­ing multi-level world-mod­els)

Here I would ideally offer read­ing recom­men­da­tions, but the fields are so broad that any given book can only give a rough idea of the ba­sics; and for in­stance, the topic of world-mod­els that hu­man brains use is just one of many, many sub­ques­tions that the fields cover. Thus my sug­ges­tion to have some safety-in­ter­ested peo­ple who’d ac­tu­ally study these fields as a ma­jor or at least a minor.

Still, if I’d have to sug­gest a cou­ple of books, with the main idea of get­ting a ba­sic ground­ing in the mind­sets and the­o­ries of the fields so that it would be eas­ier to read more spe­cial­ized re­search… on the cog­ni­tive psy­chol­ogy/​cog­ni­tive sci­ence side I’d sug­gest Cog­ni­tive Science by Jose Luis Ber­mudez (haven’t read it, but Luke Muehlhauser recom­mends it and it looked good to me based on the table of con­tents; see also Luke’s fol­low-up recom­men­da­tions be­hind that link); Cog­ni­tive Psy­chol­ogy: A Stu­dent’s Hand­book by Michael W. Eysenck & Mark T. Keane; and maybe Sen­sa­tion and Per­cep­tion by E. Bruce Gold­stein. I’m afraid that I don’t know of any good in­tro­duc­tory text­books on the so­cial psy­chol­ogy side.