What can the principal-agent literature tell us about AI risk?

Dis­cus­sion on LessWrong

This work was done col­lab­o­ra­tively with Tom David­son.

Thanks to Paul Chris­ti­ano, Ben Garfinkel, Daniel Gar­rett, Robin Han­son, Philip Tram­mell and Takuro Ya­mashita for helpful com­ments and dis­cus­sion. Er­rors our own.


The AI al­ign­ment prob­lem has similar­i­ties with the prin­ci­pal-agent prob­lem stud­ied by economists. In both cases, the prob­lem is: how do we get agents to try to do what we want them to do? Economists have de­vel­oped a so­phis­ti­cated un­der­stand­ing of the agency prob­lem and a mea­sure of the cost of failure for the prin­ci­pal, “agency rents”.

If prin­ci­pal-agent mod­els cap­ture rele­vant as­pects of AI risk sce­nar­ios, they can be used to as­sess their plau­si­bil­ity. Robin Han­son has ar­gued that Paul Chris­ti­ano’s AI risk sce­nario is es­sen­tially an agency prob­lem, and there­fore that it im­plies ex­tremely high agency rents. Han­son be­lieves that the prin­ci­pal-agent liter­a­ture (PAL) pro­vides strong ev­i­dence against rents be­ing this high.

In this post, we con­sider whether PAL pro­vides ev­i­dence against Chris­ti­ano’s sce­nario and the origi­nal Bostrom/​Yud­kowsky sce­nario. We also ex­am­ine whether the ex­ten­sions to the agency frame­work could be used to gain in­sight into AI risk, and con­sider some gen­eral difficul­ties in ap­ply­ing PAL to AI risk.


  • PAL isn’t in ten­sion with Chris­ti­ano’s sce­nario be­cause his sce­nario doesn’t im­ply mas­sive agency rents; the big losses oc­cur out­side of the prin­ci­pal-agent prob­lem, and the agency liter­a­ture can’t as­sess the plau­si­bil­ity of these losses. Ex­ten­sions to PAL could po­ten­tially shed light on the size of agency rents in this sce­nario, which are an im­por­tant de­ter­mi­nant of the fu­ture in­fluen­tial­ness of AI sys­tems.

  • Mapped onto a PAL model, the Bostrom/​Yud­kowsky sce­nario is largely about the prin­ci­pal’s un­aware­ness of the agent’s catas­trophic ac­tions. Unaware­ness mod­els are rare in PAL prob­a­bly be­cause they usu­ally aren’t very in­sight­ful. This lack of in­sight­ful­ness also seems to pre­vent ex­ist­ing PAL mod­els or pos­si­ble ex­ten­sions from teach­ing us much about this sce­nario.

  • There are also a num­ber of more gen­eral difficul­ties with us­ing PAL to as­sess AI risk, some more prob­le­matic than oth­ers.

    • PAL mod­els rarely con­sider weak prin­ci­pals and more ca­pa­ble agents

    • PAL mod­els are brittle

    • Agency rents are too nar­row a measure

    • PAL mod­els typ­i­cally as­sume con­tract enforceability

    • PAL mod­els typ­i­cally as­sume AIs work for hu­mans be­cause they are paid

  • Over­all, find­ings from PAL do not straight­for­wardly trans­fer to the AI risk sce­nar­ios con­sid­ered, so don’t provide much ev­i­dence for or against these sce­nar­ios. But new agency mod­els could teach us about the lev­els of agency rents which AI agents could ex­tract.

PAL and Chris­ti­ano’s AI risk scenarios

Chris­ti­ano’s sce­nario has two parts:

  • Part I: ma­chine learn­ing will in­crease our abil­ity to “get what we can mea­sure,” which could cause a slow-rol­ling catas­tro­phe. (“Go­ing out with a whim­per.”)

  • Part II: ML train­ing, like com­pet­i­tive economies or nat­u­ral ecosys­tems, can give rise to “greedy” pat­terns that try to ex­pand their own in­fluence. Such pat­terns can ul­ti­mately dom­i­nate the be­hav­ior of a sys­tem and cause sud­den break­downs. (“Go­ing out with a bang,” an in­stance of op­ti­miza­tion dae­mons.)

Han­son ar­gued that “Chris­ti­ano in­stead fears that as AIs get more ca­pa­ble, the AIs will gain so much more agency rents, and we will suffer so much more due to agency failures, that we will ac­tu­ally be­come worse off as as re­sult. And not just a bit worse off; we ap­par­ently get apoc­a­lypse level worse off!

PAL isn’t in ten­sion with Chris­ti­ano’s story and isn’t es­pe­cially informative

We asked Chris­ti­ano whether his sce­nario ac­tu­ally im­plies ex­tremely high agency rents. He doesn’t think so:

On my view the prob­lem is just that agency rents make AI sys­tems col­lec­tively bet­ter off. Hu­mans were pre­vi­ously the sole su­per­power and so as a class we are made worse off when we in­tro­duce a com­peti­tor, via the pos­si­bil­ity of even­tual con­flict with AI who have been greatly en­riched via agency rents…hu­mans are bet­ter off in ab­solute terms un­less con­flict leaves them worse off (whether mil­i­tary con­flict or a race for scarce re­sources). Com­pare: a ris­ing China makes Amer­i­cans bet­ter off in ab­solute terms. Also true, un­less we con­sider the pos­si­bil­ity of con­flict....[with­out con­flict] hu­mans are only worse off rel­a­tive to AI (or to hu­mans who are able to lev­er­age AI effec­tively). The availa­bil­ity of AI still prob­a­bly in­creases hu­mans’ ab­solute wealth. This is a prob­lem for hu­mans be­cause we care about our frac­tion of in­fluence over the fu­ture, not just our ab­solute level of wealth over the short term.

Chris­ti­ano’s con­cern isn’t that agency rents will sky­rocket be­cause of some dis­tinc­tive fea­tures of the hu­man-AI agency re­la­tion­ship. In­stead, “prox­ies” and “in­fluence seek­ing” are two spe­cific ways AI in­ter­ests will di­verge from ac­tual hu­man goals. This leads to typ­i­cal lev­els of agency rents; PAL con­firms that due to di­verg­ing in­ter­ests and im­perfect mon­i­tor­ing, AI agents could get some rents.[1]

The main loss oc­curs later in time and out­side of the prin­ci­pal-agent con­text, due to the fact that these rents even­tu­ally lead AIs to wield more to­tal in­fluence on the fu­ture than hu­mans.[2] This is bad be­cause, even if hu­man­ity is richer over­all, we hu­mans also “care about our frac­tion of in­fluence over the fu­ture.”[3] Com­pared to a world with al­igned AI sys­tems, hu­man­ity is leav­ing value on the table, per­ma­nently if these sys­tems can’t be rooted out. The biggest po­ten­tial down­side comes from in­fluence-seek­ing sys­tems which Chris­ti­ano be­lieves could make hu­mans worse off ab­solutely, by en­gag­ing in vi­o­lent con­flict.

Th­ese later failures aren’t ex­am­ples of mas­sive agency rents (as the term is used in PAL) be­cause failure is not ex­pected to oc­cur when the agent works on the task it was del­e­gated.[4] Rather, the in­fluence-seek­ing sys­tems be­come more in­fluen­tial via typ­i­cal agency rents, and then at some later point use these rents to in­fluence the fu­ture, pos­si­bly by en­ter­ing into con­flict with hu­mans. PAL stud­ies the size of agency rents which can be ex­tracted, but not what the agents de­cide to do with this wealth and in­fluence.

Over­all, PAL is con­sis­tent with AI agents ex­tract­ing some agency rents, which oc­curs in both parts of Chris­ti­ano’s story (and we’ll see next that putting more struc­ture on agency mod­els could tell us more about the level of rent ex­trac­tion). But it has noth­ing to say about the plau­si­bil­ity of AI agents us­ing their rents to ex­ert in­fluence over the long term fu­ture (parts 1 and 2) or en­gage in con­flict (part 2).[5]

Ex­tend­ing agency mod­els seems promis­ing for un­der­stand­ing the level of agency rents in Chris­ti­ano’s scenario

Chris­ti­ano’s sce­nario doesn’t rely on some­thing dis­tinc­tive about the hu­man-AI agency re­la­tion­ship gen­er­at­ing higher-than-usual agency rents.[6] But per­haps there is some­thing dis­tinc­tive and rents will be atyp­i­cal. In any case, the level of agency rents seems like a cru­cial con­sid­er­a­tion: if we think AI’s can ex­tract lit­tle to no rents, we prob­a­bly shouldn’t ex­pect them to ex­ert much in­fluence over the fu­ture, be­cause agency rents are what make AI rich.[7] Agency mod­els could help give us a bet­ter un­der­stand­ing of the size of agency rents in Chris­ti­ano’s story, and for fu­ture AI sys­tems more gen­er­ally.

The size of agency rents are de­ter­mined by a num­ber of fac­tors, in­clud­ing the agent’s pri­vate in­for­ma­tion, the na­ture of the task, the noise in the prin­ci­pal’s es­ti­mate of the value pro­duced by the agent, and the de­gree of com­pe­ti­tion. For in­stance, more com­plex tasks tend to cause higher rents. From The (ir)re­sistible rise of agency rents:

In the pres­ence of moral haz­ard, prin­ci­pals must leave rents to agents, to in­cen­tivize ap­pro­pri­ate ac­tions. The more com­plex and opaque the task del­e­gated to the agent, the more difficult it is to mon­i­tor his ac­tions, the larger his rents.

If, as AI agents be­come more in­tel­li­gent, mon­i­tor­ing gets in­creas­ingly difficult, or tasks get more com­plex, then we would ex­pect agency rents to in­crease.

On the other hand, com­pet­i­tive pres­sures be­tween AI agents might be greater (it’s easy to copy and run an AI; it’s hard to in­crease the hu­man work­force by trans­fer­ring hu­man cap­i­tal from one brain to an­other via teach­ing). This would limit rents:

The agents de­sire to cap­ture rents, how­ever, could be kept in check by mar­ket forces and com­pe­ti­tion among [agents]. If each prin­ci­pal could run an auc­tion with sev­eral, oth­er­wise iden­ti­cal, [agents], he could se­lect the agent with the small­est in­cen­tive prob­lem, and hence the small­est rent.

Model­ling the most rele­vant fac­tors in an agency model seems like a tractable re­search ques­tion (we dis­cuss some po­ten­tial difficul­ties be­low). Economists have only just started think­ing about AI, and there doesn’t seem to be any work study­ing rent ex­trac­tion by AI agents.

PAL and AI risk from “ac­ci­dents”

Ben Garfinkel has called the class of risks most as­so­ci­ated with Bostrom and Yud­kowsky, risks from “ac­ci­dents”. Garfinkel char­ac­ter­ises the gen­eral story in the fol­low­ing terms:

First, the au­thor imag­ines that a sin­gle AI sys­tem ex­pe­riences a mas­sive jump in ca­pa­bil­ities. Over some short pe­riod of time, a sin­gle sys­tem be­comes much more gen­eral or much more ca­pa­ble than any other sys­tem in ex­is­tence, and in fact any hu­man in ex­is­tence. Then given the sys­tem, re­searchers spec­ify a goal for it. They give it some in­put which is meant to com­mu­ni­cate what be­hav­ior it should en­gage in. The goal ends up be­ing some­thing quite sim­ple, and the sys­tem goes off and sin­gle-hand­edly pur­sues this very sim­ple goal in a way that vi­o­lates the full nu­ances of what its de­sign­ers in­tended.” Im­por­tantly, “At the limit you might worry that these safety failures could be­come so ex­treme that they could per­haps de­rail civ­i­liza­tion on the whole.

Th­ese catas­trophic ac­ci­dents con­sti­tute the main worry.

If the risk sce­nario is ad­e­quately rep­re­sented by a prin­ci­pal-agent prob­lem, agency rents ex­tracted by AI agents can be used to mea­sure the cost of mis­al­ign­ment. This time agency rents are a bet­ter mea­sure, be­cause failure is ex­pected to oc­cur when the agent works on the task it was del­e­gated.[8] The sce­nario im­plies very high agency rents, with the prin­ci­pal be­ing made much worse off be­cause he del­e­gated the task to the agent.

As Garfinkel’s nomen­cla­ture sug­gests, this story is about the de­sign­ers be­ing caught by sur­prise, not an­ti­ci­pat­ing the ac­tions the AI would take. The Wikipe­dia syn­op­sis of Su­per­in­tel­li­gence also em­pha­sizes that some­thing un­ex­pected oc­curs: “Solv­ing the con­trol prob­lem is sur­pris­ingly difficult be­cause most goals, when trans­lated into ma­chine-im­ple­mentable code, lead to un­fore­seen and un­de­sir­able con­se­quences.” In other words, the prin­ci­pal is un­aware of some spe­cific catas­troph­i­cally harm­ful ac­tions that the agent can take to achieve its goal.[9] This could be be­cause they in­cor­rectly be­lieve that the sys­tem doesn’t have cer­tain ca­pa­bil­ities, or they don’t fore­see that cer­tain ac­tions satisfy the agent’s goal, as with per­verse in­stan­ti­a­tion. Due to this, the agent takes ac­tions that greatly harm the prin­ci­pal, at great benefit to her­self.

PAL doesn’t tell us much about AI risk from accidents

Han­son’s cri­tique was aimed at Chris­ti­ano’s sce­nario, but it could equally ap­ply to this one. Is PAL at odds with this sce­nario?

As an AI agent be­comes more in­tel­li­gent, it’s ac­tion set will ex­pand, think­ing of new and some­times unan­ti­ci­pated ac­tions to achieve its goals. This may in­clude catas­trophic ac­tions that the prin­ci­pal is not aware of.[10] PAL can’t tell us what these ac­tions will be, nor if the prin­ci­pal will be aware of them.[11]

In­stead, the vast ma­jor­ity of prin­ci­pal-agent mod­els as­sume that the prin­ci­pal un­der­stands the en­vi­ron­ment perfectly, in­clud­ing perfect knowl­edge of the agent’s ac­tion set, while the premise of the ac­ci­dent sce­nario is that the prin­ci­pal is un­aware of a catas­trophic ac­tion that the agent could take. Be­cause the prin­ci­pal’s un­aware­ness is cen­tral, these mod­els as­sume, rather than show, that this source of AI risk does not ex­ist. They there­fore don’t tell us much about the plau­si­bil­ity of AI ac­ci­dents.

Microe­conomist Daniel Gar­rett ex­pressed this point nicely. We asked him about a hy­po­thet­i­cal ex­am­ple, slightly mis­re­mem­bered from Stu­art Rus­sell’s book, con­cern­ing an ad­vanced cli­mate con­trol AI sys­tem.[12] He replied:

You can eas­ily write down a model where the agent is re­warded ac­cord­ing to some out­come, and the prin­ci­pal isn’t aware the out­come can be achieved by some ac­tion the prin­ci­pal finds harm­ful. In your ex­am­ple, the out­come is the re­duc­tion of Co2 emis­sions. If the prin­ci­pal thinks car­bon se­ques­tra­tion is the only way to achieve this, but doesn’t think of an­other chem­i­cal re­ac­tion op­tion which would in­di­rectly kill ev­ery­one, she could end up pro­vid­ing in­cen­tives to kill ev­ery­one. The fact this con­clu­sion is so im­me­di­ate may ex­plain why this kind of un­aware­ness by the prin­ci­pal is given lit­tle at­ten­tion in the liter­a­ture. The prin­ci­pal-agent liter­a­ture should not be un­der­stood as say­ing that these kinds of in­cen­tives with per­verse out­comes can­not hap­pen. (our em­pha­sis)

PAL mod­els do typ­i­cally have mod­est agency rents; they typ­i­cally don’t model the prin­ci­pal as be­ing un­aware of ac­tions with catas­trophic con­se­quences. But this is the situ­a­tion dis­cussed by pro­po­nents of AI ac­ci­dent risk, so we can’t in­fer much from PAL ex­cept that such a situ­a­tion has not been of much in­ter­est to economists.

Ex­tend­ing agency mod­els doesn’t seem promis­ing for un­der­stand­ing AI risk from “ac­ci­dents”

Most PAL mod­els don’t in­clude the kind of un­aware­ness needed to model the ac­ci­dent sce­nario, but ex­ten­sions of this sort are cer­tainly pos­si­ble. How­ever, we sus­pect try­ing to model AI risk in this way wouldn’t be fruit­ful, for three main rea­sons.

Firstly, as Daniel Gar­rett sug­gests, we sus­pect the as­sump­tions about the prin­ci­pal’s un­aware­ness of the agents ac­tion set would im­ply the ac­tion cho­sen by the agent, and its con­se­quences for the prin­ci­pal, in a fairly di­rect and un­in­ter­est­ing way. There is a (very) small sub-liter­a­ture on un­aware­ness in agency prob­lems where one can find mod­els like this. In one pa­per, a prin­ci­pal hires an agent to do a work task, but isn’t aware that the agent can ma­nipu­late “short-run work­ing perfor­mance at the ex­pense of the em­ployer’s fu­ture benefit.” The agent “is bet­ter off if he is ad­di­tion­ally aware that he could ma­nipu­late the work­ing perfor­mance,” and “in the post-con­trac­tual stage, [the prin­ci­pal] is hurt by the ma­nipu­lat­ing ac­tion of [the agent].” How­ever, the model didn’t re­veal any­thing un­ex­pected about the situ­a­tion, and the out­come was di­rectly de­ter­mined by the ac­tion set and un­aware­ness as­sump­tions.

Se­condly, the ma­jor source of the un­cer­tainty sur­round­ing ac­ci­dent risk con­cerns whether the prin­ci­pal will be un­aware of catas­trophic agent ac­tions. The agency liter­a­ture can’t help us re­duce this un­cer­tainty as the un­aware­ness is built into mod­els’ as­sump­tions. For in­stance, AI sci­en­tist Yann LeCun thinks that harm­ful ac­tions “are eas­ily avoid­able by sim­ple terms in the ob­jec­tive”. If LeCun im­ple­mented a su­per­in­tel­li­gent AI in this way, agency mod­els couldn’t tell us whether he had cor­rectly cov­ered all bases.

Lastly, the as­sump­tions about the agent’s ac­tion set would be highly spec­u­la­tive. We don’t know what ac­tions su­per­in­tel­li­gent sys­tems might take to pur­sue their goals. Agency mod­els must make as­sump­tions about these ac­tions, and we don’t know what these as­sump­tions should be.

In short, the un­cer­tainty per­tains to the as­sump­tions of the model, not the way the as­sump­tions trans­late into out­comes. PAL does not, and prob­a­bly can not, provide much ev­i­dence for or against this sce­nario.

Gen­eral difficul­ties with us­ing PAL to as­sess AI risk

We’ve dis­cussed the most rele­vant con­sid­er­a­tions re­gard­ing what PAL can tell us about two spe­cific vi­sions of AI risk. We now dis­cuss some difficul­ties rele­vant to a broader set of pos­si­ble sce­nar­ios (in­clud­ing those just ex­am­ined). We list the difficul­ties from most se­ri­ous to least se­ri­ous.

PAL mod­els rarely con­sider weak prin­ci­pals and more ca­pa­ble agents[13]

AI risk sce­nar­ios typ­i­cally in­volve the AI be­ing more in­tel­li­gent than hu­mans. The type of prob­lems that economists study usu­ally don’t have this fea­ture, and there seem to be very few mod­els where the prin­ci­pal is weaker than the agent. De­spite ex­ten­sive search­ing, in­clud­ing talk­ing to mul­ti­ple con­tract the­o­rists, we were only able to find two pa­pers with a prin­ci­pal who is more bound­edly ra­tio­nal than the agent.[14] This is per­haps not so sur­pris­ing given that bounded-ra­tio­nal­ity mod­els are rel­a­tively rare, and when they do ex­ist, they tend to bound both the prin­ci­pal and the agent in the same way, or have the prin­ci­pal more ca­pa­ble. The lat­ter is be­cause such a set up is more rele­vant to typ­i­cal eco­nomic prob­lems, e.g. “ex­ploita­tive” con­tract­ing stud­ies the mis­takes made by an in­di­vi­d­ual (the agent) when in­ter­act­ing with a more ca­pa­ble firm (the prin­ci­pal).

Microe­conomist Takuro Ya­mashita agrees:

Most eco­nomic ques­tions re­lated to bounded ra­tio­nal­ity ex­plored in the prin­ci­pal-agent liter­a­ture are ap­pro­pri­ately mod­el­led by a bounded agent. It’s cer­tainly pos­si­ble to bound the prin­ci­pal, but by and large this hasn’t been done, just be­cause of the na­ture of the ques­tions that have been asked.

A re­cent re­view of Be­havi­oural Con­tract The­ory also finds that such mod­els are rare:

In al­most all ap­pli­ca­tions, re­searchers as­sume that the agent (she) be­haves ac­cord­ing to one psy­cholog­i­cally based model, while the prin­ci­pal (he) is fully ra­tio­nal and has a clas­si­cal goal (usu­ally profit max­i­miza­tion).

There doesn’t seem to be, in Han­son’s terms, a “large (mostly eco­nomic) liter­a­ture on agency failures” with an in­tel­li­gence gap rele­vant to AI risk.

PAL mod­els are brittle

PAL mod­els don’t model agency prob­lems in gen­eral. They con­sider very spe­cific agency re­la­tion­ships, stud­ied in highly struc­tured en­vi­ron­ments. Con­clu­sions can de­pend very sen­si­tively on the as­sump­tions used; find­ings from one model don’t nec­es­sar­ily gen­er­al­ise to new situ­a­tions. From the text­book Con­tract The­ory:

The ba­sic moral haz­ard prob­lem has a fairly sim­ple struc­ture, yet gen­eral con­clu­sions have been difficult to ob­tain...Very few gen­eral re­sults can be ob­tained about the form of op­ti­mal con­tracts. How­ever, this limi­ta­tion has not pre­vented ap­pli­ca­tions that use this paradigm from flour­ish­ing...Typ­i­cally, ap­pli­ca­tions have put more struc­ture on the moral haz­ard prob­lem un­der con­sid­er­a­tion, thus en­abling a sharper char­ac­ter­i­za­tion of the op­ti­mal in­cen­tive con­tract.” (our em­pha­sis)

Similar rea­son­ing ap­plies in ad­verse se­lec­tion mod­els where the out­come is very sen­si­tive to the map­ping be­tween effort and out­comes. Given an ar­bi­trary prob­lem, the op­ti­mal in­cen­tives can look like any­thing.

The agency prob­lems stud­ied by economists are typ­i­cally quite differ­ent to the sce­nar­ios en­visaged by AI risk pro­po­nents. There­fore, be­cause of the brit­tle­ness of PAL mod­els, we shouldn’t be too sur­prised if the imag­ined AI risk out­comes aren’t pre­sent in the ex­ist­ing liter­a­ture. PAL, in its cur­rent form, might just not be of much use. Fur­ther, we should not ex­pect there to be any generic an­swer to the ques­tion “How big are AI agency rents?”: the an­swer will de­pend on the spe­cific task the AI is do­ing and a host of other de­tails.

Agents rents are too nar­row a measure

As we’ve seen, AI risk sce­nar­ios can in­clude bad out­comes that aren’t agency rents, but that we nev­er­the­less care about. When ap­ply­ing PAL to AI risk, care must be taken to dis­t­in­guish be­tween rents and other bad out­comes, and we can­not as­sume that a bad out­come nec­es­sar­ily means high rents.

PAL mod­els typ­i­cally as­sume con­tract enforceability

Stu­art Arm­strong ar­gued that Han­son’s cri­tique doesn’t work be­cause PAL as­sumes con­tract en­force­abil­ity, and with ad­vanced AI, in­sti­tu­tions might not be up to the task.[15] In­deed, con­tract en­force­abil­ity is as­sumed in most of PAL, so it’s an im­por­tant con­sid­er­a­tion re­gard­ing their ap­pli­ca­bil­ity to AI sce­nar­ios more broadly.[16]

The as­sump­tion isn’t plau­si­ble in pes­simistic sce­nar­ios where hu­man prin­ci­pals and in­sti­tu­tions are in­suffi­ciently pow­er­ful to pun­ish the AI agent, e.g. due to very fast take-off. But it is plau­si­ble for when AIs are similarly smart to hu­mans, and in sce­nar­ios where pow­er­ful AIs are used to en­force con­tracts. Fur­ther­more, if we can­not en­force con­tracts with AIs then peo­ple will promptly re­al­ise and stop us­ing AIs; so we should ex­pect con­tracts to be en­force­able con­di­tional upon AIs be­ing used.[17]

There is a smaller sub-liter­a­ture on self-en­forc­ing con­tracts (sem­i­nal pa­per). Here con­tracts can be self-en­forced be­cause both par­ties have an in­ter­est in in­ter­act­ing re­peat­edly. We think these prob­a­bly won’t be helpful for un­der­stand­ing situ­a­tions with­out con­tract en­force­abil­ity, be­cause in wor­lds where con­tracts aren’t en­force­able be­cause of ad­vanced AI, con­tracts likely won’t be self-en­forc­ing ei­ther. If AIs are pow­er­ful enough that in­sti­tu­tions like the po­lice and mil­i­tary can’t con­strain them, it seems un­likely that they’d have much to gain from re­peated co­op­er­a­tive in­ter­ac­tions with hu­man prin­ci­pals. Why not make a copy of them­selves to do the task, co­erce hu­mans into do­ing it, or co­op­er­ate with other ad­vanced AIs?

PAL mod­els typ­i­cally as­sume AIs work for hu­mans be­cause they are paid

In re­al­ity AIs will prob­a­bly not re­ceive a wage, and in­stead work for hu­mans be­cause that is their de­fault be­havi­our. We think chang­ing this would prob­a­bly not make a big differ­ence to agency mod­els, be­cause the wage could be sub­sti­tuted for other re­sources the AI cares about. For in­stance, AI needs com­pute to run. If we sub­sti­tute “wage” for “com­pute”, the agency rents that the agent ex­tracts is ad­di­tional com­pute that it can use for its own pur­poses.

There is a sub-liter­a­ture on Op­ti­mal Del­e­ga­tion that does away with wages. This liter­a­ture fo­cuses on the best way to re­strict the agents ac­tion set. For AI agents, this is equiv­a­lent to AI box­ing. We don’t think this liter­a­ture will be helpful; PAL doesn’t study how re­al­is­tic it is to box AI suc­cess­fully, it just as­sumes it’s tech­nolog­i­cally pos­si­ble. It there­fore isn’t in­for­ma­tive about whether AI box­ing will work.


There are similar­i­ties be­tween the AI al­ign­ment and prin­ci­pal-agent prob­lems, sug­gest­ing that PAL could teach us about AI risk. How­ever, the situ­a­tions economists have stud­ied are very differ­ent to those dis­cussed by pro­po­nents of AI risk, mean­ing that find­ings from PAL don’t trans­fer eas­ily to this con­text. There are a few main is­sues. The prin­ci­pal-agent setup is only a part of AI risk sce­nar­ios, mak­ing agency rents too nar­row a met­ric. PAL mod­els rarely con­sider agents more in­tel­li­gent than their prin­ci­pals and the mod­els are very brit­tle. And the lack of in­sight from PAL un­aware­ness mod­els severely re­stricts their use­ful­ness for un­der­stand­ing the ac­ci­dent risk sce­nario.

Nev­er­the­less, ex­ten­sions to PAL might still be use­ful. Agency rents are what might al­low AI agents to ac­cu­mu­late wealth and in­fluence, and agency mod­els are the best way we have to learn about the size of these rents. Th­ese find­ings should in­form a wide range of fu­ture sce­nar­ios, per­haps bar­ring ex­treme ones like Bostrom/​Yud­kowsky.[18]

  1. Thanks to Wei Dai for point­ing out a pre­vi­ous in­ac­cu­racy ↩︎

  2. Agency rents are about e.g. work­ing vs shirk­ing. If the agent uses the money she earned to buy a gun and later shoot the prin­ci­pal, clearly this is very bad for her, but it’s not cap­tured by agency rents. ↩︎

  3. It’s not to­tally clear to us why we should care about our frac­tion of in­fluence over the fu­ture, rather than the to­tal in­fluence. Prob­a­bly be­cause the frac­tion of in­fluence af­fects the to­tal in­fluence, in­fluence be­ing zero-sum and re­sources finite. ↩︎

  4. It wasn’t clear to us from the origi­nal post, at least in Part 1 of the story with no con­flict, that hu­mans are bet­ter off in ab­solute terms. For in­stance, word­ing like “over time those prox­ies will come apart” and “Peo­ple re­ally will be get­ting richer for a while” seemed to sug­gest that things are ex­pected to worsen. Given this, Han­son’s in­ter­pre­ta­tion (that Chris­ti­ano’s story im­plied mas­sive agency rents) seems rea­son­able with­out fur­ther clar­ifi­ca­tion. Ben Garfinkel men­tioned an out­side-view mea­sure which he thought un­der­mined the plau­si­bil­ity of Part 1: since the in­dus­trial rev­olu­tion we seem to have been us­ing more and more prox­ies, which are op­ti­mized for more and more heav­ily, but things have been get­ting bet­ter and bet­ter. So he also seems to have un­der­stood the sce­nario to mean things get worse in ab­solute terms. ↩︎

  5. Clar­ify­ing what it means for an AI sys­tem to earn and use rents also seems im­por­tant, helping us make sure that the ab­strac­tion maps cleanly onto the prac­ti­cal sce­nar­ios we are en­visag­ing. Re­lat­edly, what traits would an AI sys­tem need to have for it to make sense to think of the sys­tem as “ac­cu­mu­lat­ing and us­ing rents”? Rents can be cashed out in in­fluence of many differ­ent kinds — a hu­man worker might get higher wage, or more free time — and what ends up oc­cur­ing will de­pend on the ca­pa­bil­ities of the AI sys­tems. Con­cretely, money can be saved in a bank ac­count, peo­ple can be in­fluenced, or com­puter hard­ware can be bought and run. One ex­am­ple of an ob­vi­ous ca­pa­bil­ity con­straint for AI: some AI sys­tems will be “switched off” af­ter they are run, limit­ing their abil­ity to trans­fer rents through time. As AI agents will (ini­tially) be owned by hu­mans, his­tor­i­cal in­stances of slaves earn­ing rents seem worth look­ing into. ↩︎

  6. Although his sce­nario is more plau­si­ble if a smarter agent ex­tracts more agency rents. ↩︎

  7. Han­son and Chris­ti­ano agree on this point. Han­son: “Just as most wages that slaves earned above sub­sis­tence went to slave own­ers, most of the wealth gen­er­ated by AI could go to the cap­i­tal own­ers, i.e. their slave own­ers. Agency rents are the differ­ence above that min­i­mum amount.” Chris­ti­ano: “Agency rents are what makes the AI rich. It’s not that com­put­ers would “be­come rich” if they were su­per­hu­man, and they just aren’t rich yet be­cause they aren’t smart enough. On the cur­rent tra­jec­tory com­put­ers just won’t get rich.” ↩︎

  8. One limi­ta­tion is that rents are the cost to the prin­ci­pal, whereas the ac­ci­dent sce­nario has costs for all hu­man­ity. This dis­tinc­tion isn’t es­pe­cially im­por­tant be­cause in the ac­ci­dent sce­nario the out­come for the prin­ci­pal is catas­trophic (i.e. ex­tremely high agency rents), and this is what is po­ten­tially in ten­sion with PAL. Nonethe­less, we should keep in mind that the to­tal costs of this sce­nario are not limited to agency rents, just as in Chris­ti­ano’s sce­nario. ↩︎

  9. Per­haps a more re­al­is­tic fram­ing: the prin­ci­pal is aware that there’s some prob­a­bil­ity that the agent will take an unan­ti­ci­pated catas­trophic ac­tion, with­out know­ing what that ac­tion might be. Un­der com­pet­i­tive pres­sures, maybe in a time of war, it could be benefi­cial for the prin­ci­pal to del­e­gate (in ex­pec­ta­tion) de­spite sig­nifi­cant risk, while hu­man­ity is made worse off (in ex­pec­ta­tion). This, of course, would be mod­el­led quite differ­ently to the ac­ci­dent AI risk we con­sider in the text, and we sus­pect that eco­nomic mod­els would con­firm that prin­ci­pals would take the risk in suffi­ciently com­pet­i­tive sce­nar­ios. Th­ese mod­els would fo­cus on nega­tive ex­ter­nal­ities of risky AI de­vel­op­ment, some­thing more nat­u­rally stud­ied in do­mains like pub­lic eco­nomics rather than with agency the­ory. In any case, we fo­cus here on the more tra­di­tional AI risk fram­ing along the lines of “you think you have the AI un­der con­trol, but be­ware, you could be wrong”. ↩︎

  10. AI ac­ci­dent risk will be large when the AI agent thinks of new ac­tions that i) harm the prin­ci­pal ii) fur­ther the agent’s goals iii) the prin­ci­pal hasn’t an­ti­ci­pated. ↩︎

  11. This is be­cause claims about the ac­tions available to the agent and the prin­ci­pal’s aware­ness are part of PAL mod­els’ as­sump­tions. We dis­cuss this more be­low. ↩︎

  12. The cor­rect ex­am­ple: “If you pre­fer solv­ing en­vi­ron­men­tal prob­lems, you might ask the ma­chine to counter the rapid acid­ifi­ca­tion of the oceans that re­sults from higher car­bon diox­ide lev­els. The ma­chine de­vel­ops a new cat­a­lyst that fa­cil­i­tates an in­cred­ibly rapid chem­i­cal re­ac­tion be­tween ocean and at­mo­sphere and re­stores the oceans’ pH lev­els. Un­for­tu­nately, a quar­ter of the oxy­gen in the at­mo­sphere is used up in the pro­cess, leav­ing us [hu­mans] to as­phyx­i­ate slowly and painfully.” ↩︎

  13. I.e. the prin­ci­pal’s ra­tio­nal­ity is bounded to a greater ex­tent than the agent’s ↩︎

  14. In the model in “Mo­ral Hazard With Unaware­ness” ei­ther the prin­ci­pal or the agent’s ra­tio­nal­ity can be bounded ↩︎

  15. As ar­gued above, we don’t think con­tract en­force­abil­ity is the main rea­son Han­son’s cri­tique of Chris­ti­ano fails; agency rents are just not un­usu­ally high in his sce­nario. ↩︎

  16. From Con­tract The­ory: “The bench­mark con­tract­ing situ­a­tion that we shall con­sider in this book is one be­tween two par­ties who op­er­ate in a mar­ket econ­omy with a well-func­tion­ing le­gal sys­tem. Un­der such a sys­tem, any con­tract the par­ties de­cide to write will be en­forced perfectly by a court, pro­vided, of course, that it does not con­tra­vene any ex­ist­ing laws.” ↩︎

  17. Thanks to Ben Garfinkel for point­ing this out. ↩︎

  18. Robin Han­son pointed out to us that when think­ing about strange fu­ture sce­nar­ios, we should try to think about similar strange sce­nar­ios that we have seen in the past (we are very sym­pa­thetic to this, de­spite our some­what skep­ti­cal po­si­tion re­gard­ing PAL). With this in mind, an­other field which seems worth look­ing into is Se­cu­rity, es­pe­cially mil­i­tary se­cu­rity. Na­tional lead­ers have been as­sas­si­nated by their guards; kings have been kil­led by their pro­tec­tors. Th­ese seem like a closer analogue to many AI risk sce­nar­ios than the typ­i­cal PAL setup. It seems im­por­tant to un­der­stand what the ma­jor risk fac­tors are in these situ­a­tions, how peo­ple have guarded against catas­trophic failures, and how this trans­lates to cases of catas­trophic AI risk. ↩︎