Success without dignity: a nearcasting story of avoiding catastrophe by luck

I’ve been try­ing to form a nearcast-based pic­ture of what it might look like to suffer or avoid an AI catas­tro­phe. I’ve writ­ten a hy­po­thet­i­cal “failure story” (How we might stum­ble into AI catas­tro­phe) and two “suc­cess sto­ries” (one pre­sum­ing a rel­a­tively grad­ual take­off, one as­sum­ing a more dis­con­tin­u­ous one).

Those suc­cess sto­ries rely on a cou­ple of key ac­tors (a lead­ing AI lab and a stan­dards-and-mon­i­tor­ing or­ga­ni­za­tion) mak­ing lots of good choices. But I don’t think sto­ries like these are our only hope. Con­tra Eliezer, I think we have a non­triv­ial1 chance of avoid­ing AI takeover even in a “min­i­mal-dig­nity” fu­ture—say, as­sum­ing es­sen­tially no growth from here in the size or in­fluence of the com­mu­ni­ties and re­search fields fo­cused speci­fi­cally on ex­is­ten­tial risk from mis­al­igned AI, and no highly sur­pris­ing re­search or other in­sights from these com­mu­ni­ties/​fields ei­ther. (There are fur­ther risks be­yond AI takeover; this post fo­cuses on AI takeover.)

This is not meant to make any­one re­lax! Just the op­po­site—I think we’re in the “This could re­ally go lots of differ­ent ways” zone where marginal effort is most valuable. (Though I have to link to my anti-burnout take af­ter say­ing some­thing like that.) My point is noth­ing like “We will be fine”—it’s more like “We aren’t stuck at the bot­tom of the lo­gis­tic suc­cess curve; ev­ery bit of im­prove­ment in the situ­a­tion helps our odds.”

I think “Luck could be enough” should be the strong de­fault on pri­ors,2 so in some sense I don’t think I owe tons of ar­gu­men­ta­tion here (I think the bur­den is on the other side). But in ad­di­tion to think­ing “I haven’t heard knock­down ar­gu­ments for doom,” I think it’s rele­vant that I feel like I can at least pic­ture suc­cess with min­i­mal dig­nity (while grant­ing that many peo­ple will think my pic­ture is vague, wish­ful and wildly un­re­al­is­tic, and they may be right). This post will try to spell that out a bit.

It won’t have se­cu­rity mind­set, to say the leastI’ll be sketch­ing things out that “could work,” and it will be easy (for me and oth­ers) to name ways they could fail. But I think hav­ing an end-to-end pic­ture of how this could look might be helpful for un­der­stand­ing my pic­ture (and push­ing back on it!)

I’ll go through:

  • How we could nav­i­gate the ini­tial al­ign­ment prob­lem:3 get­ting to the first point of hav­ing very pow­er­ful (hu­man-level-ish), yet safe, AI sys­tems.

    • For hu­man-level-ish AIs, I think it’s plau­si­ble that the al­ign­ment prob­lem is easy, triv­ial or nonex­is­tent. (Also plau­si­ble that it’s fiendishly hard!)

    • If so, it could end up cheap and easy to in­tent-al­ign hu­man-level-ish AIs, such that such AIs end up greatly out­num­ber­ing mis­al­igned ones—putting us in good po­si­tion for the de­ploy­ment prob­lem (next point).

  • How we could nav­i­gate the de­ploy­ment prob­lem:4 re­duc­ing the risk that some­one in the world will de­ploy ir­recov­er­ably dan­ger­ous sys­tems, even though the ba­sic tech­nol­ogy ex­ists to make pow­er­ful (hu­man-level-ish) AIs safe. (This is of­ten dis­cussed through the lens of “pivotal acts,” though that’s not my preferred fram­ing.5)

    • You can think of this as con­tain­ing two challenges: stop­ping mis­al­igned hu­man-level-ish AI, and main­tain­ing al­ign­ment as AI goes be­yond hu­man level.

    • A key point is that once we have al­igned hu­man-level-ish AI, the world will prob­a­bly be trans­formed enor­mously, to the point where we should con­sider ~all out­comes in play.

  • (Briefly) The main ar­gu­ments I’ve heard for why this pic­ture is un­re­al­is­tic/​doomed.

  • A few more thoughts on the “suc­cess with­out dig­nity” idea.

As with many of my posts, I don’t claim per­sonal credit for any new ground here. I’m lean­ing heav­ily on con­ver­sa­tions with oth­ers, es­pe­cially Paul Chris­ti­ano and Carl Shul­man.

The ini­tial al­ign­ment problem

What hap­pens if you train an AI us­ing the sort of pro­cess out­lined here—es­sen­tially, gen­er­a­tive pre­train­ing fol­lowed by re­in­force­ment learn­ing, with the lat­ter refer­eed by hu­mans?

I think dan­ger is likely by de­fault—but not as­sured. It seems to de­pend on a num­ber of hard-to-pre­dict things:

  • How ac­cu­rate is re­in­force­ment?

    • The greater an AI’s abil­ity to get bet­ter perfor­mance by de­ceiv­ing, ma­nipu­lat­ing or over­pow­er­ing su­per­vi­sors, the greater the dan­ger.

    • There are a num­ber of rea­sons (be­yond ex­plicit ex­is­ten­tial risk con­cern) that AI labs might in­vest heav­ily in ac­cu­rate re­in­force­ment, via tech­niques like task de­com­po­si­tion/​am­plifi­ca­tion, re­cur­sive re­ward mod­el­ing, mechanis­tic in­ter­pretabil­ity, and us­ing AIs to de­bate or su­per­vise other AIs. Rel­a­tively mod­er­ate in­vest­ments here could imag­in­ably lead to highly ac­cu­rate re­in­force­ment.

  • How “nat­u­ral” are in­tended gen­er­al­iza­tions (like “Do what the su­per­vi­sor is hop­ing I’ll do, in the sense that most hu­mans would mean this phrase rather than in a pre­cise but ma­lign sense”) vs. un­in­tended ones (like “Do what­ever max­i­mizes re­ward”)?

    • It seems plau­si­ble that large amounts of gen­er­a­tive pre­train­ing could re­sult in an AI hav­ing a suite of well-de­vel­oped hu­man­like con­cepts, such as “Do what the su­per­vi­sor is hop­ing I’ll do, in the sense that most hu­mans would mean this phrase rather than in a pre­cise tech­ni­cal sense”—and also such as “Fool the su­per­vi­sor into think­ing I did well,” but the lat­ter could be hard enough to pull off suc­cess­fully in the pres­ence of a ba­sic au­dit regime (es­pe­cially for merely hu­man-level-ish AI), and/​or suffi­ciently in con­flict with var­i­ous learned heuris­tics, that it could be dis­ad­van­taged in train­ing.

    • In this case, a rel­a­tively small amount of re­in­force­ment learn­ing could be enough to ori­ent an AI to­ward poli­cies that gen­er­al­ize as in­tended.

  • How much is train­ing “out­comes-based vs. pro­cess-based”? That is, how much does it look like “An AI goes through a long epi­sode, tak­ing many steps that aren’t su­per­vised or nec­es­sar­ily un­der­stood, and ul­ti­mately sub­ject to gra­di­ent de­scent based on whether hu­mans ap­prove of the out­come?” vs. “Each lo­cal step the AI takes is sub­ject to hu­man su­per­vi­sion and ap­proval?”

    • The former leaves a lot of scope for mis­taken feed­back that trains de­cep­tion and ma­nipu­la­tion. The lat­ter could still in some sense train “do­ing what hu­mans think they want rather than what they ac­tu­ally want,” but that’s quite differ­ent from train­ing “Do what­ever re­sults in a seem­ingly good out­come,” and I think it’s no­tice­ably less vuln­er­a­ble to some of the key risks.

    • Out­comes-based train­ing seems ab­stractly more “pow­er­ful,” and likely to be a big part of train­ing the most pow­er­ful sys­tems—but this isn’t as­sured. To­day, train­ing AIs based on out­comes of long epi­sodes is un­wieldy, and the most ca­pa­ble AIs haven’t had much of it.

  • How nat­u­ral/​nec­es­sary is it for a suffi­ciently ca­pa­ble AI to form am­bi­tious goals and act like a “max­i­mizer”?

    • There has been a lot of ink spilled about this, and I think it re­mains up in the air. I’ll just say that mod­ern AI de­vel­op­ment meth­ods are not clearly set up to pro­duce “max­i­miz­ers.”

For what it’s worth, it seems to me like the very short­est, straight­est-line imag­in­able path to trans­for­ma­tive AI from to­day looks rel­a­tively fa­vor­able on the above di­men­sions. To­day’s most ca­pa­ble mod­els mostly look like “lots of gen­er­a­tive pre­train­ing, tiny amounts of re­in­force­ment learn­ing,” which seems like a good thing at least from the point of view of (a) de­vel­op­ing a large suite of hu­man-like con­cepts; (b) a good chance that the RL part of the train­ing can be highly “ac­cu­rate” (few op­por­tu­ni­ties to get bet­ter perfor­mance by de­ceiv­ing, ma­nipu­lat­ing or over­pow­er­ing su­per­vi­sors). It’s imag­in­able to AI mod­els could be­come trans­for­ma­tive with a hand­ful of ad­di­tions (see foot­note6) that—while greatly chang­ing ca­pa­bil­ities and in­tro­duc­ing se­ri­ous al­ign­ment risks—wouldn’t clearly change what I just said.

I’d guess that the situ­a­tion will get worse (e.g., more out­come-based train­ing) com­pared to that pic­ture, but I don’t know that. And even if it does, it still seems like we could end up with mod­els that have hu­man-level and hu­man-like ca­pa­bil­ities and lack am­bi­tious goals of their own. (Even with­out a spe­cial effort.)

For an op­pos­ing view, see my di­alogue with Nate Soares. I could imag­ine be­ing about 70% con­fi­dent of Nate’s view if I’d done a lot of think­ing I haven’t done, but I doubt most peo­ple are in that sort of po­si­tion or should trust Nate (or Eliezer) to be, to the point of be­ing cer­tain of dan­ger by de­fault.

Ba­sic countermeasures

There are also pos­si­ble coun­ter­mea­sures to al­ign­ment risk, iden­ti­fi­able to­day, that I can imag­ine go­ing quite well, with­out need­ing new break­through in­sights (we would need luck, though). Ex­am­ples:

Sim­ple checks and bal­ances. An ex­am­ple: we might train “AI watch­dogs” to make the best ar­gu­ments they can (ideally in­clud­ing dis­cus­sions of in­ter­nal states) that other AIs are dan­ger­ous, that spe­cific mea­sures can make them less dan­ger­ous, etc. Gra­di­ent de­scent could be used to both push against spu­ri­ous ar­gu­ments (as as­sessed by ran­dom­ized deep dives) while push­ing to­ward more im­por­tant+cor­rect (ac­cord­ing to hu­man judges and au­di­tors) cri­tiques. There are tons of rea­sons this might not work, but I don’t think we should be shocked if it pro­duces huge div­i­dends, even com­ing from sim­ply hu­man-level AIs that are well-suited to find­ing flaws and dan­gers in mod­els and out­puts similar to them­selves. Th­ese div­i­dends could lead to highly ac­cu­rate train­ing that makes it hard to learn un­in­tended gen­er­al­iza­tions, and/​or to de­ploy­ment sce­nar­ios where it’s overly hard for hu­man-level-ish AIs to co­or­di­nate against hu­mans.

In­tense red-team­ing. I’ve writ­ten about how ad­ver­sar­ial train­ing could “undo” a good deal of mis­al­ign­ment that has popped up dur­ing train­ing, es­pe­cially if some­thing like re­laxed ad­ver­sar­ial train­ing (simu­lat­ing in­puts that one can’t ac­tu­ally pro­duce) turns out to be fea­si­ble. It’s plau­si­ble to me that AI com­pa­nies might in­vest heav­ily in this kind of work, with­out need­ing to be mostly mo­ti­vated by ex­is­ten­tial risk re­duc­tion (they might be seek­ing in­tense guaran­tees against e.g. law­suit-driv­ing be­hav­ior by AI sys­tems).

Train­ing on in­ter­nal states. I think in­ter­pretabil­ity re­search could be use­ful in many ways, but some re­quire more “dig­nity” that I’m as­sum­ing here7 and/​or per­tain to the “con­tin­u­ing al­ign­ment prob­lem” (next sec­tion).8 If we get lucky, though, we could end up with some way of train­ing AIs on their own in­ter­nal states that works at least well enough for the ini­tial al­ign­ment prob­lem.

Train­ing AIs on their own in­ter­nal states risks sim­ply train­ing them to ma­nipu­late and/​or ob­scure their own in­ter­nal states, but this may be too hard for hu­man-level-ish AI sys­tems, so we might at least get off the ground with some­thing like this.

A re­lated idea is find­ing a reg­u­larizer that pe­nal­izes e.g. dishon­esty, as in Elic­it­ing La­tent Knowl­edge.

It’s pretty easy for me to imag­ine that a de­scen­dant of the Burns et al. 2022 method, or an out­put of the Elic­it­ing La­tent Knowl­edge agenda, could fit this gen­eral bill with­out need­ing any hugely sur­pris­ing break­throughs. I also wouldn’t feel ter­ribly sur­prised if, say, 3 more equally promis­ing ap­proaches emerged in the next cou­ple of years.

The de­ploy­ment problem

Once some­one has de­vel­oped safe, pow­er­ful (hu­man-level-ish) AI, the threat re­mains that:

  • More ad­vanced AI will be de­vel­oped (in­clud­ing with the help of the hu­man-level-ish AI), and it will be less safe, due to differ­ent de­vel­op­ment meth­ods and less sus­cep­ti­bil­ity to the ba­sic coun­ter­mea­sures above.9

  • As it gets cheaper and eas­ier for any­one in the world to build pow­er­ful AI sys­tems, some­one will do so es­pe­cially care­lessly and/​or mal­i­ciously.

The situ­a­tion has now changed in a few ways:

  • There’s now a lot more ca­pac­ity for al­ign­ment re­search, threat as­sess­ment re­search (to make a more con­vinc­ing case for dan­ger and con­tribute to stan­dards and mon­i­tor­ing), mon­i­tor­ing and en­forc­ing stan­dards, and more (be­cause these things can be done by AIs). I think in­ter­pretabil­ity looks like a par­tic­u­larly promis­ing area for “au­to­mated re­search”—AIs might grind through large num­bers of analy­ses rel­a­tively quickly and reach a con­clu­sion about the thought pro­cess of some larger, more so­phis­ti­cated sys­tem.

  • There’s also a lot more ca­pac­ity for ca­pa­bil­ities re­search that could lead to more ad­vanced, more dan­ger­ous AI.

  • For a good out­come, al­ign­ment re­search or threat as­sess­ment re­search doesn’t have to “keep up with” ca­pa­bil­ities re­search for a long time—a strong demon­stra­tion of dan­ger, or de­ci­sive/​scal­able al­ign­ment solu­tion, could be enough.

It’s hard to say how all these fac­tors will shake out. But it seems plau­si­ble that one of these things will hap­pen:

  • Some rel­a­tively cheap, easy, “scal­able” solu­tion to AI al­ign­ment (the sort of thing ARC is cur­rently look­ing for) is de­vel­oped and be­comes widely used.

  • Some de­ci­sive demon­stra­tion of dan­ger is achieved, and AIs also help to cre­ate a suc­cess­ful cam­paign to per­suade key poli­cy­mak­ers to ag­gres­sively work to­ward a stan­dards and mon­i­tor­ing regime. (This could be a very ag­gres­sive regime if some par­tic­u­lar gov­ern­ment, coal­i­tion or other ac­tor has a lead in AI de­vel­op­ment that it can lev­er­age into a lot of power to stop oth­ers’ AI de­vel­op­ment.)

  • Some­thing else hap­pens to de­ci­sively change dy­nam­ics—for ex­am­ple, AIs turn out to be good enough at find­ing and patch­ing se­cu­rity holes that the offense-defense bal­ance in cy­ber­se­cu­rity flips, and it be­comes pos­si­ble to con­tain even ex­tremely ca­pa­ble AIs.

Any of these could lead to a world in which mis­al­igned AI in the wild is at least rare rel­a­tive to al­igned AI. The ad­van­tage for hu­mans+al­igned-AIs could be self-re­in­forc­ing, as they use their greater num­bers to push mea­sures (e.g., stan­dards and mon­i­tor­ing) to sup­press mis­al­igned AI sys­tems.

I con­cede that we wouldn’t be to­tally out of the woods in this case—things might shake out such that highly-out­num­bered mis­al­igned AIs can cause ex­is­ten­tial catas­tro­phe. But I think we should be op­ti­mistic by de­fault from such a point. A foot­note elab­o­rates on this, ad­dress­ing Steve Byrnes’s dis­cus­sion of a re­lated topic (which I quite liked and think raises good con­cerns, but isn’t de­ci­sive for the sce­nario I’m con­tem­plat­ing).10

More gen­er­ally, I think it’s very hard to rea­son about a world with hu­man-level-ish al­igned AIs widely available (and ini­tially out­num­ber­ing com­pa­rably pow­er­ful mis­al­igned AIs), so I think we should not be too con­fi­dent of doom start­ing from that point.

Some ob­jec­tions to this picture

The most com­mon ar­gu­ments I’ve heard for why this pic­ture is hope­less in­volve some com­bi­na­tion of:

  • AI sys­tems could quickly be­come very pow­er­ful rel­a­tive to their su­per­vi­sors, which means we have to con­front a harder ver­sion of the al­ign­ment prob­lem with­out first hav­ing hu­man-level-ish al­igned sys­tems.

    • I think it’s cer­tainly plau­si­ble this could hap­pen, but I haven’t seen a rea­son to put it at >50%.

    • To be clear, I ex­pect an ex­plo­sive “take­off” by his­tor­i­cal stan­dards. I want to give Tom David­son’s anal­y­sis more at­ten­tion, but it im­plies that there could be mere months be­tween hu­man-level-ish AI and far more ca­pa­ble AI (but that could be enough for a lot of work by hu­man-level-ish AI).

    • One key ques­tion: to the ex­tent that we can cre­ate a feed­back loop with AI sys­tems do­ing re­search to im­prove hard­ware and/​or soft­ware effi­ciency (which then in­creases the size and/​or ca­pa­bil­ity of the “au­to­mated work­force,” en­abling fur­ther re­search …), will this mostly be via in­creas­ing the num­ber of AIs or by in­creas­ing per-AI ca­pa­bil­ities? There could be a feed­back loop with hu­man-level-ish AI sys­tems ex­plod­ing in num­ber, which seems to pre­sent fewer (though still sig­nifi­cant) al­ign­ment challenges than a feed­back loop with AI sys­tems ex­plod­ing past hu­man ca­pa­bil­ity.11

  • It’s ar­guably very hard to get even hu­man-level-ish ca­pa­bil­ities with­out am­bi­tious mis­al­igned aims. I dis­cussed this topic at some length with Nate Soares—notes here. I dis­agree with this as a de­fault (though, again, it’s plau­si­ble) for rea­sons given at that link.

  • Ex­pect­ing “offense-defense” asym­me­tries (as in this post) such that we’d get catas­tro­phe even if al­igned AIs greatly out­num­ber mis­al­igned ones. Again, this seems plau­si­ble, but not the right de­fault guess for how things will go, as dis­cussed at the end of the pre­vi­ous sec­tion.

I think all of these ar­gu­ments are plau­si­ble, but very far from de­ci­sive (and in­deed each seems in­di­vi­d­u­ally <50% likely to me).

Suc­cess with­out dignity

This sec­tion is es­pe­cially hand-wavy and con­ver­sa­tional. I prob­a­bly don’t stand by what you’d get from read­ing any par­tic­u­lar sen­tence su­per closely and tak­ing it su­per se­ri­ously. I stand by some sort of vague ges­ture that this sec­tion is try­ing to make.

I have a high-level in­tu­ition that most suc­cess­ful hu­man ven­tures look—from up close—like dump­ster fires. I’m think­ing of suc­cess­ful or­ga­ni­za­tions—in­clud­ing those I’ve helped build—as well as cases where hu­mans took highly effec­tive in­ter­ven­tions against global threats, e.g. smal­l­pox erad­i­ca­tion; re­cent ad­vances in so­lar power that I’d guess are sub­stan­tially trace­able to sub­sidy pro­grams; what­ever rea­sons we haven’t had a sin­gle non-test nu­clear deto­na­tion since 1945.

I ex­pect the way AI risk is “han­dled by so­ciety” to look like a dump­ster fire, in the sense that lots of good in­ter­ven­tions will be left on the table, lots of very silly things will be done, and no in­ter­ven­tion will be satis­fy­ingly ro­bust. Align­ment mea­sures will be fal­lible, stan­dards regimes will be game­able, se­cu­rity se­tups will be im­perfect, and even the best AI labs will have lots of in­com­pe­tent and/​or reck­less peo­ple in­side them do­ing scary things.

But I don’t think that au­to­mat­i­cally trans­lates to ex­is­ten­tial catas­tro­phe, and this dis­tinc­tion seems im­por­tant. (An anal­ogy: “that bed­net has lots of gap­ing holes in it” vs. “That bed­net won’t help” or “That per­son will get malaria.”) The fu­ture is un­cer­tain; we could get lucky and stum­ble our way into a good out­come.

Fur­ther­more, there are a num­ber of in­ter­ven­tions that could in­ter­act fa­vor­ably with some baseline good luck. (I’ll dis­cuss this more in a fu­ture post.)

One key strate­gic im­pli­ca­tion of this view that I think is par­tic­u­larly worth not­ing:

  • I think there’s a com­mon headspace that says some­thing like: “We’re screwed un­less we get a mir­a­cle. Hence, ~noth­ing mat­ters ex­cept for (a) buy­ing time for that mir­a­cle to hap­pen (b) op­ti­miz­ing heav­ily for at­tract­ing and sup­port­ing un­ex­pect­edly brilli­ant peo­ple with un­ex­pect­edly great ideas.”

  • My headspace is some­thing more like: “We could be doomed even in wor­lds where our in­ter­ven­tions go as well as could be rea­son­ably ex­pected; we could be fine in wor­lds where they go ~max­i­mally poorly; ev­ery lit­tle bit (of al­ign­ment re­search, of stan­dards and mon­i­tor­ing, of se­cu­rity re­search, etc.) helps; and a lot of key in­ter­ven­tions would benefit from things other than time and top in­tel­lec­tual tal­ent—they’d benefit from al­ign­ment-con­cerned peo­ple com­mu­ni­cat­ing well, net­work­ing well, be­ing knowl­edge­able about the ex­ist­ing AI state of the art, hav­ing good rep­u­ta­tions with reg­u­la­tors and the gen­eral pub­lic, etc. etc. etc.”

  • That is, in my headspace, there are lots of things that can help—which also means that there are lots of fac­tors we need to worry about. Many are quite ugly and un­pleas­ant to deal with (e.g., PR and rep­u­ta­tion). And there are many gnarly trade­offs with no clear an­swer—e.g., I think there are things that hurt com­mu­nity epistemics12 and/​or risk mak­ing the situ­a­tion worse13 that still might be right to do.

  • I have some sus­pi­cion that the first headspace is self-serv­ing for peo­ple who re­ally don’t like deal­ing with that stuff and would rather fo­cus ex­clu­sively on try­ing to do/​sup­port/​find rev­olu­tion­ary in­tel­lec­tual in­quiry. I don’t nor­mally like mak­ing ac­cu­sa­tions like this (they rarely feel con­struc­tive) but in this case it feels like a bit of an elephant in the room—it seems like quite a strange view on pri­ors to be­lieve that rev­olu­tion­ary in­tel­lec­tual in­quiry is the “whole game” for ~any goal, es­pe­cially on the rel­a­tively short timelines many peo­ple have for trans­for­ma­tive AI.

I don’t feel emo­tion­ally at­tached to my headspace. It’s nice to not think we’re doomed, but not a very big deal for me,14 and I think I’d en­joy work premised on the first headspace above at least as much as work premised on the sec­ond one.

The sec­ond headspace is just what seems right at the mo­ment. I haven’t seen con­vinc­ing ar­gu­ments that we won’t get lucky, and it seems to me like lots of things can am­plify that luck into bet­ter odds of suc­cess. If I’m miss­ing some­thing cor­rectible, I hope this will prompt dis­cus­sion that leads there.

Notes


  1. Like >10%

  2. Since an­other way of putting it is: “AI takeover (a pretty spe­cific event) is not cer­tain (con­di­tioned on the ‘min­i­mal-dig­nity’ con­di­tions above, which don’t seem to con­strain the fu­ture a ton).”

  3. Phase 1 in this anal­y­sis

  4. Phase 2 in this anal­y­sis

  5. I think there are al­ter­na­tive ways things could go well, which I’ll cover in the rele­vant sec­tion, so I don’t want to be stuck with a “pivotal acts” frame.

  6. Salient pos­si­ble ad­di­tions to to­day’s mod­els:

    • Greater scale (more pa­ram­e­ters, more pre­train­ing)

    • Mul­ti­modal­ity (train­ing the same model on lan­guage + images or per­haps video)

    • Me­mory/​long con­texts: it seems plau­si­ble that some rel­a­tively minor ar­chi­tec­tural mod­ifi­ca­tion could make to­day’s lan­guage mod­els much bet­ter at han­dling very long con­texts than to­day’s cut­ting-edge sys­tems, e.g. they could effi­ciently iden­tify which parts of an even very long con­text ought to be paid spe­cial at­ten­tion at any given point. This could imag­in­ably be suffi­cient for them to be “taught” to do tasks, in roughly the way hu­mans are (e.g., I might give an AI a few ex­am­ples of a suc­cess­fully done task, ask it to try, cri­tique it, and re­peat this loop over the course of hun­dreds of pages of “teach­ing”—note that the “teach­ing” is sim­ply build­ing up a con­text it can con­sult for its next step, it is not us­ing gra­di­ent de­scent).

    • Scaf­fold­ing: a model some­what like to­day’s cut­ting-edge mod­els could be put in a set­ting where it’s able to del­e­gate tasks to copies of it­self. Such tasks might in­clude things like “Think about how to ac­com­plish X, and send me some thoughts” and “That wasn’t good enough, think more please.” In this way, it could be able to vary the amount of “thought” and effort it puts into differ­ent as­pects of its task. It could also be given ac­cess to some ba­sic ac­tu­a­tors (shell ac­cess might be suffi­cient). None of this need in­volve fur­ther train­ing, and it could imag­in­ably give an AI enough of the func­tion­al­ity of things like “mem­ory” to be quite ca­pa­ble.

    It’s not out of the ques­tion to me that we could get to trans­for­ma­tive AI with ad­di­tions like this, and with the vast bulk of the train­ing still just be­ing gen­er­a­tive pre­train­ing.

  7. E.g., I think in­ter­pretabil­ity could be very use­ful for demon­strat­ing dan­ger, which could lead to a stan­dards-and-mon­i­tor­ing regime, but such a regime would be a lot more “dig­nified” than the wor­lds I’m pic­tur­ing in this post.

  8. I think in­ter­pretabil­ity is very ap­peal­ing as some­thing that large num­bers of rel­a­tively nar­row “au­to­mated al­ign­ment re­searchers” could work on.

  9. De­bate-type se­tups seem like they would get harder for hu­mans to ad­ju­di­cate as AI sys­tems ad­vance; more ad­vanced AI seems harder to red-team effec­tively with­out its notic­ing “tells” re: whether it’s in train­ing; in­ter­nal-state-based train­ing seems more likely to re­sult in “ma­nipu­lat­ing one’s own in­ter­nal states” for more ad­vanced AI;

  10. Byrnes’s post seems to as­sume there are rel­a­tively straight­for­ward de­struc­tion mea­sures that re­quire dra­co­nian, scary “plans” to stop. (Con­trast with my dis­cus­sion here, in which AIs can be in­te­grated through­out the econ­omy in ways that makes it harder for mis­al­igned AIs to “get off the ground” with re­spect to be­ing de­vel­oped, es­cap­ing con­tain­ment and ac­quiring re­sources.)

    • I don’t think this is the right de­fault/​prior ex­pec­ta­tion, given that we see lit­tle ev­i­dence of this sort of dy­namic in his­tory to date. (Rel­a­tively ca­pa­ble peo­ple who want to cause wide­spread de­struc­tion even at cost to them­selves are rare, but do pe­ri­od­i­cally crop up and don’t seem to have been able to effect these sorts of dy­nam­ics to date. In­di­vi­d­u­als have done a lot of dam­age by build­ing fol­low­ings and par­tic­u­larly via gov­ern­ment power, but this seems very differ­ent from the type of dy­namic dis­cussed in Byrnes’s post.)

    • One could re­spond by point­ing to par­tic­u­lar vuln­er­a­bil­ities and de­struc­tion plans that seem hard to stop, but I haven’t been sold on any­thing along these lines, es­pe­cially when con­sid­er­ing that a rel­a­tively small num­ber of biolog­i­cal hu­mans’ sur­viv­ing could still be enough to stop mis­al­igned AIs (if we posit that al­igned AIs greatly out­num­ber mis­al­igned AIs). And I think mis­al­igned AIs are less likely to cause any dam­age if the odds are against ul­ti­mately achiev­ing their aims.

    • I note that Byrnes’s post also seems to as­sume that it’s greatly ex­pen­sive and difficult to al­ign an AI (I con­jec­ture that it may not be, above).

  11. The lat­ter, more dan­ger­ous pos­si­bil­ity seems more likely to me, but it seems quite hard to say. (There could also of course be a hy­brid situ­a­tion, as the num­ber and ca­pa­bil­ities of AI grow.)

  12. I think op­ti­miz­ing for com­mu­nity epistemics has real down­sides, both via in­fo­haz­ards/​em­pow­er­ing bad ac­tors and via rep­u­ta­tional risks/​turn­ing off peo­ple who could be helpful. I wish this weren’t the case, and in gen­eral I heuris­ti­cally tend to want to value epistemic virtue very highly, but it seems like it’s a live is­sue—I (re­luc­tantly) don’t think it’s rea­son­able to treat “X is bad for com­mu­nity epistemics” as an au­to­matic ar­gu­ment-en­der about whether X is bad (though I do think it tends to be a very strong ar­gu­ment).

  13. E.g., work­ing for an AI lab and speed­ing up AI (I plan to write more about this).

    More broadly, it seems to me like es­sen­tially all at­tempts to make the most im­por­tant cen­tury go bet­ter also risk mak­ing it go a lot worse, and for any­one out there who might’ve done a lot of good to date, there are also ar­gu­ments that they’ve done a lot of harm (e.g., by rais­ing the salience of the is­sue over­all).

    Even “Aligned AI would be bet­ter than mis­al­igned AI” seems merely like a strong bet to me, not like a >95% cer­tainty, given what I see as the ap­pro­pri­ate level of un­cer­tainty about top­ics like “What would a mis­al­igned AI ac­tu­ally do, in­cor­po­rat­ing acausal trade con­sid­er­a­tions and such­like?”; “What would hu­mans ac­tu­ally do with in­tent-al­igned AI, and what kind of uni­verse would that lead to?”; and “How should I value var­i­ous out­comes against each other, and in par­tic­u­lar how should I think about hopes of very good out­comes vs. risks of very bad ones?”

    To re­it­er­ate, on bal­ance I come down in fa­vor of al­igned AI, but I think the un­cer­tain­ties here are mas­sive—mul­ti­ple key ques­tions seem broadly “above our pay grade” as peo­ple try­ing to rea­son about a very un­cer­tain fu­ture.

  14. I’m a per­son who just doesn’t pre­tend to be emo­tion­ally scope-sen­si­tive or to viscer­ally feel the pos­si­bil­ity of im­pend­ing doom. I think it would be hard to do these things if I tried, and I don’t try be­cause I don’t think that would be good for any­one.

    I like do­ing wor­thy-feel­ing work (I would be at least as happy with work premised on a “doomer” wor­ld­view as on my cur­rent one) and hang­ing out with my fam­ily. My es­ti­mated odds that I get to live a few more years vs. ~50 more years vs. a zillion more years are quite volatile and don’t seem to im­pact my daily qual­ity of life much.

Crossposted from LessWrong (64 points, 8 comments)