(Cross-posted from my website. Podcast version here, or search for “Joe Carlsmith Audio” on your podcast app.

This essay is part of a series I’m calling “Otherness and control in the age of AGI.” I’m hoping that individual essays can be read fairly well on their own, but see here for brief summaries of the essays that have been released thus far.

Minor spoilers for Game of Thrones.)

In my last essay, I discussed Robin Hanson’s critique of the AI risk discourse – and in particular, the accusation that this discourse “others” the AIs, and seeks too much control over the values that steer the future. I find some aspects of Hanson’s critique uncompelling and implausible, but I do think he’s pointing at a real discomfort. In fact, I think that when we bring certain other Yudkowskian vibes into view – and in particular, vibes related to the “fragility of value,” “extremal Goodhart,” and “the tails come apart” – this discomfort should deepen yet further. In this essay I explain why.

The fragility of value

Engaging with Yudkowsky’s work, I think it’s easy to take away something like the following broad lesson: “extreme optimization for a slightly-wrong utility function tends to lead to valueless/horrible places.”

Thus, in justifying his claim that “any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth,” Yudkowsky argues that value is “fragile.”

There is more than one dimension of human value, where if just that one thing is lost, the Future becomes null. A single blow and all value shatters. Not every single blow will shatter all value—but more than one possible “single blow” will do so.

For example, he suggests: suppose you get rid of boredom, and so spend eternity “replaying a single highly optimized experience, over and over and over again.” Or suppose you get rid of “contact with reality,” and so put people into experience machines. Or suppose you get rid of consciousness, and so make a future of non-sentient flourishing.

Now, as Katja Grace points out, these are all pretty specific sorts of “slightly different.”^[1] But at times, at least, Yudkowsky seems to suggest that the point generalizes to many directions of subtle permutation: “if you have a 1000-byte exact specification of worthwhile happiness, and you begin to mutate it, the value created by the corresponding AI with the mutated definition falls off rapidly.”

ChatGPT imagines “slightly mutated happiness.”

Can we give some sort of formal argument for expecting value fragility of this kind? The closest I’ve seen is the literature on “extremal Goodhart” – a specific variant of Goodhart’s law (Yudkowsky gives his description here).^[2] Imprecisely, I think the thought would be something like: even if the True Utility Function is similar enough to the Slightly-Wrong Utility Function to be correlated within a restricted search space, extreme optimization searches much harder over a much larger space – and within that much larger space, the correlation between the True Utility and the Slightly-Wrong Utility breaks down, such that getting maximal Slightly-Wrong Utility is no update about the True Utility. Rather, conditional on maximal Slightly-Wrong Utility, you should expect the mean True Utility for a random point in the space. And if you’re bored, in expectation, by a random point in the space (as Yudkowsky is, for example, by a random arrangement of matter and energy in the lightcone), then you’ll be disappointed by the results of extreme but Slightly-Wrong optimization.

Now, this is not, in itself, any kind of airtight argument that any utility function subject to extreme and unchecked optimization pressure has to be exactly right. But amidst all this talk of edge instantiation and the hidden complexity of wishes and the King Midas problem and so on, it’s easy to take away that vibe.^[3] That is, if it’s not aimed precisely at the True Utility, intense optimization – even for something kinda-like True Utility – can seem likely to grab the universe and drive it in some ultimately orthogonal and as-good-as-random direction (this is the generalized meaning of “paperclips”). The tails come way, way apart.

I won’t, here, try to dive deep on whether value is fragile in this sense (note that, at the least, we need to say a lot more about when and why the correlation between the True Utility and the Slightly-Wrong Utility breaks down). Rather, I want to focus on the sort of yang this picture can prompt. In particular: Yudkowskian-ism generally assumes that at least absent civilizational destruction or very active coordination, the future will be driven by extreme optimization pressure of some kind. Something is going to foom, and then drive the accessible universe hard in its favored direction. Hopefully, it’s “us.” But the more the direction in question has to be exactly right, lest value shatter into paperclips, the tighter, it seems, we must grip the wheel – and the more exacting our standards for who’s driving.

Human paperclippers?

And now, of course, the question arises: how different, exactly, are human hearts from each other? And in particular: are they sufficiently different that, when they foom, and even “on reflection,” they don’t end up pointing in exactly the same direction? After all, Yudkowsky said, above, that in order for the future to be non-trivially “of worth,” human hearts have to be in the driver’s seat. But even setting aside the insult, here, to the dolphins, bonobos, nearest grabby aliens, and so on – still, that’s only to specify a necessary condition. Presumably, though, it’s not a sufficient condition? Presumably some human hearts would be bad drivers, too? Like, I dunno, Stalin?

Now: let’s be clear, the AI risk folks have heard this sort of question before. “Ah, but aligned with whom?” Very deep. And the Yudkowskians respond with frustration. “I just told you that we’re all about to be killed, and your mind goes to monkey politics? You’re fighting over the poisoned banana!” And even if you don’t have Yudkowsky’s probability on doom, it is, indeed, a potentially divisive and race-spurring frame – and one that won’t matter if we all end up dead. There are, indeed, times to set aside your differences – and especially, weird philosophical questions about how much your differences diverge once they’re systematized into utility functions and subjected to extreme optimization pressure—and to unite in a common cause. Sometimes, the white walkers are invading, and everyone in the realm needs to put down their disputes and head north to take a stand together; and if you, like Cercei, stay behind, and weaken the collective effort, and focus on making sure that your favored lineage sits the Iron Throne if the white walkers are defeated – well, then you are a serious asshole, and an ally of Moloch. If winter is indeed coming, let’s not be like Cercei.

Cersei Sees The White Walker Best Scene- Game of Thrones Season 7 Ep 7

Let’s hope we can get this kind of evidence ahead of time.

Still: I think it’s important to ask, with Hanson, how the abstract conceptual apparatus at work in various simple arguments for “AI alignment” apply to “human alignment,” too. In particular: the human case is rich with history, intuition, and hard-won-heuristics that the alien-ness of the AI case can easily elide. And when yang goes wrong, it’s often via giving in, too readily, to the temptations of abstraction, to the neglect of something messier and more concrete (cf communism, high-modernism-gone-wrong, etc). But the human case, at least, offers more data to collide with – and various lessons, I’ll suggest, worth learning. And anyway, even to label the AIs as the white walkers is already to take for granted large swaths of the narrative that Hanson is trying to contest. We should meet the challenge on its own terms.

Plus, there are already some worrying flags about the verdicts that a simplistic picture of value fragility will reach about “human alignment.” Consider, for example, Yudkowsky’s examples above, of utility functions that are OK with repeating optimal stuff over and over (instead of getting “bored”), or with people having optimal experiences inside experience machines, even without any “contact with reality.” Even setting aside questions about whether a universe filled to the brim with bliss should count as non-trivially “of worth,”^[4] there’s a different snag: namely, that these are both value systems that a decent number of humans actually endorse – for example, various of my friends (though admittedly, I hang out in strange circles). Yet Yudkowsky seems to think that the ethics these friends profess would shatter all value – and if they would endorse it on reflection, that makes them, effectively, paperclippers relative to him. (Indeed, I even know illusionist-ish folks who are much less excited than Yudkowsky about deep ties between consciousness and moral-importance. But this is a fringe-er view.)

Now, of course, the “on reflection” bit is important. And one route to optimism about “human alignment” is to claim that most humans will converge, on reflection, to sufficiently similar values that their utility functions won’t be “fragile” relative to each other. In the light of Reason, for example, maybe Yudkowsky and my friends would come to agree about the importance of preserving boredom and reality-contact. But even setting aside problems for the notion of “reflection” at stake, and questions about who will be disposed to “reflect” in the relevant way, positing robust convergence in this respect is a strong, convenient, and thus-far-undefended empirical hypothesis – and one that, absent a defense, might prompt questions, from the atheists, about wishful thinking.

Indeed, while it’s true that humans have various important similarities to each other (bodies, genes, cognitive architectures, acculturation processes) that do not apply to the AI case, nothing has yet been said to show that these similarities are enough to overcome the “extremal Goodhart” argument for value fragility. That argument, at least as I’ve stated it, was offered with no obvious bounds on the values-differences to which it applies – the problem statement, rather, was extremely general. So while, yes, it condemned the non-human hearts – still, one wonders: how many human hearts did it condemn along the way?

A quick glance at what happens when human values get “systematized” and then “optimized super hard for” isn’t immediately encouraging. Thus, here’s Scott Alexander on the difference between the everyday cases (“mediocristan”) on which our morality is trained, and the strange generalizations the resulting moral concepts can imply:

The morality of Mediocristan is mostly uncontroversial. It doesn’t matter what moral system you use, because all moral systems were trained on the same set of Mediocristani data and give mostly the same results in this area. Stealing from the poor is bad. Donating to charity is good. A lot of what we mean when we say a moral system sounds plausible is that it best fits our Mediocristani data that we all agree upon...

The further we go toward the tails, the more extreme the divergences become. Utilitarianism agrees that we should give to charity and shouldn’t steal from the poor, because Utility, but take it far enough to the tails and we should tile the universe with rats on heroin. Religious morality agrees that we should give to charity and shouldn’t steal from the poor, because God, but take it far enough to the tails and we should spend all our time in giant cubes made of semiprecious stones singing songs of praise. Deontology agrees that we should give to charity and shouldn’t steal from the poor, because Rules, but take it far enough to the tails and we all have to be libertarians.

From Alexander: “Mediocristan is like the route from Balboa Park to West Oakland, where it doesn’t matter what line you’re on because they’re all going to the same place. Then suddenly you enter Extremistan, where if you took the Red Line you’ll end up in Richmond, and if you took the Green Line you’ll end up in Warm Springs, on totally opposite sides of the map...”

That is, Alexander suggests a certain pessimism about extremal Goodhart in the human case. Different human value systems are similar, and reasonably aligned with each other, within a limited distribution of familiar cases, partly because they were crafted in order to capture the same intuitive data-points. But systematize them and amp them up to foom, and they decorrelate hard. Cf, too, the classical utilitarians and the negative utilitarians. On the one hand, oh-so-similar – not just in having human bodies, genes, cognitive architectures, etc, but in many more specific ways (thinking styles, blogging communities, etc). And yet, and yet – amp them up to foom, and they seek such different extremes (the one, Bliss; and the other, Nothingness).

Or consider this diagnosis, from Nate Soares of the Yudkowsky-founded Machine Intelligence Research Institute, about how the AIs will end up with misaligned goals:

The first minds humanity makes will be a terrible spaghetti-code mess, with no clearly-factored-out “goal” that the surrounding cognition pursues in a unified way. The mind will be more like a pile of complex, messily interconnected kludges, whose ultimate behavior is sensitive to the particulars of how it reflects and irons out the tensions within itself over time.

Sound familiar? Human minds too, seem pretty spaghetti-code and interconnected kludge-ish. We, too, are reflecting on and ironing-out our internal tensions, in sensitive-to-particulars ways.^[5] And remind me why this goes wrong in the AI case, especially for AIs trained to be nice in various familiar human contexts? Well, there are various stories – but a core issue, for Yudkowsky and Soares, is the meta-ethical anti-realism thing (though: less often named as such). Here’s Yudkowsky:

There’s something like a single answer, or a single bucket of answers, for questions like ‘What’s the environment really like?’ and ‘How do I figure out the environment?’ and ‘Which of my possible outputs interact with reality in a way that causes reality to have certain properties?‘… When you have a wrong belief, reality hits back at your wrong predictions… In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints. Reality doesn’t ‘hit back’ against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases.^[6]

That is, the instrumental reasoning bit – that part is constrained by reality. But the utility function – that part is unconstrained. So even granted a particular, nice-seeming pattern of behavior on a particular limited range of cases, an agent reflecting on its values and “ironing out its internal tensions” can just go careening off in a zillion possible directions, with nothing except “coherence” (a very minimal desideratum) and the contingencies of its starting-point to nudge the process down any particular path. Ethical reflection, that is, is substantially a free for all. So once the AI is powerful enough to reflect, and to prevent you from correcting it, its reflection spins away, unmoored and untethered, into the land where extremal Goodhart bites, and value shatters into paperclips.

But: remind me what part of that doesn’t apply to humans? Granted, humans and AIs work from different contingent starting-points – indeed, worryingly much. But so, too, do different humans. Less, perhaps – but how much less is necessary? What force staves off extremal Goodhart in the human-human case, but not in the AI-human one? For example: what prevents the classical utilitarians from splitting, on reflection, into tons of slightly-different variants, each of whom use a slightly-different conception of optimal pleasure (hedonium-1, hedonium-2, etc)?^[7] And wouldn’t they, then, be paperclippers to each other, what with their slightly-mutated conceptions of perfect happiness? I hear the value of mutant happiness drops off fast...

And we can worry about the human-human case for more mundane reasons, too. Thus, for example, it’s often thought that a substantial part of what’s going on with human values is either selfish or quite “partial.” That is, many humans want pleasure, status, flourishing, etc for themselves, and then also for their family, local community, and so on. We can posit that this aspect of human values will disappear or constrain itself on reflection, or that it will “saturate” to the point where more impartial and cosmopolitan values start to dominate in practice – but see above re: “convenient and substantive empirical hypothesis” (and if “saturation” helps with extremal-Goodhart problems, can you make the AI’s values saturate, too?). And absent such comforts, “alignment” between humans looks harder to come by. Full-scale egoists, for example, are famously “unaligned” with each other – Bob wants blah-for-Bob, and Sally, blah-for-Sally. And the same dynamic can easily re-emerge with respect to less extreme partialities. Cf, indeed, lots of “alignment problems” throughout history.

Of course, we haven’t, throughout history, had to worry much about alignment problems of the form “suppose that blah agent fooms, irons out its contradictions into a consistent utility function, then becomes dictator of the accessible universe and re-arranges all the matter and energy to the configuration that maxes out that utility function.” Yudkowsky’s mainline narrative asks us to imagine facing this problem with respect to AI – and no surprise, indeed, that it looks unlikely to go well. Indeed, on such a narrative, and absent the ability to make your AI something other than an aspiring-dictator (cf “corrigibility,” or as Yudkowsky puts it, building an AI that “doesn’t want exactly what we want, and yet somehow fails to kill us and take over the galaxies despite that being a convergent incentive there”^[8]), the challenge of AI alignment amounts, as Yudkowsky puts it, to the challenge of building a “Sovereign which wants exactly what we extrapolated-want and is therefore safe to let optimize all the future galaxies without it accepting any human input trying to stop it.”

But assuming that humans are not “corrigible” (Yudkowsky, at least, wants to eat the galaxies) then especially if you’re taking extremal Goodhart seriously, any given human does not appear especially “safe to let optimize all the future galaxies without accepting any input,” either – that’s, erm, a very high standard. But if that’s the standard for being a “paperclipper,” then are most humans paperclippers relative to each other?

Deeper into godlessness

We can imagine a view that answers “yes, most humans are paperclippers relative to each other.” Indeed, we can imagine a view that takes extremal Goodhart and “the tails come apart” so seriously that it decides all the hearts, except its own, are paperclippers. After all, those other hearts aren’t exactly the same as its own. And isn’t value fragile, under extreme optimization pressure, to small differences? And isn’t the future one of extreme optimization? Apparently, the only path to a non-paperclippy future is for my heart, in particular, to be dictator. It’s bleak, I know. My p(doom) is high. But one must be a scout about such things.

In fact, we can be even more mistrusting. For example: you know what might happen to your heart over time? It might change even a tiny bit! Like: what happens if you read a book, or watch a documentary, or fall in love, or get some kind of indigestion – and then your heart is never exactly the same ever again, and not because of Reason, and then the only possible vector of non-trivial long-term value in this bleak and godless lightcone has been snuffed out?! Wait, OK, I have a plan: this precise person-moment needs to become dictator. It’s rough, but it’s the only way. Do you have the nano-bots ready? Oh wait, too late. (OK, how about now? Dammit: doom again.)

Doom soon?

Now, to be clear: this isn’t Yudkowsky’s view. And one can see the non-appeal. Still, I think some of the abstract commitments driving Yudkowsky’s mainline AI alignment narrative have a certain momentum in this direction. Here I’m thinking of e.g. the ubiquity of power-seeking among smart-enough agents; the intense optimization to which a post-AGI future will be subjected; extremal Goodhart; the fragility of value; and the unmoored quality of ethical reflection given anti-realism. To avoid seeing the hearts of others as paperclippy, one must either reject/modify/complicate these commitments, or introduce some further, more empirical element (e.g., “human hearts will converge to blah degree on reflection”) that softens their blow. This isn’t, necessarily, difficult – indeed, I think these commitments are questionable/complicate-able along tons of dimensions, and that a variety of open empirical and ethical questions can easily alter the narrative at stake. But the momentum towards deeming more and more agents (and agent-moments) paperclippers seems worth bearing in mind.

We can see this momentum as leading to a yet-deeper atheism. Yudkowsky’s humanism, at least, has some trust in human hearts, and thus, in some uncontrolled Other. But the atheism I have in mind, here, trusts only in the Self, at least as the power at stake scales – and in the limit, only in this slice of Self, the Self-Right-Now. Ultimately, indeed, this Self is the only route to a good future. Maybe the Other matters as a patient – but like God, they can’t be trusted with the wheel.

We can also frame this sort of atheism in Hanson’s language. In what sense, actually, does Yudkowsky “other” the AIs? Well, basically, he says that they can’t be trusted with power – and in particular, with complete power over the trajectory of the future, which is what he thinks they’re on track to get – because their values are too different from ours. Hanson replies: aren’t the default future humans like that, too? But this sort of atheism replies: isn’t everyone except for me (or me-right-now) like that? Don’t I stand alone, surrounded on all sides by orthogonality, as the only actual member of “us”? That is, to whatever extent Yudkowsky “others” the paperclippers, this sort of atheism “others” everyone.

Balance of power problems

Now: I don’t, here, actually want to debate, in depth, who exactly is how-much-of-a-paperclipper, relative to whom. Indeed, I think that “how much would I, on reflection, value the lightcone resulting from this agent’s becoming superintelligent, ironing out their motivations into a consistent utility function, and then optimizing the galaxies into the configuration that maximizes that utility function?” is a question we should be wary about focusing on – both in thinking about each other, and in thinking about our AIs. And even if we ask it, I do actually think that tons of humans would do way better-than-paperclips – both with respect to not-killing-everyone (more in my next essay), and with respect to making the future, as Yudkowsky puts it, a “Nice Place To Live.”

Still, I think that noticing the way in which questions about AI alignment arise with respect to our alignment-with-each-other can help reframe some of the issues we face as we enter the age of AGI. For one thing, to the extent extremal Goodhart doesn’t actually bite, with respect to differences-between-humans, this might provide clues about how much it bites with respect to different sorts of AIs, and to help us notice places where over-quick talk of the “fragility of value” might mislead. But beyond this, I think that bringing to mind the extremity of the standard at stake in “how much do I like the optimal light-cone according to a foomed-up and utility-function-ified version of this agent” can help humble us about the sort of alignment-with-us we should be expecting or hoping for from fellow-creatures – human and digital alike – and to reframe the sorts of mechanisms at play in ensuring it.

In particular: pretty clearly, a lot of the problem here is coming from the fact that you’re imagining any agent fooming, becoming dictator of the lightcone, and then optimizing oh-so-hard. Yes, it’s scary (read: catastrophic) when the machine minds do this. But it’s scary period. And viewed in this light, the “alignment problem” begins to seem less like a story about values, and more like a story about the balance of power. After all: it’s not as though, before the AIs showed up, we were all sitting around with exactly-the-same communal utility function – that famous foundation of our social order. And while we might or might not be reasonably happy with what different others-of-us would do as superintelligent dictators, our present mode of co-existence involves a heavy dose of not having to find out. And intentionally so. Cf “checks and balances,” plus a zillion other incentives, hard power constraints, etc. Yes, shared ethical norms and values do some work, too (though not, I think, in an especially utility-function shaped way). But we are, at least partly, as atheists towards each other. How much is it a “human values” thing, then, if we don’t trust an AI to be God?

Of course, a huge part of the story here is that AI might throw various balances-of-power out the window, so a re-framing from “values problem” to “balance of power problem” isn’t, actually, much comfort. And indeed, I think it sometimes provides false comfort to people, in a way that obscures the role that values still have to play. Thus, for example, some people say “I reject Yudkowsky’s story that some particular AI will foom and become dictator-of-the-future; rather, I think there will be a multi-polar ecosystem of different AIs with different values. Thus: problem solved?” Well, hmm: what values in particular? Is it all still ultimately an office-supplies thing? If so, it depends how much you like a complex ecosystem of staple-maximizers, thumb-tack-maximizers, and so on – fighting, trading, etc. “Better than a monoculture.” Maybe, but how much?^[9] Also, are all the humans still dead?

Ok ok it wouldn’t be quite like this...

Clearly, not-having-a-dictator isn’t enough. Some stuff also needs to be, you know, good. And this means that even in the midst of multi-polarity, goodness will need some share of strength – enough, at least, to protect itself. Indeed, herein lies Yudkowsky’s pessimism about humans ever sharing the world peacefully with misaligned AIs. The AIs, he assumes, will be vastly more powerful than the humans – sufficiently so that the humans will have basically nothing to offer in trade or to protect themselves in conflict. Thus, on Yudkowsky’s model, perhaps different AIs will strike some sort of mutually-beneficial deal, and find a way to live in comparative harmony; but the humans will be too weak to bargain for a place in such a social contract. Rather, they’ll be nano-botted, recycled for their atoms, etc (or, if they’re lucky, scanned and used in trade with aliens).

We can haggle about some of the details of Yudkowsky’s pessimism here (see, e.g., this debate about the probability that misaligned AIs would be nice enough to at least give us some tiny portion of lightcone; or these sort of questions about whether the AIs will form of a natural coalition or find it easy to cooperate), but I’m sympathetic to the broad vibe: if roughly all the power is held by agents entirely indifferent to your welfare/preferences, it seems unsurprising if you end up getting treated poorly. Indeed, a lot of the alignment problem comes down to this.

So ultimately, yes, goodness needs at least some meaningful hard power backing and protecting it. But this doesn’t mean goodness needs to be dictator; or that goodness seeks power in the same way that a paperclip-maximizer does; or that goodness relates to agents-with-different-values the way a paperclip-maximizer relates to us. I think this difference is important, at least, from a purely ethical perspective. But I think it might be important from a more real-politik perspective as well. In the next essay, I’ll say more about what I mean.

↩︎
“You could very analogously say ‘human faces are fragile’ because if you just leave out the nose it suddenly doesn’t look like a typical human face at all. Sure, but is that the kind of error you get when you try to train ML systems to mimic human faces? Almost none of the faces on thispersondoesnotexist.com are blatantly morphologically unusual in any way, let alone noseless.”
↩︎
I think Stuart Russell’s comment here – “A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable” – really doesn’t cut it.
↩︎
See also: “The tails come apart” and “Beware surprising and suspicious convergence.” Plus Yudkowsky’s discussion of Corrigible and Sovereign AIs here, both of which appeal to the notion of wanting “exactly what we extrapolated-want.”
↩︎
I’m no fan of experience machines, but still – yes? Worth paying a lot for over paperclips, I think.
↩︎
Indeed, Soares gives various examples of humans doing similar stuff here.
↩︎
See also Soares here.
↩︎
Thanks to Carl Shulman for suggesting this example, years ago. One empirical hypothesis here is that in fact, human reflection will specifically try to avoid leading to path-dependent conclusions of this kind. But again, this is a convenient and substantive empirical hypothesis about where our meta-reflection process will lead (and note that anti-realism assumes that some kind of path dependence must be OK regardless – e.g., you need ways of not caring about the fact that in some possible worlds, you ended up caring about paperclips).
↩︎
My sense is that Yudkowsky deems this behavior roughly as anti-natural as believing that 222+222=555, after exposure to the basics of math.*
↩︎
And note that “having AI systems with lots of different values systems increases the chances that those values overlap with ours” doesn’t cut it, at least in the context of extremal goodhart, because sufficient similarity with human values requires hitting such a narrow target so precisely that throwing more not aimed-well-enough darts doesn’t help much. And the same holds if we posit that the AI values will be “complex” rather than “simple.” Sure, human values are complex, so AIs with complex values are at least still in the running for alignment. But the space of possible complex value systems is also gigantic – so the narrow target problem still applies.

An even deeper atheism

The fragility of value

Human paperclippers?

Deeper into godlessness

Balance of power problems