(Cross-posted from my website. Podcast version here, or search for “Joe Carlsmith Audio” on your podcast app.
This essay is part of a series I’m calling “Otherness and control in
the age of AGI.” I’m hoping that individual essays can be read fairly
well on their own, but see
here
for brief summaries of the essays that have been released thus far.
Minor spoilers for Game of Thrones.)
In my last essay, I discussed Robin Hanson’s critique of the AI risk
discourse – and in particular, the accusation that this discourse
“others” the AIs, and seeks too much control over the values that steer
the future. I find some aspects of Hanson’s critique uncompelling and
implausible, but I do think he’s pointing at a real discomfort. In fact,
I think that when we bring certain other Yudkowskian vibes into view – and in particular, vibes related to the “fragility of value,” “extremal
Goodhart,” and “the tails come apart” – this discomfort should deepen
yet further. In this essay I explain why.
The fragility of value
Engaging with Yudkowsky’s work, I think it’s easy to take away something
like the following broad lesson: “extreme optimization for a
slightly-wrong utility function tends to lead to valueless/horrible
places.”
Thus, in justifying his claim that “any Future not shaped by a goal
system with detailed reliable inheritance from human morals and
metamorals, will contain almost nothing of worth,” Yudkowsky argues that
value is
“fragile.”
There is more than one dimension of human value, where if just that
one thing is lost, the Future becomes null. A single blow
and all value shatters. Not every single blow will
shatter all value—but more than one possible “single blow” will
do so.
For example, he suggests: suppose you get rid of boredom, and so spend
eternity “replaying a single highly optimized experience, over and over
and over again.” Or suppose you get rid of “contact with
reality,”
and so put people into experience machines. Or suppose you get rid of
consciousness, and so make a future of non-sentient flourishing.
Now, as Katja Grace points
out,
these are all pretty specific sorts of “slightly different.”[1] But
at times, at
least, Yudkowsky seems to suggest that the point generalizes to many
directions of subtle permutation: “if you have a 1000-byte exact
specification of worthwhile happiness, and you begin to mutate it,
the value created by the corresponding AI with the mutated definition
falls off rapidly.”
ChatGPT imagines “slightly mutated happiness.”
Can we give some sort of formal argument for expecting value fragility
of this kind? The closest I’ve seen is the literature on “extremal
Goodhart” – a specific variant of Goodhart’s law (Yudkowsky gives his description
here).[2]
Imprecisely, I think the thought would be something like: even if the
True Utility Function is similar enough to the Slightly-Wrong Utility
Function to be correlated within a restricted search space, extreme
optimization searches much harder over a much larger space – and within
that much larger space, the correlation between the True Utility and the
Slightly-Wrong Utility breaks down, such that getting maximal
Slightly-Wrong Utility is no update about the True Utility. Rather,
conditional on maximal Slightly-Wrong Utility, you should expect the
mean True Utility for a random point in the space. And if you’re bored,
in expectation, by a random point in the space (as Yudkowsky is, for
example, by a random arrangement of matter and energy in the lightcone),
then you’ll be disappointed by the results of extreme but Slightly-Wrong
optimization.
Now, this is not, in itself, any kind of airtight argument that any
utility function subject to extreme and unchecked optimization pressure
has to be exactly right. But amidst all this talk of edge
instantiation
and the hidden complexity of
wishes
and the King Midas
problem
and so on, it’s easy to take away that vibe.[3] That is, if it’s not
aimed precisely at the True Utility, intense optimization – even for
something kinda-like True Utility – can seem likely to grab the
universe and drive it in some ultimately orthogonal and
as-good-as-random direction (this is the generalized meaning of
“paperclips”). The tails come way, way apart.
I won’t, here, try to dive deep on whether value is fragile in this
sense (note that, at the least, we need to say a lot more about when and
why the correlation between the True Utility and the Slightly-Wrong
Utility breaks down). Rather, I want to focus on the sort of yang this
picture can prompt. In particular: Yudkowskian-ism generally assumes
that at least absent civilizational destruction or very active
coordination, the future will be driven by extreme optimization pressure
of somekind. Something is going to foom, and then drive the
accessible universe hard in its favored direction. Hopefully, it’s “us.”
But the more the direction in question has to be exactly right, lest
value shatter into paperclips, the tighter, it seems, we must grip the
wheel – and the more exacting our standards for who’s driving.
Human paperclippers?
And now, of course, the question arises: how different, exactly, are
human hearts from each other? And in particular: are they sufficiently
different that, when they foom, and even “on reflection,” they don’t end
up pointing in exactly the same direction? After all, Yudkowsky said,
above, that in order for the future to be non-trivially “of worth,”
human hearts have to be in the driver’s seat. But even setting aside the
insult, here, to the dolphins, bonobos, nearest grabby aliens, and so on – still, that’s only to specify a necessary condition. Presumably,
though, it’s not a sufficient condition? Presumably some human hearts
would be bad drivers, too? Like, I dunno, Stalin?
Now: let’s be clear, the AI risk folks have heard this sort of question
before. “Ah, but aligned with whom?” Very deep. And the Yudkowskians
respond with frustration. “I just told you that we’re all about to be
killed, and your mind goes to monkey politics? You’re fighting over the
poisoned
banana!”
And even if you don’t have Yudkowsky’s probability on doom, it is,
indeed, a potentially divisive and race-spurring frame – and one that
won’t matter if we all end up dead. There are, indeed, times to set
aside your differences – and especially, weird philosophical questions
about how much your differences diverge once they’re systematized into
utility functions and subjected to extreme optimization pressure—and
to unite in a common cause. Sometimes, the white walkers are invading,
and everyone in the realm needs to put down their disputes and head
north to take a stand together; and if you, like Cercei, stay behind, and
weaken the collective effort, and focus on making sure that your favored
lineage sits the Iron Throne if the white walkers are defeated – well,
then you are a serious asshole, and an ally of Moloch. If winter is
indeed coming, let’s not be like Cercei.
Let’s hope we can get this kind of evidence ahead of time.
Still: I think it’s important to ask, with Hanson, how the abstract
conceptual apparatus at work in various simple arguments for “AI
alignment” apply to “human alignment,” too. In particular: the human
case is rich with history, intuition, and hard-won-heuristics that the
alien-ness of the AI case can easily elide. And when yang goes wrong,
it’s often via giving in, too readily, to the temptations of
abstraction, to the neglect of something messier and more concrete (cf
communism, high-modernism-gone-wrong, etc). But the human case, at
least, offers more data to collide with – and various lessons, I’ll
suggest, worth learning. And anyway, even to label the AIs as the white
walkers is already to take for granted large swaths of the narrative
that Hanson is trying to contest. We should meet the challenge on its
own terms.
Plus, there are already some worrying flags about the verdicts that a
simplistic picture of value fragility will reach about “human
alignment.” Consider, for example, Yudkowsky’s examples above, of
utility functions that are OK with repeating optimal stuff over and over
(instead of getting “bored”), or with people having optimal experiences
inside experience machines, even without any “contact with
reality.”
Even setting aside questions about whether a universe filled to the brim
with bliss should count as non-trivially “of worth,”[4] there’s a
different snag: namely, that these are both value systems that a decent
number of humans actually endorse – for example, various of my friends
(though admittedly, I hang out in strange circles). Yet Yudkowsky seems
to think that the ethics these friends profess would shatter all value – and if they would endorse it on reflection, that makes them,
effectively, paperclippers relative to him. (Indeed, I even know
illusionist-ish
folks who are much less excited than Yudkowsky about deep ties between
consciousness and moral-importance. But this is a fringe-er view.)
Now, of course, the “on reflection” bit is important. And one route to
optimism about “human alignment” is to claim that most humans will
converge, on reflection, to sufficiently similar values that their
utility functions won’t be “fragile” relative to each other. In the
light of Reason, for example, maybe Yudkowsky and my friends would come
to agree about the importance of preserving boredom and reality-contact.
But even setting aside problems for the notion of “reflection” at
stake,
and questions about who will be disposed to “reflect” in the relevant
way, positing robust convergence in this respect is a strong,
convenient, and thus-far-undefended empirical hypothesis – and one
that, absent a defense, might prompt questions, from the atheists, about
wishful thinking.
Indeed, while it’s true that humans have various important similarities
to each other (bodies, genes, cognitive architectures, acculturation
processes) that do not apply to the AI case, nothing has yet been said
to show that these similarities are enough to overcome the “extremal
Goodhart” argument for value fragility. That argument, at least as I’ve
stated it, was offered with no obvious bounds on the values-differences
to which it applies – the problem statement, rather, was extremely
general. So while, yes, it condemned the non-human hearts – still, one
wonders: how many human hearts did it condemn along the way?
A quick glance at what happens when human values get “systematized” and
then “optimized super hard for” isn’t immediately encouraging. Thus,
here’s
Scott Alexander on the difference between the everyday cases
(“mediocristan”) on which our morality is trained, and the strange
generalizations the resulting moral concepts can imply:
The morality of Mediocristan is mostly uncontroversial. It doesn’t
matter what moral system you use, because all moral systems were
trained on the same set of Mediocristani data and give mostly the same
results in this area. Stealing from the poor is bad. Donating to
charity is good. A lot of what we mean when we say a moral system
sounds plausible is that it best fits our Mediocristani data that we
all agree upon...
The further we go toward the tails, the more extreme the divergences
become. Utilitarianism agrees that we should give to charity and
shouldn’t steal from the poor, because Utility, but take it far enough
to the tails and we should tile the universe with rats on heroin.
Religious morality agrees that we should give to charity and shouldn’t
steal from the poor, because God, but take it far enough to the tails
and we should spend all our time in giant cubes made of semiprecious
stones singing songs of praise. Deontology agrees that we should give
to charity and shouldn’t steal from the poor, because Rules, but take
it far enough to the tails and we all have to be libertarians.
From Alexander: “Mediocristan is like the route from Balboa Park to
West Oakland, where it doesn’t matter what line you’re on because
they’re all going to the same place. Then suddenly you enter
Extremistan, where if you took the Red Line you’ll end up in Richmond,
and if you took the Green Line you’ll end up in Warm Springs, on totally
opposite sides of the map...”
That is, Alexander suggests a certain pessimism about extremal Goodhart
in the human case. Different human value systems are similar, and
reasonably aligned with each other, within a limited distribution of
familiar cases, partly because they were crafted in order to capture
the same intuitive data-points. But systematize them and amp them up to
foom, and they decorrelate hard. Cf, too, the classical utilitarians and
the negative utilitarians. On the one hand, oh-so-similar – not just in
having human bodies, genes, cognitive architectures, etc, but in many
more specific ways (thinking styles, blogging communities, etc). And
yet, and yet – amp them up to foom, and they seek such different
extremes (the one, Bliss; and the other, Nothingness).
Or consider this
diagnosis,
from Nate Soares of the Yudkowsky-founded Machine Intelligence Research
Institute, about how the AIs will end up with misaligned goals:
The first minds humanity makes will be a terrible
spaghetti-code mess,
with no clearly-factored-out “goal” that the surrounding cognition
pursues in a unified way. The mind will be more like a pile of
complex, messily interconnected kludges, whose ultimate behavior is
sensitive to
the particulars of
how it reflects and irons out the tensions within itself over time.
Sound familiar? Human minds too, seem pretty spaghetti-code and
interconnected kludge-ish. We, too, are reflecting on and ironing-out
our internal tensions, in
sensitive-to-particulars
ways.[5] And remind me why this goes wrong in the AI case, especially
for AIs trained to be nice in various familiar human contexts? Well,
there are
variousstories – but a core issue, for Yudkowsky and Soares, is the meta-ethical
anti-realism thing (though: less often named as such).
Here’s
Yudkowsky:
There’s something like a single answer, or a single bucket of answers,
for questions like ‘What’s the environment really like?’ and ‘How do I
figure out the environment?’ and ‘Which of my possible outputs
interact with reality in a way that causes reality to have certain
properties?‘… When you have a wrong belief, reality hits back at
your wrong predictions… In contrast, when it comes to a choice of
utility function, there are unbounded degrees of freedom and multiple
reflectively coherent fixpoints. Reality doesn’t ‘hit back’ against
things that are locally aligned with the loss function on a particular
range of test cases, but globally misaligned on a wider range of test
cases.[6]
That is, the instrumental reasoning bit – that part is constrained by
reality. But the utility function – that part is unconstrained. So even
granted a particular, nice-seeming pattern of behavior on a particular
limited range of cases, an agent reflecting on its values and “ironing
out its internal tensions” can just go careening off in a zillion
possible directions, with nothing except “coherence” (a very minimal
desideratum) and the contingencies of its starting-point to nudge the
process down any particular path. Ethical reflection, that is, is
substantially a free for all. So once the AI is powerful enough to
reflect, and to prevent you from correcting it, its reflection spins
away, unmoored and untethered, into the land where extremal Goodhart
bites, and value shatters into paperclips.
But: remind me what part of that doesn’t apply to humans? Granted,
humans and AIs work from different contingent starting-points – indeed,
worryingly much. But so, too, do different humans. Less, perhaps – but
how much less is necessary? What force staves off extremal Goodhart in
the human-human case, but not in the AI-human one? For example: what
prevents the classical utilitarians from splitting, on reflection, into
tons of slightly-different variants, each of whom use a
slightly-different conception of optimal pleasure (hedonium-1,
hedonium-2, etc)?[7] And wouldn’t they, then, be paperclippers to each
other, what with their slightly-mutated conceptions of perfect
happiness? I hear the value of mutant happiness drops off fast...
And we can worry about the human-human case for more mundane reasons,
too. Thus, for example, it’s often thought that a substantial part of
what’s going on with human values is either selfish or quite “partial.”
That is, many humans want pleasure, status, flourishing, etc for
themselves, and then also for their family, local community, and so on.
We can posit that this aspect of human values will disappear or
constrain itself on reflection, or that it will “saturate” to the point
where more impartial and cosmopolitan values start to dominate in
practice – but see above re: “convenient and substantive empirical
hypothesis” (and if “saturation” helps with extremal-Goodhart problems,
can you make the AI’s values saturate, too?). And absent such comforts,
“alignment” between humans looks harder to come by. Full-scale egoists,
for example, are famously “unaligned” with each other – Bob wants
blah-for-Bob, and Sally, blah-for-Sally. And the same dynamic can easily
re-emerge with respect to less extreme partialities. Cf, indeed, lots of
“alignment problems” throughout history.
Of course, we haven’t, throughout history, had to worry much about
alignment problems of the form “suppose that blah agent fooms, irons out
its contradictions into a consistent utility function, then becomes
dictator of the accessible universe and re-arranges all the matter and
energy to the configuration that maxes out that utility function.”
Yudkowsky’s mainline narrative asks us to imagine facing this problem
with respect to AI – and no surprise, indeed, that it looks unlikely to
go well. Indeed, on such a narrative, and absent the ability to make
your AI something other than an aspiring-dictator (cf “corrigibility,”
or as Yudkowsky puts
it,
building an AI that “doesn’t want exactly what we want, and yet
somehow fails to kill us and take over the galaxies despite that being a
convergent incentive there”[8]), the challenge of AI alignment amounts,
as Yudkowsky puts it, to the challenge of building a “Sovereign which
wants exactly what we extrapolated-want and is therefore safe to let
optimize all the future galaxies without it accepting any human input
trying to stop it.”
But assuming that humans are not “corrigible” (Yudkowsky, at least,
wants to eat the galaxies) then especially if you’re taking extremal
Goodhart seriously, any given human does not appear especially “safe to
let optimize all the future galaxies without accepting any input,”
either – that’s, erm, a very high standard. But if that’s the standard
for being a “paperclipper,” then are most humans paperclippers relative
to each other?
Deeper into godlessness
We can imagine a view that answers “yes, most humans are paperclippers
relative to each other.” Indeed, we can imagine a view that takes
extremal Goodhart and “the tails come apart” so seriously that it
decides all the hearts, except its own, are paperclippers. After all,
those other hearts aren’t exactly the same as its own. And isn’t value
fragile, under extreme optimization pressure, to small differences? And
isn’t the future one of extreme optimization? Apparently, the only path
to a non-paperclippy future is for my heart, in particular, to be
dictator. It’s bleak, I know. My p(doom) is high. But one must be a
scout about such things.
In fact, we can be even more mistrusting. For example: you know what
might happen to your heart over time? It might change even a tiny bit!
Like: what happens if you read a book, or watch a documentary, or fall
in love, or get some kind of indigestion – and then your heart is
never exactly the same ever again, and not because of Reason, and then
the only possible vector of non-trivial long-term value in this bleak
and godless lightcone has been snuffed out?! Wait, OK, I have a plan:
this precise person-moment needs to become dictator. It’s rough, but
it’s the only way. Do you have the nano-bots ready? Oh wait, too late.
(OK, how about now? Dammit: doom again.)
Doom soon?
Now, to be clear: this isn’t Yudkowsky’s view. And one can see the
non-appeal. Still, I think some of the abstract commitments driving
Yudkowsky’s mainline AI alignment narrative have a certain momentum in
this direction. Here I’m thinking of e.g. the ubiquity of power-seeking
among smart-enough agents; the intense optimization to which a post-AGI
future will be subjected; extremal Goodhart; the fragility of value; and
the unmoored quality of ethical reflection given anti-realism. To avoid
seeing the hearts of others as paperclippy, one must either
reject/modify/complicate these commitments, or introduce some further,
more empirical element (e.g., “human hearts will converge to blah degree
on reflection”) that softens their blow. This isn’t, necessarily,
difficult – indeed, I think these commitments are
questionable/complicate-able along tons of dimensions, and that a
variety of open empirical and ethical questions can easily alter the
narrative at stake. But the momentum towards deeming more and more
agents (and agent-moments) paperclippers seems worth bearing in mind.
We can see this momentum as leading to a yet-deeper atheism. Yudkowsky’s
humanism, at least, has some trust in human hearts, and thus, in some
uncontrolled Other. But the atheism I have in mind, here, trusts only in
the Self, at least as the power at stake scales – and in the limit,
only in this slice of Self, the Self-Right-Now. Ultimately, indeed,
this Self is the only route to a good future. Maybe the Other matters as
a patient – but like God, they can’t be trusted with the wheel.
We can also frame this sort of atheism in Hanson’s language. In what
sense, actually, does Yudkowsky “other” the AIs? Well, basically, he
says that they can’t be trusted with power – and in particular, with
complete power over the trajectory of the future, which is what he
thinks they’re on track to get – because their values are too different
from ours. Hanson replies: aren’t the default future humans like that,
too? But this sort of atheism replies: isn’t everyone except for me
(or me-right-now) like that? Don’t I stand alone, surrounded on all
sides by orthogonality, as the only actual member of “us”? That is, to
whatever extent Yudkowsky “others” the paperclippers, this sort of
atheism “others” everyone.
Balance of power problems
Now: I don’t, here, actually want to debate, in depth, who exactly is
how-much-of-a-paperclipper, relative to whom. Indeed, I think that “how
much would I, on reflection, value the lightcone resulting from this
agent’s becoming superintelligent, ironing out their motivations into a
consistent utility function, and then optimizing the galaxies into the
configuration that maximizes that utility function?” is a question we
should be wary about focusing on – both in thinking about each other,
and in thinking about our AIs. And even if we ask it, I do actually
think that tons of humans would do way better-than-paperclips – both
with respect to not-killing-everyone (more in my next essay), and with
respect to making the future, as Yudkowsky puts it, a “Nice Place To
Live.”
Still, I think that noticing the way in which questions about AI
alignment arise with respect to our alignment-with-each-other can help
reframe some of the issues we face as we enter the age of AGI. For one
thing, to the extent extremal Goodhart doesn’t actually bite, with
respect to differences-between-humans, this might provide clues about
how much it bites with respect to different sorts of AIs, and to help us
notice places where over-quick talk of the “fragility of value” might
mislead. But beyond this, I think that bringing to mind the extremity of
the standard at stake in “how much do I like the optimal light-cone
according to a foomed-up and utility-function-ified version of this
agent” can help humble us about the sort of alignment-with-us we should
be expecting or hoping for from fellow-creatures – human and digital
alike – and to reframe the sorts of mechanisms at play in ensuring it.
In particular: pretty clearly, a lot of the problem here is coming from
the fact that you’re imagining any agent fooming, becoming dictator of
the lightcone, and then optimizing oh-so-hard. Yes, it’s scary (read:
catastrophic) when the machine minds do this. But it’s scary period.
And viewed in this light, the “alignment problem” begins to seem less
like a story about values, and more like a story about the balance of
power. After all: it’s not as though, before the AIs showed up, we were
all sitting around with exactly-the-same communal utility function – that famous foundation of our social order. And while we might or might
not be reasonably happy with what different others-of-us would do as
superintelligent dictators, our present mode of co-existence involves a
heavy dose of not having to find out. And intentionally so. Cf “checks
and balances,” plus a zillion other incentives, hard power constraints,
etc. Yes, shared ethical norms and values do some work, too (though
not, I think, in an especially utility-function shaped way). But we are,
at least partly, as atheists towards each other. How much is it a “human
values” thing, then, if we don’t trust an AI to be God?
Of course, a huge part of the story here is that AI might throw various
balances-of-power out the window, so a re-framing from “values problem”
to “balance of power problem” isn’t, actually, much comfort. And indeed,
I think it sometimes provides false comfort to people, in a way that
obscures the role that values still have to play. Thus, for example,
some people say “I reject Yudkowsky’s story that some particular AI will
foom and become dictator-of-the-future; rather, I think there will be a
multi-polar ecosystem of different AIs with different values. Thus:
problem solved?” Well, hmm: what values in particular? Is it all still
ultimately an office-supplies thing? If so, it depends how much you like
a complex ecosystem of staple-maximizers, thumb-tack-maximizers, and so
on – fighting, trading, etc. “Better than a monoculture.” Maybe, but
how much?[9] Also, are all the humans still dead?
Ok ok it wouldn’t be quite like this...
Clearly, not-having-a-dictator isn’t enough. Some stuff also needs to
be, you know, good. And this means that even in the midst of
multi-polarity, goodness will need some share of strength – enough, at
least, to protect itself. Indeed, herein lies Yudkowsky’s pessimism
about humans ever sharing the world peacefully with misaligned AIs. The
AIs, he assumes, will be vastly more powerful than the humans – sufficiently so that the humans will have basically nothing to offer in
trade or to protect themselves in conflict. Thus, on Yudkowsky’s model,
perhaps different AIs will strike some sort of mutually-beneficial deal,
and find a way to live in comparative harmony; but the humans will be
too weak to bargain for a place in such a social contract. Rather,
they’ll be nano-botted, recycled for their atoms, etc (or, if they’re
lucky, scanned and used in trade with aliens).
We can haggle about some of the details of Yudkowsky’s pessimism here
(see, e.g., this
debate
about the probability that misaligned AIs would be nice enough to at
least give us some tiny portion of lightcone; or
these
sort of questions about whether the AIs will form of a natural coalition
or find it easy to cooperate), but I’m sympathetic to the broad vibe: if
roughly all the power is held by agents entirely indifferent to your
welfare/preferences, it seems unsurprising if you end up getting treated
poorly. Indeed, a lot of the alignment problem comes down to this.
So ultimately, yes, goodness needs at least some meaningful hard power
backing and protecting it. But this doesn’t mean goodness needs to be
dictator; or that goodness seeks power in the same way that a
paperclip-maximizer does; or that goodness relates to
agents-with-different-values the way a paperclip-maximizer relates to
us. I think this difference is important, at least, from a purely
ethical perspective. But I think it might be important from a more
real-politik perspective as well. In the next essay, I’ll say more about
what I mean.
“You could very analogously say ‘human faces are fragile’ because
if you just leave out the nose it suddenly doesn’t look like a
typical human face at all. Sure, but is that the kind of error you
get when you try to train ML systems to mimic human faces? Almost
none of the faces
on thispersondoesnotexist.com are
blatantly morphologically unusual in any way, let alone noseless.”
I think Stuart Russell’s comment
here
– “A system that is optimizing a function of n variables, where
the objective depends on a subset of size k<n, will often set
the remaining unconstrained variables to extreme values; if one of
those unconstrained variables is actually something we care about,
the solution found may be highly undesirable” – really doesn’t
cut it.
Thanks to Carl Shulman for suggesting this example, years ago. One
empirical hypothesis here is that in fact, human reflection will
specifically try to avoid leading to path-dependent conclusions of
this kind. But again, this is a convenient and substantive empirical
hypothesis about where our meta-reflection process will lead (and
note that anti-realism assumes that some kind of path dependence
must be OK regardless – e.g., you need ways of not caring about the
fact that in some possible worlds, you ended up caring about
paperclips).
And note that “having AI systems with lots of different values
systems increases the chances that those values overlap with ours”
doesn’t cut it, at least in the context of extremal goodhart,
because sufficient similarity with human values requires hitting
such a narrow target so precisely that throwing more not
aimed-well-enough darts doesn’t help much. And the same holds if we
posit that the AI values will be “complex” rather than “simple.”
Sure, human values are complex, so AIs with complex values are at
least still in the running for alignment. But the space of possible
complex value systems is also gigantic – so the narrow target
problem still applies.
An even deeper atheism
(Cross-posted from my website. Podcast version here, or search for “Joe Carlsmith Audio” on your podcast app.
This essay is part of a series I’m calling “Otherness and control in the age of AGI.” I’m hoping that individual essays can be read fairly well on their own, but see here for brief summaries of the essays that have been released thus far.
Minor spoilers for Game of Thrones.)
In my last essay, I discussed Robin Hanson’s critique of the AI risk discourse – and in particular, the accusation that this discourse “others” the AIs, and seeks too much control over the values that steer the future. I find some aspects of Hanson’s critique uncompelling and implausible, but I do think he’s pointing at a real discomfort. In fact, I think that when we bring certain other Yudkowskian vibes into view – and in particular, vibes related to the “fragility of value,” “extremal Goodhart,” and “the tails come apart” – this discomfort should deepen yet further. In this essay I explain why.
The fragility of value
Engaging with Yudkowsky’s work, I think it’s easy to take away something like the following broad lesson: “extreme optimization for a slightly-wrong utility function tends to lead to valueless/horrible places.”
Thus, in justifying his claim that “any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth,” Yudkowsky argues that value is “fragile.”
For example, he suggests: suppose you get rid of boredom, and so spend eternity “replaying a single highly optimized experience, over and over and over again.” Or suppose you get rid of “contact with reality,” and so put people into experience machines. Or suppose you get rid of consciousness, and so make a future of non-sentient flourishing.
Now, as Katja Grace points out, these are all pretty specific sorts of “slightly different.”[1] But at times, at least, Yudkowsky seems to suggest that the point generalizes to many directions of subtle permutation: “if you have a 1000-byte exact specification of worthwhile happiness, and you begin to mutate it, the value created by the corresponding AI with the mutated definition falls off rapidly.”
ChatGPT imagines “slightly mutated happiness.”
Can we give some sort of formal argument for expecting value fragility of this kind? The closest I’ve seen is the literature on “extremal Goodhart” – a specific variant of Goodhart’s law (Yudkowsky gives his description here).[2] Imprecisely, I think the thought would be something like: even if the True Utility Function is similar enough to the Slightly-Wrong Utility Function to be correlated within a restricted search space, extreme optimization searches much harder over a much larger space – and within that much larger space, the correlation between the True Utility and the Slightly-Wrong Utility breaks down, such that getting maximal Slightly-Wrong Utility is no update about the True Utility. Rather, conditional on maximal Slightly-Wrong Utility, you should expect the mean True Utility for a random point in the space. And if you’re bored, in expectation, by a random point in the space (as Yudkowsky is, for example, by a random arrangement of matter and energy in the lightcone), then you’ll be disappointed by the results of extreme but Slightly-Wrong optimization.
Now, this is not, in itself, any kind of airtight argument that any utility function subject to extreme and unchecked optimization pressure has to be exactly right. But amidst all this talk of edge instantiation and the hidden complexity of wishes and the King Midas problem and so on, it’s easy to take away that vibe.[3] That is, if it’s not aimed precisely at the True Utility, intense optimization – even for something kinda-like True Utility – can seem likely to grab the universe and drive it in some ultimately orthogonal and as-good-as-random direction (this is the generalized meaning of “paperclips”). The tails come way, way apart.
I won’t, here, try to dive deep on whether value is fragile in this sense (note that, at the least, we need to say a lot more about when and why the correlation between the True Utility and the Slightly-Wrong Utility breaks down). Rather, I want to focus on the sort of yang this picture can prompt. In particular: Yudkowskian-ism generally assumes that at least absent civilizational destruction or very active coordination, the future will be driven by extreme optimization pressure of some kind. Something is going to foom, and then drive the accessible universe hard in its favored direction. Hopefully, it’s “us.” But the more the direction in question has to be exactly right, lest value shatter into paperclips, the tighter, it seems, we must grip the wheel – and the more exacting our standards for who’s driving.
Human paperclippers?
And now, of course, the question arises: how different, exactly, are human hearts from each other? And in particular: are they sufficiently different that, when they foom, and even “on reflection,” they don’t end up pointing in exactly the same direction? After all, Yudkowsky said, above, that in order for the future to be non-trivially “of worth,” human hearts have to be in the driver’s seat. But even setting aside the insult, here, to the dolphins, bonobos, nearest grabby aliens, and so on – still, that’s only to specify a necessary condition. Presumably, though, it’s not a sufficient condition? Presumably some human hearts would be bad drivers, too? Like, I dunno, Stalin?
Now: let’s be clear, the AI risk folks have heard this sort of question before. “Ah, but aligned with whom?” Very deep. And the Yudkowskians respond with frustration. “I just told you that we’re all about to be killed, and your mind goes to monkey politics? You’re fighting over the poisoned banana!” And even if you don’t have Yudkowsky’s probability on doom, it is, indeed, a potentially divisive and race-spurring frame – and one that won’t matter if we all end up dead. There are, indeed, times to set aside your differences – and especially, weird philosophical questions about how much your differences diverge once they’re systematized into utility functions and subjected to extreme optimization pressure—and to unite in a common cause. Sometimes, the white walkers are invading, and everyone in the realm needs to put down their disputes and head north to take a stand together; and if you, like Cercei, stay behind, and weaken the collective effort, and focus on making sure that your favored lineage sits the Iron Throne if the white walkers are defeated – well, then you are a serious asshole, and an ally of Moloch. If winter is indeed coming, let’s not be like Cercei.
Cersei Sees The White Walker Best Scene- Game of Thrones Season 7 Ep 7
Let’s hope we can get this kind of evidence ahead of time.
Still: I think it’s important to ask, with Hanson, how the abstract conceptual apparatus at work in various simple arguments for “AI alignment” apply to “human alignment,” too. In particular: the human case is rich with history, intuition, and hard-won-heuristics that the alien-ness of the AI case can easily elide. And when yang goes wrong, it’s often via giving in, too readily, to the temptations of abstraction, to the neglect of something messier and more concrete (cf communism, high-modernism-gone-wrong, etc). But the human case, at least, offers more data to collide with – and various lessons, I’ll suggest, worth learning. And anyway, even to label the AIs as the white walkers is already to take for granted large swaths of the narrative that Hanson is trying to contest. We should meet the challenge on its own terms.
Plus, there are already some worrying flags about the verdicts that a simplistic picture of value fragility will reach about “human alignment.” Consider, for example, Yudkowsky’s examples above, of utility functions that are OK with repeating optimal stuff over and over (instead of getting “bored”), or with people having optimal experiences inside experience machines, even without any “contact with reality.” Even setting aside questions about whether a universe filled to the brim with bliss should count as non-trivially “of worth,”[4] there’s a different snag: namely, that these are both value systems that a decent number of humans actually endorse – for example, various of my friends (though admittedly, I hang out in strange circles). Yet Yudkowsky seems to think that the ethics these friends profess would shatter all value – and if they would endorse it on reflection, that makes them, effectively, paperclippers relative to him. (Indeed, I even know illusionist-ish folks who are much less excited than Yudkowsky about deep ties between consciousness and moral-importance. But this is a fringe-er view.)
Now, of course, the “on reflection” bit is important. And one route to optimism about “human alignment” is to claim that most humans will converge, on reflection, to sufficiently similar values that their utility functions won’t be “fragile” relative to each other. In the light of Reason, for example, maybe Yudkowsky and my friends would come to agree about the importance of preserving boredom and reality-contact. But even setting aside problems for the notion of “reflection” at stake, and questions about who will be disposed to “reflect” in the relevant way, positing robust convergence in this respect is a strong, convenient, and thus-far-undefended empirical hypothesis – and one that, absent a defense, might prompt questions, from the atheists, about wishful thinking.
Indeed, while it’s true that humans have various important similarities to each other (bodies, genes, cognitive architectures, acculturation processes) that do not apply to the AI case, nothing has yet been said to show that these similarities are enough to overcome the “extremal Goodhart” argument for value fragility. That argument, at least as I’ve stated it, was offered with no obvious bounds on the values-differences to which it applies – the problem statement, rather, was extremely general. So while, yes, it condemned the non-human hearts – still, one wonders: how many human hearts did it condemn along the way?
A quick glance at what happens when human values get “systematized” and then “optimized super hard for” isn’t immediately encouraging. Thus, here’s Scott Alexander on the difference between the everyday cases (“mediocristan”) on which our morality is trained, and the strange generalizations the resulting moral concepts can imply:
From Alexander: “Mediocristan is like the route from Balboa Park to West Oakland, where it doesn’t matter what line you’re on because they’re all going to the same place. Then suddenly you enter Extremistan, where if you took the Red Line you’ll end up in Richmond, and if you took the Green Line you’ll end up in Warm Springs, on totally opposite sides of the map...”
That is, Alexander suggests a certain pessimism about extremal Goodhart in the human case. Different human value systems are similar, and reasonably aligned with each other, within a limited distribution of familiar cases, partly because they were crafted in order to capture the same intuitive data-points. But systematize them and amp them up to foom, and they decorrelate hard. Cf, too, the classical utilitarians and the negative utilitarians. On the one hand, oh-so-similar – not just in having human bodies, genes, cognitive architectures, etc, but in many more specific ways (thinking styles, blogging communities, etc). And yet, and yet – amp them up to foom, and they seek such different extremes (the one, Bliss; and the other, Nothingness).
Or consider this diagnosis, from Nate Soares of the Yudkowsky-founded Machine Intelligence Research Institute, about how the AIs will end up with misaligned goals:
Sound familiar? Human minds too, seem pretty spaghetti-code and interconnected kludge-ish. We, too, are reflecting on and ironing-out our internal tensions, in sensitive-to-particulars ways.[5] And remind me why this goes wrong in the AI case, especially for AIs trained to be nice in various familiar human contexts? Well, there are various stories – but a core issue, for Yudkowsky and Soares, is the meta-ethical anti-realism thing (though: less often named as such). Here’s Yudkowsky:
That is, the instrumental reasoning bit – that part is constrained by reality. But the utility function – that part is unconstrained. So even granted a particular, nice-seeming pattern of behavior on a particular limited range of cases, an agent reflecting on its values and “ironing out its internal tensions” can just go careening off in a zillion possible directions, with nothing except “coherence” (a very minimal desideratum) and the contingencies of its starting-point to nudge the process down any particular path. Ethical reflection, that is, is substantially a free for all. So once the AI is powerful enough to reflect, and to prevent you from correcting it, its reflection spins away, unmoored and untethered, into the land where extremal Goodhart bites, and value shatters into paperclips.
But: remind me what part of that doesn’t apply to humans? Granted, humans and AIs work from different contingent starting-points – indeed, worryingly much. But so, too, do different humans. Less, perhaps – but how much less is necessary? What force staves off extremal Goodhart in the human-human case, but not in the AI-human one? For example: what prevents the classical utilitarians from splitting, on reflection, into tons of slightly-different variants, each of whom use a slightly-different conception of optimal pleasure (hedonium-1, hedonium-2, etc)?[7] And wouldn’t they, then, be paperclippers to each other, what with their slightly-mutated conceptions of perfect happiness? I hear the value of mutant happiness drops off fast...
And we can worry about the human-human case for more mundane reasons, too. Thus, for example, it’s often thought that a substantial part of what’s going on with human values is either selfish or quite “partial.” That is, many humans want pleasure, status, flourishing, etc for themselves, and then also for their family, local community, and so on. We can posit that this aspect of human values will disappear or constrain itself on reflection, or that it will “saturate” to the point where more impartial and cosmopolitan values start to dominate in practice – but see above re: “convenient and substantive empirical hypothesis” (and if “saturation” helps with extremal-Goodhart problems, can you make the AI’s values saturate, too?). And absent such comforts, “alignment” between humans looks harder to come by. Full-scale egoists, for example, are famously “unaligned” with each other – Bob wants blah-for-Bob, and Sally, blah-for-Sally. And the same dynamic can easily re-emerge with respect to less extreme partialities. Cf, indeed, lots of “alignment problems” throughout history.
Of course, we haven’t, throughout history, had to worry much about alignment problems of the form “suppose that blah agent fooms, irons out its contradictions into a consistent utility function, then becomes dictator of the accessible universe and re-arranges all the matter and energy to the configuration that maxes out that utility function.” Yudkowsky’s mainline narrative asks us to imagine facing this problem with respect to AI – and no surprise, indeed, that it looks unlikely to go well. Indeed, on such a narrative, and absent the ability to make your AI something other than an aspiring-dictator (cf “corrigibility,” or as Yudkowsky puts it, building an AI that “doesn’t want exactly what we want, and yet somehow fails to kill us and take over the galaxies despite that being a convergent incentive there”[8]), the challenge of AI alignment amounts, as Yudkowsky puts it, to the challenge of building a “Sovereign which wants exactly what we extrapolated-want and is therefore safe to let optimize all the future galaxies without it accepting any human input trying to stop it.”
But assuming that humans are not “corrigible” (Yudkowsky, at least, wants to eat the galaxies) then especially if you’re taking extremal Goodhart seriously, any given human does not appear especially “safe to let optimize all the future galaxies without accepting any input,” either – that’s, erm, a very high standard. But if that’s the standard for being a “paperclipper,” then are most humans paperclippers relative to each other?
Deeper into godlessness
We can imagine a view that answers “yes, most humans are paperclippers relative to each other.” Indeed, we can imagine a view that takes extremal Goodhart and “the tails come apart” so seriously that it decides all the hearts, except its own, are paperclippers. After all, those other hearts aren’t exactly the same as its own. And isn’t value fragile, under extreme optimization pressure, to small differences? And isn’t the future one of extreme optimization? Apparently, the only path to a non-paperclippy future is for my heart, in particular, to be dictator. It’s bleak, I know. My p(doom) is high. But one must be a scout about such things.
In fact, we can be even more mistrusting. For example: you know what might happen to your heart over time? It might change even a tiny bit! Like: what happens if you read a book, or watch a documentary, or fall in love, or get some kind of indigestion – and then your heart is never exactly the same ever again, and not because of Reason, and then the only possible vector of non-trivial long-term value in this bleak and godless lightcone has been snuffed out?! Wait, OK, I have a plan: this precise person-moment needs to become dictator. It’s rough, but it’s the only way. Do you have the nano-bots ready? Oh wait, too late. (OK, how about now? Dammit: doom again.)
Doom soon?
Now, to be clear: this isn’t Yudkowsky’s view. And one can see the non-appeal. Still, I think some of the abstract commitments driving Yudkowsky’s mainline AI alignment narrative have a certain momentum in this direction. Here I’m thinking of e.g. the ubiquity of power-seeking among smart-enough agents; the intense optimization to which a post-AGI future will be subjected; extremal Goodhart; the fragility of value; and the unmoored quality of ethical reflection given anti-realism. To avoid seeing the hearts of others as paperclippy, one must either reject/modify/complicate these commitments, or introduce some further, more empirical element (e.g., “human hearts will converge to blah degree on reflection”) that softens their blow. This isn’t, necessarily, difficult – indeed, I think these commitments are questionable/complicate-able along tons of dimensions, and that a variety of open empirical and ethical questions can easily alter the narrative at stake. But the momentum towards deeming more and more agents (and agent-moments) paperclippers seems worth bearing in mind.
We can see this momentum as leading to a yet-deeper atheism. Yudkowsky’s humanism, at least, has some trust in human hearts, and thus, in some uncontrolled Other. But the atheism I have in mind, here, trusts only in the Self, at least as the power at stake scales – and in the limit, only in this slice of Self, the Self-Right-Now. Ultimately, indeed, this Self is the only route to a good future. Maybe the Other matters as a patient – but like God, they can’t be trusted with the wheel.
We can also frame this sort of atheism in Hanson’s language. In what sense, actually, does Yudkowsky “other” the AIs? Well, basically, he says that they can’t be trusted with power – and in particular, with complete power over the trajectory of the future, which is what he thinks they’re on track to get – because their values are too different from ours. Hanson replies: aren’t the default future humans like that, too? But this sort of atheism replies: isn’t everyone except for me (or me-right-now) like that? Don’t I stand alone, surrounded on all sides by orthogonality, as the only actual member of “us”? That is, to whatever extent Yudkowsky “others” the paperclippers, this sort of atheism “others” everyone.
Balance of power problems
Now: I don’t, here, actually want to debate, in depth, who exactly is how-much-of-a-paperclipper, relative to whom. Indeed, I think that “how much would I, on reflection, value the lightcone resulting from this agent’s becoming superintelligent, ironing out their motivations into a consistent utility function, and then optimizing the galaxies into the configuration that maximizes that utility function?” is a question we should be wary about focusing on – both in thinking about each other, and in thinking about our AIs. And even if we ask it, I do actually think that tons of humans would do way better-than-paperclips – both with respect to not-killing-everyone (more in my next essay), and with respect to making the future, as Yudkowsky puts it, a “Nice Place To Live.”
Still, I think that noticing the way in which questions about AI alignment arise with respect to our alignment-with-each-other can help reframe some of the issues we face as we enter the age of AGI. For one thing, to the extent extremal Goodhart doesn’t actually bite, with respect to differences-between-humans, this might provide clues about how much it bites with respect to different sorts of AIs, and to help us notice places where over-quick talk of the “fragility of value” might mislead. But beyond this, I think that bringing to mind the extremity of the standard at stake in “how much do I like the optimal light-cone according to a foomed-up and utility-function-ified version of this agent” can help humble us about the sort of alignment-with-us we should be expecting or hoping for from fellow-creatures – human and digital alike – and to reframe the sorts of mechanisms at play in ensuring it.
In particular: pretty clearly, a lot of the problem here is coming from the fact that you’re imagining any agent fooming, becoming dictator of the lightcone, and then optimizing oh-so-hard. Yes, it’s scary (read: catastrophic) when the machine minds do this. But it’s scary period. And viewed in this light, the “alignment problem” begins to seem less like a story about values, and more like a story about the balance of power. After all: it’s not as though, before the AIs showed up, we were all sitting around with exactly-the-same communal utility function – that famous foundation of our social order. And while we might or might not be reasonably happy with what different others-of-us would do as superintelligent dictators, our present mode of co-existence involves a heavy dose of not having to find out. And intentionally so. Cf “checks and balances,” plus a zillion other incentives, hard power constraints, etc. Yes, shared ethical norms and values do some work, too (though not, I think, in an especially utility-function shaped way). But we are, at least partly, as atheists towards each other. How much is it a “human values” thing, then, if we don’t trust an AI to be God?
Of course, a huge part of the story here is that AI might throw various balances-of-power out the window, so a re-framing from “values problem” to “balance of power problem” isn’t, actually, much comfort. And indeed, I think it sometimes provides false comfort to people, in a way that obscures the role that values still have to play. Thus, for example, some people say “I reject Yudkowsky’s story that some particular AI will foom and become dictator-of-the-future; rather, I think there will be a multi-polar ecosystem of different AIs with different values. Thus: problem solved?” Well, hmm: what values in particular? Is it all still ultimately an office-supplies thing? If so, it depends how much you like a complex ecosystem of staple-maximizers, thumb-tack-maximizers, and so on – fighting, trading, etc. “Better than a monoculture.” Maybe, but how much?[9] Also, are all the humans still dead?
Ok ok it wouldn’t be quite like this...
Clearly, not-having-a-dictator isn’t enough. Some stuff also needs to be, you know, good. And this means that even in the midst of multi-polarity, goodness will need some share of strength – enough, at least, to protect itself. Indeed, herein lies Yudkowsky’s pessimism about humans ever sharing the world peacefully with misaligned AIs. The AIs, he assumes, will be vastly more powerful than the humans – sufficiently so that the humans will have basically nothing to offer in trade or to protect themselves in conflict. Thus, on Yudkowsky’s model, perhaps different AIs will strike some sort of mutually-beneficial deal, and find a way to live in comparative harmony; but the humans will be too weak to bargain for a place in such a social contract. Rather, they’ll be nano-botted, recycled for their atoms, etc (or, if they’re lucky, scanned and used in trade with aliens).
We can haggle about some of the details of Yudkowsky’s pessimism here (see, e.g., this debate about the probability that misaligned AIs would be nice enough to at least give us some tiny portion of lightcone; or these sort of questions about whether the AIs will form of a natural coalition or find it easy to cooperate), but I’m sympathetic to the broad vibe: if roughly all the power is held by agents entirely indifferent to your welfare/preferences, it seems unsurprising if you end up getting treated poorly. Indeed, a lot of the alignment problem comes down to this.
So ultimately, yes, goodness needs at least some meaningful hard power backing and protecting it. But this doesn’t mean goodness needs to be dictator; or that goodness seeks power in the same way that a paperclip-maximizer does; or that goodness relates to agents-with-different-values the way a paperclip-maximizer relates to us. I think this difference is important, at least, from a purely ethical perspective. But I think it might be important from a more real-politik perspective as well. In the next essay, I’ll say more about what I mean.
“You could very analogously say ‘human faces are fragile’ because if you just leave out the nose it suddenly doesn’t look like a typical human face at all. Sure, but is that the kind of error you get when you try to train ML systems to mimic human faces? Almost none of the faces on thispersondoesnotexist.com are blatantly morphologically unusual in any way, let alone noseless.”
I think Stuart Russell’s comment here – “A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable” – really doesn’t cut it.
See also: “The tails come apart” and “Beware surprising and suspicious convergence.” Plus Yudkowsky’s discussion of Corrigible and Sovereign AIs here, both of which appeal to the notion of wanting “exactly what we extrapolated-want.”
I’m no fan of experience machines, but still – yes? Worth paying a lot for over paperclips, I think.
Indeed, Soares gives various examples of humans doing similar stuff here.
See also Soares here.
Thanks to Carl Shulman for suggesting this example, years ago. One empirical hypothesis here is that in fact, human reflection will specifically try to avoid leading to path-dependent conclusions of this kind. But again, this is a convenient and substantive empirical hypothesis about where our meta-reflection process will lead (and note that anti-realism assumes that some kind of path dependence must be OK regardless – e.g., you need ways of not caring about the fact that in some possible worlds, you ended up caring about paperclips).
My sense is that Yudkowsky deems this behavior roughly as anti-natural as believing that 222+222=555, after exposure to the basics of math.*
And note that “having AI systems with lots of different values systems increases the chances that those values overlap with ours” doesn’t cut it, at least in the context of extremal goodhart, because sufficient similarity with human values requires hitting such a narrow target so precisely that throwing more not aimed-well-enough darts doesn’t help much. And the same holds if we posit that the AI values will be “complex” rather than “simple.” Sure, human values are complex, so AIs with complex values are at least still in the running for alignment. But the space of possible complex value systems is also gigantic – so the narrow target problem still applies.