Some thoughts on "AI could defeat all of us combined"

This week I found myself tracing back from Zvi’s To predict what happens, ask what happens (a) to Ajeya’s Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover (a) to Holden’s AI could defeat all of us combined (a).

A few thoughts on that last one.

First off, I’m really grateful that someone is putting in the work to clearly make the case to a skeptical audience that AI poses an existential risk. Noble work!

I notice that different parts of me (“parts” in the Internal Family Systems sense) have very different reactions to the topic. I can be super-egoically onboard with a point but other parts of my awareness (usually less conceptual, more “lower-down-in-the-body” parts) are freaked out and/or have serious objections.

I notice also an impulse to respond to these lower-level objections dismissively: “Shut up you stupid reptile brain! Can’t you see the logic checks out?! This is what matters!”

This… hasn’t been very productive.

Greg knows what’s up:

I’ve noticed that engaging AI-doomer content tends to leave pretty strong traces of anxiety-ish-ness in the body.
I’ve been finding it quite helpful to sit still and feel all this. Neither pushing away nor engaging thought.
The body knows how to do this.

I’m generally interested in how to weave together the worlds of healing/dharma/valence and EA/rationality/x-risk.

There’s a lot to say about that; one noticing is that arguments for taking seriously something charged and fraught like AI x-risk are received by an internally-fractured audience – different parts of a reader’s psychology react differently to the message, and it’s not enough to address just their super-egoic parts.

(Not a novel point but the IFS-style parts framework has helped me think about it more crisply.)

Now to the meat of Holden’s post. He gives this beautiful analogy, which I’m going to start using more:

At a high level, I think we should be worried if a huge (competitive with world population) and rapidly growing set of highly skilled humans on another planet was trying to take down civilization just by using the Internet. So we should be worried about a large set of disembodied AIs as well.

He then spends a lot of time drawing a distinction between “superintelligence risk” and “how AIs could defeat humans without superintelligence.” e.g.

To me, this is most of what we need to know: if there’s something with human-like skills, seeking to disempower humanity, with a population in the same ballpark as (or larger than) that of all humans, we’ve got a civilization-level problem. [Holden’s emphasis]

But this assumes that the AI systems are able to coordinate fluidly (superhumanly?) across their population. Indeed he takes that as a premise:

So, for what follows, let’s proceed from the premise: “For some weird reason, humans consistently design AI systems (with human-like research and planning abilities) that coordinate with each other to try and overthrow humanity.”

A lot of his arguments for why an AI population like this would pose an existential threat to humanity (bribing/convincing/fooling/blackmailing humans, deploying military robots, developing infrastructure to secure themselves from being unplugged) seem to assume a central coordinating body, something like a strategy engine that’s able to maintain a high-fidelity, continually-updating world model and then develop and execute coordinated action plans on the basis of that world model. Something like the Diplomacy AI (a), except for instead of playing Diplomacy it’s playing real-world geopolitics.

Two thoughts on that:

I don’t see how a coordinated population of AIs like that would be different from a superintelligence, so it’s unclear why the distinction matters (or I’m misunderstanding some nuance of it).
It seems like someone would need to build at least a beta version of the real-world strategy engine to catalyze the feedback loops and the coordinated actions across an AI population.

I’ve been wondering about a broader version of (2) for a while now… a lot of the superintelligence risk arguments seem to implicitly assume a “waking up” point at which a frontier AI system realizes enough situational awareness to start power-seeking or whatever deviation from its intended purpose we’re worried about.

To be clear I’m not saying that this is impossible – that kind of self-awareness could well be an emergent capability of GPT-N, or AutoGPT++ could realize that it needs to really improve its world model and start to power-seek in order to achieve whatever goal. (It does seem like those sorts of moves would trigger a bunch of fire alarms though.)

I just wish that these assumptions were made more explicit in the AI risk discourse, especially as we start making the case to increasingly mainstream audiences.

e.g. Rob Bensinger wrote up a nice piecewise argument for AGI ruin (a), but his piece (3) rolls what seem to me to be very particular, crux-y capabilities (e.g. something like this “waking up”) into the general category of capabilities improvement:

(3) High Early Capabilities. As a strong default, absent alignment breakthroughs or global coordination breakthroughs, early STEM-level AGIs will be scaled to capability levels that allow them to understand their situation, and allow them to kill all humans if they want.

It’s similar in Carlsmith’s six-step model (a), where advanced abilities are considered all together in step one:

Advanced capability: they outperform the best humans on some set of tasks which when performed at advanced levels grant significant power in today’s world (tasks like scientific research, business/military/political strategy, engineering, and persuasion/manipulation).
Agentic planning: they make and execute plans, in pursuit of objectives, on the basis of models of the world.
Strategic awareness: the models they use in making plans represent with reasonable accuracy the causal upshot of gaining and maintaining power over humans and the real-world environment.

I feel like all this could use more parsing out.

Which specific forms of awareness and planning would be required to develop and keep up-to-date a world model as good as e.g. the US military’s?

What fire alarms would progress along those dimensions trigger along the way?

How plausible is it that marginal development of the various current AI approaches unlock these abilities?

n.b. I’m not setting this up as a knockdown argument for why superintelligence risk isn’t real, and I’m not an AI risk skeptic. Rather I’m presenting a research direction I’d like to understand better.

Cross-posted to my blog.

Some thoughts on “AI could defeat all of us combined”