How could we know that an AGI system will have good consequences?

(Note: This was languishing in a drafts folder for a while, and probably isn’t quite right in various ways. I’m posting it because I expect it’s better to share flawed thoughts than to sit on the post until I’m satisfied with it, i.e., forever.)

Let’s play a game of “what do you think you know, and why do you think you know it?”.

Imagine that you’re about to launch an AGI. What you think you know is that, with at least 50% confidence (we’re of course not looking for proofs — that would be crazy), the AGI is going to execute some pivotal act that ends the acute risk period in a good way. Why do you think you know that?

Insofar as people’s alignment proposals can be construed as answers to this question, we have the option of answering with one of these proposals. I might very roughly classify the existing proposals into the following bins:

1. Output evaluation approaches. You know what the AGI is going to do with sufficient precision that it screens off any alignment concerns. For example, your AGI system only outputs plans in the first place, and you’ve already reviewed the plan, and you’re confident the plan will work, in a way that screens off any other worry about the AGI being misaligned.

2. Cognitive interpretability approaches. You understand the AGI’s cognition sufficiently well that, while you may not be sure what it’s going to do, you’re confident that it’s going to be good. You aren’t worried that it will kill all humans, because you understand how its plan came to be and what solution-spaces it was searching to solve various sub-problems and so on, and you’re confident no consideration was ever given to human-killing.

3. Heavy-precedent approaches. You have run this AGI before on many similar tasks, and trained out all the hiccups. While you might not know precisely what it’s going to do, and you might not know what’s going on inside its mind, you’ve been around the block a few times, and the task it’s about to perform is sufficiently similar to other tasks it has empirically succeeded at, justifying your confidence.

Roughly speaking, I think that alignment approaches with a heavy reliance on output evaluation are doomed, both on the grounds that humans can’t evaluate the effectiveness of a plan capable of ending the acute risk period, and because the real plan is less like a story and more like a tree.

For an example of “humans can’t reliably evaluate the effectiveness of this class of plans”, imagine that the plan is an enormous bitstring that’s going to be sent to the motor outputs. If you decode the string, you find that it figures out how to make long DNA strands that allegedly code for a protein factory that can be used to build a general-purpose nanofactory. You’re hard-pressed, however, to confirm that this is actually (all and only) what the plan does.

For an example of “the real plan is less like a story and more like a tree”, imagine that the AI’s plan is “I’m going to build a wetlab, then do a bunch of experimentation, then think about the results of the experiments in various ways and build a protein factory that builds a nanofactory that I’m going to experiment with until I figure out how to build nanomachines that can be used for some good pivotal act”. In order to trust that this sort of abstract plan doesn’t kill you when put into practice, you have to trust the system’s thinking and its notion of ‘goodness’, which is going to dump you pretty quickly into cognitive-interpretability-style justification.

Roughly speaking, I think that cognitive interpretability approaches are doomed, at least in the modern paradigm, because we’re not building minds but rather training minds, and we have very little grasp of their internal thinking, and there are convergent instrumental reasons to expect things to go wrong by default, and the social environment doesn’t seem to me to be fighting against those defaults with anything nearing the force I expect is necessary.

Roughly speaking, I think that heavy-precedent approaches are doomed because I haven’t myself been able to think of any pivotal action that has safe analogs we can do a bunch of empiricism on; nor have I heard a concrete proposal like this that strikes me as realistic from anyone else. “Well, it never killed all humans in the toy environments we trained it in (at least, not after the first few sandboxed incidents, after which we figured out how to train blatantly adversarial-looking behavior out of it)” doesn’t give me much confidence. If you’re smart enough to design nanotech that can melt all GPUs or whatever (disclaimer: this is a toy example of a pivotal act, and I think better pivotal-act options than this exist) then you’re probably smart enough to figure out when you’re playing for keeps, and all AGIs have an incentive not to kill all “operators” in the toy games once they start to realize they’re in toy games.

So that’s not a great place to be.

The doomedness of cognitive interpretability approaches seems to me to be the weakest. And indeed, this is where it seems to me that many people are focusing their efforts, from one angle or another.

If I may continue coarsely classifying proposals in ways their advocates might not endorse, I’d bin a bunch of proposals I’ve heard as hybrid approaches, that try to get cognitive-interpretability-style justification by way of heavy-precedent-style justification.

E.g., Paul Christiano’s plan prior to ELK was (very roughly, as I understood it) to somehow get ourselves into a position where we can say “I know the behavior of this system will be fine because I know that its cognition was only seeking fine outcomes, and I know its behavior was only seeking fine outcomes because its cognition is composed of human-esque parts, and I know that those human-esque parts are human-esque because we have access to the ground truth of short human thoughts, and because we have heavy-precedent-style empirical justification that the components of the overall cognition operate as intended.”

(This post was mostly drafted before ELK. ELK looks more to me like a different kind of interpretability+precedent hybrid approach — one that tries to get AGI-comprehension tools (for cognitive interpretability), and tries to achieve confidence in those tools via “we tried it and saw” arguments.)

I’m not very optimistic about such plans myself, mostly because I don’t expect the first working AGI systems to have architectures compatible with this plan, but secondarily because of the cognitive-interpretability parts of the justification. How do we string locally-human-esque reasoning chunks together in a way that can build nanotech for the purpose of a good pivotal act? And why can that sort of chaining not similarly result in a system that builds nanotech to Kill All Humans? And what made us confident we’re in the former case and not the latter?

But I digress. Maybe I’ll write more about that some other time.

Cf. Evan Hubinger’s post on training stories. From my perspective, training stories are focused pretty heavily on the idea that justification is going to come from a style more like heavily precedented black boxes than like cognitive interpretability, so I’m not too sold on his decomposition, but I endorse thinking about the question of how and where we could (allegedly) end up knowing that the AGI is good to deploy.

(Note that it’s entirely possible that I misunderstood Evan, and/or that Evan’s views have changed since that post.)

An implicit background assumption that’s loud in my models here is the assumption that early AGI systems will exist in an environment where they can attain a decisive strategic advantage over the rest of the world.

I believe this because of how the world looks “brittle” (e.g., nanotech exists) and because lots of technological progress seems cognition-constrained (such as, again, nanotech). This is a big part of why I think heavy-precedent-style justifications are doomed.

Many locals seem to expect a smoother and slower transition from here to superhumanly capable general-purpose science AI — a transition that somehow leaves no window where the world’s most competent AGI can unilaterally dominate the strategic landscape. I admit I have no concrete visualization of how that could go (and hereby solicit implausibly-detailed stories to make such scenarios seem more plausible to me, if you think outcomes like this are likely!). Given that I have a lot of trouble visualizing such worlds, I’m not a good person to talk about where our justifications could come from in those worlds.

I might say more on this topic later, but for now I just want to share this framing, and solicit explicit accounts of how we’re supposed to believe that your favorite flavor of AGI is going to do good stuff.