Quintin Pope comments on Quick takes on “AI is easy to control”

Quintin Pope Dec 3, 2023, 12:19 AM
13 points
6 ∶ 2
(Didn’t consult Nora on this; I speak for myself)

I only briefly skimmed this response, and will respond even more briefly.
Re “Re: “AIs are white boxes”″
You apparently completely misunderstood the point we were making with the white box thing. It has ~nothing to do with mech interp. It’s entirely about whitebox optimization being better at controlling stuff than blackbox optimization. This is true even if the person using the optimizers has no idea how the system functions internally.

Re: “Re: “Black box methods are sufficient”″ (and the other stuff about evolution)
Evolution analogies are bad. There are many specific differences between ML optimization processes and biological evolution that predictably result in very different high level dynamics. You should not rely on one to predict the other, as I have argued extensively elsewhere.

Trying to draw inferences about ML from bio evolution is only slightly less absurd than trying to draw inferences about cheesy humor from actual dairy products. Regardless of the fact they can both be called “optmization processes”, they’re completely different things, with different causal structures responsible for their different outcomes, and crucially, those differences in causal structure explain their different outcomes. There’s thus no valid inference from “X happened in biological evolution” to “X will eventually happen in ML”, because X happening in biological evolution is explained by evolution-specific details that don’t appear in ML (at least for most alignment-relevant Xs that I see MIRI people reference often, like the sharp left turn).
Re: “Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between “AI will be able to figure out what humans want” (yes; obviously; this was never under dispute) and “AI will care”″
This wasn’t the point we were making in that section at all. We were arguing about concept learning order and the ease of internalizing human values versus other features for basing decisions on. We were arguing that human values are easy features to learn / internalize / hook up to decision making, so on any natural progression up the learning capacity ladder, you end up with an AI that’s aligned before you end up with one that’s so capable it can destroy the entirety of human civilization by itself.

Re “Even though this was just a quick take, it seemed worth posting in the absence of a more polished response from me, so, here we are.”
I think you badly misunderstood the post (e.g., multiple times assuming we’re making an argument we’re not, based on shallow pattern matching of the words used: interpreting “whitebox” as meaning mech interp and “values are easy to learn” as “it will know human values”), and I wish you’d either take the time to actually read / engage with the post in sufficient depth to not make these sorts of mistakes, or not engage at all (or at least not be so rude when you do it).

(Note that this next paragraph is speculation, but a possibility worth bringing up, IMO):
As it is, your response feels like you skimmed just long enough to pattern match our content to arguments you’ve previously dismissed, then regurgitated your cached responses to those arguments. Without further commenting on the merits of our specific arguments, I’ll just note that this is a very bad habit to have if you want to actually change your mind in response to new evidence/arguments about the feasibility of alignment.

Re: “Overall take: unimpressed.”
I’m more frustrated and annoyed than “unimpressed”. But I also did not find this response impressive.