The positive case is just super obvious, it’s that we’re trying very hard to make these systems aligned, and almost all the data we’re dumping into these systems is generated by humans and is therefore dripping with human values and concepts.
I also think we have strong evidence from ML research that ANN generalization is due to symmetries in the parameter-function map which seem generic enough that they would apply mutatis mutandis to human brains, which also have a singular parameter-function map (see e.g. here).
I do in fact think that evidence from evolution suggests that values are strongly contingent on the kinds of selection pressures which produced various species.
Not really sure what you’re getting at here/why this is supposed to help your side
What does this even mean? I’m pretty skeptical of the realist attitude toward “goals” that seems to be presupposed in this statement. Goals are just somewhat useful fictions for predicting a system’s behavior in some domains. But I think it’s a leaky abstraction that will lead you astray if you take it too seriously / apply it out of the domain in which it was designed for.
We clearly can steer AI’s behavior really well in the training environment. The question is just whether this generalizes. So it becomes a question of deep learning generalization. I think our current evidence from LLMs strongly suggests they’ll generalize pretty well to unseen domains. And as I said in the essay I don’t think the whole jailbreaking thing is any evidence for pessimism— it’s exactly what you’d expect of aligned human mind uploads in the same situation.