On the first point, my objection is that the human regime is special (because human-level systems are capable of self-reflection, deception, etc.) regardless of which methods ultimately produce systems in that regime, or how “spiky” they are.
A small, relatively gradual jump in the human-level regime is plausibly more than enough to enable an AI to outsmart / hide / deceive humans, via e.g. a few key insights gleaned from reading a corpus of neuroscience, psychology, and computer security papers, over the course of a few hours of wall clock time.
The second point is exactly what I’m saying is unsupported, unless you already accept the SLT argument as untrue. You say in the post you don’t expect catastrophic interference between current alignment methods, but you don’t consider that a human-level AI will be capable of reflecting on those methods (and their actual implementation, which might be buggy).
Similarly, elsewhere in the piece you say:
Once you condition on this specific failure mode of evolution, you can easily predict that humans would undergo a sharp left turn at the point where we could pass significant knowledge across generations. I don’t think there’s anything else to explain here, and no reason to suppose some general tendency towards extreme sharpness in inner capability gains.
And
In my frame, we’ve already figured out and applied the sharp left turn to our AI systems, in that we don’t waste our compute on massive amounts of incredibly inefficient neural architecture search, hyperparameter tuning, or meta optimization.
But again, the actual SLT argument is not about “extreme sharpness” in capability gains. It’s an argument which applies to the human-level regime and above, so we can’t already be past it no matter what frame you use. The version of the SLT argument you argue against is a strawman, which is what my original LW comment was pointing out.
I think readers can see this for themselves if they just re-read the SLT post carefully, particularly footnotes 3-5, and then re-read the parts of your post where you talk about it.
On the first point, my objection is that the human regime is special (because human-level systems are capable of self-reflection, deception, etc.) regardless of which methods ultimately produce systems in that regime, or how “spiky” they are.
A small, relatively gradual jump in the human-level regime is plausibly more than enough to enable an AI to outsmart / hide / deceive humans, via e.g. a few key insights gleaned from reading a corpus of neuroscience, psychology, and computer security papers, over the course of a few hours of wall clock time.
The second point is exactly what I’m saying is unsupported, unless you already accept the SLT argument as untrue. You say in the post you don’t expect catastrophic interference between current alignment methods, but you don’t consider that a human-level AI will be capable of reflecting on those methods (and their actual implementation, which might be buggy).
Similarly, elsewhere in the piece you say:
And
But again, the actual SLT argument is not about “extreme sharpness” in capability gains. It’s an argument which applies to the human-level regime and above, so we can’t already be past it no matter what frame you use. The version of the SLT argument you argue against is a strawman, which is what my original LW comment was pointing out.
I think readers can see this for themselves if they just re-read the SLT post carefully, particularly footnotes 3-5, and then re-read the parts of your post where you talk about it.
[edit: I also responded further on LW here.]