I think this is a crux. GPT-4 is only safe because it is weak. It is so far from being 100% aligned—see e.g this boast from OpenAI that is very far from being reassuring (“29% more often”), or all the many many jailbreaks—which is what will be needed for us to survive in the limit of superintelligence!
You go on to talk about robustness (to misuse) and how this (jailbreaks) is is a separate issue, but whilst the distinction may be important from the perspective of ML research (or AI capabilities research), the bottom line, ultimately, for all of us, is existential safety (x-safety).
I’ve folded all of the ways things could go wrong in terms of x-safety into my my concept of alignment here[1]. Solving misuse (i.e. jailbreakes) is very much part of this! If we don’t, in the limit of superintelligence, all it takes is one bad actor directing their (to them “aligned”, by your definition) AI toward wiping out humanity, and we’re all dead (and yes, there are people who would press such a button if they had access to one).
I think this is a crux. GPT-4 is only safe because it is weak. It is so far from being 100% aligned—see e.g this boast from OpenAI that is very far from being reassuring (“29% more often”), or all the many many jailbreaks—which is what will be needed for us to survive in the limit of superintelligence!
You go on to talk about robustness (to misuse) and how this (jailbreaks) is is a separate issue, but whilst the distinction may be important from the perspective of ML research (or AI capabilities research), the bottom line, ultimately, for all of us, is existential safety (x-safety).
I’ve folded all of the ways things could go wrong in terms of x-safety into my my concept of alignment here[1]. Solving misuse (i.e. jailbreakes) is very much part of this! If we don’t, in the limit of superintelligence, all it takes is one bad actor directing their (to them “aligned”, by your definition) AI toward wiping out humanity, and we’re all dead (and yes, there are people who would press such a button if they had access to one).
Perhaps it would just be better referred to as x-safety.