Far from being “behind” capabilities, it seems that alignment research has made great strides in recent years. OpenAI and Anthropic showed that Reinforcement Learning from Human Feedback (RLHF) can be used to turn ungovernable large language models into helpful and harmless assistants. Scalable oversight techniques like Constitutional AI and model-written critiques show promise for aligning the very powerful models of the future. And just this week, it was shown that efficient instruction-following language models can be trained purely with synthetic text generated by a larger RLHF’d model, thereby removing unsafe or objectionable content from the training data and enabling far greater control.
As far as I am aware, no current AI system, LLM-based or otherwise, is anywhere near capable enough to act autonomously in sufficiently general real-world contexts, such that it actually poses any kind of threat to humans on its own (even evaluating frontier models for this possibility requires giving them a lot of help). That is where the extinction-level danger lies. It is (mostly) not about human misuse of AI systems, whether that misuse is intentional or adversarial (i.e. a human is deliberately trying to use the AI system to cause harm) or unintentional (i.e. the model is poorly trained or the system is buggy, resulting in harm that neither the user nor the AI system itself intended or wanted.)
I think there’s also a technical misunderstanding implied by this paragraph, of how the base model training process works and what the purpose of high-quality vs. diverse training material is. In particular, the primary purpose of removing “objectionable content” (and / or low-quality internet text) from the base model training process is to make the training process more efficient, and seems unlikely to accomplish anything alignment-relevant.
The reason is that the purpose of the base model training process is to build up a model which is capable of predicting the next token in a sequence of tokens which appears in the world somewhere, in full generality. A model which is actually human-level or smarter would (by definition) be capable of predicting, generating, and comprehending objectionable content, even if it had never seen such content during the training process. (See Is GPT-N bounded by human capabilities? No. for more.)
Using synthetic training data for the RLHF process is maybe more promising, but it depends on the degree to which RLHF works by imbuing the underlying model with the right values, vs. simply chiseling away all the bits of model that were capable of imagining and comprehending novel, unseen-in-training ideas in the first place (including objectionable ones, or ones we’d simply prefer the model not think about). Perhaps RLHF works more like the former mechanism, and as a result RLHF (or RLAIF) will “just work” as an alignment strategy, even as models scale to human-level and beyond.
Note that it is possible to gather evidence on this question as it applies to current systems, though I would caution against extrapolating such evidence very far. For example, are there any capabilities that a base model has before RLHF, which are not deliberately trained against during RHLF (e.g. generating objectionable content), which the final model is incapable of doing?
If, say, the RLHF process trains the model to refuse to generate sexually explicit content, and as a side effect, the RLHF’d model now does worse on answering questions about anatomy compared to the base model, that would be evidence that the RLHF process simply chiseled away the model’s ability to comprehend important parts of the universe entirely, rather than imbuing it with a value against answering certain kinds of questions as intended.
I don’t actually know how this particular experimental result would turn out, but either way, I wouldn’t expect any trends or rules that apply to current AI systems to continue applying as those systems scale to human-level intelligence or above.
For my own part, I would like to see a pause on all kinds of AI capabilities research and hardware progress, at least until AI researchers are less confused about a lot of topics like this. As for how realistic that proposal is, whether it likely constitutes a rather permanent pause, or what the consequences of trying and failing to implement such a pause would be, I make no comment, other than to say that sometimes the universe presents you with an unfair, impossible problem.
As far as I am aware, no current AI system, LLM-based or otherwise, is anywhere near capable enough to act autonomously in sufficiently general real-world contexts, such that it actually poses any kind of threat to humans on its own (even evaluating frontier models for this possibility requires giving them a lot of help). That is where the extinction-level danger lies. It is (mostly) not about human misuse of AI systems, whether that misuse is intentional or adversarial (i.e. a human is deliberately trying to use the AI system to cause harm) or unintentional (i.e. the model is poorly trained or the system is buggy, resulting in harm that neither the user nor the AI system itself intended or wanted.)
I think there’s also a technical misunderstanding implied by this paragraph, of how the base model training process works and what the purpose of high-quality vs. diverse training material is. In particular, the primary purpose of removing “objectionable content” (and / or low-quality internet text) from the base model training process is to make the training process more efficient, and seems unlikely to accomplish anything alignment-relevant.
The reason is that the purpose of the base model training process is to build up a model which is capable of predicting the next token in a sequence of tokens which appears in the world somewhere, in full generality. A model which is actually human-level or smarter would (by definition) be capable of predicting, generating, and comprehending objectionable content, even if it had never seen such content during the training process. (See Is GPT-N bounded by human capabilities? No. for more.)
Using synthetic training data for the RLHF process is maybe more promising, but it depends on the degree to which RLHF works by imbuing the underlying model with the right values, vs. simply chiseling away all the bits of model that were capable of imagining and comprehending novel, unseen-in-training ideas in the first place (including objectionable ones, or ones we’d simply prefer the model not think about). Perhaps RLHF works more like the former mechanism, and as a result RLHF (or RLAIF) will “just work” as an alignment strategy, even as models scale to human-level and beyond.
Note that it is possible to gather evidence on this question as it applies to current systems, though I would caution against extrapolating such evidence very far. For example, are there any capabilities that a base model has before RLHF, which are not deliberately trained against during RHLF (e.g. generating objectionable content), which the final model is incapable of doing?
If, say, the RLHF process trains the model to refuse to generate sexually explicit content, and as a side effect, the RLHF’d model now does worse on answering questions about anatomy compared to the base model, that would be evidence that the RLHF process simply chiseled away the model’s ability to comprehend important parts of the universe entirely, rather than imbuing it with a value against answering certain kinds of questions as intended.
I don’t actually know how this particular experimental result would turn out, but either way, I wouldn’t expect any trends or rules that apply to current AI systems to continue applying as those systems scale to human-level intelligence or above.
For my own part, I would like to see a pause on all kinds of AI capabilities research and hardware progress, at least until AI researchers are less confused about a lot of topics like this. As for how realistic that proposal is, whether it likely constitutes a rather permanent pause, or what the consequences of trying and failing to implement such a pause would be, I make no comment, other than to say that sometimes the universe presents you with an unfair, impossible problem.