Absent a specific reason to believe that we will be sampling from an extremely tiny section of an enormously broad space, why should we believe we will hit the target?
I could make this same argument about capabilities, and be demonstratably wrong. The space of neural network values that don’t produce coherent grammar is unimaginably, ridiculously vast compared to the “tiny target” of ones that do. But this obviously doesn’t mean that chatGPT is impossible.
The reason is that we aren’t randomly throwing a dart at possibility space, but using a highly efficient search mechanism to rapidly toss out bad designs until we hit the target. But when these machines are trained, we simultaneously select for capabilities and for alignment (murderbots are not efficient translators). For chatGPT, this leads to an “aligned” machine, at least by some definitions.
Where I think the motte and bailey often occurs is jumping between “aligned enough not to exterminate us”, and “aligned with us nearly perfectly in every way” or “unable to be misused by bad actors”. The former seems like it might happen naturally over development, whereas the latter two seem nigh impossible.
Yes, the training process is running a search where our steering is (sort of) effective for getting capabilities—though note that with e.g. LLMs we have approximately zero ability to reliably translate known inputs [X] into known capabilities [Y].
We are not doing the same thing to select for alignment, because “alignment” is:
an internal representation that depends on multiple unsolved problems in philosophy, decision theory, epistemology, math, etc, rather than “observable external behavior” (which is what we use to evaluate capabilities & steer training)
something that might be inextricably tied to the form of general intelligence which by default puts us in the “dangerous capabilities” regime, or if not strongly bound in theory, then strongly bound in practice
I do think this disagreement is substantially downstream of a disagreement about what “alignment” represents, i.e. I think that you might attempt outer alignment of GPT-4 but not inner alignment, because GPT-4 doesn’t have the internal bits which make inner alignment a relevant concern.
GPT-4 doesn’t have the internal bits which make inner alignment a relevant concern.
Is this commonly agreed upon even after fine-tuning with RLHF? I assumed it’s an open empirical question. The way I understand is is that there’s a reward signal (human feedback) that’s shaping different parts of the neural network that determines GPT-4′s ouputs, and we don’t have good enough interpretability techniques to know whether some parts of the neural network are representations of “goals”, and even less so what specific goals they are.
I would’ve thought it’s an open question whether even base models have internal representations of “goals”, either always active or only active in some specific context. For example if we buy the simulacra (predictors?) frame, a goal could be active only when a certain simulacrum is active.
I don’t know if it’s commonly agreed upon; that’s just my current belief based on available evidence (to the extent that the claim is even philosophically sound enough to be pointing at a real thing).
Or another rephrase. How is the “secretly is planning to murder all humans” improving the models scores on a benchmark? If you think about it first, what gradient from the training set even led to this capability of an inner cognitive process looking for a chance to betray. What force is causing this cognitive process to come out of the random initial weights?
Humans seem to have such a force but it’s because modeling “if I kill this rival then the reward for me is...” was evolutionarily useful. Also it’s probably a behavior that is learned.
And second, yeah, SGD should push “neutral” weights inside the network that are not contributing to correct answers towards weights that do increase the odds of a correct output distribution. So it should actively destroy “unnecessary ” cognitive processes inside the model.
You could prove this. Make a psychopathic model designed to “betray” in a game like world and then see how many rounds of training on a new dataset clear the ability for the model to kill when it improves score.
How is the “secretly is planning to murder all humans” improving the models scores on a benchmark?
(I personally don’t find this likely, so this might accidentally be a strawman)
For example: planning and gaining knowledge are incentivized on many benchmarks → instrumental convergence makes model instrumentally value power among other things → a very advanced system that is great at long-term planning might conclude that “murdering all humans” is useful for power or other instrumentally convergent goals
You could prove this. Make a psychopathic model designed to “betray” in a game like world and then see how many rounds of training on a new dataset clear the ability for the model to kill when it improves score.
I think with our current interpretability techniques we wouldn’t be able to robustly distinguish between a model that generalized to behave well in any reasonable environment vs a model that learned to behave well in that specific environment but would turn back to betray in many other environments
I could make this same argument about capabilities, and be demonstratably wrong. The space of neural network values that don’t produce coherent grammar is unimaginably, ridiculously vast compared to the “tiny target” of ones that do. But this obviously doesn’t mean that chatGPT is impossible.
The reason is that we aren’t randomly throwing a dart at possibility space, but using a highly efficient search mechanism to rapidly toss out bad designs until we hit the target. But when these machines are trained, we simultaneously select for capabilities and for alignment (murderbots are not efficient translators). For chatGPT, this leads to an “aligned” machine, at least by some definitions.
Where I think the motte and bailey often occurs is jumping between “aligned enough not to exterminate us”, and “aligned with us nearly perfectly in every way” or “unable to be misused by bad actors”. The former seems like it might happen naturally over development, whereas the latter two seem nigh impossible.
The argument w.r.t. capabilities is disanalogous.
Yes, the training process is running a search where our steering is (sort of) effective for getting capabilities—though note that with e.g. LLMs we have approximately zero ability to reliably translate known inputs [X] into known capabilities [Y].
We are not doing the same thing to select for alignment, because “alignment” is:
an internal representation that depends on multiple unsolved problems in philosophy, decision theory, epistemology, math, etc, rather than “observable external behavior” (which is what we use to evaluate capabilities & steer training)
something that might be inextricably tied to the form of general intelligence which by default puts us in the “dangerous capabilities” regime, or if not strongly bound in theory, then strongly bound in practice
I do think this disagreement is substantially downstream of a disagreement about what “alignment” represents, i.e. I think that you might attempt outer alignment of GPT-4 but not inner alignment, because GPT-4 doesn’t have the internal bits which make inner alignment a relevant concern.
Is this commonly agreed upon even after fine-tuning with RLHF? I assumed it’s an open empirical question. The way I understand is is that there’s a reward signal (human feedback) that’s shaping different parts of the neural network that determines GPT-4′s ouputs, and we don’t have good enough interpretability techniques to know whether some parts of the neural network are representations of “goals”, and even less so what specific goals they are.
I would’ve thought it’s an open question whether even base models have internal representations of “goals”, either always active or only active in some specific context. For example if we buy the simulacra (predictors?) frame, a goal could be active only when a certain simulacrum is active.
(would love to be corrected :D)
I don’t know if it’s commonly agreed upon; that’s just my current belief based on available evidence (to the extent that the claim is even philosophically sound enough to be pointing at a real thing).
Or another rephrase. How is the “secretly is planning to murder all humans” improving the models scores on a benchmark? If you think about it first, what gradient from the training set even led to this capability of an inner cognitive process looking for a chance to betray. What force is causing this cognitive process to come out of the random initial weights?
Humans seem to have such a force but it’s because modeling “if I kill this rival then the reward for me is...” was evolutionarily useful. Also it’s probably a behavior that is learned.
And second, yeah, SGD should push “neutral” weights inside the network that are not contributing to correct answers towards weights that do increase the odds of a correct output distribution. So it should actively destroy “unnecessary ” cognitive processes inside the model.
You could prove this. Make a psychopathic model designed to “betray” in a game like world and then see how many rounds of training on a new dataset clear the ability for the model to kill when it improves score.
(I personally don’t find this likely, so this might accidentally be a strawman)
For example: planning and gaining knowledge are incentivized on many benchmarks → instrumental convergence makes model instrumentally value power among other things → a very advanced system that is great at long-term planning might conclude that “murdering all humans” is useful for power or other instrumentally convergent goals
I think with our current interpretability techniques we wouldn’t be able to robustly distinguish between a model that generalized to behave well in any reasonable environment vs a model that learned to behave well in that specific environment but would turn back to betray in many other environments