GPT-4 doesn’t have the internal bits which make inner alignment a relevant concern.
Is this commonly agreed upon even after fine-tuning with RLHF? I assumed it’s an open empirical question. The way I understand is is that there’s a reward signal (human feedback) that’s shaping different parts of the neural network that determines GPT-4′s ouputs, and we don’t have good enough interpretability techniques to know whether some parts of the neural network are representations of “goals”, and even less so what specific goals they are.
I would’ve thought it’s an open question whether even base models have internal representations of “goals”, either always active or only active in some specific context. For example if we buy the simulacra (predictors?) frame, a goal could be active only when a certain simulacrum is active.
I don’t know if it’s commonly agreed upon; that’s just my current belief based on available evidence (to the extent that the claim is even philosophically sound enough to be pointing at a real thing).
Is this commonly agreed upon even after fine-tuning with RLHF? I assumed it’s an open empirical question. The way I understand is is that there’s a reward signal (human feedback) that’s shaping different parts of the neural network that determines GPT-4′s ouputs, and we don’t have good enough interpretability techniques to know whether some parts of the neural network are representations of “goals”, and even less so what specific goals they are.
I would’ve thought it’s an open question whether even base models have internal representations of “goals”, either always active or only active in some specific context. For example if we buy the simulacra (predictors?) frame, a goal could be active only when a certain simulacrum is active.
(would love to be corrected :D)
I don’t know if it’s commonly agreed upon; that’s just my current belief based on available evidence (to the extent that the claim is even philosophically sound enough to be pointing at a real thing).