Hubinger et al’s definition of unidentifiability, which I’m referring to in this post:
Unidentifiability. It is a common problem in machine learning for a dataset to not contain enough information to adequately pinpoint a specific concept. This is closely analogous to the reason that machine learning models can fail to generalize or be susceptible to adversarial examples(19)—there are many more ways of classifying data that do well in training than any specific way the programmers had in mind.
I’m referring to unidentifiability in terms of goals of a model in a (pre-trained) reinforcement learning context. I think the internet contains enough information to adequately pin-point following directions. Do you disagree, or are you using this term some other way?
Pre-trained models having weird output probabilities for carefully designed gibberish inputs doesn’t seem relevant to me. Wouldn’t that be more of a capability failure than goal misalignment? It doesn’t seem to indicate that the model is optimizing for something other than next token prediction. I’m arguing that models are unlikely to be deceptively aligned, not that they are immune to all adversarial inputs. I haven’t read the post you linked to in full, so let me know if I’m missing something.
My unidentifiability argument is that if a model:
Has been pre-trained on ~the whole internet
Is sophisticated/complex enough to become TAI potential if (more) RL training occurs
Is told to follow directions subject to ethical considerations, then given directions
Then it would be really weird if it didn’t understand that it’s designed to follow directions subject to ethical considerations. If there’s a way for this to happen, I haven’t seen it described anywhere.
It might still occasionally misinterpret your directions, but it should generally understand that the training goal is to follow directions subject to non-consequentialist ethical considerations before RL training turns it into a proxy goal optimizer. Deception gets even less likely when you factor in that to be deceptive, it would need a very long-term goal and situational awareness before or around the same time as it understood that it needs to follow directions subject to ethical considerations. What’s the story for how this happens?
I was using unidentifiability in the Hubinger way. I do believe that if you try to get an AI trained in the way you mention here to follow directions subject to ethical considerations, by default, the things it considers “maximally ethical” will be approximately as strange as the sentences from above.
That said, this is not actually related to the problem of deceptive alignment, so I realise now that this is very much a side point.
Hubinger et al’s definition of unidentifiability, which I’m referring to in this post:
I’m referring to unidentifiability in terms of goals of a model in a (pre-trained) reinforcement learning context. I think the internet contains enough information to adequately pin-point following directions. Do you disagree, or are you using this term some other way?
Pre-trained models having weird output probabilities for carefully designed gibberish inputs doesn’t seem relevant to me. Wouldn’t that be more of a capability failure than goal misalignment? It doesn’t seem to indicate that the model is optimizing for something other than next token prediction. I’m arguing that models are unlikely to be deceptively aligned, not that they are immune to all adversarial inputs. I haven’t read the post you linked to in full, so let me know if I’m missing something.
My unidentifiability argument is that if a model:
Has been pre-trained on ~the whole internet
Is sophisticated/complex enough to become TAI potential if (more) RL training occurs
Is told to follow directions subject to ethical considerations, then given directions
Then it would be really weird if it didn’t understand that it’s designed to follow directions subject to ethical considerations. If there’s a way for this to happen, I haven’t seen it described anywhere.
It might still occasionally misinterpret your directions, but it should generally understand that the training goal is to follow directions subject to non-consequentialist ethical considerations before RL training turns it into a proxy goal optimizer. Deception gets even less likely when you factor in that to be deceptive, it would need a very long-term goal and situational awareness before or around the same time as it understood that it needs to follow directions subject to ethical considerations. What’s the story for how this happens?
I was using unidentifiability in the Hubinger way. I do believe that if you try to get an AI trained in the way you mention here to follow directions subject to ethical considerations, by default, the things it considers “maximally ethical” will be approximately as strange as the sentences from above.
That said, this is not actually related to the problem of deceptive alignment, so I realise now that this is very much a side point.