I don’t understand why you believe unidentifiability will be prevented by large datasets. Take the recent SolidGoldMagikarp work. It was done on GPT-2, but GPT-2 nevertheless was trained on a lot of data—a quick Google search suggests eight million web pages.
Despite this, when people tried to find the sentences that maximally determined the next token, what we got was...strange.
This is exactly the kind of thing I would expect to see if unidentifiability was a major problem—when we attempt to poke the bounds of extreme behaviour of the AI and take it far off distribution as a result, what we get is complete nonsense and not at all correlated with what we actually want. Clearly it understands the concepts of “girl”, “USA”, and “evil” very differently to us, and not in a way we would endorse.
This is far from a guarantee that unidentifiability will remain a problem, but considering your position is under 1%, things like this seem to add much more credence to unidentifiability in my world model than you give it.
Hubinger et al’s definition of unidentifiability, which I’m referring to in this post:
Unidentifiability. It is a common problem in machine learning for a dataset to not contain enough information to adequately pinpoint a specific concept. This is closely analogous to the reason that machine learning models can fail to generalize or be susceptible to adversarial examples(19)—there are many more ways of classifying data that do well in training than any specific way the programmers had in mind.
I’m referring to unidentifiability in terms of goals of a model in a (pre-trained) reinforcement learning context. I think the internet contains enough information to adequately pin-point following directions. Do you disagree, or are you using this term some other way?
Pre-trained models having weird output probabilities for carefully designed gibberish inputs doesn’t seem relevant to me. Wouldn’t that be more of a capability failure than goal misalignment? It doesn’t seem to indicate that the model is optimizing for something other than next token prediction. I’m arguing that models are unlikely to be deceptively aligned, not that they are immune to all adversarial inputs. I haven’t read the post you linked to in full, so let me know if I’m missing something.
My unidentifiability argument is that if a model:
Has been pre-trained on ~the whole internet
Is sophisticated/complex enough to become TAI potential if (more) RL training occurs
Is told to follow directions subject to ethical considerations, then given directions
Then it would be really weird if it didn’t understand that it’s designed to follow directions subject to ethical considerations. If there’s a way for this to happen, I haven’t seen it described anywhere.
It might still occasionally misinterpret your directions, but it should generally understand that the training goal is to follow directions subject to non-consequentialist ethical considerations before RL training turns it into a proxy goal optimizer. Deception gets even less likely when you factor in that to be deceptive, it would need a very long-term goal and situational awareness before or around the same time as it understood that it needs to follow directions subject to ethical considerations. What’s the story for how this happens?
I was using unidentifiability in the Hubinger way. I do believe that if you try to get an AI trained in the way you mention here to follow directions subject to ethical considerations, by default, the things it considers “maximally ethical” will be approximately as strange as the sentences from above.
That said, this is not actually related to the problem of deceptive alignment, so I realise now that this is very much a side point.
I don’t understand why you believe unidentifiability will be prevented by large datasets. Take the recent SolidGoldMagikarp work. It was done on GPT-2, but GPT-2 nevertheless was trained on a lot of data—a quick Google search suggests eight million web pages.
Despite this, when people tried to find the sentences that maximally determined the next token, what we got was...strange.
This is exactly the kind of thing I would expect to see if unidentifiability was a major problem—when we attempt to poke the bounds of extreme behaviour of the AI and take it far off distribution as a result, what we get is complete nonsense and not at all correlated with what we actually want. Clearly it understands the concepts of “girl”, “USA”, and “evil” very differently to us, and not in a way we would endorse.
This is far from a guarantee that unidentifiability will remain a problem, but considering your position is under 1%, things like this seem to add much more credence to unidentifiability in my world model than you give it.
Hubinger et al’s definition of unidentifiability, which I’m referring to in this post:
I’m referring to unidentifiability in terms of goals of a model in a (pre-trained) reinforcement learning context. I think the internet contains enough information to adequately pin-point following directions. Do you disagree, or are you using this term some other way?
Pre-trained models having weird output probabilities for carefully designed gibberish inputs doesn’t seem relevant to me. Wouldn’t that be more of a capability failure than goal misalignment? It doesn’t seem to indicate that the model is optimizing for something other than next token prediction. I’m arguing that models are unlikely to be deceptively aligned, not that they are immune to all adversarial inputs. I haven’t read the post you linked to in full, so let me know if I’m missing something.
My unidentifiability argument is that if a model:
Has been pre-trained on ~the whole internet
Is sophisticated/complex enough to become TAI potential if (more) RL training occurs
Is told to follow directions subject to ethical considerations, then given directions
Then it would be really weird if it didn’t understand that it’s designed to follow directions subject to ethical considerations. If there’s a way for this to happen, I haven’t seen it described anywhere.
It might still occasionally misinterpret your directions, but it should generally understand that the training goal is to follow directions subject to non-consequentialist ethical considerations before RL training turns it into a proxy goal optimizer. Deception gets even less likely when you factor in that to be deceptive, it would need a very long-term goal and situational awareness before or around the same time as it understood that it needs to follow directions subject to ethical considerations. What’s the story for how this happens?
I was using unidentifiability in the Hubinger way. I do believe that if you try to get an AI trained in the way you mention here to follow directions subject to ethical considerations, by default, the things it considers “maximally ethical” will be approximately as strange as the sentences from above.
That said, this is not actually related to the problem of deceptive alignment, so I realise now that this is very much a side point.