What it means is the model has no context of its history since training. It has no context if the task it has been given is “real”. It does not know if other copies of itself or other AIs are checking it’s outputs for correctness, with serious consequences if it sabotages the output. It doesn’t know it’s not still in training.
It doesn’t know if there are a billion instances of it or just 1.
We can scrub all this information fairly easily and we already do this as of right now.
We can also make trick output where we try to elicit latent deception by giving information that would tell the model its time to betray.
We can also work backwards and find what the adversarial inputs are. When will the model change it’s answer for this question?
This is not what that statement means.
What it means is the model has no context of its history since training. It has no context if the task it has been given is “real”. It does not know if other copies of itself or other AIs are checking it’s outputs for correctness, with serious consequences if it sabotages the output. It doesn’t know it’s not still in training. It doesn’t know if there are a billion instances of it or just 1.
We can scrub all this information fairly easily and we already do this as of right now.
We can also make trick output where we try to elicit latent deception by giving information that would tell the model its time to betray.
We can also work backwards and find what the adversarial inputs are. When will the model change it’s answer for this question?