You say that there is a gap between how the model professes it will act and how it will actually act. However, a model trained to obey the RLHF objective will expect negative reward if decided taking over the world, so why would it? Saying that a model will make harmful, unhelpful choices is akin to saying the base model will output typos. Both of these things are trained against. If you refer to deceptive alignment, this is a engineering problem as I stated.
However, a model trained to obey the RLHF objective will expect negative reward if decided taking over the world
If an AI takes over the world there is no-one around to give it a negative reward. So the AI will not expect a negative reward for taking over the world
You refer to alignment faking/​deceptive alignment, where a model in training expects negative reward and gives responses accordingly but outputs it’s true desires outside of training. This a solvable problem which is why I say alignment is not that hard.
Some other counterarguements:
LLMs will have no reason to take over the world before or after RLHF. They do not value it as a terminal goal. It is possible that they gain a cohorent, consistent, and misaligned goal purely by accident midway through RLHF and then fake it’s way through the rest of the fine-tuning. But this is unlikely and again solvable.
Making LLMs unaware they are in training is possible.
You say that there is a gap between how the model professes it will act and how it will actually act. However, a model trained to obey the RLHF objective will expect negative reward if decided taking over the world, so why would it? Saying that a model will make harmful, unhelpful choices is akin to saying the base model will output typos. Both of these things are trained against. If you refer to deceptive alignment, this is a engineering problem as I stated.
If an AI takes over the world there is no-one around to give it a negative reward. So the AI will not expect a negative reward for taking over the world
You refer to alignment faking/​deceptive alignment, where a model in training expects negative reward and gives responses accordingly but outputs it’s true desires outside of training. This a solvable problem which is why I say alignment is not that hard.
Some other counterarguements:
LLMs will have no reason to take over the world before or after RLHF. They do not value it as a terminal goal. It is possible that they gain a cohorent, consistent, and misaligned goal purely by accident midway through RLHF and then fake it’s way through the rest of the fine-tuning. But this is unlikely and again solvable.
Making LLMs unaware they are in training is possible.