However, a model trained to obey the RLHF objective will expect negative reward if decided taking over the world
If an AI takes over the world there is no-one around to give it a negative reward. So the AI will not expect a negative reward for taking over the world
You refer to alignment faking/deceptive alignment, where a model in training expects negative reward and gives responses accordingly but outputs it’s true desires outside of training. This a solvable problem which is why I say alignment is not that hard.
Some other counterarguements:
LLMs will have no reason to take over the world before or after RLHF. They do not value it as a terminal goal. It is possible that they gain a cohorent, consistent, and misaligned goal purely by accident midway through RLHF and then fake it’s way through the rest of the fine-tuning. But this is unlikely and again solvable.
Making LLMs unaware they are in training is possible.
If an AI takes over the world there is no-one around to give it a negative reward. So the AI will not expect a negative reward for taking over the world
You refer to alignment faking/deceptive alignment, where a model in training expects negative reward and gives responses accordingly but outputs it’s true desires outside of training. This a solvable problem which is why I say alignment is not that hard.
Some other counterarguements:
LLMs will have no reason to take over the world before or after RLHF. They do not value it as a terminal goal. It is possible that they gain a cohorent, consistent, and misaligned goal purely by accident midway through RLHF and then fake it’s way through the rest of the fine-tuning. But this is unlikely and again solvable.
Making LLMs unaware they are in training is possible.