The issue is not whether the AI understands human morality. The issue is whether it cares.
The arguments from the “alignment is hard” side that I was exposed to don’t rely on the AI misinterpreting what the humans want. In fact, superhuman AI assumed to be better at humans at understanding human morality. It still could do things that go against human morality. Overall I get the impression you misunderstand what alignment is about (or maybe you just have a different association to words as “alignment” than me).
Whether a language model can play a nice character that would totally give back the dictatorial powers after takeover is barely any evidence whether the actual super-human AI system will step back from its position of world dictator after it has accomplished some tasks.
You say that there is a gap between how the model professes it will act and how it will actually act. However, a model trained to obey the RLHF objective will expect negative reward if decided taking over the world, so why would it? Saying that a model will make harmful, unhelpful choices is akin to saying the base model will output typos. Both of these things are trained against. If you refer to deceptive alignment, this is a engineering problem as I stated.
However, a model trained to obey the RLHF objective will expect negative reward if decided taking over the world
If an AI takes over the world there is no-one around to give it a negative reward. So the AI will not expect a negative reward for taking over the world
You refer to alignment faking/deceptive alignment, where a model in training expects negative reward and gives responses accordingly but outputs it’s true desires outside of training. This a solvable problem which is why I say alignment is not that hard.
Some other counterarguements:
LLMs will have no reason to take over the world before or after RLHF. They do not value it as a terminal goal. It is possible that they gain a cohorent, consistent, and misaligned goal purely by accident midway through RLHF and then fake it’s way through the rest of the fine-tuning. But this is unlikely and again solvable.
Making LLMs unaware they are in training is possible.
The issue is not whether the AI understands human morality. The issue is whether it cares.
The arguments from the “alignment is hard” side that I was exposed to don’t rely on the AI misinterpreting what the humans want. In fact, superhuman AI assumed to be better at humans at understanding human morality. It still could do things that go against human morality. Overall I get the impression you misunderstand what alignment is about (or maybe you just have a different association to words as “alignment” than me).
Whether a language model can play a nice character that would totally give back the dictatorial powers after takeover is barely any evidence whether the actual super-human AI system will step back from its position of world dictator after it has accomplished some tasks.
You say that there is a gap between how the model professes it will act and how it will actually act. However, a model trained to obey the RLHF objective will expect negative reward if decided taking over the world, so why would it? Saying that a model will make harmful, unhelpful choices is akin to saying the base model will output typos. Both of these things are trained against. If you refer to deceptive alignment, this is a engineering problem as I stated.
If an AI takes over the world there is no-one around to give it a negative reward. So the AI will not expect a negative reward for taking over the world
You refer to alignment faking/deceptive alignment, where a model in training expects negative reward and gives responses accordingly but outputs it’s true desires outside of training. This a solvable problem which is why I say alignment is not that hard.
Some other counterarguements:
LLMs will have no reason to take over the world before or after RLHF. They do not value it as a terminal goal. It is possible that they gain a cohorent, consistent, and misaligned goal purely by accident midway through RLHF and then fake it’s way through the rest of the fine-tuning. But this is unlikely and again solvable.
Making LLMs unaware they are in training is possible.