However, a model trained to obey the RLHF objective will expect negative reward if decided taking over the world
If an AI takes over the world there is no-one around to give it a negative reward. So the AI will not expect a negative reward for taking over the world
Once upon a time, some people were arguing that AI might kill everyone, and EA resources should address that problem instead of fighting Malaria. So OpenPhil poured millions of dollars into orgs such as EpochAI (they got 9 million). Now 3 people from EpochAI created a startup to provide training data to help AI replace human workers. Some people are worried that this startup increases AI capabilities, and therefore increases the chance that AI will kill everyone.