[Question] What mechanisms could lead a fully autonomous AI system to act against human welfare?

I’m trying to better understand how conflict between advanced AI systems and human interests could arise in practice.

Suppose we have a highly capable, fully autonomous AI system that no longer depends on humans for operation or resources. Under what conditions or mechanisms would such a system end up acting in ways that are harmful to human welfare?

Are there specific alignment failure modes (e.g., reward mis-specification, instrumental convergence, goal misgeneralization) that make this more likely?

Also, to what extent is the idea of “AI vs humanity” a misleading framing, versus a useful way to think about long-term risk?

I’d appreciate perspectives grounded in existing AI safety research, as well as any models or frameworks that help clarify this.

No answers.
No comments.