In theory, the best way to be the best next-word-predictor is to model humans. Internally, humans model the world they live in. A sufficiently powerful human modeler would likely model the world the humans live in. Further, humans reason and so a really good next-word-predictor would be able to more accurately predict the next word by reasoning. Similarly, it is an optimization strategy to develop other cognitive abilities, logic, etc.
All of this allows you to predict the correct next word with less “neurons” because it takes fewer neurons to learn how to do logical deduction and memorize some premises than it takes to memorize all of the possible outputs that some future prompt may require.
The fact that we train on human data just means that we are training the AI to be able to reason and critically think in the same way we do. Once it has that ability, we can then “scale it up”, which is something humans really struggle with.
How can you align AI with humans when humans are not internally aligned?
AI Alignment researchers often talk about aligning AIs with humans, but humans are not aligned with each other as a species. There are groups whose goals directly conflict with each other, and I don’t think there is any singular goal that all humans share.
As an extreme example, one may say “keep humans alive” is a shared goal among humans, but there are people who think that is an anti-goal and humans should be wiped off the planet (e.g., eco-terrorists). “Humans should be happy” is another goal that not everyone shares, and there are entire religions that discourage pleasure and enjoyment.
You could try to simplify further to “keep species around” but there are some who would be fine with a wire-head future while others are not, and some who would be fine with humans merely existing in a zoo while others are not.
Almost every time I hear alignment researchers speak about aligning humans with AI, they seem to start with a premise that there is a cohesive worldview to align with. The best “solution” to this problem that I have heard suggested is that there should be multiple AIs that compete with each other on behalf of different groups of humans or perhaps individual humans, and each would separately represent the goals of those humans. However, the people who suggest this strategy are generally not AI Alignment researchers but rather people arguing against AI alignment researchers.
What is the implied alignment target that AI alignment researchers are trying to work towards?