I don’t think the ‘next-token’ aspect has any bearing at all. That models emit one token at a time is just about the interface we allow them to have. But it doesn’t limit the model’s internal architecture to just predict one token at a time. Indeed, given the remarkable coherence and quality of LLM responses (including rarely, if ever, getting stuck where a sentence can’t be meaningfully completed) is evidence it IS considering more than just the next token. And indeed there’s now direct evidence LLM’s think far ahead https://www.anthropic.com/research/tracing-thoughts-language-model. Just one example, when asked to rhyme when writing out the first sentence, the model will already internally have considered what words could form a rhyme in the second sentence.
Our use and training of LLM’s is focused on next-token, and for a simple model with few parameters it will indeed by very simple, just looking at the frequency distribution given the previous word etc. But when you search for the best model with billions of parameters things radically change—here, the best way of the model to predict the next token is to develop ACTUAL intelligence which includes thinking further ahead, even though our interface to the model is simpler.
I don’t think the ‘next-token’ aspect has any bearing at all. That models emit one token at a time is just about the interface we allow them to have. But it doesn’t limit the model’s internal architecture to just predict one token at a time. Indeed, given the remarkable coherence and quality of LLM responses (including rarely, if ever, getting stuck where a sentence can’t be meaningfully completed) is evidence it IS considering more than just the next token. And indeed there’s now direct evidence LLM’s think far ahead https://www.anthropic.com/research/tracing-thoughts-language-model. Just one example, when asked to rhyme when writing out the first sentence, the model will already internally have considered what words could form a rhyme in the second sentence.
Our use and training of LLM’s is focused on next-token, and for a simple model with few parameters it will indeed by very simple, just looking at the frequency distribution given the previous word etc. But when you search for the best model with billions of parameters things radically change—here, the best way of the model to predict the next token is to develop ACTUAL intelligence which includes thinking further ahead, even though our interface to the model is simpler.