There’s an important distinction here between prediction the next token in a piece of text and predicting the next action in a causal chain. If you have a computation that is represented by a causal graph, and you train a predictor to predict nodes conditional on previous nodes, then it’s true that the predictor won’t end up being able to do better than the original computational process. But text is not ordered that way! Texts often describe outcomes before describing the details of the events which generated them. If you train on texts like those, you get something more powerful than an imitator. If you train a good enough next-token predictor on chess games where the winner is mentioned before the list of moves, you can get superhuman play by prepending “This is a game which white/black wins:”. If you train a good enough next-token predictor on texts that have the outputs of circuits listed before the inputs, you get an NP-oracle. You’re almost certainly not going to get an NP-oracle from GPT-9, but that’s because of the limitations of the training processes and architectures of that this universe can support, it’s not a limitation of the loss function.
I think there very much is a limitation in the loss function, when you consider efficiency of results. In chess, stockfish and alphazero don’t just match the best chess players, they exceed them by a ridiculous level, and that’s right now. Whereas GPT, with the same level of computation, still hasn’t figured out how not to make illegal moves.
I can’t rule out that a future GPT version will be able to beat the best human, by really good pattern matching on what a “winning” game looks like. But that’s still pattern matching on human games. Stockfish has no such limitation.
There’s an important distinction here between prediction the next token in a piece of text and predicting the next action in a causal chain. If you have a computation that is represented by a causal graph, and you train a predictor to predict nodes conditional on previous nodes, then it’s true that the predictor won’t end up being able to do better than the original computational process. But text is not ordered that way! Texts often describe outcomes before describing the details of the events which generated them. If you train on texts like those, you get something more powerful than an imitator. If you train a good enough next-token predictor on chess games where the winner is mentioned before the list of moves, you can get superhuman play by prepending “This is a game which white/black wins:”. If you train a good enough next-token predictor on texts that have the outputs of circuits listed before the inputs, you get an NP-oracle. You’re almost certainly not going to get an NP-oracle from GPT-9, but that’s because of the limitations of the training processes and architectures of that this universe can support, it’s not a limitation of the loss function.
I think there very much is a limitation in the loss function, when you consider efficiency of results. In chess, stockfish and alphazero don’t just match the best chess players, they exceed them by a ridiculous level, and that’s right now. Whereas GPT, with the same level of computation, still hasn’t figured out how not to make illegal moves.
I can’t rule out that a future GPT version will be able to beat the best human, by really good pattern matching on what a “winning” game looks like. But that’s still pattern matching on human games. Stockfish has no such limitation.