So, if I accept Ryan’s framing of the inconsistent triad, I’d reject the 3rd one, and say that “Current LLMs never “learn” at runtime (e.g. the in-context learning they can do isn’t real learning)”
You have to reject one of the three. So, if you reject the third (as I do), then you think LLMs do learn at runtime.
I’m quite confused, given the fact that all of the weights in the transformer are frozen after training and RLHF, why it’s called learning at all
In RLHF and training, no aspect of the GPU hardware is being updated at all, its all frozen. So why does that count as learning? I would say that a system can (potentially!) be learning as long as there is some evolving state. In the case of transformers and in-context learning, that state is activations.
You have to reject one of the three. So, if you reject the third (as I do), then you think LLMs do learn at runtime.
Ah sorry I misread the trilemma, my bad! I think I’d still hold the 3rd to be true (Current LLMs never “learn” at runtime) though I’m open to changing my mind on that looking at further research. I guess I could see ways to reject 1 (e.g. if I copied the answers and just used a lookup table I’d get 100% but I don’t think there’s any learning, so it’s certainly feasible for this to be false, but agreed it doesn’t feel satisfying), or 2 (Maybe Chollet would say selection-from-memorised-templates doesn’t count as a learning, also agreed unsatisfying). It’s a good challenge!
In RLHF and training, no aspect of the GPU hardware is being updated at all, its all frozen. So why does that count as learning?
I’m not really referring to hardware here, in pre-training and RLHF the model weights are being changed and updated, and that’s where the ‘learning’ (if we want to call it that) comes in—the model is ‘learning’ to store information/generate information with some combination of accurately predicting the next token in its training data and satisfying the RL model created from human reward labelling. Which is my issue with calling ICL ‘learning’ since the model weights are fixed, the model isn’t learning anything. Similarly, all the activation functions between the layers do not change either. I also don’t make intuitive sense to me to call the outputs of layers as ‘learning’ - the activations are ‘just matmul’ which I know is reductionist, but they aren’t a thing that acquires a new state in my mind.
But again, this is something I want to do a deep dive into myself, so I accept that my thoughts on ICL might not be very clear
I’m not really referring to hardware here, in pre-training and RLHF the model weights are being changed and updated
Sure, I was just using this as an example. I should have made this more clera.
Here is a version of the exact same paragraph you wrote but for activations and incontext learning:
in pre-training and RLHF the model activations are being changed and updated by each layer, and that’s where the ‘in-context learning’ (if we want to call it that) comes in—the activations are being updated/optimized to better predict the next token and understand the text. The layers learned to in-context learn (update the activations) across a wide variety of data in pretraining.
Fair enough if you want to say “the model isn’t learning, the activations are learning”, but then you should also say “short term (<1 minute) learning in humans isn’t the brain learning, it is the transient neural state learning”.
I’ll have to dive into the technical details here I think, but the mystery of in-context learning has certainly shot up my reading list, and I really appreciate that link btw! It seems Blaine has some of the similary a-priori scepticism that I do towards it, but the right way for me to proceed is dive into the empirical side and see if my ideas hold water there.
You have to reject one of the three. So, if you reject the third (as I do), then you think LLMs do learn at runtime.
In RLHF and training, no aspect of the GPU hardware is being updated at all, its all frozen. So why does that count as learning? I would say that a system can (potentially!) be learning as long as there is some evolving state. In the case of transformers and in-context learning, that state is activations.
Ah sorry I misread the trilemma, my bad! I think I’d still hold the 3rd to be true (Current LLMs never “learn” at runtime) though I’m open to changing my mind on that looking at further research. I guess I could see ways to reject 1 (e.g. if I copied the answers and just used a lookup table I’d get 100% but I don’t think there’s any learning, so it’s certainly feasible for this to be false, but agreed it doesn’t feel satisfying), or 2 (Maybe Chollet would say selection-from-memorised-templates doesn’t count as a learning, also agreed unsatisfying). It’s a good challenge!
I’m not really referring to hardware here, in pre-training and RLHF the model weights are being changed and updated, and that’s where the ‘learning’ (if we want to call it that) comes in—the model is ‘learning’ to store information/generate information with some combination of accurately predicting the next token in its training data and satisfying the RL model created from human reward labelling. Which is my issue with calling ICL ‘learning’ since the model weights are fixed, the model isn’t learning anything. Similarly, all the activation functions between the layers do not change either. I also don’t make intuitive sense to me to call the outputs of layers as ‘learning’ - the activations are ‘just matmul’ which I know is reductionist, but they aren’t a thing that acquires a new state in my mind.
But again, this is something I want to do a deep dive into myself, so I accept that my thoughts on ICL might not be very clear
Sure, I was just using this as an example. I should have made this more clera.
Here is a version of the exact same paragraph you wrote but for activations and incontext learning:
in pre-training and RLHF the model activations are being changed and updated by each layer, and that’s where the ‘in-context learning’ (if we want to call it that) comes in—the activations are being updated/optimized to better predict the next token and understand the text. The layers learned to in-context learn (update the activations) across a wide variety of data in pretraining.
(We can show transformers learning to optimization in [very toy cases](https://www.lesswrong.com/posts/HHSuvG2hqAnGT5Wzp/no-convincing-evidence-for-gradient-descent-in-activation#Transformers_Learn_in_Context_by_Gradient_Descent__van_Oswald_et_al__2022_).)
Fair enough if you want to say “the model isn’t learning, the activations are learning”, but then you should also say “short term (<1 minute) learning in humans isn’t the brain learning, it is the transient neural state learning”.
I’ll have to dive into the technical details here I think, but the mystery of in-context learning has certainly shot up my reading list, and I really appreciate that link btw! It seems Blaine has some of the similary a-priori scepticism that I do towards it, but the right way for me to proceed is dive into the empirical side and see if my ideas hold water there.