I’m quite confused, given the fact that all of the weights in the transformer are frozen after training and RLHF, why it’s called learning at all. The model certainly isn’t learning anything.
I would frame it as: the model is learning but then forgetting what it’s learned (due to its inability to move anything from working/short-term memory to long-term memory). That’s something that we see in learning in humans as well (one example: I’ve learned an enormous number of six-digit confirmation codes, each of which I remember just long enough to enter it into the website that’s asking for it), although of course not so consistently.
If this is true, then substituting in a less capable model should have equally good results; would you predict that to be the case? I claim that plugging in an older/smaller model would produce much worse results, and if that’s the case then we should consider a substantial part of the performance to be coming from the model.
This seems to me to be Chollet trying to have it both ways. Either a) ARC is an important measure of ‘true’ intelligence (or at least of the ability to reason over novel problems), and so we should consider LLMs’ poor performance on it a sign that they’re not general intelligence, or b) ARC isn’t a very good measure of true intelligence, in which case LLMs’ performance on it isn’t very important. Those can’t be simultaneously true. I think that nearly everywhere but in the quote, Chollet has claimed (and continues to claim) that a) is true.