I can buy that GPT4o would be best, but perhaps other LLMs might reached ‘ok’ scores on ARC-AGI if directly swapped out? I’m not sure what you refer to be ‘careful optimization’ here though.
I think much worse LLMs like GPT-2 or GPT-3 would virtually eliminate performance.
This is very clear as these LLMs can’t code basically at all.
If you instead consider LLMs which are only somewhat less powerful like llama-3-70b (which is perhaps 10x less effective compute?), the reduction in perf will be smaller.
I think much worse LLMs like GPT-2 or GPT-3 would virtually eliminate performance.
This is very clear as these LLMs can’t code basically at all.
If you instead consider LLMs which are only somewhat less powerful like llama-3-70b (which is perhaps 10x less effective compute?), the reduction in perf will be smaller.