LLMs have made no progress on any of these problems
Can we bet on this? I propose: we give a video model of your choice from 2023 and one of my choice from 2025 two prompts (one your choice, one my choice) then ask some neutral panel of judges (Iâm happy to just ask random people in a coffee shop) which model produced more realistic videos.
Iâll bet you a $10 donation to the charity of your/âmy choice that a judge we agree on with formal/âcredentialed expertise in deep learning research (e.g. an academic or corporate AI researcher) will say that typical autoregressive large language models like GPT-4/âGPT-5 or Claude 2/âClaude 4.5 have not non-trivially made or constituted progress on the AI research problem of learning from video data via approaches that donât rely on pixel-level prediction.
Iâm open to counter-offers.
Iâll also say yes to anyone who wants to take the other side of this bet.
I didnât say that pixel-to-pixel prediction or other low-level techniques havenât made incremental progress. I said that this approach is ultimately forlorn â if the goal is human-level computer vision for robotics applications or AGI that can see â and LLMs didnât make any progress on any alternative approaches.
Possibly something like V-JEPA 2, but in that case Iâm just going off of Meta touting its own results, and I would want to hear opinions from independent experts.
Sorry, I donât mean models that you consider to be better, but rather metrics/âbehaviors. Like what can V-JEPA-2 (or any model) do that previous models couldnât which you would consider to be a sign of progress?
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
Again, the caveat here is that this is Meta touting their own results, so I take it with a grain of salt.
I donât think higher scores on the benchmarks mentioned automatically imply progress on the underlying technical challenge. Itâs more about the underlying technical ideas in V-JEPA 2 â Yann LeCun has explained the rationale for these ideas â and where they could ultimately go given further research.
Iâm very skeptical of AI benchmarks in general because I tend to think they have poor construct validity, depending how you interpret them, i.e., insofar as they attempt to measure cognitive abilities or aspects of general intelligence, they mostly donât measure those things successfully.[1]
The clearest and crudest example to illustrate this point is LLM performance on IQ tests. The naive interpretation is that if an LLM scores above average on an IQ test, i.e., above 100, then it must have the cognitive properties a human does when they score above average on an IQ test, that is, such an LLM must be a general intelligence. But many LLMs, such as GPT-4 and Claude 3 Opus, score well above 100 on IQ tests. Are GPT-4 and Claude 3 Opus therefore AGIs? No, of course not. So, IQ tests donât have construct validity when applied to LLMs if you think IQ tests measure general intelligence for AI systems.
I donât think anybody really believes IQ tests actually prove LLMs are AGIs, which is why itâs a useful example. But people often do use benchmarks to compare LLM intelligence to human intelligence based on similar reasoning. I donât think the reasoning is any more valid with those benchmarks than it is for IQ tests.
Benchmarks are useful for measuring certain things; Iâm not trying to argue with narrow interpretations. Iâm specifically arguing with the use of benchmarks to put general intelligence on a number line, such that a lower score on a benchmark means an AI system is further away from general intelligence and a higher score means it is closer to general intelligence. This isnât valid with IQ tests and it isnât valid with most benchmarks.
Researchers can validly use benchmarks as a measure of performance, but I want to ward against the overboard interpretation of benchmarks, as if they were scientific tests of cognitive ability or general intelligence â which they arenât.
Just one example of what I mean: if you show AI models an image of a 3D model of an object, such as a folding chair, in a typical pose, they will correctly classify the object 99.6% of the time. You might conclude: these AI models have a good visual understanding of these objects, of what they are, of how they look. But if you just rotate the 3D models into an atypical pose, such as showing the folding chair upside-down, object recognition accuracy drops to 67.1%. The error rate increases by 82x from 0.4% to 32.9%. (Humans perform equally well regardless of whether the pose is typical or atypical â good robustness!)
Usually, when we measure AI performance on some dataset or some set of tasks, we donât do this kind of perturbation to test robustness. And this is just one way you can call the construct validity of benchmarks into question. (If benchmarks are being construed more broadly than their creators probably intend, in most cases.)
Economic performance is a more robust test of AI capabilities than almost anything else. However, itâs also a harsh and unforgiving test, which doesnât allow us to measure early progress.
Edited on Monday, December 8, 2025 at 12:00pm Eastern to add: I just realized Iâve been lumping in criterion validity with construct validity, but they are two different concepts. Both are important in this context. Both concepts fall under the umbrella of measurement validity.
Can we bet on this? I propose: we give a video model of your choice from 2023 and one of my choice from 2025 two prompts (one your choice, one my choice) then ask some neutral panel of judges (Iâm happy to just ask random people in a coffee shop) which model produced more realistic videos.
Iâll bet you a $10 donation to the charity of your/âmy choice that a judge we agree on with formal/âcredentialed expertise in deep learning research (e.g. an academic or corporate AI researcher) will say that typical autoregressive large language models like GPT-4/âGPT-5 or Claude 2/âClaude 4.5 have not non-trivially made or constituted progress on the AI research problem of learning from video data via approaches that donât rely on pixel-level prediction.
Iâm open to counter-offers.
Iâll also say yes to anyone who wants to take the other side of this bet.
I didnât say that pixel-to-pixel prediction or other low-level techniques havenât made incremental progress. I said that this approach is ultimately forlorn â if the goal is human-level computer vision for robotics applications or AGI that can see â and LLMs didnât make any progress on any alternative approaches.
What are examples of what you would consider to be progress on âeffective video predictionâ?
Possibly something like V-JEPA 2, but in that case Iâm just going off of Meta touting its own results, and I would want to hear opinions from independent experts.
Sorry, I donât mean models that you consider to be better, but rather metrics/âbehaviors. Like what can V-JEPA-2 (or any model) do that previous models couldnât which you would consider to be a sign of progress?
The V-JEPA 2 abstract explains this:
Again, the caveat here is that this is Meta touting their own results, so I take it with a grain of salt.
I donât think higher scores on the benchmarks mentioned automatically imply progress on the underlying technical challenge. Itâs more about the underlying technical ideas in V-JEPA 2 â Yann LeCun has explained the rationale for these ideas â and where they could ultimately go given further research.
Iâm very skeptical of AI benchmarks in general because I tend to think they have poor construct validity, depending how you interpret them, i.e., insofar as they attempt to measure cognitive abilities or aspects of general intelligence, they mostly donât measure those things successfully.[1]
The clearest and crudest example to illustrate this point is LLM performance on IQ tests. The naive interpretation is that if an LLM scores above average on an IQ test, i.e., above 100, then it must have the cognitive properties a human does when they score above average on an IQ test, that is, such an LLM must be a general intelligence. But many LLMs, such as GPT-4 and Claude 3 Opus, score well above 100 on IQ tests. Are GPT-4 and Claude 3 Opus therefore AGIs? No, of course not. So, IQ tests donât have construct validity when applied to LLMs if you think IQ tests measure general intelligence for AI systems.
I donât think anybody really believes IQ tests actually prove LLMs are AGIs, which is why itâs a useful example. But people often do use benchmarks to compare LLM intelligence to human intelligence based on similar reasoning. I donât think the reasoning is any more valid with those benchmarks than it is for IQ tests.
Benchmarks are useful for measuring certain things; Iâm not trying to argue with narrow interpretations. Iâm specifically arguing with the use of benchmarks to put general intelligence on a number line, such that a lower score on a benchmark means an AI system is further away from general intelligence and a higher score means it is closer to general intelligence. This isnât valid with IQ tests and it isnât valid with most benchmarks.
Researchers can validly use benchmarks as a measure of performance, but I want to ward against the overboard interpretation of benchmarks, as if they were scientific tests of cognitive ability or general intelligence â which they arenât.
Just one example of what I mean: if you show AI models an image of a 3D model of an object, such as a folding chair, in a typical pose, they will correctly classify the object 99.6% of the time. You might conclude: these AI models have a good visual understanding of these objects, of what they are, of how they look. But if you just rotate the 3D models into an atypical pose, such as showing the folding chair upside-down, object recognition accuracy drops to 67.1%. The error rate increases by 82x from 0.4% to 32.9%. (Humans perform equally well regardless of whether the pose is typical or atypical â good robustness!)
Usually, when we measure AI performance on some dataset or some set of tasks, we donât do this kind of perturbation to test robustness. And this is just one way you can call the construct validity of benchmarks into question. (If benchmarks are being construed more broadly than their creators probably intend, in most cases.)
Economic performance is a more robust test of AI capabilities than almost anything else. However, itâs also a harsh and unforgiving test, which doesnât allow us to measure early progress.
Edited on Monday, December 8, 2025 at 12:00pm Eastern to add: I just realized Iâve been lumping in criterion validity with construct validity, but they are two different concepts. Both are important in this context. Both concepts fall under the umbrella of measurement validity.