Sorry, I don’t mean models that you consider to be better, but rather metrics/​behaviors. Like what can V-JEPA-2 (or any model) do that previous models couldn’t which you would consider to be a sign of progress?
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
Again, the caveat here is that this is Meta touting their own results, so I take it with a grain of salt.
I don’t think higher scores on the benchmarks mentioned automatically imply progress on the underlying technical challenge. It’s more about the underlying technical ideas in V-JEPA 2 — Yann LeCun has explained the rationale for these ideas — and where they could ultimately go given further research.
I’m very skeptical of AI benchmarks in general because I tend to think they have poor construct validity, depending how you interpret them, i.e., insofar as they attempt to measure cognitive abilities or aspects of general intelligence, they mostly don’t measure those things successfully.[1]
The clearest and crudest example to illustrate this point is LLM performance on IQ tests. The naive interpretation is that if an LLM scores above average on an IQ test, i.e., above 100, then it must have the cognitive properties a human does when they score above average on an IQ test, that is, such an LLM must be a general intelligence. But many LLMs, such as GPT-4 and Claude 3 Opus, score well above 100 on IQ tests. Are GPT-4 and Claude 3 Opus therefore AGIs? No, of course not. So, IQ tests don’t have construct validity when applied to LLMs if you think IQ tests measure general intelligence for AI systems.
I don’t think anybody really believes IQ tests actually prove LLMs are AGIs, which is why it’s a useful example. But people often do use benchmarks to compare LLM intelligence to human intelligence based on similar reasoning. I don’t think the reasoning is any more valid with those benchmarks than it is for IQ tests.
Benchmarks are useful for measuring certain things; I’m not trying to argue with narrow interpretations. I’m specifically arguing with the use of benchmarks to put general intelligence on a number line, such that a lower score on a benchmark means an AI system is further away from general intelligence and a higher score means it is closer to general intelligence. This isn’t valid with IQ tests and it isn’t valid with most benchmarks.
Researchers can validly use benchmarks as a measure of performance, but I want to ward against the overboard interpretation of benchmarks, as if they were scientific tests of cognitive ability or general intelligence — which they aren’t.
Just one example of what I mean: if you show AI models an image of a 3D model of an object, such as a folding chair, in a typical pose, they will correctly classify the object 99.6% of the time. You might conclude: these AI models have a good visual understanding of these objects, of what they are, of how they look. But if you just rotate the 3D models into an atypical pose, such as showing the folding chair upside-down, object recognition accuracy drops to 67.1%. The error rate increases by 82x from 0.4% to 32.9%. (Humans perform equally well regardless of whether the pose is typical or atypical — good robustness!)
Usually, when we measure AI performance on some dataset or some set of tasks, we don’t do this kind of perturbation to test robustness. And this is just one way you can call the construct validity of benchmarks into question. (If benchmarks are being construed more broadly than their creators probably intend, in most cases.)
Economic performance is a more robust test of AI capabilities than almost anything else. However, it’s also a harsh and unforgiving test, which doesn’t allow us to measure early progress.
Edited on Monday, December 8, 2025 at 12:00pm Eastern to add: I just realized I’ve been lumping in criterion validity with construct validity, but they are two different concepts. Both are important in this context. Both concepts fall under the umbrella of measurement validity.
Sorry, I don’t mean models that you consider to be better, but rather metrics/​behaviors. Like what can V-JEPA-2 (or any model) do that previous models couldn’t which you would consider to be a sign of progress?
The V-JEPA 2 abstract explains this:
Again, the caveat here is that this is Meta touting their own results, so I take it with a grain of salt.
I don’t think higher scores on the benchmarks mentioned automatically imply progress on the underlying technical challenge. It’s more about the underlying technical ideas in V-JEPA 2 — Yann LeCun has explained the rationale for these ideas — and where they could ultimately go given further research.
I’m very skeptical of AI benchmarks in general because I tend to think they have poor construct validity, depending how you interpret them, i.e., insofar as they attempt to measure cognitive abilities or aspects of general intelligence, they mostly don’t measure those things successfully.[1]
The clearest and crudest example to illustrate this point is LLM performance on IQ tests. The naive interpretation is that if an LLM scores above average on an IQ test, i.e., above 100, then it must have the cognitive properties a human does when they score above average on an IQ test, that is, such an LLM must be a general intelligence. But many LLMs, such as GPT-4 and Claude 3 Opus, score well above 100 on IQ tests. Are GPT-4 and Claude 3 Opus therefore AGIs? No, of course not. So, IQ tests don’t have construct validity when applied to LLMs if you think IQ tests measure general intelligence for AI systems.
I don’t think anybody really believes IQ tests actually prove LLMs are AGIs, which is why it’s a useful example. But people often do use benchmarks to compare LLM intelligence to human intelligence based on similar reasoning. I don’t think the reasoning is any more valid with those benchmarks than it is for IQ tests.
Benchmarks are useful for measuring certain things; I’m not trying to argue with narrow interpretations. I’m specifically arguing with the use of benchmarks to put general intelligence on a number line, such that a lower score on a benchmark means an AI system is further away from general intelligence and a higher score means it is closer to general intelligence. This isn’t valid with IQ tests and it isn’t valid with most benchmarks.
Researchers can validly use benchmarks as a measure of performance, but I want to ward against the overboard interpretation of benchmarks, as if they were scientific tests of cognitive ability or general intelligence — which they aren’t.
Just one example of what I mean: if you show AI models an image of a 3D model of an object, such as a folding chair, in a typical pose, they will correctly classify the object 99.6% of the time. You might conclude: these AI models have a good visual understanding of these objects, of what they are, of how they look. But if you just rotate the 3D models into an atypical pose, such as showing the folding chair upside-down, object recognition accuracy drops to 67.1%. The error rate increases by 82x from 0.4% to 32.9%. (Humans perform equally well regardless of whether the pose is typical or atypical — good robustness!)
Usually, when we measure AI performance on some dataset or some set of tasks, we don’t do this kind of perturbation to test robustness. And this is just one way you can call the construct validity of benchmarks into question. (If benchmarks are being construed more broadly than their creators probably intend, in most cases.)
Economic performance is a more robust test of AI capabilities than almost anything else. However, it’s also a harsh and unforgiving test, which doesn’t allow us to measure early progress.
Edited on Monday, December 8, 2025 at 12:00pm Eastern to add: I just realized I’ve been lumping in criterion validity with construct validity, but they are two different concepts. Both are important in this context. Both concepts fall under the umbrella of measurement validity.