Can you *really* train AI to “get” videos just by showing it a million of them?

Video models have become astonishingly capable. Sora and its peers can generate spatiotemporally coherent video sequences that look photorealistic, maintain object continuity across frames, and respect basic physical constraints. By conventional measures, they’re superhuman at video production.

But there’s a gap nobody has been measuring systematically. Can these models actually reason about what’s happening in a video? Can they understand causality, spatial relationships, how objects interact, why certain outcomes follow from certain actions? Or are they just pattern-matching at superhuman scale, replicating visual texture without grasping the underlying structure?

The distinction matters. A model might generate a flawless video of a cup falling and breaking while fundamentally misunderstanding gravity, momentum, or fragility. It might produce spatiotemporally perfect sequences while reasoning about them in ways that would fail immediately on variations it hasn’t seen before. The current state of video modeling research has optimized for what’s easy to measure, not what matters.

This measurement blind spot exists because existing video reasoning benchmarks are tiny. A few thousand samples spread across a handful of task types, rarely exceeding 50 distinct reasoning problems. You can’t study scaling behavior on datasets that small. You can’t distinguish between genuine understanding and pattern memorization. You can’t watch reasoning abilities emerge as models grow larger and more sophisticated.

Right now we’re building increasingly capable video models while remaining almost entirely ignorant about whether they’re actually reasoning about the spatiotemporal world or just performing statistical compression on visual data at superhuman fidelity.

Rethinking how to measure reasoning

Before building a dataset, researchers need to ask a prior question: what exactly should we measure?

This is where conventional benchmarking approaches break down. Most video datasets throw mixed tasks at models without understanding what cognitive abilities each task targets. There’s no underlying theory of what “video reasoning” actually consists of, so there’s no principled way to know whether you’re measuring the right things or just chasing whatever scores highest on your metric.


Read more

Scroll to Top