Current text-to-video models excel at one thing: generating a stunning 10-second clip. Show them a prompt and they’ll create visually impressive footage with coherent motion and lighting. But ask them to make a short film, and something breaks. Characters shift appearance mid-scene. Backgrounds contradict themselves. A character who was sitting down suddenly stands up for no reason. The narrative falls apart because the model was never designed to think about continuity across multiple shots.
The root cause is architectural. Models like Sora, Kling, and Vidu treat each generation as an independent task. You provide a prompt, they generate a clip, and that’s the end of the matter. There’s no mechanism for persistence, no way for the model to remember who the characters are or where the story is going. When researchers tried asking these models to generate multi-shot sequences by providing multiple shot descriptions at once, the results were predictable: either the models ignored the multi-shot instructions entirely and produced one continuous clip, or if they respected the structure, the resulting shots featured wildly inconsistent characters and wildly inconsistent settings.
Three distinct problems layer on top of each other. First is consistency: a character’s face, clothing, and position in space must persist across cuts. Second is spatial reasoning: if shot one shows a character entering a room from the left, shot two must respect that spatial relationship. The character should be positioned logically relative to where they entered. Third is causal logic: if shot one shows someone picking up an empty glass, shot two shows them pouring water, and shot three should show that glass now containing water. The model needs to track state changes, not just visual repetition.
The tempting workaround is to stitch independent clips together with a separate continuity model. This fails because the damage is already done. If shot one has a character looking left and shot two (generated independently) has them looking right, no stitching algorithm fixes that contradiction. You’re editing your way out of a generation problem, which is like trying to fix a bad take by cutting it cleverly. Sometimes it masks the problem, but usually you end up with a bad film that’s been cleverly disguised.
What does it takes to maintain coherence?
Before diving into technical solutions, consider how film directors actually work. They don’t shoot scene one, leave, and return weeks later for scene two. Instead, they hold an entire vision of the film in their head. They block out scenes knowing how they connect, maintain detailed continuity notes, and think several shots ahead to set up visual and narrative payoffs. They’re thinking holistically about the entire film.
The traditional approach to longer-form video generation is sequential: generate shot one, freeze it, then generate shot two conditional on shot one’s output. This is like translating a book one sentence at a time, where each sentence is optimized in isolation and later sentences can’t go back and fix earlier ones.
An alternative exists: hold all the shot representations in a shared memory space and update them together repeatedly before rendering any final pixels. This is more like a writer who sketches the entire plot first, then refines all chapters simultaneously, weaving threads throughout.
Why processing all shots together changes everything
A new paper proposes a solution to this problem, called HoloCine. HoloCine’s breakthrough is architectural. Rather than generate shots sequentially or independently, the model processes all shots’ latent representations jointly in a unified context. This means when the model generates what shot two should look like, it’s literally attending to the representations of shots one, three, and four simultaneously. These representations communicate with each other in real time.


