Can you think about watching a video call where the other person nods at exactly the moment you start talking, but their expression remains blank until you finish? That’s what current talking head avatars do. They excel at lip-syncing to audio, generating convincing mouth movements from sound alone. But they fail at something more fundamental: they don’t react. A real conversation partner tilts their head when confused, smiles when you share good news, nods along as you speak. Current avatars are frozen statues that only move their mouths.
This kills the illusion of genuine interaction. When you talk to someone who doesn’t react, you stop believing they’re listening. The uncanny valley isn’t about photorealism or animation quality, but also about responsiveness.
The root cause traces back to architecture. Existing models like INFP (the current baseline) use bidirectional processing. They look at the entire temporal window of a conversation to generate motion, which means they need to see the full context before reacting. It’s like watching a film you’ve already seen, where you know what’s coming. This approach has a fatal cost for real-time interaction: latency. To see facial reactions properly, the model needs 500ms or more of temporal context. But humans perceive conversation partners as responsive when reactions arrive in 200-300ms. Below that threshold, it stops feeling like conversation and starts feeling like broadcast performance.
There’s also an expressiveness problem. Even when these models do react, they’re timid. A person listening to good news shows genuine delight. Current models produce neutral micro-movements. No one teaches them what expressive reaction looks like, so they default to cautious, muted responses. But collecting thousands of labeled examples of “good reaction vs bad reaction” would be expensive and impractical.


