Discussion about this post

User's avatar
Phil Chacko's avatar

I’ve been trying to understand how models that output pixels and not intermediate representations would offer enough creative control. Editing with chat alone isn’t going to be it. You make a really useful point about that these representations are in latent space.

Seems like there’s plenty of work to do on editing UI in top of latent space?

KBS Sidhu's avatar

Fascinating and sharply argued piece — especially the framing of multimodal “composition” as the real unlock rather than raw text-to-video. The comparison between explicit (engine-based) and implicit (world model) approaches is also a helpful lens for thinking about where this is headed.

For a future article, I’d love to see three issues explored more rigorously:

Benchmarking methodology. When we say Seedance or Kling are “a generation ahead,” what were the prompts, seeds, rejection rates, and post-processing steps? A transparent comparison framework would make the case much stronger.

Enterprise viability and IP risk. If looser data regimes are part of the advantage, how does that affect global deployment, licensing, and brand-safe adoption outside China?

Editability vs. spectacle. How close are these models to deterministic, frame-level control suitable for production workflows, as opposed to impressive but stochastic outputs?

Overall, this was a compelling synthesis — it would be great to see the next piece go one layer deeper into the structural constraints behind the hype.

No posts

Ready for more?