Pyramid Flow Introduces a New Approach to Generative Video
October 14, 2024
Generative video models seem to be debuting daily. Pyramid Flow, among the latest, aims for realism, producing dynamic video sequences that have temporal consistency and rich detail while being open source and free. The model can create clips of up to 10 seconds using both text and image prompts. It offers a cinematic look, supporting 1280×768 pixel resolution clips at 24 fps. Developed by a consortium of researchers from Peking University, Beijing University and Kuaishou Technology, Pyramid Flow harnesses a new technique that starts with low-resolution video, outputting at full-res only at the end of the process.
The raw code for Pyramid Flow is available for download from GitHub and Hugging Face. A separate GitHub locale offers an optimized inference shell but the code must be downloaded and run locally.
“At inference, the model can generate a 5-second, 384p video in just 56 seconds — on par with or faster than many full-sequence diffusion counterparts — though Runway’s Gen 3-Alpha Turbo still takes cake in terms of speed of AI video generation, coming in at under one minute and often times 10-20 seconds in our tests,” VentureBeat writes, lauding the videos posted by the Pyramid model creators as “incredibly lifelike, high enough resolution, and compelling.”
Examples are posted on the model’s Github project page (the screen grab above is from the sample described as “snowy Tokyo city is bustling”).
It’s worth noting that Kuaishou is the Beijing-based creator of the well-received Kling AI generative video model released in June. “Compared to closed-source generators like Sora and [Kling], Pyramid Flow holds its own, delivering results that are sometimes indistinguishable from those of its proprietary competitors,” writes one AI enthusiast on Medium in an article that is objective enough to cite flaws like “deformations in complex scenes” and “high memory requirements.”
VentureBeat, too, takes issues with Pyramid’s limitations, noting “it lacks some of the advanced fine-tuning capabilities found in models like Runway Gen-3 Alpha, which offers precise control over cinematic elements like camera angles, keyframes, and human gestures. Similarly, Luma’s Dream Machine provides advanced camera control options that Pyramid Flow is still catching up to.”
A technical paper details the novel approach. “This work introduces a unified pyramidal flow matching algorithm,” the researchers explain. “It reinterprets the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution, thereby enabling more efficient video generative modeling.”
“As they write, the ability to compress and optimize video generation at different stages leads to faster convergence during training, allowing Pyramid Flow to generate more samples per training batch,” reports VentureBeat. “The proposed pyramidal flow reduces the token count by a factor of four compared to traditional diffusion models, which results in more efficient training.”
No Comments Yet
You can be the first to comment!
Leave a comment
You must be logged in to post a comment.