Stability Introduces GenAI Video Model: Stable Video Diffusion

Stability AI has opened research preview on its first foundation model for generative video, Stable Video Diffusion, offering text-to-video and image-to-video. Based on the company’s Stable Diffusion text-to-image model, the new open-source model generates video by animating existing still frames, including “multi-view synthesis.” While the company plans to enhance and extend the model’s capabilities, it currently comes in two versions: SVD, which transforms stills into 576×1024 videos of 14 frames, and SVD-XT that generates up to 24 frames — each at between three and 30 frames per second.

SVD and SVD-XT both “generate fairly high-quality four-second clips,” according to TechCrunch, which writes that the samples embedded in a Stable Video Diffusion news release “could go to-to-toe with outputs from Meta’s recent video-generation model as well as AI-produced examples we’ve seen from Google and AI startups Runway and Pika Labs.”

“Generative models for different modalities promise to revolutionize the landscape of media creation and use,” Stability explains in a research paper that focuses on the influence on generative video of data selection, calling it uncharted territory.

Establishing as accepted fact that “pretraining on a large and diverse dataset and finetuning on a smaller but higher quality dataset significantly improves performance,” Stability’s work directly addresses “the separation of video pretraining at lower resolutions and high-quality finetuning.”

The research paper also discloses concerns and limitations, writing that “reducing the potential to use [generative video models] for creating misinformation and harm are crucial aspects before real-world deployment.”

With regard to long video synthesis, Stability says that while it’s latent approach provides some efficiency benefits, “generating multiple key frames at once is expensive,” suggesting “future work on long video synthesis should either try a cascade of very coarse frame generation, or build dedicated tokenizers for video generation.”

Stability researchers also say “videos generated with our approach sometimes suffer from too little generated motion” and also that video diffusion models, including this one “are typically slow to sample and have high VRAM requirements.”

Diffusion distillation methods are identified as “promising candidates.” Conversely, TechCrunch is critical of the fact that Stable Video Diffusion “can’t generate videos without motion or slow camera pans, be controlled by text, render text (at least not legibly) or consistently generate faces and people ‘properly.’”

Though an area of intense interest, generative video still has a long way to go. To showcase what it’s accomplished so far, an upcoming Web browser-based experience for text-to-video interface will allow some practical application in sectors including advertising, education, entertainment and more. Interested users are invited to sign up for the waitlist.

Developers can apply to use code for the Stable Video Diffusion research release at the company’s GitHub repository, with the weights required to run the model locally housed on its Hugging Face page.

Earlier this month, Stable Diffusion announced a Stable 3D model for generating 3D content the company says can be suitable for use in things like graphic design and video game development.

No Comments Yet

You can be the first to comment!

Leave a comment

You must be logged in to post a comment.