Microsoft’s VASA-1 Can Generate Talking Faces in Real Time

By ETCentric Staff
April 22, 2024

Microsoft has developed VASA, a framework for generating lifelike virtual characters with vocal capabilities including speaking and singing. The premiere model, VASA-1, can perform the feat in real time from a single static image and a vocalization clip. The research demo showcases realistic audio-enhanced faces that can be fine-tuned to look in different directions or change expression in video clips of up to one minute at 512 x 512 pixels and up to 40fps “with negligible starting latency,” according to Microsoft, which says “it paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors.”

VASA stands for “visual affective skills avatar.” VASA-1 can produce not only lip sync movements but also “a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness,” Microsoft Asia researchers write in the introduction to a scientific paper.

The landing page showcases Mona Lisa rapping as well as multiple examples of a single headshot converted into a video of that avatar talking or singing.

“The project marks a significant shift in what has been achieved in AI-generated content as it works with very minimal input,” writes VentureBeat, emphasizing that it works with one image, and describing how emotions can be adjusted “by simply moving a slider up and down.”

On top of it, it can work on content that was not included in the training dataset, including artistic photos, singing audios and non-English speech.

This year has seen a flurry of audio-centric artificial intelligence. ElevenLabs demoed a text-to-sound app used to add vocalization to faces generated by third-party visual apps, which Pika Labs adapted into a product called Pika Lip Sync that makes Pika-generated images speak.

SiliconANGLE says that although Nvidia’s Audio2Face and Runway’s Lip Sync can be used for similar results, “VASA-1 seems to be able to create much more realistic talking heads, with reduced mouth artifacts.”

VentureBeat is, however, critical of VASA-1’s output (capable of 45fps offline in batch processing mode), suggesting the “movement does not appear smooth” in some of the clips, making them look AI-generated.

Microsoft feels the VASA-1 output is realistic enough to facilitate misuse through deepfakes, and therefore has “no plans to release an online demo, API, product, additional implementation details, or any related offerings until we are certain that the technology will be used responsibly.”

Related:
Microsoft’s New AI Video Tool Could Be the Next Internet Revolution, Android Authority, 4/18/24

Topics: App, Artificial Intelligence, Audio, Audio2Face, Avatar, Conversation, Deepfake, ElevenLabs, Emotion, Generative AI, Image, Internet, Lip Sync, Microsoft, Nvidia, Photo, Pika Labs, Pika Lip Sync, Responsible AI, Runway, Singing, Speech, Talking Head, VASA, VASA-1, Vocal, Voice

Microsoft’s VASA-1 Can Generate Talking Faces in Real Time

No Comments Yet

Leave a comment