Alibaba’s EMO Can Generate Performance Video from Images

Alibaba is touting a new artificial intelligence system that can animate portraits, making people sing and talk in realistic fashion. Researchers at the Alibaba Group’s Institute for Intelligent Computing developed the generative video framework, calling it EMO, short for Emote Portrait Alive. Input a single reference image along with “vocal audio,” as in talking or singing, and “our method can generate vocal avatar videos with expressive facial expressions and various head poses,” the researchers say, adding that EMO can generate videos of any duration, “depending on the length of video input.”

VentureBeat suggests EMO “represents a major advance in audio-driven talking head video generation, an area that has challenged AI researchers for years,” describing it as “able to create fluid and expressive facial movements and head poses that closely match the nuances of a provided audio track.”

A key aspect of EMO is its ability to “sync lips in a synthesized video clip with real audio,” which means “the model supports songs across multiple languages.” Visually, it can simulate various artistic styles, from photographic to painterly or animated, and it can also work with a variety of audio inputs.

“The model, built on a Stable Diffusion backbone, is not the first of its kind but arguably the most effective,” writes PetaPixel, declaring “the overall accuracy of the lip movements in response to the input audio is remarkable.”

“YouTube channel RINKI compiled all of Alibaba’s demo clips and upscaled them to 4K,” notes PetaPixel.

“Traditional techniques often fail to capture the full spectrum of human expressions and the uniqueness of individual facial styles,” the researchers explain in an EMO paper that describes “a novel framework that utilizes a direct audio-to-video synthesis approach, bypassing the need for intermediate 3D models or facial landmarks.”

Numerous examples can be seen on an Institute for Intelligent Computing’s GitHub post: an AI Leonardo DiCaprio synced to a cover of Eminem’s “Godzilla,” various female AI creations (including OpenAI’s now-famous sunglasses-clad “Sora Lady”), and Leonardo da Vinci’s “Mona Lisa” reciting Shakespeare.

“The EMO research hints at a future where personalized video content can be synthesized from just a photo and an audio clip,” reports VentureBeat, flagging ethical concerns about deepfakes and “potential misuse of such technology to impersonate people without consent or spread misinformation.”

The Alibaba researchers hint at plans to develop synthetic media detection methods.

No Comments Yet

You can be the first to comment!

Leave a comment

You must be logged in to post a comment.