DeepMind’s V2A Generates Music, Sound Effects, Dialogue

By Paula Parisi
June 19, 2024

Google DeepMind has unveiled new research on AI tech it calls V2A (“video-to-audio”) that can generate soundtracks for videos. The initiative complements the wave of AI video generators from companies ranging from biggies like OpenAI and Alibaba to startups such as Luma and Runway, all of which require a separate app to add sound. V2A technology “makes synchronized audiovisual generation possible” by combining video pixels with natural language text prompts “to generate rich soundscapes for the on-screen action,” DeepMind writes, explaining that it can “create shots with a dramatic score, realistic sound effects or dialogue.”

V2A “can also generate soundtracks for a range of traditional footage, including archival material, silent films and more,” DeepMind notes in a blog post that explains how features like unlimited output and positive and negative prompts offer the of flexibility of rapid experimentation along with enhanced creative control.

The post includes samples based on prompts including “jellyfish pulsating under water,” car skidding and “mellow harmonica plays as the sun goes down.” The examples are paired with Veo, which DeepMind describes as its “most capable generative video model.”

“AI-powered sound-generating tools aren’t novel,” writes TechCrunch. This summer, Stability AI released the free Stable Audio Open, while ElevenLabs launched Sound Effects.

TechCrunch elaborates on sound generators from Pika Labs and GenreX, which “have trained models to take a video and make a best guess at what music or effects are appropriate in a given scene,” and a Microsoft project produces “talking and singing videos from a still image.”

The optional text prompts can, however, “be used to shape and refine the final product so that it’s as accurate and as realistic as possible,” Engadget reports, explaining “positive prompts steer the output towards sounds you want” while “negative prompts steer it away from sounds you don’t want.”

DeepMind “trained its AI tool on video, audio, and annotations containing ‘detailed descriptions of sound and transcripts of spoken dialogue,’” which “allows the video-to-audio generator to match audio events with visual scenes,” explains The Verge, adding that such features can help set V2A apart from competing tools.

V2A is being introduced as “Google DeepMind shifts from research lab to AI product factory,” writes Bloomberg. Last year, Google merged its Google Brain AI unit with that of London-based DeepMind, which it acquired in 2014.

DeepMind’s V2A Generates Music, Sound Effects, Dialogue

No Comments Yet

Leave a comment