Nvidia AI Model Fugatto a Breakthrough in Generative Sound

By Paula Parisi
November 27, 2024

Nvidia has unveiled an AI sound model research project called Fugatto that “can create any combination of music, voices and sounds” based on text and audio inputs. Described by Nvidia as “the world’s most flexible sound machine,” many appear to agree that the new model represents an audio breakthrough, with the potential to generate a wide array of sounds that have not previously existed. While popular sound models from companies including Suno and ElevenLabs “can compose a song or modify a voice, none have the dexterity of the new offering,” Nvidia claims.

Fugatto (an abbreviation for Foundational Generative Audio Transformer Opus 1) offers “a completely new paradigm in how sound and audio are manipulated and transformed by AI,” writes Tom’s Guide, noting “it goes way beyond converting text to speech or producing music from text prompts and delivers some genuinely innovative features we haven’t seen before.”

“What does a screaming saxophone sound like? The Fugatto model has an answer,” reports Ars Technica, citing “a sample-filled website showcasing how Fugatto can be used to dial a number of distinct audio traits and descriptions up or down, resulting in everything from the sound of saxophones barking to people speaking underwater to ambulance sirens singing in a kind of choir.”

Nvidia researchers have published a technical paper on Fugatto’s approach, and Ars Technica takes a detailed look at the platform’s underpinnings, starting with the “difficulty in crafting a training dataset that can ‘reveal meaningful relationships between audio and language.’”

To achieve its goal, “the researchers start by using an LLM to generate a Python script that can create a large number of template-based and free-form instructions describing different audio ‘personas’ (e.g., ‘standard, young-crowd, thirty-somethings, professional’),” Ars Technica writes. “They then generate a set of both absolute (e.g., ‘synthesize a happy voice’) and relative (e.g., ‘increase the happiness of this voice’) instructions that can be applied to those personas.”

“We wanted to create a model that understands and generates sound like humans do,” orchestral conductor and composer Rafael Valle, a manager of applied audio research at Nvidia and one of the dozen-plus researchers behind Fugatto, wrote in a company blog post.

SiliconANGLE cites an Nvidia YouTube video that “demonstrates how Fugatto can generate the sound of a train that slowly morphs into an orchestral performance, change happy voices into angry ones, and so on.”

With Fugatto, Nvidia is joining Meta Platforms, OpenAI and Runway “in releasing a generative artificial intelligence model that’s designed to create ‘new’ music and audio from human language prompts,” SiliconANGLE points out.

Topics: Artificial Intelligence, Audio, ElevenLabs, Fugatto, Generative AI, Generative Sound, Language, LLM, Meta Platforms, Music, Nvidia, OpenAI, Python, Rafael Valle, Research, Runway AI, Song, Sound, Speech, Suno, Text, Text-to-Audio, Voice

Nvidia AI Model Fugatto a Breakthrough in Generative Sound

No Comments Yet

Leave a comment