Meta Creates Voicebox Generative AI Model for Audio Synth

Meta Platforms has unveiled Voicebox, an AI model that can produce high-quality audio clips and edit pre-recorded audio. It also uses artificial intelligence for speech generation efforts, using what Meta calls “in-context learning” to accomplish tasks it was not specifically trained for. The company says Voicebox is first in class with this type of generalized learning for audio. Untrained tasks include sampling, stylizing and editing. As an editor, it can isolate and remove sounds like car horns and background animal noise while preserving the content and style of the source audio. The multilingual model generates speech in six languages.

Voicebox is “a text-guided generative model for speech at scale,” Meta researchers write in a scientific paper that further describes “a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are neither filtered nor enhanced.”

Like GPT, “Voicebox can perform many different tasks through in-context learning but is more flexible as it can also condition on future context,” according to the researchers.

Although Meta is not generally releasing Voicebox, due to concerns about potential misuse, it is teeing up the technology to help developers with future applications.

Like other large language models, Voicebox “has been trained on a very general task that can be used for many applications,” writes VentureBeat, noting a difference in that “while LLMs try to learn the statistical regularities of words and text sequences, Voicebox has been trained to learn the patterns that map voice audio samples to their transcripts.”

Such a model can then be adapted to many downstream tasks with minimal fine-tuning. “The goal is to build a single model that can perform many text-guided speech generation tasks through in-context learning,” Meta’s research staff explains.

The model was trained using a technique Meta calls “flow matching,” claimed to be “more efficient and generalizable than diffusion-based learning methods used in other generative models,” according to VentureBeat, which provides an explainer of the process of “text-guided speech infilling.”

Given a sample of someone’s speech and a passage of text in English, French, German, Spanish, Polish or Portuguese, the app can apply cross-lingual style transfer for an effective audio result, Meta details in an introductory announcement.

Voicebox can create outputs from scratch as well as modify a sample it’s given, generating outputs across a wide variety of styles. But Voicebox has its limits, notes VentureBeat, explaining that the audiobook training data “does not transfer well to conversational speech that is casual and contains non-verbal sounds.”

No Comments Yet

You can be the first to comment!

Leave a comment

You must be logged in to post a comment.