Amazon Claims ’Emergent Abilities’ for Text-to-Speech Model

Researchers at Amazon have trained what they are calling the largest text-to-speech model ever created, which they claim is exhibiting “emergent” qualities — the ability to inherently improve itself at speaking complex sentences naturally. Called BASE TTS, for Big Adaptive Streamable TTS with Emergent abilities, the new model could pave the way for more human-like interactions with AI, reports suggest. Trained on 100,000 hours of public domain speech data, BASE TTS offers “state-of-the-art naturalness” in English as well as some German, Dutch and Spanish. Text-to-speech models are used in developing voice assistants for smart devices and apps and accessibility.

Regarding the “emergent” aspect, TechCrunch writes that there is a certain “leap in ability” that is sometimes observed in language models that surpass a certain size, noting that “for reasons unknown to us, once LLMs grow past a certain point, they start being way more robust and versatile, able to perform tasks they weren’t trained to.”

Emergent abilities refer to novel behaviors or skills that emerge in advanced AI systems — particularly LLMs — without being specifically trained, arising seemingly spontaneously. “That is not to say they are gaining sentience or anything, just that past a certain point their performance on certain conversational AI tasks hockey sticks,” TechCrunch explains, adding that the team at Amazon AGI “suggests this is in fact the case” with BASE TTS.

To process the 100,000 hours of public domain audio sourced from the public web, “researchers split the audio into small files that each included no more than 40 seconds of speech,” SiliconANGLE reports.

“Echoing the widely-reported ‘emergent abilities’ of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences,” Amazon researchers wrote in a scientific paper about BASE TTS.

In a web post offering audio samples from the new model, Amazon explains “it deploys a 1-billion- parameter autoregressive Transformer that converts raw texts into discrete codes (‘speechcodes’) followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner.”

The BASE TTS “comprises two separate AI models,” SiliconANGLE explains, detailing “the first turns text entered by the user into abstract mathematical representations dubbed speechcodes. The second neural network, in turn, transforms those mathematical representations into audio.”

No Comments Yet

You can be the first to comment!

Leave a comment

You must be logged in to post a comment.