Microsoft’s VASA-1 Can Generate Talking Faces in Real Time

Microsoft has developed VASA, a framework for generating lifelike virtual characters with vocal capabilities including speaking and singing. The premiere model, VASA-1, can perform the feat in real time from a single static image and a vocalization clip. The research demo showcases realistic audio-enhanced faces that can be fine-tuned to look in different directions or change expression in video clips of up to one minute at 512 x 512 pixels and up to 40fps “with negligible starting latency,” according to Microsoft, which says “it paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors.” Continue reading Microsoft’s VASA-1 Can Generate Talking Faces in Real Time

Audio-First Social Platform Airchat Has Successful Relaunch

Airchat is the latest app to take tech leaders in Silicon Valley by storm. Described as a “combination of voice notes and Twitter,” Airchat lets you follow other users and scroll through posts — adding replies, likes and shares — but the twist is the content is generated through audio recordings the app then transcribes. Airchat ranked 27th on the App Store’s social networking chart, even though users must be invited to join. Launched last year by Naval Ravikant, founder of AngelList, and erstwhile Tinder product exec Brian Norgard, Airchat was just relaunched on iOS and Android. Continue reading Audio-First Social Platform Airchat Has Successful Relaunch

Deepgram’s Speech Portfolio Now Includes Human-Like Aura

Deepgram’s new Aura software turns text into generative audio with a “human-like voice.” The 9-year-old voice recognition company has raised nearly $86 million to date on the strength of its Voice AI platform. Aura is an extremely low-latency text-to-speech voice AI that can be used for voice AI agents, the company says. Paired with Deepgram’s Nova-2 speech-to-text API, developers can use it to “easily (and quickly) exchange real-time information between humans and LLMs to build responsive, high-throughput AI agents and conversational AI applications,” according to Deepgram. Continue reading Deepgram’s Speech Portfolio Now Includes Human-Like Aura

ElevenLabs Promotes Its Latest Advances in AI Audio Effects

“What if you could describe a sound and generate it with AI?,” asks startup ElevenLabs, which set out to do just that, and says it has succeeded. The two-year-old company explains it “used text prompts like ‘waves crashing,’ ‘metal clanging,’ ‘birds chirping,’ and ‘racing car engine’ to generate audio.” Best known for using machine learning to clone voices, the AI firm founded by Google and Palantir alums has yet to make publicly available its new text-to-sound model but began teasing it by releasing online demos this week. Some see the technology as a natural complement to the latest wave of image generators. Continue reading ElevenLabs Promotes Its Latest Advances in AI Audio Effects

CES: Voiseed Upgrades Its Platform for Expressive AI Voices

Milano-based Voiseed demonstrated its web-based Revoiceit platform at CES, pitched as the best way to manage synthetic voice actors, particularly ensuring that synthetic voices present realistic emotions. The company describes it as a cloud-based solution that uses “generative AI to infuse virtual voices with human emotions and prosody, creating highly expressive, lifelike audio experiences.” While Revoiceit’s most obvious feature is its Studio (imagine Adobe Audition devoted to second-by-second management of voices), it may well be the product’s forthcoming API that provides real value to developers of entertaining technology products. Continue reading CES: Voiseed Upgrades Its Platform for Expressive AI Voices

Meta AI Seamless Translator Converts Nearly 100 Languages

The research division of Meta AI has developed Seamless Communication, a suite of artificial intelligence models that generate what the company says is natural and authentic communication across languages, facilitating what amounts to real-time universal speech translation. The models were released with accompanying research papers and data. The flagship model, Seamless, merges capabilities from a trio of models — SeamlessExpressive, SeamlessStreaming and SeamlessM4T v2 — into a single system that can translate between almost 100 spoken and written languages, preserving idioms, emotion and the speaker’s vocal style, Meta says. Continue reading Meta AI Seamless Translator Converts Nearly 100 Languages

Adobe Reveals Its New AI Tool for Editing Problematic Audio

Adobe has unveiled Project Sound Lift, an AI-powered technology that separates speech recordings into discrete tracks of voices, non-speech sounds and other background noise in video. The company describes Project Sound Lift as “a one-click solution” that leverages AI to help users easily manipulate audio recordings “across a range of scenarios” to “enhance, transform, and control speech and sound independently.” Adobe’s existing Enhance Speech technology, available in the company’s Premiere Pro editing program, has been integrated within Project Sound Lift to aid creators in producing studio-quality audio content. Continue reading Adobe Reveals Its New AI Tool for Editing Problematic Audio

Meta’s WhatsApp Launches Voice Chat for Up to 128 People

Meta Platforms-owned instant messaging and VoIP service WhatsApp has updated its Voice Chat feature for mobile so it can now host group calls of up to 128 participants. Voice chats allow WhatsApp users to instantly talk live with members of a group chat while still being able to message within the group. The new feature, which is being compared to a Discord server, is being rolled out globally. The idea is to have the Voice Chat be less disruptive than group calling, which rings-in all group members. Voice chats can be quietly started with an in-chat bubble users tap to join. The updated version will have end-to-end encryption by default. Continue reading Meta’s WhatsApp Launches Voice Chat for Up to 128 People

Music Industry Considers Impact of AI as New Tools Emerge

Alphabet is developing an AI tool that would let creators generate music in the voice of famous recording artists. Lyor Cohen, global head of music for Google and its YouTube subsidiary, has reportedly been in discussions with music labels for several months about obtaining the rights to use songs by major artists to train an AI model in this manner. The discussions continue, but not without raising concerns in the music business. Meanwhile, other AI tools are already generating new content, but not without facing some resistance. The use of artificial intelligence to generate creative works in the style of others is being hashed out in the courts. Continue reading Music Industry Considers Impact of AI as New Tools Emerge

ChatGPT Goes Multimodal: OpenAI Adds Vision, Voice Ability

OpenAI began previewing vision capabilities for GPT-4 in March, and the company is now starting to roll out the image input and output to users of its popular ChatGPT. The multimodal expansion also includes audio functionality, with OpenAI proclaiming late last month that “ChatGPT can now see, hear and speak.” The upgrade vaults GPT-4 into the multimodal category with what OpenAI is apparently calling GPT-4V (for “Vision,” though equally applicable to “Voice”). “We’re rolling out voice and images in ChatGPT to Plus and Enterprise users,” OpenAI announced. Continue reading ChatGPT Goes Multimodal: OpenAI Adds Vision, Voice Ability

OpenAI’s ChatGPT Upgraded with ‘Talk’ Tech, Image Search

OpenAI is experimenting with new voice and image capabilities in ChatGPT. According to the company, users can now “speak with ChatGPT and have it talk back,” thanks to an intuitive new interface that, in addition to facilitating voice conversations, will allow users to show ChatGPT an image to discuss. “Snap a picture of a landmark while traveling and have a live conversation about what’s interesting about it,” OpenAI explains, alternatively suggesting you “snap pictures of your fridge and pantry to figure out what’s for dinner” or have it help with homework based on pictures of a math problem. Continue reading OpenAI’s ChatGPT Upgraded with ‘Talk’ Tech, Image Search

Google’s MusicLM AI Can Generate Tunes from Text Prompts

Google is introducing a new artificial intelligence app called MusicLM that creates music in any style or genre based on text prompts and can translate a whistled melody or casually hummed snipped into instrument sounds. TechCrunch calls the technology “impressive” but says the Alphabet company “fearing the risks, has no immediate plans to release it,” in recognition of the controversy surrounding AI models trained using copyrighted material. MusicLM was created using a dataset of 280,000 musical hours, resulting in the ability to generate minutes-long songs of “significant complexity.” Continue reading Google’s MusicLM AI Can Generate Tunes from Text Prompts

CES: Startup Leverages AI to Address Problematic Acoustics

There are a growing number of companies working on technologies that strive to make a person’s voice more intelligible to the listener over speakers, headphones, hearing aids and other consumer audio devices. Augmented Hearing, a Danish startup launched two years ago, is one of the more interesting companies at CES 2023 focusing on this space. The firm’s software-based solution runs on iOS, Windows and other CE operating systems. Their solution could mitigate the current trend of people across all age groups turning on closed captioning because they often find video dialogue difficult to understand. Continue reading CES: Startup Leverages AI to Address Problematic Acoustics

WhatsApp Debuts Communities with End-to-End Encryption

Meta Platforms is globally releasing a major update for WhatsApp called Communities, which doubles the number of group chat members to 1,024, and adds video (and voice) for up to 32. Designed for schools, clubs, churches, the workplace and other organizations, Communities features include support for sub-groups, admin controls and in-chat polls. “We’re aiming to raise the bar for how organizations communicate with a level of privacy and security not found anywhere else,” the company said of the upgrade, stressing end-to-end encryption. In fact, Communities are not publicly discoverable, requiring an invitation. Continue reading WhatsApp Debuts Communities with End-to-End Encryption

Meta Says Its AI-Compressed Audio Codec Beats MP3 by 10x

Meta Platforms says its vision for the metaverse will rely heavily on compression technology “to deliver high-quality, uninterrupted experiences for everyone.” With that in mind, it’s trained its Fundamental AI Research (FAIR) lab on developing “hypercompression” solutions. First up is EnCodec, an audio technology it says compresses at 64 kbps, with no loss in quality, and at 10 times the efficiency of MP3. The EnCodec protocol has the potential to  greatly improve the sound and reliability of speech over low-bandwidth (like when your mobile phone is only getting one bar). It also works for music. Continue reading Meta Says Its AI-Compressed Audio Codec Beats MP3 by 10x