OpenAI Pushes Conversational Agents with Three New Models

By Paula Parisi
March 24, 2025

OpenAI has debuted three new models for transcription and voice generation — gpt-4o-transcribe, gpt-4o-mini-transcribe and gpt-4o-mini-tts. The text-to-speech and speech-to-text AI models are designed to help developers create AI agents with highly customizable voices. OpenAI claims these models will power natural and responsive voice agents, moving AI out of the text-based communications stage and into intuitive spoken conversations. The suite outperforms existing solutions in accuracy and reliability, OpenAI says, especially with “accents, noisy environments, and varying speech speeds,” making them well-suited for customer call centers and meeting notes.

“These models will initially be available through the ChatGPT maker’s application programming interface (API) for third-party software developers to build their own apps,” writes VentureBeat, noting “they will also be available on a custom demo site, OpenAI.fm, that individual users can access for limited testing and fun.”

The new gpt-4o-mini-tts model has “better steerability,” allowing developers to be able to “‘instruct’ the model not just on what to say but how to say it,” a first that will enable “more customized experiences for use cases ranging from customer service to creative storytelling,” according to OpenAI.

The gpt-4o-mini-tts is available in the text-to-speech API⁠. The text-to-speech models are limited to artificial, preset voices that perform in more than 100 languages.

The presets can be customized “via text prompt to change their accents, pitch, tone and other vocal qualities — including conveying whatever emotions the user asks them to,” reports VentureBeat, explaining that “using text alone on the demo site, a user could get the same voice to sound like a cackling mad scientist or a zen, calm yoga teacher.”

The new models were built on GPT-4o, launched by OpenAI in May 2024. They succeed Whisper, the two-year-old open-source text-to-speech model. The transcription models beat Whisper in word error rate benchmarks across 33 languages, scoring a low 2.46 percent in English, the company says.

“For OpenAI, the models fit into its broader ‘agentic’ vision: building automated systems that can independently accomplish tasks on behalf of users,” writes TechCrunch. While the definition of “agent” varies, OpenAI Head of Product Olivier Godement “described one interpretation as a chatbot that can speak with a business’s customers.”

This month OpenAI released a new API and an agents SDK to advance agent creation. “So far, OpenAI’s agents have all been text-based,” but the company envisions them responding to “natural spoken language,” Inc. reports.

OpenAI Pushes Conversational Agents with Three New Models

No Comments Yet

Leave a comment