OpenAI has released what it calls its “most advanced speech-to-speech model yet.”
Dubbed gpt-realtime, the model is better at following complex instructions, calling tools with precision, and producing speech that sounds more natural and expressive, the company said in a Thursday (Aug. 28) blog post.
“We trained the model in close collaboration with customers to excel at real-world tasks like customer support, personal assistance and education—aligning the model to how developers build and deploy voice agents,” OpenAI said in the post.
OpenAI also said in post that it has made the Realtime API (application programming interface) generally available after introducing it in public beta in October and seeing thousands of developers build with it.
The API now has new features that help developers build voice agents. These include supporting remote MCP servers, image inputs and phone calling through Session Initiation Protocol (SIP), according to the post.
The company said these features make voice agents “more capable through access to additional tools and context.”
“Unlike traditional pipelines that chain together multiple models across speech-to-text and text-to-speech, the Realtime API processes and generates audio directly through a single model and API,” the post said. “This reduces latency, preserves nuance in speech and produces more natural, expressive responses.”
Both the Realtime API and gpt-realtime were made available to all developers starting Thursday, per the post.
OpenAI introduced the Realtime API in October, saying the tool enables developers to build low-latency, multimodal experiences in their apps.
PYMNTS reported at the time that the Realtime API was among the product announcements that showed the company was doubling down on making artificial intelligence more accessible and developer-friendly.
“It’s clear they’re focusing on empowering developers to build innovative applications rather than just competing in the consumer space,” aiRESULTS CEO Matt Hasan told PYMNTS at the time.
Venture capital firm Andreessen Horowitz said in June that voice-based AI agents are advancing to such a degree that they are now outperforming call centers.
“Voice is one of the most powerful unlocks for AI application companies,” Olivia Moore, a partner at Andreessen Horowitz, wrote at the time in a blog post. “It is the most frequent and information-dense form of communication, made programmable for the first time due to AI.”
