Azure AI Speech needs seconds of audio to clone voices • The Register

Microsoft has upgraded Azure AI Speech so that users can rapidly generate a voice replica with just a few seconds of sampled speech.

The personal voice feature for AI Speech became generally available on May 21, 2024. It was impressive but required some training to get the best out of it. According to Microsoft, the feature has been upgraded to a new zero-shot text-to-speech model named “DragonV2.1Neural” with “more natural-sounding and expressive voices.” It will also generate audio in any of the more than 100 supported languages.

Microsoft said the upgrade, compared to the previous model, “brings improvements to the naturalness of speech, offering more realistic and stable prosody while maintaining better pronunciation accuracy.”

The system, which was already pretty good, is now even more worryingly accurate. “This capability unlocks a wide range of applications, from customizing chatbot voices to dubbing video content in an actor’s original voice across multiple languages, enabling truly immersive and individualized audio experiences,” Microsoft said.

It could also be a boon for people with goals that may be malicious or deceptive, and we can imagine audio deepfakes produced with the service becoming ever more challenging to spot.

But not to fear – in addition to watermarks to make the generated audio easier to identify (although not by human ears), Microsoft insists that “all customers must agree to our usage policies, which include requiring explicit consent from the original speaker, disclosing the synthetic nature of the content created, and prohibiting impersonation of any person or deceiving people using the personal voice service.”

So that’s all right then.

Microsoft is not the first to offer a service capable of cloning a user’s voice with only a few seconds of audio. Earlier this year, Palo Alto-based AI startup Zyphra unveiled a pair of open text-to-speech models claimed to require just a few seconds of sample audio. In our testing, we found that approximately 30 seconds of sample speech was needed to create something that was eerily accurate.

AI voice cloning has become a serious problem in recent years, as the technology has outpaced safeguards. In March, Consumer Reports called out four companies offering AI voice cloning software for failing to provide meaningful safeguards, while the FBI warned that scammers were using deepfaked voices of senior US government officials as part of a major fraud campaign. ®

Azure AI Speech needs seconds of audio to clone voices • The Register

Tags: