{"id":117197,"date":"2025-10-12T10:33:29","date_gmt":"2025-10-12T10:33:29","guid":{"rendered":"https:\/\/www.europesays.com\/ie\/117197\/"},"modified":"2025-10-12T10:33:29","modified_gmt":"2025-10-12T10:33:29","slug":"open-asr-leaderboard-tests-more-than-60-speech-recognition-models-for-accuracy-and-speed","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/ie\/117197\/","title":{"rendered":"Open ASR Leaderboard tests more than 60 speech recognition models for accuracy and speed"},"content":{"rendered":"<p>                                    <a class=\"article-menu__content__link\" href=\"#summary\"><br \/>\n                        <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/the-decoder.com\/resources\/icons\/summary.svg\" alt=\"summary\" width=\"27\" height=\"24\" data-no-lazy=\"1\"\/><br \/>\n                        Summary<br \/>\n                    <\/a><\/p>\n<p><strong>A research group from Hugging Face, Nvidia, the University of Cambridge, and Mistral AI has released the Open ASR Leaderboard, an evaluation platform for automatic speech recognition systems.<\/strong><\/p>\n<p>The leaderboard is meant to provide a clear comparison of open source and commercial models. According to the project&#8217;s <a target=\"_blank\" rel=\"noopener nofollow\" href=\"https:\/\/arxiv.org\/abs\/2405.13497\" data-type=\"editable-link\">study<\/a>, more than 60 models from 18 companies have been tested so far. The evaluation covers three main categories: English transcription, multilingual recognition (German, French, Italian, Spanish, and Portuguese), and long audio files over 30 seconds. The last category highlights how some systems perform differently on long versus short recordings.<\/p>\n<p>Two main benchmarks are used:<\/p>\n<ul>\n<li>Word Error Rate (WER) measures the number of incorrect words. Lower is better.<\/li>\n<li>Inverse Real-Time Factor (RTFx) measures speed. For example, an RTFx of 100 means one minute of audio is transcribed in 0.6 seconds.<\/li>\n<\/ul>\n<p>To keep comparisons fair, transcripts are normalized before scoring. The process removes punctuation and capitalization, spells out numbers, and drops filler words like &#8220;uh&#8221; and &#8220;mhm.&#8221; This matches the normalization standard used by OpenAI&#8217;s Whisper.<\/p>\n<p>Ad<\/p>\n<p>THE DECODER Newsletter<\/p>\n<p>The most important AI news straight to your inbox.<\/p>\n<p>\u2713 Weekly<\/p>\n<p>\u2713 Free<\/p>\n<p>\u2713 Cancel at any time<\/p>\n<p>Accuracy vs. speed<\/p>\n<p>The leaderboard shows clear differences between model types in English transcription. Systems built on large language models deliver the most accurate results. Nvidia&#8217;s Canary Qwen 2.5B leads with a WER of 5.63 percent.<\/p>\n<p><a href=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/10\/open-asr-leaderboard-word-error-rate.png\"><img data-lazyloaded=\"1\" fetchpriority=\"high\" decoding=\"async\" class=\"wp-image-27969 size-full\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/10\/open-asr-leaderboard-word-error-rate.png\" alt=\"Table showing results from the Open ASR leaderboard for English speech recognition. Shows model name, average error rate (WER), speed (RTFx), whether open source, technology used, and supported languages. NVIDIA Canary Qwen 2.5B leads with an error rate of 5.63%.\" width=\"1062\" height=\"496\"\/><\/a>\u00a0Top-performing speech recognition models for English transcription in the Open ASR Leaderboard. | Image: Srivastav et al.<\/p>\n<p>Share<\/p>\n<p>Recommend our article<\/p>\n<p>        Share<\/p>\n<p>However, these accurate models are slower to process audio. Simpler systems, like Nvidia&#8217;s Parakeet CTC 1.1B, transcribe audio 2,728 times faster than real time, but only rank 23rd in accuracy.<\/p>\n<p>Multilingual models lose some specialization<\/p>\n<p>Tests across several languages show a trade-off between versatility and accuracy. Models narrowly trained on one language outperform broader multilingual models for that language, but struggle with others. Whisper models trained only on English beat the multilingual Whisper Large v3 at English, but can&#8217;t reliably transcribe other languages.<\/p>\n<p>In multilingual tests, Microsoft&#8217;s Phi-4 multimodal instruct leads in German and Italian. Nvidia&#8217;s Parakeet TDT v3 covers 25 languages, while v2 supports just one, but the wider model performs worse on English than the specialized version.<\/p>\n<p><a href=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/10\/open-asr-leaderboard-average-multiple-languages.png\"><img loading=\"lazy\" data-lazyloaded=\"1\" decoding=\"async\" class=\"wp-image-27970 size-full\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/10\/open-asr-leaderboard-average-multiple-languages.png\" alt=\"Microsoft Phi 4 Multimodal Instruct leads with scores between 3.59 and 5.15 percent, while Elevenlabs Scribe v1 shows significantly worse results.\" width=\"745\" height=\"252\"\/><\/a>Multilingual performance of selected speech recognition models in five European languages. | Image: Srivastav et al.<br \/>\nOpen source outperforms commercial models on short audio<\/p>\n<p>Open source models take the top spots for short audio. The highest-ranking commercial system, Aqua Voice Avalon, is sixth. Speed comparisons for paid services aren&#8217;t fully reliable, since upload times and other factors can distort results.<\/p>\n<p>Recommendation<\/p>\n<p>                                            <a class=\"link-overlay\" href=\"https:\/\/the-decoder.com\/cat-attack-on-reasoning-model-shows-how-important-context-engineering-is\/\" aria-label=\"\" cat=\"\" attack=\"\" on=\"\" reasoning=\"\" model=\"\" shows=\"\" how=\"\" important=\"\" context=\"\" engineering=\"\" is=\"\" rel=\"nofollow noopener\" target=\"_blank\"><\/p>\n<p>                                                        \t\t\t<a class=\"post-thumbnail\" href=\"https:\/\/the-decoder.com\/cat-attack-on-reasoning-model-shows-how-important-context-engineering-is\/\" aria-hidden=\"true\" tabindex=\"-1\" rel=\"nofollow noopener\" target=\"_blank\"><\/p>\n<p>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" data-lazyloaded=\"1\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/10\/cat_neural_network-1.png\" loading=\"lazy\" alt=\"\" cat=\"\" attack=\"\" on=\"\" reasoning=\"\" model=\"\" shows=\"\" how=\"\" important=\"\" context=\"\" engineering=\"\" is=\"\" width=\"375\" height=\"250\"\/><br \/>\n\t\t\t\t\t\t\t<\/a><\/p>\n<p>                \t\t\t<a class=\"post-thumbnail\" href=\"https:\/\/the-decoder.com\/cat-attack-on-reasoning-model-shows-how-important-context-engineering-is\/\" aria-hidden=\"true\" tabindex=\"-1\" rel=\"nofollow noopener\" target=\"_blank\"><\/p>\n<p>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" data-lazyloaded=\"1\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/10\/cat_neural_network-1.png\" loading=\"lazy\" alt=\"\" cat=\"\" attack=\"\" on=\"\" reasoning=\"\" model=\"\" shows=\"\" how=\"\" important=\"\" context=\"\" engineering=\"\" is=\"\" width=\"375\" height=\"250\"\/><br \/>\n\t\t\t\t\t\t\t<\/a><\/p>\n<p>For longer audio, commercial providers do better. Elevenlabs Scribe v1 (4.33 percent WER) and RevAI Fusion (5.04 percent) top the list, likely due to targeted optimization for long-form content and stronger infrastructure.<\/p>\n<p>The entire leaderboard and codebase are available on <a target=\"_blank\" rel=\"noopener nofollow\" href=\"https:\/\/github.com\/huggingface\/open_asr_leaderboard\" data-type=\"editable-link\">GitHub<\/a>. Developers can submit new models by providing scripts that run on the official test set. The datasets are hosted on the <a target=\"_blank\" rel=\"noopener nofollow\" href=\"https:\/\/huggingface.co\/spaces\/hf-audio\/open_asr_leaderboard\" data-type=\"editable-link\">Hugging Face Hub<\/a> and can be explored directly online.<\/p>\n<p>The team plans to add more languages, applications, and metrics in future updates, including new combinations of system components that haven&#8217;t been widely tested. As large language models become more common, the expectation is that even more speech recognition systems will adopt this technology.<\/p>\n","protected":false},"excerpt":{"rendered":"Summary A research group from Hugging Face, Nvidia, the University of Cambridge, and Mistral AI has released the&hellip;\n","protected":false},"author":2,"featured_media":117198,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[74],"tags":[71885,1735,18,19,17,82],"class_list":{"0":"post-117197","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-technology","8":"tag-ai-and-audio","9":"tag-audio","10":"tag-eire","11":"tag-ie","12":"tag-ireland","13":"tag-technology"},"share_on_mastodon":{"url":"","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/117197","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/comments?post=117197"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/117197\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media\/117198"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media?parent=117197"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/categories?post=117197"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/tags?post=117197"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}