ElevenLabs, the highly-valued AI voice cloning and generation startup from former Palantir alumni, has introduced Scribe v1, its new speech-to-text model, which reportedly achieves the highest accuracy across multiple languages. This model, launched today, is set to transform the accuracy of speech recognition technology, significantly outperforming competitors including Google’s Gemini 2.0 Flash, OpenAI’s Whisper v3, and Deepgram Nova-3.
Scribe v1 is not just another transcription service; it has been engineered to deliver high-precision transcription, boasting the lowest word error rates (WER) recorded, according to benchmarks from FLEURS and Common Voice. For English, the model stands at 96.7% accuracy, and it performs even more impressively for languages like Italian, which recorded at 98.7%. What’s more, Scribe handles non-verbal events adeptly, such as laughter and background noise, enhancing overall comprehension of audio content.
Flavio Schneider, ElevenLabs’ lead researcher, shared insights on social media, stating, "Scribe doesn’t just transcribe — it understands audio.” He elaborated on the model's capabilities, noting, "It can detect non-verbal events (like laughter, sound effects, music and background noise) and analyze long audio contexts for accurate diarization, even in the most challenging environments." Diarization, the process of separating speakers based on their vocal qualities, is key for multi-speaker recordings, with Scribe capable of distinguishing up to 32 different speakers within the same audio file.
For those interested in its application, Scribe is currently available through the ElevenLabs website and API, priced at $0.40 per hour of audio input. To encourage early adoption, ElevenLabs is offering a 50% discount for the first six weeks post-launch. Looking forward, the company is also working on releasing a low-latency version, which will make Scribe suitable for real-time applications.
This innovation arrives alongside rival Hume’s launch of Octave, their own LLM-powered text-to-speech model. While Scribe focuses on transcription, Octave is aimed at providing customizable emotional AI-generated voices for creative projects including audiobooks and podcasts. Despite serving different purposes, both models highlight the growing competition within AI-driven audio technologies, emphasizing a timely enhancement of tools for communication and content creation.
Scribe’s ability to deliver high-accuracy transcription positions it as a beneficial tool for enterprises requiring scalable documentation, meeting transcription, and content accessibility solutions. This advancement is particularly significant for multinational businesses, media companies, and customer support applications, where precise communication is key.
With the advent of Scribe, the pathway for improved digital communication and documentation has been paved, promising efficiency and significant accuracy enhancements for professionals worldwide. ElevenLabs will also host a virtual event next week for those interested in the development behind Scribe, providing more insights, benchmarks, and documentation for users.