Mistral AI has officially launched the second iteration of its speech-to-text technology, named Voxtral Transcribe 2, on February 4, 2026. This new product aims to provide real-time transcription services at an unprecedented cost of $0.003 per minute for batch processing, positioning itself as a formidable competitor in the market by significantly undercutting alternatives like ElevenLabs” Scribe v2, which is priced five times higher.
The latest Voxtral Transcribe 2 includes two distinct models tailored for various applications. The Voxtral Mini Transcribe V2 is designed for batch processing and features speaker diarization along with word-level timestamps. In contrast, the Voxtral Realtime model focuses on live applications, boasting a latency that can be adjusted to below 200 milliseconds. This rapid response time is particularly beneficial for interactive voice agents, ensuring a seamless user experience.
Mistral AI”s performance metrics show that the new models achieve a word error rate of approximately 4% on the FLEURS dataset, outperforming notable competitors like GPT-4o mini Transcribe, Gemini 2.5 Flash, Assembly Universal, and Deepgram Nova. These claims indicate that Voxtral Realtime not only matches the quality of its rivals but also processes data three times faster than ElevenLabs” offering. For live subtitling, Voxtral Realtime maintains accuracy with a delay of just 2.4 seconds. Should the delay be reduced to 480 milliseconds, a slight increase in the word error rate is observed, which Mistral has deemed acceptable for conversational AI purposes.
One of the standout features of the Voxtral Realtime model is its open weights, released under the Apache 2.0 license. This accessibility allows enterprises to deploy the technology on their own infrastructure without relying on external API calls. With a 4 billion parameter model capable of running on edge devices, this development is especially significant for sectors such as healthcare and finance, where sensitive audio data must remain within secure internal systems. Furthermore, both models are compliant with GDPR and HIPAA regulations, alleviating concerns that have previously hindered the adoption of AI in enterprise environments.
Since the first launch of Voxtral in July 2025, several enhancements have been made. The introduction of speaker diarization addresses a previously noted gap, providing accurate speaker identification with start and end times. The updated release also includes context biasing, allowing users to input up to 100 domain-specific terms to enhance the system”s accuracy regarding proper nouns and specialized vocabulary. Language support has expanded to encompass 13 languages, including Chinese, Hindi, Arabic, Japanese, and Korean. Additionally, the maximum audio file length for processing has increased from 30-40 minutes to a substantial 3 hours per request.
The pricing strategy of Voxtral Transcribe 2 presents intriguing possibilities for contact centers and meeting intelligence platforms that have been incurring high costs for transcription services. Transcribing a million minutes of audio now costs only $3,000, a figure that makes many previously unfeasible use cases attainable. While Voxtral Realtime”s streaming applications are priced at $0.006 per minute, this remains competitively low compared to other offerings, facilitating real-time sentiment analysis and live assistance for agents.
Developers interested in exploring these new capabilities can immediately access both models via the Mistral Studio“s audio playground or download the Realtime weights directly from Hugging Face.












































