Microsoft AI Introduces MAI-Transcribe-1.5: 2.4% WER on Artificial Analysis, Best-in-Class FLEURS Accuracy, and Up to 5x Faster Long-Audio Transcription
Last week Microsoft AI has introduced MAI-Transcribe-1.5. It is the second iteration of the corporate’s in-house speech-to-text household. The mannequin targets accuracy throughout 43 languages, accents, and noisy environments. The Microsoft workforce positions it for manufacturing transcription workloads.
What is MAI-Transcribe-1.5
MAI-Transcribe-1.5 is an automated speech recognition (ASR) mannequin. It takes audio as enter and returns textual content. Microsoft constructed it in-house, not on a third-party base. The mannequin handles 43 languages with a single system. It is optimized for numerous accents, dialects, and real-world acoustic situations.
Microsoft is integrating it into Copilot, Teams, GitHub, and Dynamics 365 Contact Centre. It can be accessible in Foundry, Microsoft’s mannequin platform.
The Accuracy Case
Accuracy right here is measured by Word-Error-Rate (WER). Lower WER means fewer errors per transcribed phrase. Microsoft studies best-in-class WER throughout 43 languages on FLEURS. FLEURS is a normal multilingual transcription benchmark.
On the Artificial Analysis leaderboard, the mannequin posts a WER of two.4%. That locations it third on a aggressive open benchmark. So the image is cut up. Microsoft workforce claims first place on FLEURS and third on Artificial Analysis.
The language enlargement is the opposite accuracy story. Coverage grew from 25 languages to 43. The 18 new languages had been added with out compromising accuracy. Ten of them are South Asian, together with Bengali, Tamil, and Telugu. Eight are European, similar to Ukrainian, Greek, and Catalan.
Speed
MAI-Transcribe-1.5 leads on accuracy-times-speed on the Artificial Analysis leaderboard. It runs up to 5x sooner than fashions of comparable accuracy. The impact is largest on lengthy audio recordsdata. The mannequin can transcribe an hour of audio in below 15 seconds.
Microsoft cites up to 5x speedups over Gemini 3.1, Scribe v2, and GPT-4o-Transcribe on lengthy audio. Against the prior MAI-Transcribe-1, the Azure card lists up to 5.7x sooner long-form inference. For batch pipelines processing giant archives, that latency hole compounds shortly.
Keyword (Entity) Biasing: The Feature Worth Understanding
Generic transcribers typically fail on domain-specific phrases. These embrace folks, product names, medical phrases, and inner acronyms. Those phrases often matter most to enterprise customers.
MAI-Transcribe-1.5 provides key phrase biasing, additionally referred to as entity biasing. You provide an inventory of domain-specific key phrases. The Azure card helps up to 200 key phrases. The mannequin biases its predictions towards that checklist. Critically, it doesn’t blindly power matches. It makes use of shared context to determine when biasing ought to apply. Microsoft studies a 30% WER discount on FLEURS when biasing is used.
A brief instance reveals the impact. Without biasing, names render as “Sean,” “Oif,” and “Societal.” With a provided identify checklist, the mannequin recovers “Shaun,” “Aoife,” and “Xochitl.” This is related for conferences, healthcare, and name facilities with area of interest vocabulary.
Use Cases
The Azure mannequin card lists concrete manufacturing situations. Each maps to a standard engineering workload:
- Video captions for media and content material platforms.
- Accessibility instruments that rely on correct captions.
- Meeting transcription for Teams-style collaboration instruments.
- Call evaluation for contact facilities and assist analytics.
- Content creation workflows that want quick draft transcripts.
- Voice brokers that convert speech to textual content earlier than reasoning.
Automatic language identification helps when the enter language is unknown. The mannequin detects the spoken language with no handbook setting.
MAI-Transcribe-1.5 vs MAI-Transcribe-1
The desk under compares the 2 generations utilizing said information solely.
| Attribute | MAI-Transcribe-1 | MAI-Transcribe-1.5 |
|---|---|---|
| Languages lined | 25 | 43 |
| Keyword/entity biasing | Not listed | Up to 200 key phrases |
| Long-form inference velocity | Baseline | Up to 5.7x sooner |
| Artificial Analysis WER | Not specified | 2.4% (ranked #3) |
| FLEURS place (per Microsoft) | State-of-the-art | Best-in-class throughout 43 languages |
| Automatic language identification | Not specified | Yes |
| Lifecycle | Prior launch | Generally accessible (GA) |
| Input / Output | Audio / Text | Audio / Text |
Strengths and Limitations
Strengths:
- 43-language protection from a single mannequin, up from 25.
- Keyword/entity biasing yields up to 30% WER discount on FLEURS.
- Sub-15-second transcription for an hour of audio.
- Generally accessible now by Azure AI Foundry.
- Robust on noisy, real-world audio, per Microsoft.
Limitations:
- No diarization but, so speaker labels are unavailable.
- No native streaming API, so real-time use is proscribed.
- Several accuracy, velocity, and value claims are first-party.
- Ranked third on Artificial Analysis, behind two rivals.
Sources
- Introducing MAI-Transcribe-1.5 — Microsoft AI
- MAI-Transcribe-1.5 model card — Azure AI Foundry
- MAI-Transcribe-1.5 Foundry API documentation
- MAI-Transcribe-1.5 Cookbook
- MAI Playground
The put up Microsoft AI Introduces MAI-Transcribe-1.5: 2.4% WER on Artificial Analysis, Best-in-Class FLEURS Accuracy, and Up to 5x Faster Long-Audio Transcription appeared first on MarkTechPost.
