xAI Launches Standalone Grok Speech-to-Text and Text-to-Speech APIs, Targeting Enterprise Voice Developers
Elon Musk’s AI firm xAI has launched two standalone audio APIs — a Speech-to-Text (STT) API and a Text-to-Speech (TTS) API — each constructed on the identical infrastructure that powers Grok Voice on cellular apps, Tesla automobiles, and Starlink buyer assist. The launch strikes xAI squarely into the aggressive speech API market at present occupied by ElevenLabs, Deepgram, and AssemblyAI.
What Is the Grok Speech-to-Text API?
Speech-to-Text is the expertise that converts spoken audio into written textual content. For builders constructing assembly transcription instruments, voice brokers, name middle analytics, or accessibility options, an STT API is a core constructing block. Rather than growing this from scratch, builders name an endpoint, ship audio, and obtain a structured transcript in return.
The Grok STT API is now usually out there, providing transcription throughout 25 languages with each batch and streaming modes. The batch mode is designed for processing pre-recorded audio information, whereas streaming permits real-time transcription as audio is captured. Pricing is stored easy: Speech-to-Text is $0.10 per hour for batch and $0.20 per hour for streaming.
The API consists of word-level timestamps, speaker diarization, and multichannel assist, together with clever Inverse Text Normalization that accurately handles numbers, dates, currencies, and extra. It additionally accepts 12 audio codecs — 9 container codecs (WAV, MP3, OGG, Opus, FLAC, AAC, MP4, M4A, MKV) and three uncooked codecs (PCM, µ-law, A-law), with a most file measurement of 500 MB per request.
Speaker diarization is the method of separating audio by particular person audio system — answering the query ‘who mentioned what.’ This is important for multi-speaker recordings like conferences, interviews, or buyer calls. Word-level timestamps assign exact begin and finish occasions to every phrase within the transcript, enabling use circumstances like subtitle era, searchable recordings, and authorized documentation. Inverse Text Normalization converts spoken types like ‘100 sixty-seven thousand 9 hundred eighty-three {dollars} and fifteen cents’ into readable structured output: “$167,983.15.”.
Benchmark Performance
xAI analysis group is making sturdy claims on accuracy. On telephone name entity recognition — names, account numbers, dates — Grok STT claims a 5.0% error price versus ElevenLabs at 12.0%, Deepgram at 13.5%, and AssemblyAI at 21.3%. That is a considerable margin if it holds in manufacturing. For video and podcast transcription, Grok and ElevenLabs tied at a 2.4% error price, with Deepgram and AssemblyAI trailing at 3.0% and 3.2% respectively. xAI group additionally stories a 6.9% phrase error price on common audio benchmarks.


What is the Grok Text-to-Speech API?
Text-to-Speech converts written textual content into spoken audio. Developers use TTS APIs to energy voice assistants, read-aloud options, podcast era, IVR (interactive voice response) programs, and accessibility instruments.
The Grok TTS API delivers quick, pure speech synthesis with detailed management by way of speech tags, and is priced at $4.20 per 1 million characters. The API accepts as much as 15,000 characters per REST request; for longer content material, a WebSocket streaming endpoint is accessible that has no textual content size restrict and begins returning audio earlier than the complete enter is processed. The API helps 20 languages and 5 distinct voices: Ara, Eve, Leo, Rex, and Sal — with Eve set because the default.
Beyond voice choice, builders can inject inline and wrapping speech tags to manage supply. These embrace inline tags like [laugh], [sigh], and [breath], and wrapping tags like <whisper>textual content</whisper> and <emphasis>textual content</emphasis>, letting builders create participating, lifelike supply with out complicated markup. This expressiveness addresses one of many core limitations of conventional TTS programs, which regularly produce technically appropriate however emotionally flat output.
Key Takeaways
- xAI has launched two standalone audio APIs — Grok Speech-to-Text (STT) and Text-to-Speech (TTS) — constructed on the identical manufacturing stack already serving hundreds of thousands of customers throughout Grok cellular apps, Tesla automobiles, and Starlink buyer assist.
- The Grok STT API provides real-time and batch transcription throughout 25 languages with speaker diarization, word-level timestamps, Inverse Text Normalization, and assist for 12 audio codecs — priced at $0.10/hour for batch and $0.20/hour for streaming.
- On telephone name entity recognition benchmarks, Grok STT stories a 5.0% error price, considerably outperforming ElevenLabs (12.0%), Deepgram (13.5%), and AssemblyAI (21.3%), with notably sturdy efficiency in medical, authorized, and monetary use circumstances.
- The Grok TTS API helps 5 expressive voices (Ara, Eve, Leo, Rex, Sal) throughout 20 languages, with inline and wrapping speech tags like
[laugh],[sigh], and<whisper>giving builders fine-grained management over vocal supply — priced at $4.20 per 1 million characters.
Check out the Technical details here. Also, be happy to observe us on Twitter and don’t neglect to hitch our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us
The submit xAI Launches Standalone Grok Speech-to-Text and Text-to-Speech APIs, Targeting Enterprise Voice Developers appeared first on MarkTechPost.
