Meet OmniVoice Studio: A Local, Open-Source Alternative to ElevenLabs
ElevenLabs expenses between $5 and $330 monthly for voice AI companies. Every audio file you course of goes by way of their cloud servers. For these on the lookout for an open supply different of ElevenLabs, OmniVoice Studio is sweet match as an open-source desktop utility that runs the identical classes of duties domestically. It is a really fascinating particular person challenge that handles voice cloning, video dubbing, real-time dictation, vocal isolation, and speaker diarization — with out sending information to an exterior server.
What OmniVoice Studio Does
The utility bundles six distinct capabilities. Understanding every one helps make clear what the system is doing underneath the hood.
Voice cloning works from a 3-second audio clip. The system makes use of zero-shot studying, that means it clones a voice it has by no means been skilled on earlier than. It does this by conditioning a diffusion-based TTS mannequin on the quick reference audio. The underlying mannequin, OmniVoice from k2-fsa, helps 600+ languages.
Voice design allows you to construct a brand new voice from parameters: gender, age, accent, pitch, pace, emotion, and dialect — with out cloning any present voice.
Video dubbing takes a YouTube URL or a neighborhood video file. It runs transcription utilizing WhisperX, interprets the transcript, synthesizes new audio utilizing the TTS engine, and exports an MP4. The complete pipeline runs domestically.
The dictation widget is a system-wide floating overlay. On macOS it prompts by way of ⌘+⇧+Space from any utility. It streams transcription by way of WebSocket and auto-pastes the end result into no matter app is in focus.
The Batch Queue allows you to drop up to 50 movies and stroll away, with per-job progress bars monitoring every one by way of the complete pipeline.
The MCP Server exposes OmniVoice Studio’s capabilities to any MCP consumer — together with Claude, Cursor, or your individual tooling.
The Architecture
The challenge makes use of a React frontend speaking to a FastAPI backend. The backend exposes 97 API endpoints, makes use of Server-Sent Events (SSE) for streaming updates, and shops information in SQLite.
Four core ML libraries deal with the heavy work:
- WhisperX handles automated speech recognition (ASR) with word-level alignment. It helps 99 languages for transcription.
- Demucs (Meta) handles supply separation. It splits speech from background music and preserves each stems independently.
- Pyannote handles speaker diarization — figuring out which speaker stated which phrases in a multi-speaker audio file. It is used along with WhisperX.
- AudioSeal (Meta) embeds an invisible neural watermark into generated audio. This watermark survives compression and serves as AI provenance metadata.
The desktop wrapper is constructed with Tauri, a Rust-based framework for cross-platform native apps. The codebase is 56% Python, 23.6% JavaScript, 11% CSS, 3.4% Shell, 3.3% Rust, and a couple of.6% TypeScript.
For GPU assist, the backend auto-detects CUDA (NVIDIA), MPS (Apple Silicon Metal), and ROCm (AMD). With 8 GB VRAM or much less, TTS mechanically offloads to CPU throughout transcription. No configuration is required.
Six TTS Engines, One Backend Registry
OmniVoice Studio ships a pluggable multi-engine TTS backend. You can swap engines in Settings → TTS Engine or by setting the OMNIVOICE_TTS_BACKEND atmosphere variable.
The six built-in engines are OmniVoice (default, 600+ languages), CosyVoice 3 (9 languages plus 18 dialects, Apache-2.0), MLX-Audio (Apple Silicon-only, contains Kokoro and Qwen3-TTS amongst others), VoxCPM2 (30 languages, Apache-2.0), MOSS-TTS-Nano (20 languages, runs realtime on CPU), and KittenTTS (English-only, CPU-only, MIT).
Adding a customized engine takes roughly 50 strains of Python. You subclass TTSBackend in backend/companies/tts_backend.py and register it within the _REGISTRY dictionary on the backside of that file.
Language Coverage
ElevenLabs helps 32 languages. OmniVoice Studio helps 646 languages for TTS and 99 languages for transcription by way of WhisperX. Translation protection is determined by the goal language pair.
Getting Started
Prerequisites are ffmpeg, Bun, and uv. Clone the repo, then run:
uv sync
bun set up
bun dev
The frontend masses at http://localhost:5173 and the API runs on port 8000. Model weights obtain mechanically on first era.
Marktechpost’s Visual Explainer
Key Takeaways
- OmniVoice Studio runs voice cloning, dubbing, diarization, and dictation absolutely domestically — no API keys or cloud account wanted.
- It helps 646 languages for TTS and 99 for transcription by way of WhisperX; ElevenLabs helps 32 languages.
- The backend is FastAPI + SQLite + WhisperX + Demucs + Pyannote + AudioSeal, wrapped in a Tauri desktop app.
- An MCP Server is inbuilt, making OmniVoice usable from Claude, Cursor, or any MCP consumer.
- Adding a customized TTS engine requires subclassing
TTSBackendin roughly 50 strains of Python.
Check out the Repo here. Also, be at liberty to observe us on Twitter and don’t neglect to be a part of our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us
The publish Meet OmniVoice Studio: A Local, Open-Source Alternative to ElevenLabs appeared first on MarkTechPost.
