|

Meet OmniVoice Studio: A Local, Open-Source Alternative to ElevenLabs

ElevenLabs expenses between $5 and $330 monthly for voice AI companies. Every audio file you course of goes by way of their cloud servers. For these on the lookout for an open supply different of ElevenLabs, OmniVoice Studio is sweet match as an open-source desktop utility that runs the identical classes of duties domestically. It is a really fascinating particular person challenge that handles voice cloning, video dubbing, real-time dictation, vocal isolation, and speaker diarization — with out sending information to an exterior server.

What OmniVoice Studio Does

The utility bundles six distinct capabilities. Understanding every one helps make clear what the system is doing underneath the hood.

Voice cloning works from a 3-second audio clip. The system makes use of zero-shot studying, that means it clones a voice it has by no means been skilled on earlier than. It does this by conditioning a diffusion-based TTS mannequin on the quick reference audio. The underlying mannequin, OmniVoice from k2-fsa, helps 600+ languages.

Voice design allows you to construct a brand new voice from parameters: gender, age, accent, pitch, pace, emotion, and dialect — with out cloning any present voice.

Video dubbing takes a YouTube URL or a neighborhood video file. It runs transcription utilizing WhisperX, interprets the transcript, synthesizes new audio utilizing the TTS engine, and exports an MP4. The complete pipeline runs domestically.

The dictation widget is a system-wide floating overlay. On macOS it prompts by way of ⌘+⇧+Space from any utility. It streams transcription by way of WebSocket and auto-pastes the end result into no matter app is in focus.

The Batch Queue allows you to drop up to 50 movies and stroll away, with per-job progress bars monitoring every one by way of the complete pipeline.

The MCP Server exposes OmniVoice Studio’s capabilities to any MCP consumer — together with Claude, Cursor, or your individual tooling.

The Architecture

The challenge makes use of a React frontend speaking to a FastAPI backend. The backend exposes 97 API endpoints, makes use of Server-Sent Events (SSE) for streaming updates, and shops information in SQLite.

Four core ML libraries deal with the heavy work:

  • WhisperX handles automated speech recognition (ASR) with word-level alignment. It helps 99 languages for transcription.
  • Demucs (Meta) handles supply separation. It splits speech from background music and preserves each stems independently.
  • Pyannote handles speaker diarization — figuring out which speaker stated which phrases in a multi-speaker audio file. It is used along with WhisperX.
  • AudioSeal (Meta) embeds an invisible neural watermark into generated audio. This watermark survives compression and serves as AI provenance metadata.

The desktop wrapper is constructed with Tauri, a Rust-based framework for cross-platform native apps. The codebase is 56% Python, 23.6% JavaScript, 11% CSS, 3.4% Shell, 3.3% Rust, and a couple of.6% TypeScript.

For GPU assist, the backend auto-detects CUDA (NVIDIA), MPS (Apple Silicon Metal), and ROCm (AMD). With 8 GB VRAM or much less, TTS mechanically offloads to CPU throughout transcription. No configuration is required.

Six TTS Engines, One Backend Registry

OmniVoice Studio ships a pluggable multi-engine TTS backend. You can swap engines in Settings → TTS Engine or by setting the OMNIVOICE_TTS_BACKEND atmosphere variable.

The six built-in engines are OmniVoice (default, 600+ languages), CosyVoice 3 (9 languages plus 18 dialects, Apache-2.0), MLX-Audio (Apple Silicon-only, contains Kokoro and Qwen3-TTS amongst others), VoxCPM2 (30 languages, Apache-2.0), MOSS-TTS-Nano (20 languages, runs realtime on CPU), and KittenTTS (English-only, CPU-only, MIT).

Adding a customized engine takes roughly 50 strains of Python. You subclass TTSBackend in backend/companies/tts_backend.py and register it within the _REGISTRY dictionary on the backside of that file.

Language Coverage

ElevenLabs helps 32 languages. OmniVoice Studio helps 646 languages for TTS and 99 languages for transcription by way of WhisperX. Translation protection is determined by the goal language pair.

Getting Started

Prerequisites are ffmpeg, Bun, and uv. Clone the repo, then run:

uv sync
bun set up
bun dev

The frontend masses at http://localhost:5173 and the API runs on port 8000. Model weights obtain mechanically on first era.

Marktechpost’s Visual Explainer

OmniVoice Studio — How to Use It
01 / 08
What Is OmniVoice Studio?
OmniVoice Studio is an open-source desktop utility for voice cloning, video dubbing, real-time dictation, and speaker diarization. Everything runs domestically in your machine. No API keys, no cloud account, no subscription required.
  • 646 languages supported for TTS by way of the default OmniVoice engine
  • 99 languages for transcription by way of WhisperX
  • Available on macOS, Windows, and Linux
  • GPU is elective — full pipeline runs on CPU
  • Free for private, instructional, and analysis use (FSL-1.1-ALv2)

OmniVoice Studio — How to Use It
02 / 08
System Requirements
A GPU is elective. Without one, TTS runs roughly 3× slower on CPU. With ≤8 GB VRAM, TTS mechanically offloads to CPU throughout transcription — no config wanted.
Component Minimum Recommended
OS Win 10 / macOS 12+ / Ubuntu 20.04+ Any fashionable 64-bit OS
RAM 8 GB 16 GB+
VRAM 4 GB (auto-offloads) 8 GB+ (RTX 3060+)
Disk 10 GB free 20 GB+ SSD
Python 3.10+ 3.11–3.12
GPU Optional CUDA / MPS / ROCm

OmniVoice Studio — How to Use It
03 / 08
Installation
The challenge recommends working from supply. Install three stipulations first: ffmpeg, Bun (JS runtime), and uv (Python bundle supervisor).

git clone https://github.com/debpalash/OmniVoice-Studio.git
cd OmniVoice-Studio
uv sync
bun set up
bun dev

Frontend masses at http://localhost:5173  |  API runs on port 8000.
Model weights obtain mechanically on first era.
Pre-built installers out there: macOS DMG, Windows MSI, Linux AppImage and .deb — see the Releases web page on GitHub.

OmniVoice Studio — How to Use It
04 / 08
Voice Cloning
Voice cloning makes use of zero-shot studying — it clones a voice from a clip as quick as 3 seconds, with out prior coaching on that voice. The default OmniVoice engine circumstances a diffusion-based TTS mannequin on the reference audio.
  • Go to the Voice Clone tab within the UI
  • Upload or document a 3-second audio clip of the goal voice
  • Enter your textual content and choose a goal language (646 out there)
  • Click Generate — output is saved to your challenge library
Voice Gallery: Search YouTube, browse classes, and obtain reference clips immediately contained in the app to construct your voice library.

OmniVoice Studio — How to Use It
05 / 08
Video Dubbing
The full dubbing pipeline runs domestically: transcribe → translate → synthesize → mux. Demucs isolates vocals so the unique background audio is preserved within the last export.
  • Go to the Dub tab — paste a YouTube URL or add a neighborhood file
  • WhisperX transcribes speech with word-level alignment
  • Select a goal language; translation runs mechanically
  • TTS engine re-voices the transcript; Demucs preserves background audio
  • Export the ultimate MP4 with dubbed audio combined in
Batch Queue: Drop up to 50 movies and stroll away. Each job has its personal progress bar monitoring by way of the complete pipeline.

OmniVoice Studio — How to Use It
06 / 08
Dictation & Speaker Diarization
Dictation works system-wide from any utility. Diarization identifies particular person audio system in a multi-speaker audio file utilizing Pyannote + WhisperX.
  • Press ⌘+⇧+Space (macOS) to open the floating dictation widget
  • Speech streams by way of WebSocket and auto-pastes into the lively enter area
  • Upload a multi-speaker file to the Diarization tab
  • Pyannote identifies who stated what; every speaker will get an auto-extracted voice profile
  • Assign a TTS voice per speaker for per-speaker dubbing
Hugging Face token required for Pyannote diarization. See docs/setup/huggingface-token.md within the repo.

OmniVoice Studio — How to Use It
07 / 08
TTS Engines
Six TTS engines are inbuilt. Switch by way of Settings → TTS Engine or the env var:
OMNIVOICE_TTS_BACKEND=cosyvoice
Engine Languages Clone Platform
OmniVoice (default) 600+ CUDA / MPS / CPU
CosyVoice 3 9 + 18 dialects CUDA / MPS / CPU
MLX-Audio Multi Varies Apple Silicon solely
VoxCPM2 30 CUDA / MPS / CPU
MOSS-TTS-Nano 20 CUDA / CPU
KittenTTS English CPU solely
Custom engine: Subclass TTSBackend in backend/companies/tts_backend.py and add it to _REGISTRY. ~50 strains of Python.

OmniVoice Studio — How to Use It
08 / 08
MCP Server & Resources
OmniVoice Studio ships a built-in MCP Server, exposing voice and dubbing capabilities to any MCP-compatible consumer — Claude, Cursor, or your individual tooling — with out opening the desktop UI.
  • MCP Server begins alongside the FastAPI backend on bun dev
  • Point your MCP consumer on the native server to entry all endpoints
  • AudioSeal (Meta) embeds an invisible neural watermark in all generated audio for AI provenance
  • GitHub: github.com/debpalash/OmniVoice-Studio
  • Install docs: docs/set up/ (macos / home windows / linux / docker)
  • Troubleshooting: docs/set up/troubleshooting.md
  • Discord: discord.gg/bzQavDfVV9

Key Takeaways

  • OmniVoice Studio runs voice cloning, dubbing, diarization, and dictation absolutely domestically — no API keys or cloud account wanted.
  • It helps 646 languages for TTS and 99 for transcription by way of WhisperX; ElevenLabs helps 32 languages.
  • The backend is FastAPI + SQLite + WhisperX + Demucs + Pyannote + AudioSeal, wrapped in a Tauri desktop app.
  • An MCP Server is inbuilt, making OmniVoice usable from Claude, Cursor, or any MCP consumer.
  • Adding a customized TTS engine requires subclassing TTSBackend in roughly 50 strains of Python.


Check out the Repo hereAlso, be at liberty to observe us on Twitter and don’t neglect to be a part of our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The publish Meet OmniVoice Studio: A Local, Open-Source Alternative to ElevenLabs appeared first on MarkTechPost.

Similar Posts