Meet OmniVoice Studio: A Local, Open-Source Alternative to ElevenLabs

ElevenLabs expenses between $5 and $330 monthly for voice AI companies. Every audio file you course of goes by way of their cloud servers. For these on the lookout for an open supply different of ElevenLabs, OmniVoice Studio is sweet match as an open-source desktop utility that runs the identical classes of duties domestically. It is a really fascinating particular person challenge that handles voice cloning, video dubbing, real-time dictation, vocal isolation, and speaker diarization — with out sending information to an exterior server.

What OmniVoice Studio Does

The utility bundles six distinct capabilities. Understanding every one helps make clear what the system is doing underneath the hood.

Voice cloning works from a 3-second audio clip. The system makes use of zero-shot studying, that means it clones a voice it has by no means been skilled on earlier than. It does this by conditioning a diffusion-based TTS mannequin on the quick reference audio. The underlying mannequin, OmniVoice from k2-fsa, helps 600+ languages.

Voice design allows you to construct a brand new voice from parameters: gender, age, accent, pitch, pace, emotion, and dialect — with out cloning any present voice.

Video dubbing takes a YouTube URL or a neighborhood video file. It runs transcription utilizing WhisperX, interprets the transcript, synthesizes new audio utilizing the TTS engine, and exports an MP4. The complete pipeline runs domestically.

The dictation widget is a system-wide floating overlay. On macOS it prompts by way of ⌘+⇧+Space from any utility. It streams transcription by way of WebSocket and auto-pastes the end result into no matter app is in focus.

The Batch Queue allows you to drop up to 50 movies and stroll away, with per-job progress bars monitoring every one by way of the complete pipeline.

The MCP Server exposes OmniVoice Studio’s capabilities to any MCP consumer — together with Claude, Cursor, or your individual tooling.

The Architecture

The challenge makes use of a React frontend speaking to a FastAPI backend. The backend exposes 97 API endpoints, makes use of Server-Sent Events (SSE) for streaming updates, and shops information in SQLite.

Four core ML libraries deal with the heavy work:

WhisperX handles automated speech recognition (ASR) with word-level alignment. It helps 99 languages for transcription.
Demucs (Meta) handles supply separation. It splits speech from background music and preserves each stems independently.
Pyannote handles speaker diarization — figuring out which speaker stated which phrases in a multi-speaker audio file. It is used along with WhisperX.
AudioSeal (Meta) embeds an invisible neural watermark into generated audio. This watermark survives compression and serves as AI provenance metadata.

The desktop wrapper is constructed with Tauri, a Rust-based framework for cross-platform native apps. The codebase is 56% Python, 23.6% JavaScript, 11% CSS, 3.4% Shell, 3.3% Rust, and a couple of.6% TypeScript.

For GPU assist, the backend auto-detects CUDA (NVIDIA), MPS (Apple Silicon Metal), and ROCm (AMD). With 8 GB VRAM or much less, TTS mechanically offloads to CPU throughout transcription. No configuration is required.

Six TTS Engines, One Backend Registry

OmniVoice Studio ships a pluggable multi-engine TTS backend. You can swap engines in Settings → TTS Engine or by setting the OMNIVOICE_TTS_BACKEND atmosphere variable.

The six built-in engines are OmniVoice (default, 600+ languages), CosyVoice 3 (9 languages plus 18 dialects, Apache-2.0), MLX-Audio (Apple Silicon-only, contains Kokoro and Qwen3-TTS amongst others), VoxCPM2 (30 languages, Apache-2.0), MOSS-TTS-Nano (20 languages, runs realtime on CPU), and KittenTTS (English-only, CPU-only, MIT).

Adding a customized engine takes roughly 50 strains of Python. You subclass TTSBackend in backend/companies/tts_backend.py and register it within the _REGISTRY dictionary on the backside of that file.

Language Coverage

ElevenLabs helps 32 languages. OmniVoice Studio helps 646 languages for TTS and 99 languages for transcription by way of WhisperX. Translation protection is determined by the goal language pair.

Getting Started

Prerequisites are ffmpeg, Bun, and uv. Clone the repo, then run:

Copy Code

uv sync
bun set up
bun dev

The frontend masses at http://localhost:5173 and the API runs on port 8000. Model weights obtain mechanically on first era.

Marktechpost’s Visual Explainer

What Is OmniVoice Studio?

OmniVoice Studio is an open-source desktop utility for voice cloning, video dubbing, real-time dictation, and speaker diarization. Everything runs domestically in your machine. No API keys, no cloud account, no subscription required.

646 languages supported for TTS by way of the default OmniVoice engine
99 languages for transcription by way of WhisperX
Available on macOS, Windows, and Linux
GPU is elective — full pipeline runs on CPU
Free for private, instructional, and analysis use (FSL-1.1-ALv2)

System Requirements

A GPU is elective. Without one, TTS runs roughly 3× slower on CPU. With ≤8 GB VRAM, TTS mechanically offloads to CPU throughout transcription — no config wanted.

Component	Minimum	Recommended
OS	Win 10 / macOS 12+ / Ubuntu 20.04+	Any fashionable 64-bit OS
RAM	8 GB	16 GB+
VRAM	4 GB (auto-offloads)	8 GB+ (RTX 3060+)
Disk	10 GB free	20 GB+ SSD
Python	3.10+	3.11–3.12
GPU	Optional	CUDA / MPS / ROCm

Installation

The challenge recommends working from supply. Install three stipulations first: ffmpeg, Bun (JS runtime), and uv (Python bundle supervisor).

git clone https://github.com/debpalash/OmniVoice-Studio.git cd OmniVoice-Studio uv sync bun set up bun dev

Frontend masses at http://localhost:5173 | API runs on port 8000.
Model weights obtain mechanically on first era.

Pre-built installers out there: macOS DMG, Windows MSI, Linux AppImage and .deb — see the Releases web page on GitHub.

Voice Cloning

Voice cloning makes use of zero-shot studying — it clones a voice from a clip as quick as 3 seconds, with out prior coaching on that voice. The default OmniVoice engine circumstances a diffusion-based TTS mannequin on the reference audio.

Go to the Voice Clone tab within the UI
Upload or document a 3-second audio clip of the goal voice
Enter your textual content and choose a goal language (646 out there)
Click Generate — output is saved to your challenge library

Voice Gallery: Search YouTube, browse classes, and obtain reference clips immediately contained in the app to construct your voice library.

Video Dubbing

The full dubbing pipeline runs domestically: transcribe → translate → synthesize → mux. Demucs isolates vocals so the unique background audio is preserved within the last export.

Go to the Dub tab — paste a YouTube URL or add a neighborhood file
WhisperX transcribes speech with word-level alignment
Select a goal language; translation runs mechanically
TTS engine re-voices the transcript; Demucs preserves background audio
Export the ultimate MP4 with dubbed audio combined in

Batch Queue: Drop up to 50 movies and stroll away. Each job has its personal progress bar monitoring by way of the complete pipeline.

Dictation & Speaker Diarization

Dictation works system-wide from any utility. Diarization identifies particular person audio system in a multi-speaker audio file utilizing Pyannote + WhisperX.

Press ⌘+⇧+Space (macOS) to open the floating dictation widget
Speech streams by way of WebSocket and auto-pastes into the lively enter area
Upload a multi-speaker file to the Diarization tab
Pyannote identifies who stated what; every speaker will get an auto-extracted voice profile
Assign a TTS voice per speaker for per-speaker dubbing

Hugging Face token required for Pyannote diarization. See docs/setup/huggingface-token.md within the repo.

TTS Engines

Six TTS engines are inbuilt. Switch by way of Settings → TTS Engine or the env var:
OMNIVOICE_TTS_BACKEND=cosyvoice

Engine	Languages	Clone	Platform
OmniVoice (default)	600+	✓	CUDA / MPS / CPU
CosyVoice 3	9 + 18 dialects	✓	CUDA / MPS / CPU
MLX-Audio	Multi	Varies	Apple Silicon solely
VoxCPM2	30	✓	CUDA / MPS / CPU
MOSS-TTS-Nano	20	✓	CUDA / CPU
KittenTTS	English	✗	CPU solely

Custom engine: Subclass TTSBackend in backend/companies/tts_backend.py and add it to _REGISTRY. ~50 strains of Python.

MCP Server & Resources

OmniVoice Studio ships a built-in MCP Server, exposing voice and dubbing capabilities to any MCP-compatible consumer — Claude, Cursor, or your individual tooling — with out opening the desktop UI.

MCP Server begins alongside the FastAPI backend on bun dev
Point your MCP consumer on the native server to entry all endpoints
AudioSeal (Meta) embeds an invisible neural watermark in all generated audio for AI provenance

GitHub: github.com/debpalash/OmniVoice-Studio
Install docs: docs/set up/ (macos / home windows / linux / docker)
Troubleshooting: docs/set up/troubleshooting.md
Discord: discord.gg/bzQavDfVV9

Key Takeaways

OmniVoice Studio runs voice cloning, dubbing, diarization, and dictation absolutely domestically — no API keys or cloud account wanted.
It helps 646 languages for TTS and 99 for transcription by way of WhisperX; ElevenLabs helps 32 languages.
The backend is FastAPI + SQLite + WhisperX + Demucs + Pyannote + AudioSeal, wrapped in a Tauri desktop app.
An MCP Server is inbuilt, making OmniVoice usable from Claude, Cursor, or any MCP consumer.
Adding a customized TTS engine requires subclassing TTSBackend in roughly 50 strains of Python.

Check out the Repo here. Also, be at liberty to observe us on Twitter and don’t neglect to be a part of our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The publish Meet OmniVoice Studio: A Local, Open-Source Alternative to ElevenLabs appeared first on MarkTechPost.

Meet OmniVoice Studio: A Local, Open-Source Alternative to ElevenLabs

What OmniVoice Studio Does

The Architecture

Six TTS Engines, One Backend Registry

Language Coverage

Getting Started

Key Takeaways

LongCat-Flash-Omni: A SOTA Open-Source Omni-Modal Model with 560B Parameters with 27B activated, Excelling at Real-Time Audio-Visual Interaction

How to Build Smarter Multilingual Text Wrapping with BudouX Through Parsing, HTML Rendering, Model Introspection, and Toy Training

How to Build a Fully Functional Computer-Use Agent that Thinks, Plans, and Executes Virtual Actions Using Local AI Models

A Coding Guide to Exploring nanobot’s Full Agent Pipeline, from Wiring Up Tools and Memory to Skills, Subagents, and Cron Scheduling

How LeapXpert uses AI to bring order and oversight to business messaging

OpenAGI Foundation Launches Lux: A Foundation Computer Use Model that Tops Online Mind2Web with OSGym At Scale

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

What OmniVoice Studio Does

The Architecture

Six TTS Engines, One Backend Registry

Language Coverage

Getting Started

Key Takeaways

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!