|

Maya1: A New Open Source 3B Voice Model For Expressive Text To Speech On A Single GPU

Maya Research has launched Maya1, a 3B parameter textual content to speech mannequin that turns textual content plus a brief description into controllable, expressive speech whereas operating in actual time on a single GPU.

What Maya1 Actually Does?

Maya1 is a state-of-the-art speech mannequin for expressive voice technology. It is constructed to seize actual human emotion and exact voice design from textual content inputs.

The core interface has 2 inputs:

  1. A pure language voice description, for instance ‘Female voice in her 20s with a British accent, energetic, clear diction” or “Demon character, male voice, low pitch, gravelly timbre, sluggish pacing’.
  2. The textual content that ought to be spoken

The mannequin combines each indicators and generates audio that matches the content material and the described type. You may insert inline emotion tags contained in the textual content, akin to <snigger>, <sigh>, <whisper>, <offended>, <giggle>, <gasp>, <cry> and greater than 20 feelings.

Maya1 outputs 24 okHz mono audio and helps actual time streaming, which makes it appropriate for assistants, interactive brokers, video games, podcasts and dwell content material.

The Maya Research staff claims that the mannequin outperforms high proprietary methods whereas remaining totally open supply underneath the Apache 2.0 license.

Architecture and SNAC Codec

Maya1 is a 3B parameter decoder solely transformer with a Llama type spine. Instead of predicting uncooked waveforms, it predicts tokens from a neural audio codec named SNAC.

The technology move is

textual content → tokenize → generate SNAC codes (7 tokens per body) → decode → 24 okHz audio

SNAC makes use of a multi scale hierarchical construction at about 12, 23 and 47 Hz. This retains the autoregressive sequence compact whereas preserving element. The codec is designed for actual time streaming at about 0.98 kbps.

The necessary level is that the transformer operates on discrete codec tokens as an alternative of uncooked samples. A separate SNAC decoder, for instance hubertsiuzdak/snac_24khz, reconstructs the waveform. This separation makes technology extra environment friendly and simpler to scale than direct waveform prediction.

Training Data And Voice Conditioning

Maya1 is pretrained on an web scale English speech corpus to be taught broad acoustic protection and pure coarticulation. It is then positive tuned on a curated proprietary dataset of studio recordings that embody human verified voice descriptions, greater than 20 emotion tags per pattern, a number of English accents, and character or position variations.

The documented knowledge pipeline contains:

  1. 24 okHz mono resampling with about minus 23 LUFS loudness
  2. Voice exercise detection with silence trimming between 1 and 14 seconds
  3. Forced alignment utilizing Montreal Forced Aligner for phrase boundaries
  4. MinHash LSH textual content deduplication
  5. Chromaprint primarily based audio deduplication
  6. SNAC encoding with 7 token body packing

The Maya Research staff evaluated a number of methods to situation the mannequin on a voice description. Simple colon codecs and key worth tag codecs both triggered the mannequin to talk the outline or didn’t generalize properly. The finest performing format makes use of an XML type attribute wrapper that encodes the outline and textual content in a pure means whereas remaining strong.

In observe, this implies builders can describe voices in free kind textual content, near how they’d transient a voice actor, as an alternative of studying a customized parameter schema.

https://huggingface.co/maya-research/maya1

Inference And Deployment On A Single GPU

The reference Python script on Hugging Face masses the mannequin with AutoModelForCausalLM.from_pretrained("maya-research/maya1", torch_dtype=torch.bfloat16, device_map="auto") and makes use of the SNAC decoder from SNAC.from_pretrained("hubertsiuzdak/snac_24khz").

The Maya Research staff recommends a single GPU with 16 GB or extra of VRAM, for instance A100, H100 or a client RTX 4090 class card.

For manufacturing, they supply a vllm_streaming_inference.py script that integrates with vLLM. It helps Automatic Prefix Caching for repeated voice descriptions, a WebAudio ring buffer, multi GPU scaling and sub 100 millisecond latency targets for actual time use.

Beyond the core repository, they’ve launched:

  • A Hugging Face Space that exposes an interactive browser demo the place customers enter textual content and voice descriptions and take heed to output
  • GGUF quantized variants of Maya1 for lighter deployments utilizing llama.cpp
  • A ComfyUI node that wraps Maya1 as a single node, with emotion tag helpers and SNAC integration

These tasks reuse the official mannequin weights and interface, in order that they keep in keeping with the principle implementation.

Key Takeaways

  1. Maya1 is a 3B parameter, decoder solely, Llama type textual content to speech mannequin that predicts SNAC neural codec tokens as an alternative of uncooked waveforms, and outputs 24 okHz mono audio with streaming assist.
  2. The mannequin takes 2 inputs, a pure language voice description and the goal textual content, and helps greater than 20 inline emotion tags akin to <snigger>, <cry>, <whisper> and <gasp> for native management of expressiveness.
  3. Maya1 is skilled with a pipeline that mixes giant scale English pretraining and studio high quality positive tuning with loudness normalization, voice exercise detection, pressured alignment, textual content deduplication, audio deduplication and SNAC encoding.
  4. The reference implementation runs on a single 16 GB plus GPU utilizing torch_dtype=torch.bfloat16, integrates with a SNAC decoder, and has a vLLM primarily based streaming server with Automatic Prefix Caching for low latency deployment.
  5. Maya1 is launched underneath the Apache 2.0 license, with official weights, Hugging Face Space demo, GGUF quantized variants and ComfyUI integration, which makes expressive, emotion wealthy, controllable textual content to speech accessible for industrial and native use.

Editorial Comments

Maya1 pushes open supply textual content to speech into territory that was beforehand dominated by proprietary APIs. A 3B parameter Llama type decoder that predicts SNAC codec tokens, runs on a single 16 GB GPU with vLLM streaming and Automatic Prefix Caching, and exposes greater than 20 inline feelings with pure language voice design, is a sensible constructing block for actual time brokers, video games and instruments. Overall, Maya1 exhibits that expressive, controllable TTS may be each open and manufacturing prepared.


Check out the Model Weights and Demo. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The put up Maya1: A New Open Source 3B Voice Model For Expressive Text To Speech On A Single GPU appeared first on MarkTechPost.

Similar Posts