Alibaba Qwen Team Introduces Qwen3.5-LiveTranslate-Flash: Real-Time Multimodal Interpretation Across 60 Languages at 2.8-Second Latency

Simultaneous interpretation is likely one of the more durable issues in utilized AI. You’re asking a mannequin to translate speech earlier than the speaker has completed a sentence. Every further second of delay breaks the phantasm of real-time communication. Alibaba’s Qwen crew has been chipping away at this with every launch. Their newest mannequin, Qwen3.5-LiveTranslate-Flash, brings that latency right down to 2.8 seconds and expands enter language protection to 60 languages.

https://qwen.ai/weblog?id=qwen3.5-livetranslate

A Meaningful Jump From the Previous Release

The Qwen3-LiveTranslate-Flash dealt with 18 enter languages at roughly three seconds of latency. Qwen3.5-LiveTranslate-Flash brings that right down to 2.8 seconds, expands enter protection to 60 languages, and provides speech output in 29 languages. That’s greater than a 3× growth in language protection on the enter aspect. For devs constructing multilingual merchandise, this reduces the necessity for per-language mannequin switching in most world enterprise eventualities.

The latency enchancment comes from a method for processing what the crew calls ‘studying items.’ Rather than ready for a full sentence to reach earlier than producing output, the mannequin decides when sufficient which means has gathered in a section to decide to a translation. It streams output repeatedly whereas the speaker continues to be speaking. This is similar underlying logic as semantic unit prediction however with a tighter implementation that shaves off that further 200 milliseconds.

Vision Is Now a First-Class Input

Most translation programs deal with audio as the one enter sign. That works tremendous in clear studio circumstances. It breaks down in a crowded convention room, a loud commerce flooring, or wherever with overlapping voices and dangerous acoustics.

Qwen3.5-LiveTranslate-Flash takes a special strategy. It analyzes visible info in parallel with audio on-screen textual content, bodily proven objects, lip actions, and gestures. When a phrase is phonetically ambiguous or the audio stream degrades, the visible context fills the hole and sharpens the interpretation determination. This will not be a minor characteristic. In real-world deployment, audio high quality isn’t assured. Having a imaginative and prescient channel means the mannequin handles the messy actuality of reside interpretation extra gracefully than audio-only programs.

Voice Cloning Happens in Real Time

This is the half that stands out most within the Qwen3.5 launch. Standard translation programs substitute the speaker’s voice with a generic synthesis voice. Qwen3.5-LiveTranslate-Flash as an alternative clones the attribute voice options of the unique speaker throughout the translation itself. A single spoken sentence is sufficient for the mannequin to carry out this acoustic adaptation.

For listeners on the receiving finish, the translated output seems like the identical individual talking the goal language and never a robotic substitute. In reside convention interpretation, multilingual livestreams, or worldwide buyer calls, that is essential. The expertise feels noticeably extra human than what present programs ship.

Configure Domain-Specific Keywords

One persistent failure mode for translation fashions in skilled settings is correct nouns and specialised vocabulary. A mannequin translating a medical briefing may persistently mistranslate a drug title. A authorized interpretation session breaks down over a technical statute time period.

Qwen3.5-LiveTranslate-Flash addresses this with dynamic key phrase configuration at runtime. Developers can inject a glossary of name names, medical phrases, authorized terminology, or technical vocabulary, and the mannequin handles these phrases considerably extra reliably. This isn’t obtainable in most general-purpose translation APIs and it closes an actual hole for domain-specific enterprise deployments.

Benchmark Performance

On FLEURS and CoVoST2 — two established benchmarks for multilingual speech translation — Qwen3.5-LiveTranslate-Flash outperforms main industrial alternate options. FLEURS exams translation high quality throughout all kinds of language pairs below actual acoustic circumstances. CoVoST2 covers 21 translation instructions from speech, making it a sensible proxy for multilingual pipeline efficiency.

Marktechpost’s Visual Explainer

How to Use Qwen3.5-LiveTranslate-Flash

A step-by-step integration information — from setup to production-ready real-time translation

What it does

Qwen3.5-LiveTranslate-Flash at a look

Qwen3.5-LiveTranslate-Flash is an API-only, closed-weight real-time translation mannequin from Alibaba’s Qwen crew. It takes audio and video frames as simultaneous inputs and outputs translated textual content and speech. The mannequin makes use of a WebSocket-based protocol over Alibaba Cloud Model Studio.

Latency

2.8s

Per token to audio out

Input languages

Speech + visible enter

Speech output

Languages with voice

Protocol

WebSocket

Persistent connection

✓
Vision-enhanced comprehension — lip actions, gestures, and on-screen textual content all feed into the interpretation determination alongside audio
◆
Real-time voice cloning — clones the unique speaker’s voice profile within the translated output from a single spoken sentence
(*60*)◆
Semantic unit prediction — commits to output segments earlier than a full sentence ends, enabling steady streaming with out ready for full utterances
◆
Dynamic key phrase configuration — inject domain-specific glossaries at runtime for technical, medical, or authorized terminology

Before you begin

Prerequisites

You want an Alibaba Cloud account with Model Studio entry and a legitimate DashScope API key. The mannequin is accessible by means of the qwen3-livetranslate-flash-realtime mannequin ID.

Create an Alibaba Cloud account

Get your DashScope API key

Navigate to Model Studio → API Keys. Generate a key and retailer it because the surroundings variable DASHSCOPE_API_KEY. Never hardcode it in supply information.

Install the Python dependency

Install the websocket-client bundle for the WebSocket connection. For audio seize, additionally set up pyaudio.

Check your audio setup

The mannequin accepts 16kHz, 16-bit PCM mono audio on enter. Confirm your microphone or audio supply can output on this format earlier than connecting.

BASH

# Install dependencies
pip set up websocket-client pyaudio

# Set your API key as an surroundings variable
export DASHSCOPE_API_KEY="your_key_here"

Step 3 — Connection

Establish the WebSocket connection

The mannequin makes use of the WebSocket protocol for a persistent, bidirectional connection. You authenticate by way of a Bearer token within the connection header utilizing your DashScope API key.

PYTHON

import json, websocket, os

API_KEY = os.getenv("DASHSCOPE_API_KEY")
API_URL = (
    "wss://dashscope-intl.aliyuncs.com"
    "/api-ws/v1/realtime"
    "?mannequin=qwen3-livetranslate-flash-realtime"
)

def on_open(ws):
    print("Connected to Qwen3.5-LiveTranslate-Flash")

def on_message(ws, message):
    knowledge = json.masses(message)
    print("Translation occasion:", knowledge)

def on_error(ws, error):
    print("Error:", error)

ws = websocket.WebSocketApp(
    API_URL,
    header=["Authorization: Bearer " + API_KEY],
    on_open=on_open,
    on_message=on_message,
    on_error=on_error
)
ws.run_forever()

ⓘ

The connection stays open for the total session. You don’t reconnect per utterance. Send audio chunks and picture frames repeatedly over the identical socket.

Step 4 — Audio streaming

Configure and stream audio enter

After connecting, ship a session configuration occasion to set the supply and goal languages. Then stream PCM audio chunks repeatedly. The mannequin makes use of session.input_audio_transcription.language to establish the enter language.

PYTHON

import base64, pyaudio

# Audio enter config: 16kHz, 16-bit PCM mono
INPUT_RATE    = 16000
INPUT_CHUNK   = 1600  # 100ms per chunk
INPUT_FORMAT  = pyaudio.paInt16
INPUT_CHANNELS = 1

def on_open(ws):
    # 1. Send session config first
    session_cfg = {
        "kind": "session.replace",
        "session": {
            "input_audio_transcription": {
                "language": "zh"  # supply: Chinese
            },
            "translation": {
                "target_language": "en"  # goal: English
            }
        }
    }
    ws.ship(json.dumps(session_cfg))

    # 2. Stream microphone audio
    pa = pyaudio.PyAudio()
    stream = pa.open(
        charge=INPUT_RATE, channels=INPUT_CHANNELS,
        format=INPUT_FORMAT, enter=True,
        frames_per_buffer=INPUT_CHUNK
    )
    whereas True:
        chunk = stream.learn(INPUT_CHUNK)
        audio_b64 = base64.b64encode(chunk).decode()
        ws.ship(json.dumps({
            "kind": "input_audio_buffer.append",
            "audio": audio_b64
        }))

⚠

Do not ship audio earlier than the session.replace occasion is acknowledged. Wait for the server’s session affirmation occasion earlier than streaming audio chunks.

Step 5 — Vision enter

Send video frames for vision-enhanced comprehension

Qwen3.5-LiveTranslate-Flash reads lip actions, gestures, and on-screen textual content from video frames alongside audio. Send base64-encoded JPEG frames at an everyday interval throughout the session. Even a low body charge considerably improves accuracy in noisy audio circumstances.

PYTHON

import cv2, base64, threading, time

def stream_video_frames(ws):
    cap = cv2.VideoCapture(0)  # 0 = default digital camera
    whereas True:
        ret, body = cap.learn()
        if not ret:
            break
        # Encode body as JPEG → base64
        _, buf = cv2.imencode(".jpg", body)
        img_b64 = base64.b64encode(buf).decode()
        ws.ship(json.dumps({
            "kind": "input_image_buffer.append",
            "picture": img_b64
        }))
        time.sleep(0.5)  # ~2fps is ample

# Run video streaming in a separate thread
threading.Thread(
    goal=stream_video_frames,
    args=(ws,), daemon=True
).begin()

ⓘ

Vision enter is optionally available however advisable for reside human speech eventualities. For pre-recorded audio information with no digital camera feed, you may omit picture frames totally and depend on audio alone.

Step 6 — Domain accuracy

Dynamic key phrase configuration

For technical, medical, authorized, or brand-specific vocabulary, you may inject a key phrase glossary at session begin. The mannequin makes use of this checklist to considerably enhance translation reliability for phrases that commonplace coaching knowledge might deal with inconsistently.

PYTHON

# Add to your session.replace payload
session_cfg = {
    "kind": "session.replace",
    "session": {
        "input_audio_transcription": {
            "language": "zh"
        },
        "translation": {
            "target_language": "en"
        },
        # Inject area key phrases right here
        "key phrases": [
            {"source": "达芬奇机器人",  "target": "da Vinci Surgical System"},
            {"source": "腹腔镜",      "target": "laparoscope"},
            {"source": "实体瘤",      "target": "solid tumor"}
        ]
    }
}
ws.ship(json.dumps(session_cfg))

✓Works for model names, drug names, authorized statutes, and technical mannequin numbers
✓Keywords are scoped to the session and don’t persist throughout connections
(*60*)◆Keep the checklist centered — solely phrases the place mistranslation would trigger actual errors

Reference

Supported languages

Qwen3.5-LiveTranslate-Flash understands 60 enter languages and may produce speech output in 29 languages. The highlighted capsules beneath are confirmed speech output languages. All capsules signify supported enter.

Chinese

English

French

German

Spanish

Japanese

Korean

Russian

Portuguese

Italian

Arabic

Hindi

Turkish

Indonesian

Thai

Vietnamese

Greek

Mandarin

Cantonese

Wu dialect

Sichuanese

Tianjin dialect

Beijing dialect

+ 37 extra

ⓘ

Highlighted capsules have confirmed speech (audio) output help. Plain capsules are input-only or unconfirmed for voice output. Verify your particular goal language pair within the Alibaba Cloud Model Studio documentation earlier than constructing audio-output pipelines.

⚠

The mannequin helps textual content output for all 60 enter languages. Speech output is accessible for 29 languages solely. If your pipeline requires audio supply and your goal language will not be within the confirmed checklist, plan for a fallback TTS step.

Key Takeaways

Qwen3.5-LiveTranslate-Flash delivers real-time multimodal interpretation throughout 60 enter languages and 29 speech output languages at 2.8 seconds of latency.
The mannequin makes use of vision-enhanced comprehension — studying lip actions, gestures, and on-screen textual content — to take care of accuracy in noisy or degraded audio environments.
Real-time voice cloning replicates the unique speaker’s voice profile within the translated output utilizing only a single spoken sentence.
Semantic unit prediction by way of “studying items” processing permits steady streaming output with out ready for full sentences, decreasing latency to 2.8 seconds.
Dynamic key phrase configuration permits builders to inject domain-specific glossaries at runtime, bettering translation reliability for technical, medical, and authorized terminology.

Check out the Technical details. Also, be happy to observe us on Twitter and don’t overlook to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The publish Alibaba Qwen Team Introduces Qwen3.5-LiveTranslate-Flash: Real-Time Multimodal Interpretation Across 60 Languages at 2.8-Second Latency appeared first on MarkTechPost.

Alibaba Qwen Team Introduces Qwen3.5-LiveTranslate-Flash: Real-Time Multimodal Interpretation Across 60 Languages at 2.8-Second Latency

A Meaningful Jump From the Previous Release

Vision Is Now a First-Class Input

Voice Cloning Happens in Real Time

Configure Domain-Specific Keywords

Benchmark Performance

Marktechpost’s Visual Explainer

Create an Alibaba Cloud account

Get your DashScope API key

Install the Python dependency

Check your audio setup

Key Takeaways

Together AI Releases DeepSWE: A Fully Open-Source RL-Trained Coding Agent Based on Qwen3-32B and Achieves 59% on SWEBench

15 Best Vibe Coding Tools in 2026 Compared: Pricing, Features, and Best Fit

Building a Human Handoff Interface for AI-Powered Insurance Agent Using Parlant and Streamlit

Zhipu AI Releases GLM-4.6: Achieving Enhancements in Real-World Coding, Long-Context Processing, Reasoning, Searching and Agentic AI

Google AI Released 5 New AI Agents/Platforms for Developers

7 LLM Generation Parameters—What They Do and How to Tune Them?

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

A Meaningful Jump From the Previous Release

Vision Is Now a First-Class Input

Voice Cloning Happens in Real Time

Configure Domain-Specific Keywords

Benchmark Performance

Marktechpost’s Visual Explainer

Create an Alibaba Cloud account

Get your DashScope API key

Install the Python dependency

Check your audio setup

Key Takeaways

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!