LongCat-Flash-Omni: A SOTA Open-Source Omni-Modal Model with 560B Parameters with 27B activated, Excelling at Real-Time Audio-Visual Interaction

How do you design a single mannequin that may pay attention, see, learn and reply in actual time throughout textual content, picture, video and audio with out dropping the effectivity? Meituan’s LongCat staff has launched LongCat Flash Omni, an open supply omni modal mannequin with 560 billion parameters and about 27 billion lively per token, constructed on the shortcut related Mixture of Experts design that LongCat Flash launched. The mannequin extends the textual content spine to imaginative and prescient, video and audio, and it retains a 128K context so it could run lengthy conversations and doc degree understanding in a single stack.

https://github.com/meituan-longcat/LongCat-Flash-Omni?tab=readme-ov-file

LongCat Flash Omni retains the language mannequin unchanged, then provides notion modules. A LongCat ViT encoder processes each photographs and video frames so there isn’t any separate video tower. An audio encoder collectively with the LongCat Audio Codec turns speech into discrete tokens, then the decoder can output speech from the identical LLM stream, which allows actual time audio visible interplay.

Streaming and Feature Interleaving

The analysis staff describes chunk sensible audio visible function interleaving, the place audio options, video options and timestamps are packed into 1 second segments. Video is sampled at 2 frames per second by default, then the speed is adjusted in line with video size, the report doesn’t tie the sampling rule to consumer or mannequin talking phases, so the proper description is length conditioned sampling. This retains latency low and nonetheless supplies spatial context for GUI, OCR and video QA duties.

Curriculum from Text to Omni

Training follows a staged curriculum. The analysis staff first trains the LongCat Flash textual content spine, which prompts 18.6B to 31.3B parameters per token, common 27B, then applies textual content speech continued pretraining, then multimodal continued pretraining with picture and video, then context extension to 128K, then audio encoder alignment.

Systems Design, Modality Decoupled Parallelism

Because the encoders and the LLM have totally different compute patterns, Meituan makes use of modality decoupled parallelism. Vision and audio encoders run with hybrid sharding and activation recomputation, the LLM runs with pipeline, context and skilled parallelism, and a ModalityBridge aligns embeddings and gradients. The analysis staff reviews that multimodal supervised fantastic tuning retains greater than 90 p.c of the throughput of textual content solely coaching, which is the principle techniques outcome on this launch.

Benchmarks and Positioning

LongCat Flash Omni reaches 61.4 on OmniBench, that is increased than Qwen 3 Omni Instruct at 58.5 and Qwen 2.5 Omni at 55.0, however decrease than Gemini 2.5 Pro at 66.8. On VideoMME it scores 78.2, which is near GPT 4o and Gemini 2.5 Flash, and on VoiceBench it reaches 88.7, barely increased than GPT 4o Audio in the identical desk.

Key Takeaways

LongCat Flash Omni is an open supply omni modal mannequin constructed on Meituan’s 560B MoE spine, it prompts about 27B parameters per token by shortcut related MoE with zero computation specialists, so it retains giant capability however inference pleasant compute.
The mannequin attaches unified imaginative and prescient video encoding and a streaming audio path to the prevailing LongCat Flash LLM, utilizing 2 fps default video sampling with length conditioned adjustment, and packs audio visible options into 1 second chunks for synchronized decoding, which is what allows actual time any to any interplay.
LongCat Flash Omni scores 61.4 on OmniBench, above Qwen 3 Omni Instruct at 58.5, however under Gemini 2.5 Pro at 66.8.
Meituan makes use of modality decoupled parallelism, imaginative and prescient and audio encoders run with hybrid sharding, the LLM runs with pipeline, context and skilled parallelism, and report greater than 90 p.c of textual content solely throughput for multimodal SFT, which is the principle techniques contribution of the discharge.

Editorial Comments

This launch reveals that Meituan is attempting to make omni modal interplay sensible, not experimental. It retains the 560B Shortcut related Mixture of Experts with 27B activated, so the language spine stays suitable with earlier LongCat releases. It provides streaming audio visible notion with 2 fps default video sampling and length conditioned adjustment, so latency stays low with out dropping spatial grounding. It reviews over 90 p.c textual content solely throughput in multimodal supervised fantastic tuning by modality decoupled parallelism.

Check out the Paper, Model Weights and GitHub Repo. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The publish LongCat-Flash-Omni: A SOTA Open-Source Omni-Modal Model with 560B Parameters with 27B activated, Excelling at Real-Time Audio-Visual Interaction appeared first on MarkTechPost.

LongCat-Flash-Omni: A SOTA Open-Source Omni-Modal Model with 560B Parameters with 27B activated, Excelling at Real-Time Audio-Visual Interaction

Streaming and Feature Interleaving

Curriculum from Text to Omni

Systems Design, Modality Decoupled Parallelism

Benchmarks and Positioning

Key Takeaways

Editorial Comments

StepFun AI Releases Step-Audio 2 Mini: An Open-Source 8B Speech-to-Speech AI Model that Surpasses GPT-4o-Audio

Apple Introduces DiffuCoder: A 7B Diffusion LLM Tailored for Code Generation

NVIDIA AI Released Jet-Nemotron: 53x Faster Hybrid-Architecture Language Model Series that Translates to a 98% Cost Reduction for Inference at Scale

The rise of algorithmic agriculture? AI steps in

Forget the Turing Test, AI’s real challenge is communication

Tutorial: Exploring SHAP-IQ Visualizations

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Architecture and Modal Attachments

Streaming and Feature Interleaving

Curriculum from Text to Omni

Systems Design, Modality Decoupled Parallelism

Benchmarks and Positioning

Key Takeaways

Editorial Comments

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!