|

OpenMOSS Releases MOSS-Audio: An Open-Source Foundation Model for Speech, Sound, Music, and Time-Aware Audio Reasoning

Understanding what’s occurring in an audio clip is a deceptively exhausting drawback. Transcribing spoken phrases is the straightforward half. A really succesful system additionally wants to acknowledge who’s talking, detect their emotional state, interpret background sounds, analyze musical content material, and reply time-grounded questions like ‘what did the speaker say on the 2-minute mark?’. Tackling all of that required stitching collectively a number of specialised techniques.

Tthe OpenMOSS staff, MOSI.AI, and Shanghai Innovation Institute launched MOSS-Audio: an open-source audio understanding mannequin designed to unify all of these capabilities inside a single basis mannequin.

What MOSS-Audio Actually Does

MOSS-Audio helps speech understanding, environmental sound understanding, music understanding, audio captioning, time-aware QA, and complicated reasoning over real-world audio. Its functionality set breaks down into a number of distinct areas. Speech & Content Understanding precisely acknowledges and transcribes spoken content material, supporting each word-level and sentence-level timestamp alignment. Speaker, Emotion & Event Analysis identifies speaker traits, analyzes emotional states primarily based on tone, timbre, and context, and detects key acoustic occasions inside the audio. Scene & Sound Cue Extraction pulls significant alerts from background sounds, environmental noise, and non-speech alerts to deduce scene context and ambiance. Music Understanding analyzes musical model, emotional development, and instrumentation. Audio Question Answering & Summarization handles questions and summaries throughout speech, podcasts, conferences, and interviews. Finally, Complex Reasoning performs multi-hop reasoning over audio content material, powered by each chain-of-thought coaching and reinforcement studying.

In sensible phrases, a single MOSS-Audio mannequin can do the entire above with out switching between completely different specialised techniques.

Four Model Variants

The staff launched 4 variants at launch: MOSS-Audio-4B-Instruct, MOSS-Audio-4B-Thinking, MOSS-Audio-8B-Instruct, and MOSS-Audio-8B-Thinking. The naming conference is price understanding in case you’re deciding which to make use of. The Instruct variants are optimized for direct instruction following, making them well-suited for manufacturing pipelines the place you need predictable, structured outputs. The Thinking variants present stronger chain-of-thought reasoning capabilities, higher suited for duties requiring multi-hop inference. The 4B fashions use Qwen3-4B because the LLM spine, and the 8B fashions use Qwen3-8B, leading to complete mannequin sizes of roughly 4.6B and 8.6B parameters respectively.

https://github.com/OpenMOSS/MOSS-Audio

The Architecture: Three Components Working Together

MOSS-Audio follows a modular design comprising three parts: an audio encoder, a modality adapter, and a big language mannequin. Raw audio is first encoded by the MOSS-Audio-Encoder into steady temporal representations at 12.5 Hz. Those representations are then projected into the language mannequin’s embedding house by means of the adapter, and lastly consumed by the LLM for auto-regressive textual content era.

The analysis staff skilled the encoder from scratch fairly than counting on off-the-shelf audio frontends. Their reasoning: a devoted encoder delivers extra sturdy speech representations, tighter temporal alignment, and higher extensibility throughout acoustic domains.

Two architectural improvements inside MOSS-Audio are price understanding intimately.

DeepStack Cross-Layer Feature Injection: A standard weak spot in audio fashions is that relying solely on the encoder’s top-layer options tends to lose low-level acoustic data, issues like prosody, transient occasions, and native time-frequency construction. MOSS-Audio addresses this with a DeepStack-inspired cross-layer injection module between the encoder and the language mannequin: along with the encoder’s final-layer output, options from earlier and intermediate layers are chosen, independently projected, and injected into the language mannequin’s early layers. This preserves multi-granularity data starting from low-level acoustic particulars to high-level semantic abstractions, serving to the mannequin retain rhythm, timbre, transients, and background construction {that a} single high-level illustration can not totally seize.

Time-Aware Representation: Time is a vital dimension in audio that textual content fashions aren’t naturally geared up to deal with. MOSS-Audio addresses this by means of a time-marker insertion technique throughout pretraining: specific time tokens are inserted between audio body representations at fastened time intervals to point temporal positions. This lets the mannequin study ‘what occurred when’ inside a unified textual content era framework, naturally supporting timestamp ASR, occasion localization, time-based QA, and long-audio retrospection — with out requiring a separate localization head or post-processing pipeline.

Benchmark Performance

The numbers are sturdy. On normal audio understanding, MOSS-Audio-8B-Thinking achieves a median accuracy of 71.08 throughout 4 benchmarks — 77.33 on MMAU, 64.92 on MMAU-Pro, 66.53 on MMAR, and 75.52 on MMSU, outperforming majority of open-source fashions. That contains bigger fashions: Step-Audio-R1 at 33B scores 70.67, and Qwen3-Omni-30B-A3B-Instruct at 30B scores 67.91. For additional context, Kimi-Audio (7B) scores 61.14 and MiMo-Audio-7B scores 62.97 on the identical common. The 4B Thinking variant scores 68.37, which means the smaller mannequin with chain-of-thought coaching beats all bigger open-source instruct-only opponents.

On speech captioning, evaluated with an LLM-as-a-Judge methodology throughout 13 fine-grained dimensions together with gender, age, accent, pitch, quantity, velocity, texture, readability, fluency, emotion, tone, character, and abstract, MOSS-Audio-Instruct variants lead throughout 11 out of 13 dimensions, with MOSS-Audio-8B-Instruct attaining one of the best general common rating of 3.7252.

On automated speech recognition (ASR) spanning 12 analysis dimensions — together with well being situation, code-switching, dialect, singing, and non-speech eventualities — MOSS-Audio-8B-Instruct achieves the lowest general CER (Character Error Rate) of 11.30 throughout all examined fashions.

https://github.com/OpenMOSS/MOSS-Audio

Key Takeaways

  • Single Model, Full Audio Stack: MOSS-Audio unifies speech transcription, speaker and emotion evaluation, environmental sound understanding, music evaluation, audio captioning, time-aware QA, and complicated reasoning into one open-source mannequin, eliminating the necessity to chain a number of specialised techniques collectively.
  • Two Architectural Innovations Drive Performance: DeepStack Cross-Layer Feature Injection preserves multi-granularity acoustic data by injecting options from intermediate encoder layers immediately into the LLM’s early layers, whereas time-marker insertion throughout pretraining provides the mannequin specific temporal consciousness for timestamp-grounded duties.
  • Best-in-Class Benchmark Results at Efficient Scale: MOSS-Audio-8B-Thinking achieves a median accuracy of 71.08 on normal audio understanding benchmarks, outperforming all open-source fashions together with 30B+ techniques, whereas the 4B Thinking variant alone beats each bigger open-source instruct-only competitor.
  • Dominant Timestamp ASR Accuracy: MOSS-Audio-8B-Instruct scores 35.77 AAS on AISHELL-1 and 131.61 AAS on LibriSpeech, dramatically outperforming each Qwen3-Omni-30B-A3B-Instruct (833.66) and the closed-source Gemini-3.1-Pro (708.24) on the identical benchmark.

Check out the Model Weights and Repo. Also, be at liberty to comply with us on Twitter and don’t overlook to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The submit OpenMOSS Releases MOSS-Audio: An Open-Source Foundation Model for Speech, Sound, Music, and Time-Aware Audio Reasoning appeared first on MarkTechPost.

Similar Posts