NVIDIA and the University of Maryland Researchers Released Audio Flamingo Next (AF-Next): A Super Powerful and Open Large Audio-Language Model
Understanding audio has at all times been the multimodal frontier that lags behind imaginative and prescient. While image-language fashions have quickly scaled towards real-world deployment, constructing open fashions that robustly motive over speech, environmental sounds, and music — particularly at size — has remained fairly arduous. NVIDIA and the University of Maryland researchers are actually taking a direct swing at that hole.
The analysis staff have launched Audio Flamingo Next (AF-Next), the most succesful mannequin in the Audio Flamingo collection and a completely open Large Audio-Language Model (LALM) educated on internet-scale audio information.
Audio Flamingo Next (AF-Next) is available in three specialised variants for various use circumstances. The launch consists of AF-Next-Instruct for normal query answering, AF-Next-Think for superior multi-step reasoning, and AF-Next-Captioner for detailed audio captioning.
What is a Large Audio-Language Model (LALM)?
A Large Audio-Language Model (LALM) pairs an audio encoder with a decoder-only language mannequin to allow query answering, captioning, transcription, and reasoning straight over audio inputs. Think of it as the audio equal of a vision-language mannequin like LLaVA or GPT-4V, however designed to deal with speech, environmental sounds, and music concurrently — inside a single unified mannequin.

The Architecture: Four Components Working in a Pipeline
AF-Next is constructed round 4 essential elements: First is the AF-Whisper audio encoder, a customized Whisper-based encoder additional pre-trained on a bigger and extra various corpus, together with multilingual speech and multi-talker ASR information. Given an audio enter, the mannequin resamples it to 16 okayHz mono and converts the waveform right into a 128-channel log mel-spectrogram utilizing a 25 ms window and 10 ms hop dimension. The spectrogram is processed in non-overlapping 30-second chunks by way of AF-Whisper, which outputs options at 50 Hz, after which a stride-2 pooling layer is utilized. The hidden dimension is 1280.
Second is the audio adaptor, a 2-layer MLP that maps AF-Whisper’s audio representations into the language mannequin’s embedding house. Third is the LLM spine: Qwen-2.5-7B, a decoder-only causal mannequin with 7B parameters, 36 transformer layers, and 16 consideration heads, with context size prolonged from 32k to 128k tokens by way of extra long-context coaching.
A refined however vital architectural element is Rotary Time Embeddings (RoTE). Standard positional encodings in transformers index a token by its discrete sequence place i. RoTE replaces this: as a substitute of the normal RoPE rotation angle θ ← −i · 2π, RoTE makes use of θ ← −τi · 2π, the place τi is every token’s absolute timestamp. For audio tokens produced at a hard and fast 40 ms stride, discrete time positions are interpolated earlier than being fed into the RoTE module. This yields positional representations grounded in precise time reasonably than sequence order — a core design alternative enabling the mannequin’s temporal reasoning, significantly for lengthy audio. Finally, a streaming TTS module permits voice-to-voice interplay.
Temporal Audio Chain-of-Thought: The Key Reasoning Recipe
Chain-of-Thought (CoT) prompting has improved reasoning throughout textual content and imaginative and prescient fashions, however prior audio CoT work confirmed solely small positive factors as a result of coaching datasets have been restricted to brief clips with easy questions. AF-Next addresses this with Temporal Audio Chain-of-Thought, the place the mannequin explicitly anchors every intermediate reasoning step to a timestamp in the audio earlier than producing a solution, encouraging devoted proof aggregation and lowering hallucination over lengthy recordings.
To practice this functionality, the analysis staff created AF-Think-Time, a dataset of query–reply–thinking-chain triplets curated from difficult audio sources together with trailers, film recaps, thriller tales, and long-form multi-party conversations. AF-Think-Time consists of roughly 43K coaching samples, with a mean of 446.3 phrases per pondering chain.
Training at Scale: 1 Million Hours, Four Stages
The last coaching dataset includes roughly 108 million samples and roughly 1 million hours of audio, drawn from each present publicly launched datasets and uncooked audio collected from the open web and subsequently labeled synthetically. New information classes launched embody over 200K lengthy movies spanning 5 to half-hour for long-form captioning and QA, multi-talker speech understanding information masking speaker identification, interruption identification, and goal speaker ASR, roughly 1 million samples for multi-audio reasoning throughout a number of simultaneous audio inputs, and roughly 386K security and instruction-following samples.
Training follows a four-stage curriculum, every with distinct information mixtures and context lengths. Pre-training has two sub-stages: Stage 1 trains solely the audio adaptor whereas holding each AF-Whisper and the LLM frozen (max audio 30 seconds, 8K token context); Stage 2 moreover fine-tunes the audio encoder whereas nonetheless holding the LLM frozen (max audio 1 minute, 8K token context). Mid-training additionally has two sub-stages: Stage 1 performs full fine-tuning of the total mannequin, including AudioAbilities-XL and newly curated information (max audio 10 minutes, 24K token context); Stage 2 introduces long-audio captioning and QA, down-sampling the Stage 1 combination to half its authentic mix weights whereas increasing context to 128K tokens and audio to half-hour. The mannequin ensuing from mid-training is particularly launched as AF-Next-Captioner. Post-training applies GRPO-based reinforcement studying specializing in multi-turn chat, security, instruction following, and chosen skill-specific datasets, producing AF-Next-Instruct. Finally, CoT-training begins from AF-Next-Instruct, applies SFT on AF-Think-Time, then GRPO utilizing the post-training information combination, producing AF-Next-Think.
One notable contribution from the analysis staff is hybrid sequence parallelism, which makes 128K-context coaching possible on lengthy audio. Without it, audio token enlargement blows previous normal context home windows and the quadratic reminiscence value of self-attention turns into infeasible. The answer combines Ulysses consideration — which makes use of all-to-all collectives to distribute sequence and head dimensions inside nodes the place high-bandwidth interconnects can be found — with Ring consideration, which circulates key-value blocks throughout nodes by way of point-to-point transfers. Ulysses handles intra-node communication effectively; Ring scales throughout nodes.

Benchmark Results: Strong Across the Board
On MMAU-v05.15.25, the most generally used audio reasoning benchmark, AF-Next-Instruct achieves a mean accuracy of 74.20 vs. Audio Flamingo 3’s 72.42, with AF-Next-Think reaching 75.01 and AF-Next-Captioner pushing to 75.76 — with positive factors throughout all three subcategories: sound (79.87), music (75.3), and speech (72.13). On the tougher MMAU-Pro benchmark, AF-Next-Think (58.7) surpasses the closed-source Gemini-2.5-Pro (57.4).
Music understanding sees significantly robust positive factors. On Medley-Solos-DB instrument recognition, AF-Next reaches 92.13 vs. Audio Flamingo 2’s 85.80. On SongCaps music captioning, GPT5 protection and correctness scores soar from 6.7 and 6.2 (AF3) to eight.8 and 8.9 respectively.
Long-audio understanding is the place AF-Next most clearly separates itself. On LongAudioBench, AF-Next-Instruct achieves 73.9, outperforming each Audio Flamingo 3 (68.6) and the closed-source Gemini 2.5 Pro (60.4). On the speech-inclusive variant (+Speech), AF-Next reaches 81.2 vs. Gemini 2.5 Pro’s 66.2. On ASR, AF-Next-Instruct units new lows amongst LALMs with a Word Error Rate of 1.54 on LibriSpeech test-clean and 2.76 on test-other. On VoiceBench, AF-Next-Instruct achieves the highest scores on AlpacaEval (4.43), CommonEval (3.96), and OpenBookQA (80.9), surpassing Audio Flamingo 3 by over 14 factors on OpenBookQA. On CoVoST2 speech translation, AF-Next reveals a very notable 12-point enchancment over Phi-4-mm on Arabic EN→X translation (21.9 vs. 9.9).

Key Takeaways
Here are 5 key takeaways:
- A Fully Open Audio-Language Model at Internet Scale: AF-Next is taken into account the first LALM to scale audio understanding to internet-scale information — roughly 108 million samples and 1 million hours of audio.
- Temporal Audio Chain-of-Thought Solves Long-Audio Reasoning: Rather than reasoning like prior CoT approaches, AF-Next explicitly anchors every intermediate reasoning step to a timestamp in the audio earlier than producing a solution. This makes the mannequin considerably extra devoted and interpretable on lengthy recordings as much as half-hour — an issue prior fashions largely sidestepped.
- Three Specialized Variants for Different Use Cases: The launch consists of AF-Next-Instruct for normal query answering, AF-Next-Think for superior multi-step reasoning, and AF-Next-Captioner for detailed audio captioning — permitting practitioners to pick the proper mannequin based mostly on their job reasonably than utilizing a one-size-fits-all checkpoint.
- Beats Closed Models on Long Audio Despite Being Smaller On LongAudioBench, AF-Next-Instruct scores 73.9 — outperforming the closed-source Gemini 2.5 Pro (60.4) and Audio Flamingo 3 (68.6). On the tougher speech-inclusive variant, the hole widens additional, with AF-Next reaching 81.2 vs. Gemini 2.5 Pro’s 66.2.
Check out the Paper, Project Page and Model Weights. Also, be happy to comply with us on Twitter and don’t neglect to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us
The publish NVIDIA and the University of Maryland Researchers Released Audio Flamingo Next (AF-Next): A Super Powerful and Open Large Audio-Language Model appeared first on MarkTechPost.
