VibeThinker-3B: A 3B Dense Reasoning Model Built on Qwen2.5-Coder-3B With the Spectrum-to-Signal Post-Training Pipeline
While current breakthroughs in AI reasoning have largely been pushed by large scale, pouring in billions of parameters to cross complicated cognitive thresholds—VibeThinker-3B is charting a very completely different path.
Created by researchers from Sina Weibo Inc (China), this 3-billion-parameter mannequin proves that effectivity can punch far above its weight class. Released underneath an open-source MIT license, VibeThinker-3B matches the efficiency of fashions a whole bunch of occasions its measurement on verifiable duties like arithmetic, coding, and STEM disciplines.
What is VibeThinker-3B
VibeThinker-3B is a compact dense mannequin constructed on the Qwen2.5-Coder-3B base. It is post-trained, not pretrained from scratch. The analysis staff applies supervised fine-tuning, reinforcement studying, and self-distillation on high.
The coaching framework continues the Spectrum-to-Signal Principle (SSP) from the earlier VibeThinker-1.5B. SFT (Supervised Fine-Tuning) builds a broad area of legitimate reasoning paths, the ‘Spectrum.’ RL then amplifies the appropriate paths, the ‘Signal.’
The mannequin targets one job: reasoning the place a verifier can verify the reply. The analysis staff recommends bigger basic fashions for open-domain data duties. VibeThinker-3B is a specialist by design.
It runs on normal stacks. The mannequin weights require transformers>=4.54.0. For quicker inference it recommends vLLM==0.10.1 or SGLang>=0.4.9.post6. The BF16 weights are roughly 6 GB, sufficiently small for a single GPU.

Benchmark
On AIME26, VibeThinker-3B scores 94.3. According to the analysis paper, that is akin to DeepSeek V3.2 (671B) and Kimi K2.5 (1T).
On ResideCodeBench v6, it reaches 80.2 Pass@1. On OJBench, one other code benchmark, it scores 38.6, beneath the largest fashions. On HMMT25 it scores 89.3, and on BruMO25 it reaches 93.8. On IMO-AnswerBench, a 400-problem IMO-level set, it scores 76.4.
The desk beneath compares it towards a lot bigger reasoning fashions. The ‘+CLR’ row makes use of test-time scaling. It stands for Claim-Level Reliability Assessment
| Model | Params | AIME26 | HMMT25 | IMO-Ans | LCBv6 | GPQA-D |
|---|---|---|---|---|---|---|
| VibeThinker-3B | 3B | 94.3 | 89.3 | 76.4 | 80.2 | 70.2 |
| VibeThinker-3B +CLR | 3B | 97.1 | 95.4 | 80.6 | — | 72.9 |
| GPT-OSS (excessive) | 120B | 93.2 | 90.0 | 75.6 | 81.9 | 80.1 |
| DeepSeek V3.2 | 671B | 94.2 | 90.2 | 78.3 | 80.8 | 82.4 |
| GLM-5 | 744B | 95.8 | 97.9 | 82.5 | 85.5 | 86.0 |
| Kimi K2.5 | 1T | 93.3 | 95.4 | 81.8 | 85.0 | 87.6 |
The sample is constant. On verifiable math and code, the 3B mannequin sits close to the high cluster. On GPQA-Diamond, a knowledge-heavy benchmark, the hole to giant fashions stays seen.
The analysis staff additionally ran an out-of-distribution coding check. It used current LeetCode weekly and biweekly contests, from Apr 25 to May 31, 2026. The mannequin handed 123 of 128 first-attempt Python submissions. That is a 96.1% acceptance price on unseen issues.
Inside the Spectrum-to-Signal Pipeline
The post-training pipeline runs in 4 levels. Each one targets a unique weak spot of small reasoning fashions.
First comes curriculum-based two-stage SFT. Stage 1 covers math, code, STEM, dialogue, and instruction following broadly. Stage 2 shifts to more durable, longer-horizon samples filtered by reasoning size and problem. Diversity-Exploring Distillation preserves a number of legitimate answer paths by means of each levels.
Second comes multi-domain Reasoning RL. The analysis staff reuses MaxEnt-Guided Policy Optimization (MGPO). MGPO weights prompts close to the mannequin’s present functionality boundary, the place appropriate and incorrect rollouts coexist. Training runs sequentially throughout Math, Code, and STEM.
A notable element: VibeThinker-3B drops progressive context growth. The analysis staff discovered high-truncation warm-up damage lengthy reasoning at this scale. So RL makes use of a single 64K long-context window all through.
Math RL provides a Long2Short stage. It redistributes reward amongst appropriate trajectories by size. Shorter appropriate solutions get increased reward, longer ones decrease, with the group imply unchanged. The objective is fewer redundant tokens with out dropping accuracy.
Third, Offline Self-Distillation merges the RL checkpoints again into one pupil mannequin. Fourth, Instruct RL improves instruction adherence. That stage explains the 93.4 IFEval and 74.5 IFBench scores. Both present reasoning tuning didn’t break controllability.
CLR: Scaling at Test Time, Not Parameter Count
Claim-Level Reliability Assessment (CLR) is the report’s test-time scaling technique. It runs on answer-verifiable duties and provides no parameters.
The process has two steps. The mannequin first generates Ok = 32 trajectories per downside. From every, it extracts M = 5 decision-relevant claims plus a closing reply.
The mannequin then acts as its personal verifier. It validates or falsifies every declare, producing binary verdicts. CLR maps these right into a nonlinear trajectory reliability rating, the place one weak declare sharply lowers the weight.
Answers are clustered by equivalence, and the highest reliability-weighted reply wins. The full movement runs 8 occasions, and the averaged Pass@1 is reported. CLR lifts AIME26 to 97.1 and BruMO25 to 99.2.
The interactive demo beneath helps you to flip claims and watch the rating collapse. It additionally helps you to change benchmarks and evaluate towards bigger fashions.
