JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines
JetBrains launched Mellum2, open-sourcing the weights underneath the Apache 2.0 license. The first model of Mellum was a completion-focused 4B dense mannequin. Mellum2 is its successor: a general-purpose mannequin specialised in software program engineering. It covers code era and modifying, debugging, multi-step reasoning, software use and performance calling, agentic coding, and conversational programming help.
JetBrains workforce positions Mellum2 as a “focal mannequin” — a quick, specialised part inside bigger AI methods, not a standalone alternative for frontier fashions.
Architecture
Mellum2 makes use of a Mixture-of-Experts (MoE) structure with 12B complete parameters and a couple of.5B energetic parameters per token. In MoE fashions, solely a subset of parameters runs on every token. Here, the mannequin has 64 consultants and prompts 8 per token. This retains per-token compute equal to a 2.5B dense mannequin, whereas the overall parameter depend supplies larger capability for specialization.
Key architectural particulars:
- Layers: 28
- Hidden dimension: 2304
- MoE consultants: 64 complete, 8 activated per token
- Attention: Grouped-Query Attention (GQA) with 32 question heads and 4 KV heads
- Sliding Window Attention (SWA): Applied to a few of each 4 layers, with a window dimension of 1,024. Full consideration runs on the remaining layer.
- Context size: 131,072 tokens
- Multi-Token Prediction (MTP) head: Serves as an auxiliary pre-training goal and as a built-in draft mannequin for speculative decoding
- Precision: bfloat16
- Vocabulary dimension: 98,304
The mannequin handles pure language and code. It shouldn’t be multimodal — there is no such thing as a picture or video enter.
Pre-Training
Pre-training spans roughly 10.6 trillion tokens via a three-phase curriculum. The knowledge combination progressively shifts from numerous net content material towards curated code and mathematical content material throughout the three phases.
Training used the Muon optimizer underneath FP8 hybrid precision with a Warmup-Hold-Decay studying fee schedule with linear decay to zero.
After pre-training, the bottom mannequin’s context window was prolonged to 128K tokens utilizing a layer-selective YaRN technique earlier than post-training started.
The Model Family
JetBrains workforce launched six checkpoints overlaying the complete coaching pipeline:
| Checkpoint | Description |
|---|---|
| Mellum2-12B-A2.5B-Base-Pretrain | Base checkpoint earlier than long-context extension |
| Mellum2-12B-A2.5B-Base | Final base mannequin after context extension |
| Mellum2-12B-A2.5B-Instruct-SFT | Supervised fine-tuned instruction checkpoint |
| Mellum2-12B-A2.5B-Thinking-SFT | Supervised pondering checkpoint |
| Mellum2-12B-A2.5B-Instruct | RL-tuned instruction mannequin |
| Mellum2-12B-A2.5B-Thinking | RL-tuned pondering mannequin |
Post-training follows two phases: supervised fine-tuning (SFT), then reinforcement studying with verifiable rewards (RLVR) on math, executable coding, software use, instruction following, reasoning, and data duties.
The Instruct variant solutions immediately, with out an externalized chain of thought. Use it for low-latency duties: direct solutions, software use, and instruction following.
The Thinking variant emits an express reasoning hint earlier than its closing reply. Use it for advanced debugging, multi-step planning, or agentic flows the place step-by-step reasoning issues.
Benchmark Results
All numbers under are self-reported by JetBrains. The comparability set is open-weight fashions in the 4B–14B vary.
Coding:
| Benchmark | Mellum2 Instruct | Qwen3.5 (4B) | Qwen3.5 (9B) | Ministral 3 (14B) | OLMo-3 (7B) | Seed-Coder (8B) |
|---|---|---|---|---|---|---|
| DwellCodeBench v6 | 37.2 | 51.0 | 63.7 | 42.4 | 28.2 | 28.1 |
| EvalPlus | 78.4 | 69.4 | 71.8 | 74.1 | 67.3 | 73.8 |
| MultiPL-E | 67.1 | 51.0 | 67.1 | 71.5 | 36.1 | 77.0 |
Tool Use:
| Benchmark | Mellum2 Instruct | Qwen3.5 (4B) | Qwen3.5 (9B) | Ministral 3 (14B) | OLMo-3 (7B) |
|---|---|---|---|---|---|
| BFCL v3 | 66.3 | 64.1 | 70.5 | 52.7 | 41.9 |
| BFCL v4 | 44.2 | 52.0 | 60.6 | 38.8 | 19.8 |
Math:
| Benchmark | Mellum2 Instruct | Qwen3.5 (4B) | Qwen3.5 (9B) | Ministral 3 (14B) | OLMo-3 (7B) |
|---|---|---|---|---|---|
| AIME 2025+2026 | 41.7 | 38.3 | 58.3 | 33.3 | 40.0 |
| GSM-Plus | 80.5 | 85.2 | 87.9 | 86.6 | 85.8 |
Knowledge and Conversational:
| Benchmark | Mellum2 Instruct | Qwen3.5 (4B) | Qwen3.5 (9B) | Ministral 3 (14B) | OLMo-3 (7B) |
|---|---|---|---|---|---|
| MMLU-Redux | 78.1 | 87.5 | 91.1 | 85.9 | 71.8 |
| GPQA Diamond | 40.9 | 76.8 | 79.8 | 58.6 | 40.9 |
| IFEval | 75.8 | 82.1 | 83.9 | 67.3 | 83.2 |
| MixEval | 62.2 | 65.9 | 71.1 | 71.2 | 59.4 |
Benchmark notes:
- EvalPlus is the imply of HumanEval+ and MBPP+
- AIME is the imply of AIME 2025 and AIME 2026 (30 questions every)
- BFCL v4 is the macro-average of 5 subtasks: v1, v2, v3, net search, reminiscence
- Seed-Coder (8B) doesn’t help native software calling; BFCL scores are usually not listed for it

Use Cases
JetBrains identifies 4 manufacturing situations the place Mellum2’s latency and effectivity profile is related:
- Routing and orchestration: In a multi-model system, a router analyzes incoming prompts and selects the suitable mannequin or software for every process. Mellum2’s low per-token compute makes it appropriate for this high-frequency classification step.
- Low-latency RAG pipelines: Retrieval-Augmented Generation (RAG) methods retrieve related context, summarize it, and generate a response. Mellum2 handles retrieval summarization at decrease latency than bigger dense fashions.
- Sub-agents in advanced workflows: Agent pipelines break duties into steps: context gathering, planning, validation, and execution. Mellum2 can deal with repetitive or latency-sensitive steps as a substitute of routing each step via a single giant frontier mannequin.
- Private and native deployment: The Apache 2.0 license permits self-hosting with out restrictions. Engineers can run Mellum2 on their very own infrastructure, retaining code and knowledge underneath their management.
Strengths and Limitations
Strengths:
- MoE design prompts solely 2.5B of 12B parameters per token — per-token compute equal to a 2.5B dense mannequin
- MTP head allows speculative decoding and not using a separate draft mannequin
- 131,072 token context window
- Full checkpoint set launched: base pretrain, base, SFT, and RL-tuned variants for each Instruct and Thinking
- Apache 2.0 license — permits industrial use, self-hosting, and fine-tuning
- Strong EvalPlus (78.4) and BFCL v3 (66.3) scores relative to 4B–14B comparisons
- vLLM help, together with elective tool-calling through
--tool-call-parser hermes
Limitations:
- Text and code solely — no picture or multimodal enter
- DwellCodeBench v6 (37.2) trails Qwen3.5 9B (63.7) and Ministral 3 14B (42.4)
- GPQA Diamond (40.9) and MMLU-Redux (78.1) are under most fashions in the comparability set
- GSM-Plus (80.5) is under all comparable fashions listed
- Not designed for frontier-level duties — JetBrains explicitly positions Mellum2 as a part mannequin
Marktechpost’s Visual Explainer
Getting Started
Serve Mellum2 with vLLM:
pip set up vllm
vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct --max-model-len 131072
With software calling enabled:
vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct
--max-model-len 131072
--enable-auto-tool-choice
--tool-call-parser hermes
Using the Hugging Face Transformers library:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("JetBrains/Mellum2-12B-A2.5B-Instruct")
mannequin = AutoModelForCausalLM.from_pretrained("JetBrains/Mellum2-12B-A2.5B-Instruct")
messages = [{"role": "user", "content": "Write a Python function to reverse a string."}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(mannequin.gadget)
outputs = mannequin.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs["input_ids"].form[-1]:]))
Check out the Model Weights and Technical details. Also, be happy to comply with us on Twitter and don’t neglect to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The submit JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines appeared first on MarkTechPost.
