JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines

JetBrains launched Mellum2, open-sourcing the weights underneath the Apache 2.0 license. The first model of Mellum was a completion-focused 4B dense mannequin. Mellum2 is its successor: a general-purpose mannequin specialised in software program engineering. It covers code era and modifying, debugging, multi-step reasoning, software use and performance calling, agentic coding, and conversational programming help.

JetBrains workforce positions Mellum2 as a “focal mannequin” — a quick, specialised part inside bigger AI methods, not a standalone alternative for frontier fashions.

Architecture

Mellum2 makes use of a Mixture-of-Experts (MoE) structure with 12B complete parameters and a couple of.5B energetic parameters per token. In MoE fashions, solely a subset of parameters runs on every token. Here, the mannequin has 64 consultants and prompts 8 per token. This retains per-token compute equal to a 2.5B dense mannequin, whereas the overall parameter depend supplies larger capability for specialization.

Key architectural particulars:

Layers: 28
Hidden dimension: 2304
MoE consultants: 64 complete, 8 activated per token
Attention: Grouped-Query Attention (GQA) with 32 question heads and 4 KV heads
Sliding Window Attention (SWA): Applied to a few of each 4 layers, with a window dimension of 1,024. Full consideration runs on the remaining layer.
Context size: 131,072 tokens
Multi-Token Prediction (MTP) head: Serves as an auxiliary pre-training goal and as a built-in draft mannequin for speculative decoding
Precision: bfloat16
Vocabulary dimension: 98,304

The mannequin handles pure language and code. It shouldn’t be multimodal — there is no such thing as a picture or video enter.

Pre-Training

Pre-training spans roughly 10.6 trillion tokens via a three-phase curriculum. The knowledge combination progressively shifts from numerous net content material towards curated code and mathematical content material throughout the three phases.

Training used the Muon optimizer underneath FP8 hybrid precision with a Warmup-Hold-Decay studying fee schedule with linear decay to zero.

After pre-training, the bottom mannequin’s context window was prolonged to 128K tokens utilizing a layer-selective YaRN technique earlier than post-training started.

The Model Family

JetBrains workforce launched six checkpoints overlaying the complete coaching pipeline:

Checkpoint	Description
Mellum2-12B-A2.5B-Base-Pretrain	Base checkpoint earlier than long-context extension
Mellum2-12B-A2.5B-Base	Final base mannequin after context extension
Mellum2-12B-A2.5B-Instruct-SFT	Supervised fine-tuned instruction checkpoint
Mellum2-12B-A2.5B-Thinking-SFT	Supervised pondering checkpoint
Mellum2-12B-A2.5B-Instruct	RL-tuned instruction mannequin
Mellum2-12B-A2.5B-Thinking	RL-tuned pondering mannequin

Post-training follows two phases: supervised fine-tuning (SFT), then reinforcement studying with verifiable rewards (RLVR) on math, executable coding, software use, instruction following, reasoning, and data duties.

The Instruct variant solutions immediately, with out an externalized chain of thought. Use it for low-latency duties: direct solutions, software use, and instruction following.

The Thinking variant emits an express reasoning hint earlier than its closing reply. Use it for advanced debugging, multi-step planning, or agentic flows the place step-by-step reasoning issues.

Benchmark Results

All numbers under are self-reported by JetBrains. The comparability set is open-weight fashions in the 4B–14B vary.

Coding:

Benchmark	Mellum2 Instruct	Qwen3.5 (4B)	Qwen3.5 (9B)	Ministral 3 (14B)	OLMo-3 (7B)	Seed-Coder (8B)
DwellCodeBench v6	37.2	51.0	63.7	42.4	28.2	28.1
EvalPlus	78.4	69.4	71.8	74.1	67.3	73.8
MultiPL-E	67.1	51.0	67.1	71.5	36.1	77.0

Tool Use:

Benchmark	Mellum2 Instruct	Qwen3.5 (4B)	Qwen3.5 (9B)	Ministral 3 (14B)	OLMo-3 (7B)
BFCL v3	66.3	64.1	70.5	52.7	41.9
BFCL v4	44.2	52.0	60.6	38.8	19.8

Math:

Benchmark	Mellum2 Instruct	Qwen3.5 (4B)	Qwen3.5 (9B)	Ministral 3 (14B)	OLMo-3 (7B)
AIME 2025+2026	41.7	38.3	58.3	33.3	40.0
GSM-Plus	80.5	85.2	87.9	86.6	85.8

Knowledge and Conversational:

Benchmark	Mellum2 Instruct	Qwen3.5 (4B)	Qwen3.5 (9B)	Ministral 3 (14B)	OLMo-3 (7B)
MMLU-Redux	78.1	87.5	91.1	85.9	71.8
GPQA Diamond	40.9	76.8	79.8	58.6	40.9
IFEval	75.8	82.1	83.9	67.3	83.2
MixEval	62.2	65.9	71.1	71.2	59.4

Benchmark notes:

EvalPlus is the imply of HumanEval+ and MBPP+
AIME is the imply of AIME 2025 and AIME 2026 (30 questions every)
BFCL v4 is the macro-average of 5 subtasks: v1, v2, v3, net search, reminiscence
Seed-Coder (8B) doesn’t help native software calling; BFCL scores are usually not listed for it

https://weblog.jetbrains.com/ai/2026/06/mellum2-goes-open-source-a-fast-model-for-ai-workflows/

Use Cases

JetBrains identifies 4 manufacturing situations the place Mellum2’s latency and effectivity profile is related:

Routing and orchestration: In a multi-model system, a router analyzes incoming prompts and selects the suitable mannequin or software for every process. Mellum2’s low per-token compute makes it appropriate for this high-frequency classification step.
Low-latency RAG pipelines: Retrieval-Augmented Generation (RAG) methods retrieve related context, summarize it, and generate a response. Mellum2 handles retrieval summarization at decrease latency than bigger dense fashions.
Sub-agents in advanced workflows: Agent pipelines break duties into steps: context gathering, planning, validation, and execution. Mellum2 can deal with repetitive or latency-sensitive steps as a substitute of routing each step via a single giant frontier mannequin.
Private and native deployment: The Apache 2.0 license permits self-hosting with out restrictions. Engineers can run Mellum2 on their very own infrastructure, retaining code and knowledge underneath their management.

Strengths and Limitations

Strengths:

MoE design prompts solely 2.5B of 12B parameters per token — per-token compute equal to a 2.5B dense mannequin
MTP head allows speculative decoding and not using a separate draft mannequin
131,072 token context window
Full checkpoint set launched: base pretrain, base, SFT, and RL-tuned variants for each Instruct and Thinking
Apache 2.0 license — permits industrial use, self-hosting, and fine-tuning
Strong EvalPlus (78.4) and BFCL v3 (66.3) scores relative to 4B–14B comparisons
vLLM help, together with elective tool-calling through --tool-call-parser hermes

Limitations:

Text and code solely — no picture or multimodal enter
DwellCodeBench v6 (37.2) trails Qwen3.5 9B (63.7) and Ministral 3 14B (42.4)
GPQA Diamond (40.9) and MMLU-Redux (78.1) are under most fashions in the comparability set
GSM-Plus (80.5) is under all comparable fashions listed
Not designed for frontier-level duties — JetBrains explicitly positions Mellum2 as a part mannequin

Marktechpost’s Visual Explainer

Overview

JetBrains Open-Sources Mellum2

A 12B Mixture-of-Experts mannequin launched underneath Apache 2.0 on June 2, 2026. Trained from scratch on ~10.6 trillion tokens for software program engineering duties.

Total Params

12B

Active / Token

2.5B

License

Apache 2.0

Context

131,072 tok

Architecture

MoE

Pre-train Data

~10.6T tok

Architecture

How Mellum2 Is Built

MoE prompts 8 of 64 consultants per token — per-token compute stays equal to a 2.5B dense mannequin. An MTP head allows speculative decoding and not using a separate draft mannequin.

Layers

Hidden Size

2304

Experts (complete / energetic)

64 / 8

GQA Heads (Q / KV)

32 / 4

SWA Window

1,024 (¾ layers)

Vocabulary

98,304

Precision

bfloat16

Modality

Text + Code

Pre-Training

Training Pipeline

Three-phase curriculum progressively shifts from numerous net knowledge towards curated code and math. Context prolonged to 128K through layer-selective YaRN earlier than post-training.

Data: ~10.6 trillion tokens throughout three curriculum phases
Optimizer: Muon underneath FP8 hybrid precision
LR Schedule: Warmup-Hold-Decay with linear decay to zero
Context Extension: Layer-selective YaRN to 128K tokens
Post-Training: SFT → RLVR on coding, math, software use, reasoning, data
Design Constraint: Inference effectivity on commodity GPUs validated by ablation

Model Family

Six Checkpoints Released

Full pipeline from base pretrain via RL-tuned variants. Use Instruct for direct low-latency solutions. Use Thinking for express step-by-step reasoning traces.

BASEMellum2-12B-A2.5B-Base-PretrainBefore context extension

BASEMellum2-12B-A2.5B-BaseAfter YaRN extension

SFTMellum2-12B-A2.5B-Instruct-SFTSupervised instruction

SFTMellum2-12B-A2.5B-Thinking-SFTSupervised pondering

RLVRMellum2-12B-A2.5B-InstructRL-tuned, no CoT

RLVRMellum2-12B-A2.5B-ThinkingRL-tuned, express CoT

Benchmarks

Evaluation Results (Instruct Variant)

All numbers self-reported by JetBrains. Comparison set: open-weight fashions in the 4B–14B vary.

Benchmark	Mellum2	Qwen3.5 9B	Ministral 3 14B	OLMo-3 7B
DwellCodeBench v6	37.2	63.7	42.4	28.2
EvalPlus	78.4	71.8	74.1	67.3
MultiPL-E	67.1	67.1	71.5	36.1
BFCL v3	66.3	70.5	52.7	41.9
AIME 2025+2026	41.7	58.3	33.3	40.0
IFEval	75.8	83.9	67.3	83.2

Use Cases

Where Mellum2 Fits in Production

JetBrains positions Mellum2 as a “focal mannequin” — dealing with high-frequency, latency-sensitive steps inside bigger AI pipelines.

Routing & Orchestration — Analyze prompts and choose the appropriate mannequin or software per process
RAG Pipelines — Summarize retrieved context at low latency earlier than response era
Sub-Agents — Handle repetitive steps in agent pipelines (context gathering, validation, planning)
Private Deployment — Apache 2.0 permits full self-hosting with no exterior API calls required

Strengths & Limitations

What Works and What Doesn’t

Mellum2 is designed for effectivity in part roles, not frontier-level functionality throughout all benchmarks.

✓ Strengths

2.5B energetic params — compute of a dense 2.5B mannequin
MTP head allows built-in speculative decoding
131K token context window
Strong EvalPlus (78.4) and BFCL v3 (66.3)
Apache 2.0 — industrial use, fine-tuning, self-hosting
vLLM help with tool-calling

✗ Limitations

Text and code solely — no multimodal enter
DwellCodeBench v6 (37.2) under Qwen3.5 9B (63.7)
GPQA Diamond (40.9) under most comparisons
GSM-Plus (80.5) trails all fashions listed
Not a frontier alternative — part function solely

Quick Start

Deploy with vLLM

Install vLLM and serve the Instruct variant. Enable tool-calling with the hermes parser for function-calling workflows.

pip set up vllm

# Basic serve
vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct 
  --max-model-len 131072

# With software calling
vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct 
  --max-model-len 131072 
  --enable-auto-tool-choice 
  --tool-call-parser hermes

Model weights: huggingface.co/JetBrains/mellum-2 · Technical report: arXiv:2605.31268

Getting Started

Serve Mellum2 with vLLM:

Copy Code

pip set up vllm
vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct --max-model-len 131072

With software calling enabled:

Copy Code

vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct 
  --max-model-len 131072 
  --enable-auto-tool-choice 
  --tool-call-parser hermes

Using the Hugging Face Transformers library:

Copy Code

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("JetBrains/Mellum2-12B-A2.5B-Instruct")
mannequin = AutoModelForCausalLM.from_pretrained("JetBrains/Mellum2-12B-A2.5B-Instruct")

messages = [{"role": "user", "content": "Write a Python function to reverse a string."}]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(mannequin.gadget)

outputs = mannequin.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs["input_ids"].form[-1]:]))

Check out the Model Weights and Technical details. Also, be happy to comply with us on Twitter and don’t neglect to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The submit JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines appeared first on MarkTechPost.

JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines

Architecture

Pre-Training

The Model Family

Benchmark Results

Use Cases

Strengths and Limitations

Strengths:

Limitations:

Marktechpost’s Visual Explainer

JetBrains Open-Sources Mellum2

How Mellum2 Is Built

Training Pipeline

Six Checkpoints Released

Evaluation Results (Instruct Variant)

Where Mellum2 Fits in Production

What Works and What Doesn’t

✓ Strengths

✗ Limitations

Deploy with vLLM

Getting Started

Liquid AI Releases LFM2.5-1.2B-Thinking: a 1.2B Parameter Reasoning Model That Fits Under 1 GB On-Device

An Implementation of the Microsoft Agent Governance Toolkit for Safe AI Agent Tool Use with Policies, Approvals, Audit Logs, and Risk Controls

Build a Nanobot-Style AI Agent in Google Colab with Tool Calling, Session Memory, Skills, and MCP Servers

How to Build an Advanced End-to-End Voice AI Agent Using Hugging Face Pipelines?

A Coding Guide to Design a Complete Agentic Workflow in Gemini for Automated Medical Evidence Gathering and Prior Authorization Submission

The Best Chinese Open Agentic/Reasoning Models (2025): Expanded Review, Comparative Insights & Use Cases

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Architecture

Pre-Training

The Model Family

Benchmark Results

Use Cases

Strengths and Limitations

Strengths:

Limitations:

Marktechpost’s Visual Explainer

JetBrains Open-Sources Mellum2

How Mellum2 Is Built

Training Pipeline

Six Checkpoints Released

Evaluation Results (Instruct Variant)

Where Mellum2 Fits in Production

What Works and What Doesn’t

✓ Strengths

✗ Limitations

Deploy with vLLM

Getting Started

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!