Meet ‘North Mini Code’: Cohere’s 30B Open-Weight Mixture-of-Experts Model With 3B Active Parameters for Agentic Coding

This week, Cohere AI staff shipped its first developer-facing coding mannequin named ‘North Mini Code‘. ‘North Mini Code’ is open-weight and targeted at software program engineers. It is a mixture-of-experts (MoE) mannequin with 30B complete parameters. Only 3B of these parameters activate per token.

The launch is positioned round “sovereign” AI. The thought is easy: run succesful fashions by yourself phrases. Small, environment friendly coding fashions let groups self-host with out massive GPU clusters. North Mini Code targets that hole immediately.

North Mini Code

North Mini Code is a 30B-A3B parameter mannequin. The A3B stands for three billion lively parameters per ahead cross. Cohere optimized it for three jobs: code era, agentic software program engineering, and terminal duties. The mannequin is text-in, text-out. There isn’t any picture or video enter.

The context window is 256K tokens. Maximum output size is 64K tokens. Cohere lists a minimal {hardware} bar of 1 H100 at FP8. Weights ship underneath Apache 2.0 on Hugging Face. You may also attain it by means of the Cohere API, Model Vault, and OpenRouter.

Field	North-Mini-Code-1.0
License	Apache 2.0
Model dimension	30B complete; 3B lively
Context size	256K complete; 64K max era
Optimized for	Code era, agentic software program engineering, terminal duties
Availability	Hugging Face, Cohere API, Cohere Model Vault, OpenRouter
Hardware (minimal)	1× H100 @ FP8

The Architecture

North Mini Code is a decoder-only Transformer with sparse MoE layers. Its consideration interleaves two sorts in a 3:1 ratio. Sliding-window consideration makes use of RoPE for positions. Global consideration makes use of no positional embeddings in any respect. The feed-forward block holds 128 specialists. Eight specialists activate per token. Each skilled is an FFN with SwiGLU activation.

The router applies a sigmoid earlier than top-k choice. A single dense layer sits earlier than the sparse layers. That combine retains lively compute small whereas widening complete capability. Cohere launched the weights in BF16.

Post-training ran in two phases. First got here two-stage cascaded supervised fine-tuning (SFT). Then got here reinforcement studying with verifiable rewards (RLVR). The post-training targeted on agentic coding. The mannequin additionally helps interleaved pondering and native software use.

Benchmarks

Cohere stories a 33.4 on the Artificial Analysis Coding Index. It describes this as a aggressive place amongst equally sized fashions. The firm evaluated on SWE-Bench Verified, SWE-Bench Pro, and Terminal-Bench v2. It additionally used Terminal-Bench Hard, SciCode, and DwellCodeBench v6.

The methodology is particular. SWE-Bench used the SWE-agent harness v1.1.0. Terminal-Bench v2 used a easy ReAct harness with one terminal software. Terminal-Bench Hard used the Terminus-2 harness. Each benchmark ran with three seeds, then averaged. Sampling used temperature 1.0 and top_p 0.95.

The Speed

In Cohere’s inner checks, North Mini Code reached as much as 2.8x greater output throughput. That held at equivalent concurrency and {hardware}. It additionally confirmed a 30% edge in inter-token latency. Time-to-first-token was nearer between the 2. Devstral Small 2 saved a slight TTFT lead.

Metric	North Mini Code vs Devstral Small 2
Output throughput	Up to 2.8x greater (similar concurrency and {hardware})
Inter-token latency	30% higher for North Mini Code
Time-to-first-token	Slightly behind Devstral Small 2

Use Cases With Examples

Cohere constructed North Mini Code for agentic workflows.

Three patterns stand out in its personal framing:

Sub-agent orchestration: A principal agent delegates subtasks to helpers. Example: one agent writes unit checks whereas one other fixes failing code.
Systems structure mapping: The mannequin reads a repository and sketches its construction. Example: tracing how companies name one another earlier than a big refactor.
Code evaluations: The mannequin scans a diff for issues. Example: flagging an unguarded null dereference earlier than a merge.

Terminal duties match the mannequin as effectively. Example: itemizing information, working a construct, then parsing the output for errors.

Getting Started

The quickest path is Hugging Face Transformers. Install Transformers from supply for this mannequin. Recommended sampling is temperature 1.0 and top_p 0.95.

Copy Code

# Install Transformers from supply (required for this mannequin):
# pip set up "git+https://github.com/huggingface/transformers.git"
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "CohereLabs/North-Mini-Code-1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
mannequin = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

immediate = "Write a python program to examine if a string is a palindrome or not."
messages = [{"role": "user", "content": prompt}]

# return_dict=True yields a dict (input_ids + attention_mask) so **inputs unpacks cleanly
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(mannequin.gadget)

gen_tokens = mannequin.generate(
    **inputs,
    max_new_tokens=1024,
    do_sample=True,
    temperature=1.0,
    top_p=0.95,
)

# Decode solely the newly generated tokens, not the immediate
output = tokenizer.decode(gen_tokens[0][inputs["input_ids"].form[-1]:])
print(output)

For serving, vLLM works. You want vLLM principal plus Cohere’s melody library. Accurate response parsing is determined by it.

Copy Code

uv pip set up "git+https://github.com/vllm-project/vllm.git"
uv pip set up "cohere_melody>=0.9.0"

vllm serve CohereLabs/North-Mini-Code-1.0 
  -tp 2 
  --max-model-len 320000 
  --tool-call-parser cohere_command4 
  --reasoning-parser cohere_command4 
  --enable-auto-tool-choice

Quantized builds exist for Ollama, LM Studio, and llama.cpp. You may also strive the mannequin earlier than downloading. Cohere affords free entry by means of OpenCode and a hosted Hugging Face Space.

Key Takeaways

Cohere’s first coding mannequin, North Mini Code, is a 30B mixture-of-experts that prompts simply 3B parameters per token.
It runs on a single H100 at FP8, with 256K context and 64K max output.
Weights ship underneath Apache 2.0, although the Hugging Face card provides a non-commercial word.
Cohere official launch stories 33.4 on the Artificial Analysis Coding Index, and as much as 2.8x throughput over Devstral Small 2.
Built for agentic coding—sub-agent orchestration, structure mapping, code evaluations with native software use

Marktechpost’s Interactive Explainer

Cohere · Open-Weight Coding Model

North Mini Code

Cohere’s first developer coding mannequin: a 30B mixture-of-experts that prompts simply 3B parameters per token, constructed for agentic software program engineering and terminal duties.

30B complete params
3B lively / token
256K context
64K max output
1× H100 @ FP8

The mannequin at a look

Open weights, launched June 9, 2026. Text in, textual content out.

Size

30B complete / 3B lively

Architecture

Sparse MoE (decoder-only)

Context

256K / 64K output

Min {hardware}

1× H100 @ FP8

Precision

BF16 weights

License

Apache 2.0 see word

Context window · drag to discover

128K tokens

a mid-size codebase

8K64K output cap256K max

Relatable sizes are approximate. The actual limits are 256K context and 64K most era.

Optimized for

Code era
Agentic software program engineering
Terminal duties

Agentic use instances

Sub-agent orchestration
Systems structure mapping
Code evaluations

License word: Cohere’s weblog states Apache 2.0. The Hugging Face card provides an acceptable-use addendum and a non-commercial word. Check each earlier than deploying.

The ahead cross

Tap any stage to see what it does. The MoE block is the place sparsity occurs.

→

→

→

→

Input tokens

Text is tokenized and fed to a decoder-only Transformer. The mannequin is textual content in, textual content out.

Try the router

Each MoE block holds 128 specialists. The router selects 8 per token. Route tokens and watch protection develop.

Coral = the 8 specialists firing now. Peach = specialists used earlier within the run. Hover a sq. to examine.

8 / 128 specialists

6.25% of specialists run per token, so compute stays small.

Unique specialists used0 / 128

Tokens routed0

Reported efficiency

Figures are from Cohere. Independent runs by yourself workload nonetheless matter.

Artificial Analysis Coding Index

Output throughput vs Devstral Small 2

Better inter-token latency

Higher is healthier

North Mini Codeas much as 2.8×

Devstral Small 21.0× (baseline)

Time-to-first-token was carefully matched, with Devstral Small 2 holding a slight edge.

Benchmarks: SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench v2, Terminal-Bench Hard, SciCode, DwellCodeBench v6. Harnesses: SWE-agent v1.1.0 (SWE-Bench), a ReAct harness with one terminal software (Terminal-Bench v2), Terminus-2 (Terminal-Bench Hard). Each run used 3 seeds, averaged, at temperature 1.0 and top_p 0.95.

Quickstart

Hugging Face Transformers, put in from supply. Recommended sampling: temperature 1.0, top_p 0.95.

# Install Transformers from supply, then:
from transformers import AutoTokenizer, AutoModelForCausalLM

mid = "CohereLabs/North-Mini-Code-1.0"
tok = AutoTokenizer.from_pretrained(mid)
mannequin = AutoModelForCausalLM.from_pretrained(mid, device_map="auto")

msgs = [{"role": "user", "content": "Write a Python palindrome checker."}]
inputs = tok.apply_chat_template(
    msgs, add_generation_prompt=True,
    return_dict=True, return_tensors="pt",
).to(mannequin.gadget)

out = mannequin.generate(**inputs, max_new_tokens=1024,
                     do_sample=True, temperature=1.0, top_p=0.95)
print(tok.decode(out[0][inputs["input_ids"].form[-1]:]))

Serve with vLLM (+ cohere_melody)
Trained for OpenCode
Native software use + interleaved pondering

Quantized: Ollama, LM Studio, llama.cpp
Also on Cohere API, Model Vault, OpenRouter

M
MarktechpostAI fashions, analysis & dev instruments, decoded for builders.

Meet ‘North Mini Code’: Cohere’s 30B Open-Weight Mixture-of-Experts Model With 3B Active Parameters for Agentic Coding

North Mini Code

The Architecture

Benchmarks

The Speed

Use Cases With Examples

Getting Started

Key Takeaways

Marktechpost’s Interactive Explainer

North Mini Code

Input tokens

MBZUAI Researchers Introduce PAN: A General World Model For Interactable Long Horizon Simulation

How to Build a Fully Searchable AI Knowledge Base with OpenKB, OpenRouter, and Llama

Qwen Introduces Qwen3.7-Max: A Reasoning Agent Model With a 1M-Token Context Window

Parallax: A Parameterized Local Linear Attention That Keeps Softmax and Adds a Learned Covariance Correction Branch

Your LLM is 5x Slower Than It Should Be. The Reason? Pessimism—and Stanford Researchers Just Showed How to Fix It

Tencent Hunyuan Releases HPC-Ops: A High Performance LLM Inference Operator Library

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

North Mini Code

The Architecture

Benchmarks

The Speed

Use Cases With Examples

Getting Started

Key Takeaways

Marktechpost’s Interactive Explainer

North Mini Code

Input tokens

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!