Cohere Releases Command A+: A 218B Sparse MoE Model for Agentic Workflows That Runs on as Few as Two H100 GPUs

Cohere simply launched Command A+, as an open-source mannequin focusing on enterprise agentic workflows. Available beneath an Apache 2.0 license, Command A+ is a mixture-of-experts (MoE) mannequin constructed for high-performance agentic duties with minimal compute overhead. The mannequin is optimized for reasoning, agentic workflows, RAG, multilingual, and multimodal doc processing. It unifies capabilities from 4 prior fashions — Command A, Command A Reasoning, Command A Vision, and Command A Translate — right into a single scalable mannequin.

Architecture

Command A+ is a decoder-only Sparse Mixture-of-Experts Transformer with 218B complete parameters and 25B lively parameters. It has 128 consultants, of which 8 are lively per token, and a single shared professional is utilized to all tokens. In a MoE mannequin, every token is routed via solely a subset of professional sub-networks reasonably than the total parameter set, protecting lively compute at 25B-parameter scale at inference time.

The consideration layers interleave sliding-window consideration layers with Rotational Positional Embeddings and world consideration layers with out positional embeddings in a 3:1 ratio. The sparse MoE layer is skilled in a totally dropless method and makes use of a token-choice router, with a normalized sigmoid over the top-k professional logits per token.

Input modalities are textual content, picture, and power use. Output modalities are textual content, reasoning, and power use. The mannequin helps a 128K enter context size and a 64K max era size.

Hardware Requirements and Quantization

Three quantization variants can be found with minimal GPU necessities: BF16 (16-bit) requires 4× B200 or 8× H100 GPUs; FP8 (8-bit) requires 2× B200 or 4× H100 GPUs; W4A4 (4-bit) runs on a single B200 or 2× H100 GPUs. All three quantizations present negligible variations in benchmark high quality. Cohere recommends W4A4 for most deployments.

W4A4 Quantization Methodology

Cohere applies NVFP4 W4A4 quantization, 4-bit weights and activations with two-level scaling, to the MoE consultants solely. The consideration path, together with Q/Okay/V/O projections, the KV cache, and a focus compute, is saved at full precision.

To shut residual high quality gaps, Cohere makes use of Quantization-Aware Distillation (QAD) within the post-training section: the quantized scholar mannequin is skilled to match the full-precision instructor’s output distribution, utilizing pretend quantization operators within the ahead move and straight-through estimators on the backward move.

https://cohere.com/weblog/command-a-plus

Performance vs. Prior Command A Models

On τ²-Bench Telecom, scores improved from 37% to 85% over Command A Reasoning, and Terminal-Bench Hard agentic coding efficiency reached 25% from 3%.

On inside North platform evaluations, all scored utilizing LLM-as-a-judge methods, Agentic Question Answering accuracy improved by 20% over Command A Reasoning. Agentic QA measures how nicely the mannequin solutions enterprise questions utilizing MCP-connected cloud file methods. Spreadsheet evaluation high quality improved by 32%, and Memory Usage Quality — measuring how nicely an agent leverages data from a earlier session to reply questions in a subsequent session — scored 54% with Command A+ in comparison with 39% with Command A Reasoning.

Command A+ is Cohere’s first multimodal reasoning mannequin. It achieved 63% on MMMU Pro and 75.1% on MMMU, in contrast with 65.3% for Command A Vision on the latter. MathVista scores improved from 73.5% to 80.6%, and CharXiv reasoning improved from 46.9% to 52.7%.

Command A+ expands multilingual protection from 23 to 48 languages, with positive factors in machine translation and multilingual reasoning.

Command A+ scored 37 on the Artificial Analysis Intelligence Index, outperforming different main open fashions.

Speed and Latency

At the identical quantization and concurrency ranges, Command A+ delivers as much as 63% greater Output Tokens per Second (TOPS) and reduces Time To First Token (TTFT) by as much as 17% in contrast with Command A Reasoning. The W4A4 quantization contributes a further 47% improve in velocity and a 13% discount in latency. Speculative decoding, optimized particularly for the MoE structure, delivers a further 1.5–1.6× inference speedup for each textual content and multimodal inputs.

Tokenizer

Command A+ is the primary mannequin to make use of Cohere’s newest tokenizer, decreasing the variety of tokens required to generate the identical response. Tokenization effectivity improved by 20% for Arabic, 16% for Korean, and 18% for Japanese.

Getting Started

The mannequin is supported by vLLM and Transformers. Tool use is dealt with via chat templates in Transformers utilizing JSON schema for instrument descriptions. When reasoning is enabled, the mannequin generates pondering traces between <|START_THINKING|> and <|END_THINKING|> tags earlier than producing a remaining reply.

The W4A4 variant requires vLLM ≥0.21.0 and cohere_melody>=0.9.0 for correct response parsing. Cohere recommends the next sampling parameters: temperature=0.9, top_p=0.95, and repetition_penalty=1.04.

Key Takeaways

Command A+ has 218B complete / 25B lively parameters in a Sparse MoE structure, launched beneath Apache 2.0.
W4A4 applies NVFP4 quantization to MoE consultants solely with QAD post-training, working on 2× H100s.
τ²-Bench Telecom improved from 37% to 85%; Terminal-Bench Hard from 3% to 25% vs. Command A Reasoning.
TOPS elevated as much as 63% and TTFT diminished as much as 17% vs. Command A Reasoning at matching quantization.
Command A+ is Cohere’s first multimodal reasoning mannequin, increasing language help from 23 to 48 languages.

Check out the Model Weights and Technical details. Also, be happy to comply with us on Twitter and don’t overlook to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The submit Cohere Releases Command A+: A 218B Sparse MoE Model for Agentic Workflows That Runs on as Few as Two H100 GPUs appeared first on MarkTechPost.

Cohere Releases Command A+: A 218B Sparse MoE Model for Agentic Workflows That Runs on as Few as Two H100 GPUs

Architecture

Hardware Requirements and Quantization

W4A4 Quantization Methodology

Performance vs. Prior Command A Models

Speed and Latency

Tokenizer

Getting Started

Key Takeaways

LightSeek Foundation Releases TokenSpeed, an Open-Source LLM Inference Engine Targeting TensorRT-LLM-Level Performance for Agentic Workloads

Anthropic Introduces Natural Language Autoencoders That Convert Claude’s Internal Activations Directly into Human-Readable Text Explanations

How to Build an Agentic Voice AI Assistant that Understands, Reasons, Plans, and Responds through Autonomous Multi-Step Intelligence

Texas A&M Researchers Introduce a Two-Phase Machine Learning Method Named ‘ShockCast’ for High-Speed Flow Simulation with Neural Temporal Re-Meshing

Bringing AI Agents Into Any UI: The AG-UI Protocol for Real-Time, Structured Agent–Frontend Streams

Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Architecture

Hardware Requirements and Quantization

W4A4 Quantization Methodology

Performance vs. Prior Command A Models

Speed and Latency

Tokenizer

Getting Started

Key Takeaways

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!