Zyphra Releases ZAYA1-8B: A Reasoning MoE Trained on AMD Hardware That Punches Far Above Its Weight Class

Zyphra AI has launched ZAYA1-8B, a small Mixture of Experts (MoE) language mannequin with 760 million energetic parameters and eight.4 billion complete parameters. Trained end-to-end on AMD {hardware}, the mannequin outperforms open-weight fashions many instances its measurement on math and coding benchmarks, and is now accessible below an Apache 2.0 license on Hugging Face and as a serverless endpoint on Zyphra Cloud.

With below 1 billion energetic parameters, ZAYA1-8B achieves scores aggressive with first-generation frontier reasoning fashions like DeepSeek-R1-0528, Gemini-2.5-Pro, and Claude 4.5 Sonnet on difficult mathematical reasoning duties. With its novel test-time compute methodology known as Markovian RSA, it surpasses Claude 4.5 Sonnet and GPT-5-High on HMMT’25 (89.6 vs 88.3) and closes in on frontier open-weight fashions like DeepSeek-V3.2 on arithmetic benchmarks.

What is a Mixture of Experts Model and Why Does Active Parameter Count Matter?

The distinction between ‘energetic’ and ‘complete’ parameters issues an incredible deal. In a normal dense mannequin, each parameter prompts for each enter token. In a Mixture of Experts mannequin, solely a subset of the community’s parameters — the ‘specialists’ — are activated at inference time. ZAYA1-8B has 8.4B complete parameters however solely 760M are energetic per ahead move. This dramatically reduces inference compute and reminiscence bandwidth necessities whereas retaining the representational capability of a a lot bigger mannequin.

ZAYA1-8B will be deployed on-device for native LLM purposes, run effectively in test-time compute harnesses, and serve requests at decrease latency in comparison with dense fashions with related benchmark efficiency.

Architecture: MoE++ and Three Key Innovations

ZAYA1-8B is constructed on Zyphra’s MoE++ structure, which introduces three particular adjustments over normal MoE designs. Together, these type the bottom of ZAYA1-8B’s intelligence effectivity which is the design purpose Zyphra frames as maximizing intelligence extracted per parameter and per FLOP.

Compressed Convolutional Attention (CCA), a sequence mixing mechanism developed by Zyphra that operates in a compressed latent area and achieves 8× KV-cache compression versus normal consideration. The KV-cache is the reminiscence used throughout inference to retailer intermediate consideration states — an 8× discount straight lowers reminiscence necessities at inference time and permits longer efficient contexts throughout the identical {hardware} envelope.
ZAYA1 MLP-based router with PID-controller bias balancing. Standard MoE routers sometimes use linear projections to find out which professional processes a given token. Zyphra replaces this with an MLP-based router and provides PID-controller-style bias balancing to enhance routing stability — actively stopping load imbalance throughout specialists, which is a recognized failure mode in MoE coaching.
Learned residual scaling, which controls residual-norm progress by means of depth at negligible parameter and FLOP value. In deep networks, residual stream norms can develop unstably layer over layer; discovered scaling addresses this with out including significant overhead.

Training Infrastructure: Fully Built on AMD

ZAYA1-8B is a MoE mannequin pretrained, midtrained, and supervised fine-tuned on an AMD Instinct MI300 stack. The full coaching pipeline ran on a cluster of 1,024 AMD Instinct MI300x nodes linked through AMD Pensando Pollara interconnect, in a customized coaching cluster constructed with IBM.

Reasoning-First Pretraining and a Five-Stage Post-Training Pipeline

ZAYA1-8B’s efficiency displays improvements throughout the complete stack: Zyphra’s MoE++ structure, reasoning-first pretraining, a reasoning RL cascade methodology, and the novel Markovian RSA test-time compute methodology.

Zyphra’s post-training pipeline consists of 5 sequential phases:

The first is a normal SFT stage protecting primary chat, instruction following, code, math, and test-time compute (TTC) talents.
The second is a reasoning warmup combining mathematical duties, logic and puzzle fixing, with TTC prompts to coach the mannequin to natively self-aggregate candidate options.
Third is a big RLVE-Gym part with dynamically adjusted puzzle problem to coach core reasoning circuits.
Fourth is a large-scale math and code RL part to deepen efficiency in these two elementary domains.
Finally, a comparatively light-weight RLHF/RLAIF part improves chat conduct, instruction following, and writing fashion.

Zyphra’s analysis staff noticed probably the most substantial functionality boosts on arithmetic and coding throughout RL, with smaller however significant positive factors in multiple-choice data retrieval (MMLU and GPQA-Diamond) and non-verifiable duties reminiscent of artistic writing.

Markovian RSA: A Novel Test-Time Compute Method

The most technically essential contribution alongside the mannequin is Markovian RSA, a test-time compute (TTC) scheme that mixes two prior concepts in a brand new method.

The first is Recursive Self-Aggregation (RSA), which generates a number of reasoning traces in parallel and aggregates them recursively throughout iterations. The second is the Markovian thinker concept, which performs reasoning in fixed-duration chunks — solely the tail finish of the earlier chunk is handed to the subsequent, preserving the context window bounded no matter how lengthy the mannequin causes.

Markovian RSA combines these: for every immediate, a number of traces are generated in parallel; fixed-length tail segments are extracted from every hint; new aggregation prompts are constructed by sub-sampling from the candidate pool; and these aggregated prompts seed the subsequent spherical of parallel responses. The outcome has favorable inference properties — rollout era is parallelizable, and the Markovian chunking technique ensures intermediate chain-of-thought lengths by no means exceed a set context window measurement.

A key discovering comes out to be that co-design between the post-training methodology and the inference harness is crucial. ZAYA1-8B was skilled to grasp and reply to Markovian RSA aggregation prompts and chunking beginning in SFT and persevering with by means of RL. When Zyphra utilized the identical methodology to Qwen3-4B-Thinking-2507 with out this co-design, the efficiency uplift was considerably smaller — stating that the harness and post-training should be developed collectively to understand the positive factors.

With Markovian RSA at an extra-high test-time compute funds of 5.5 million tokens per downside, ZAYA1-8B outperforms DeepSeek-V3.2 and GPT-OSS-High on the difficult APEX-shortlist arithmetic benchmark.

Benchmark Results

On the in-class comparability in opposition to equally sized fashions, ZAYA1-8B scores 89.1 on AIME’26, 71.6 on HMMT Feb.’26, 59.3 on IMO-AnswerBench, 32.2 on APEX-shortlist, 65.8 on LiveCodeBench-v6, and 71.0 on GPQA-Diamond — outperforming Qwen3-4B-Thinking-2507 and Gemma-4-E4B-it throughout all arithmetic and coding classes.

Against bigger open-weight fashions, ZAYA1-8B with 760M energetic parameters surpasses Mistral-Small-4-119B (6B energetic, 119B complete) on math and coding benchmarks particularly — scoring 89.1 vs 86.4 on AIME’26, 71.6 vs 70.6 on HMMT Feb.’26, and 63.8 vs 57.9 on LiveCodeBench-v6. Mistral-Small-4-119B retains benefits on GPQA-Diamond (77.2 vs 71.0) and MMLU-Pro (81.6 vs 74.2), the place data breadth issues greater than mathematical reasoning depth.

Key Takeaways

ZAYA1-8B delivers frontier-level math and coding efficiency with solely 760M energetic parameters, outperforming open-weight fashions many instances its measurement.
Its MoE++ structure introduces three improvements — CCA with 8× KV-cache compression, an MLP-based router with PID-controller bias balancing, and discovered residual scaling — to maximise intelligence per parameter.
A novel test-time compute methodology known as Markovian RSA, combining Recursive Self-Aggregation with Markovian chunking, pushes ZAYA1-8B previous DeepSeek-V3.2 and GPT-OSS-High on APEX-shortlist at 5.5M tokens per downside.
ZAYA1-8B is the primary MoE mannequin pretrained, midtrained, and SFT’d completely on AMD Instinct MI300 {hardware} — on a 1,024 MI300x node cluster constructed with IBM.
Released below Apache 2.0, it’s accessible on Hugging Face and Zyphra Cloud.

Check out the Paper, Model Weights and Technical details. Also, be happy to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The put up Zyphra Releases ZAYA1-8B: A Reasoning MoE Trained on AMD Hardware That Punches Far Above Its Weight Class appeared first on MarkTechPost.

Zyphra Releases ZAYA1-8B: A Reasoning MoE Trained on AMD Hardware That Punches Far Above Its Weight Class

What is a Mixture of Experts Model and Why Does Active Parameter Count Matter?

Architecture: MoE++ and Three Key Innovations

Training Infrastructure: Fully Built on AMD

Reasoning-First Pretraining and a Five-Stage Post-Training Pipeline

Markovian RSA: A Novel Test-Time Compute Method

Benchmark Results

Key Takeaways

Comparing the Top 6 Inference Runtimes for LLM Serving in 2025

Anthropic’s New Research Shows Claude can Detect Injected Concepts, but only in Controlled Layers

AREAL: Accelerating Large Reasoning Model Training with Fully Asynchronous Reinforcement Learning

Google DeepMind Unveils AlphaGenome: A Unified Sequence-to-Function Model Using Hybrid Transformers and U-Nets to Decode the Human Genome

Falcon LLM Team Releases Falcon-H1 Technical Report: A Hybrid Attention–SSM Model That Rivals 70B LLMs

Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Training Architecture Achieving 88% Goodput Under High Hardware Failure Rates

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

What is a Mixture of Experts Model and Why Does Active Parameter Count Matter?

Architecture: MoE++ and Three Key Innovations

Training Infrastructure: Fully Built on AMD

Reasoning-First Pretraining and a Five-Stage Post-Training Pipeline

Markovian RSA: A Novel Test-Time Compute Method

Benchmark Results

Key Takeaways

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!