ZAYA1: AI model using AMD GPUs for training hits milestone

Zyphra, AMD, and IBM spent a 12 months testing whether or not AMD’s GPUs and platform can assist large-scale AI model training, and the result’s ZAYA1.

In partnership, the three corporations skilled ZAYA1 – described as the primary main Mixture-of-Experts basis model constructed fully on AMD GPUs and networking – which they see as proof that the market doesn’t should rely upon NVIDIA to scale AI.

The model was skilled on AMD’s Instinct MI300X chips, Pensando networking, and ROCm software program, all operating throughout IBM Cloud’s infrastructure. What’s notable is how standard the setup appears to be like. Instead of experimental {hardware} or obscure configurations, Zyphra constructed the system very similar to any enterprise cluster—simply with out NVIDIA’s elements.

Zyphra says ZAYA1 performs on par with, and in some areas forward of, well-established open fashions in reasoning, maths, and code. For companies pissed off by provide constraints or spiralling GPU pricing, it quantities to one thing uncommon: a second possibility that doesn’t require compromising on functionality.

How Zyphra used AMD GPUs to chop prices with out gutting AI training efficiency

Most organisations comply with the identical logic when planning training budgets: reminiscence capability, communication velocity, and predictable iteration occasions matter greater than uncooked theoretical throughput.

MI300X’s 192GB of high-bandwidth reminiscence per GPU offers engineers some respiratory room, permitting early training runs with out instantly resorting to heavy parallelism. That tends to simplify initiatives which can be in any other case fragile and time-consuming to tune.

Zyphra constructed every node with eight MI300X GPUs related over InfinityFabric and paired each with its personal Pollara community card. A separate community handles dataset reads and checkpointing. It’s an unfussy design, however that appears to be the purpose; the less complicated the wiring and community structure, the decrease the swap prices and the simpler it’s to maintain iteration occasions regular.

ZAYA1: An AI model that punches above its weight

ZAYA1-base prompts 760 million parameters out of a complete 8.3 billion and was skilled on 12 trillion tokens in three levels. The structure leans on compressed consideration, a refined routing system to steer tokens to the suitable consultants, and lighter-touch residual scaling to maintain deeper layers steady.

The model makes use of a mixture of Muon and AdamW. To make Muon environment friendly on AMD {hardware}, Zyphra fused kernels and trimmed pointless reminiscence visitors so the optimiser wouldn’t dominate every iteration. Batch sizes had been elevated over time, however that relies upon closely on having storage pipelines that may ship tokens shortly sufficient.

All of this results in an AI model skilled on AMD {hardware} that competes with bigger friends similar to Qwen3-4B, Gemma3-12B, Llama-3-8B, and OLMoE. One benefit of the MoE construction is that solely a sliver of the model runs without delay, which helps handle inference reminiscence and reduces serving cost.

A financial institution, for instance, might prepare a domain-specific model for investigations with no need convoluted parallelism early on. The MI300X’s reminiscence headroom offers engineers house to iterate, whereas ZAYA1’s compressed consideration cuts prefill time throughout analysis.

Making ROCm behave with AMD GPUs

Zyphra didn’t disguise the truth that transferring a mature NVIDIA-based workflow onto ROCm took work. Instead of porting elements blindly, the staff hung out measuring how AMD {hardware} behaved and reshaping model dimensions, GEMM patterns, and microbatch sizes to swimsuit MI300X’s most well-liked compute ranges.

InfinityFabric operates finest when all eight GPUs in a node take part in collectives, and Pollara tends to succeed in peak throughput with bigger messages, so Zyphra sized fusion buffers accordingly. Long-context training, from 4k as much as 32k tokens, relied on ring consideration for sharded sequences and tree consideration throughout decoding to keep away from bottlenecks.

Storage issues had been equally sensible. Smaller fashions hammer IOPS; bigger ones want sustained bandwidth. Zyphra bundled dataset shards to cut back scattered reads and elevated per-node web page caches to hurry checkpoint restoration, which is significant throughout lengthy runs the place rewinds are inevitable.

Keeping clusters on their toes

Training jobs that run for weeks hardly ever behave completely. Zyphra’s Aegis service displays logs and system metrics, identifies failures similar to NIC glitches or ECC blips, and takes easy corrective actions routinely. The staff additionally elevated RCCL timeouts to maintain brief community interruptions from killing whole jobs.

Checkpointing is distributed throughout all GPUs slightly than compelled by means of a single chokepoint. Zyphra stories greater than ten-fold sooner saves in contrast with naïve approaches, which immediately improves uptime and cuts operator workload.

What the ZAYA1 AMD training milestone means for AI procurement

The report attracts a clear line between NVIDIA’s ecosystem and AMD’s equivalents: NVLINK vs InfinityFabric, NCCL vs RCCL, cuBLASLt vs hipBLASLt, and so forth. The authors argue the AMD stack is now mature sufficient for severe large-scale model improvement.

None of this means enterprises ought to tear out current NVIDIA clusters. A extra life like path is to maintain NVIDIA for manufacturing whereas using AMD for levels that profit from the reminiscence capability of MI300X GPUs and ROCm’s openness. It spreads provider danger and will increase complete training quantity with out main disruption.

This all leads us to a set of suggestions: deal with model form as adjustable, not mounted; design networks across the collective operations your training will really use; construct fault tolerance that protects GPU hours slightly than merely logging failures; and modernise checkpointing so it not derails training rhythm.

It’s not a manifesto, simply our sensible takeaway from what Zyphra, AMD, and IBM discovered by training a big MoE AI model on AMD GPUs. For organisations trying to increase AI capability with out relying solely on one vendor, it’s a doubtlessly helpful blueprint.

See additionally: Google commits to 1000x more AI infrastructure in next 4-5 years

Banner for AI & Big Data Expo by TechEx events.

Want to study extra about AI and large knowledge from business leaders? Check out AI & Big Data Expo happening in Amsterdam, California, and London. The complete occasion is a part of TechEx and is co-located with different main expertise occasions together with the Cyber Security Expo. Click here for extra data.

AI News is powered by TechForge Media. Explore different upcoming enterprise expertise occasions and webinars here.

The publish ZAYA1: AI model using AMD GPUs for training hits milestone appeared first on AI News.