Z.AI Introduces GLM-5.1: An Open-Weight 754B Agentic Model That Achieves SOTA on SWE-Bench Pro and Sustains 8-Hour Autonomous Execution
Z.AI, the AI platform developed by the staff behind the GLM mannequin household, has launched GLM-5.1 — its next-generation flagship mannequin developed particularly for agentic engineering. Unlike fashions optimized for clear, single-turn benchmarks, GLM-5.1 is constructed for agentic duties, with considerably stronger coding capabilities than its predecessor, and achieves state-of-the-art efficiency on SWE-Bench Pro whereas main GLM-5 by a large margin on NL2Repo (repo era) and Terminal-Bench 2.0 (real-world terminal duties).
Architecture: DSA, MoE, and Asynchronous RL
Before diving into what GLM-5.1 can do, it’s price understanding what it’s constructed on — as a result of the structure is meaningfully totally different from a typical dense transformer.
GLM-5 adopts DSA to considerably cut back coaching and inference prices whereas sustaining long-context constancy. The mannequin makes use of a glm_moe_dsa structure (Mixture of Experts (MoE) mannequin mixed with DSA). For AI devs evaluating whether or not to self-host, this issues: MoE fashions activate solely a subset of their parameters per ahead move, which may make inference considerably extra environment friendly than a comparably-sized dense mannequin, although they require particular serving infrastructure.
On the coaching aspect, GLM-5 implements a brand new asynchronous reinforcement studying infrastructure that drastically improves post-training effectivity by decoupling era from coaching. Novel asynchronous agent RL algorithms additional enhance RL high quality, enabling the mannequin to study from complicated, long-horizon interactions extra successfully. This is what permits the mannequin to deal with agentic duties with the type of sustained judgment that single-turn RL coaching struggles to provide.
The Plateau Problem GLM-5.1 is Solving
To perceive what makes GLM-5.1 totally different at inference time, it helps to grasp a selected failure mode in LLMs used as brokers. Previous fashions — together with GLM-5 — are inclined to exhaust their repertoire early: they apply acquainted methods for fast preliminary positive factors, then plateau. Giving them extra time doesn’t assist.
This is a structural limitation for any developer making an attempt to make use of an LLM as a coding agent. The mannequin applies the identical playbook it is aware of, hits a wall, and stops making progress no matter how lengthy it runs. GLM-5.1, in contrast, is constructed to remain efficient on agentic duties over for much longer horizons. The mannequin handles ambiguous issues with higher judgment and stays productive over longer classes. It breaks complicated issues down, runs experiments, reads outcomes, and identifies blockers with actual precision. By revisiting its reasoning and revising its technique by repeated iteration, GLM-5.1 sustains optimization over a whole bunch of rounds and hundreds of instrument calls.
The sustained efficiency requires greater than a bigger context window. This functionality requires the mannequin to take care of objective alignment over prolonged execution, lowering technique drift, error accumulation, and ineffective trial and error, enabling really autonomous execution for complicated engineering duties.
Benchmarks: Where GLM-5.1 Stands
On SWE-Bench Pro, GLM-5.1 achieves a rating of 58.4, outperforming GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro, setting a brand new state-of-the-art consequence.
The broader benchmark profile reveals a well-rounded mannequin. GLM-5.1 scores 95.3 on AIME 2026, 94.0 on HMMT Nov. 2025, 82.6 on HMMT Feb. 2026, and 86.2 on GPQA-Diamond — a graduate-level science reasoning benchmark. On agentic and tool-use benchmarks, GLM-5.1 scores 68.7 on CyberGym (a considerable bounce from GLM-5’s 48.3), 68.0 on BrowseComp, 70.6 on τ³-Bench, and 71.8 on MCP-Atlas (Public Set) — the final one notably related given MCP’s rising position in manufacturing agent techniques. On Terminal-Bench 2.0, the mannequin scores 63.5, rising to 66.5 when evaluated with Claude Code because the scaffolding.
Across 12 consultant benchmarks protecting reasoning, coding, brokers, instrument use, and shopping, GLM-5.1 demonstrates a broad and well-balanced functionality profile. This reveals that GLM-5.1 shouldn’t be a single-metric enchancment — it advances concurrently throughout normal intelligence, real-world coding, and complicated job execution.
In phrases of general positioning, GLM-5.1’s normal functionality and coding efficiency are general aligned with Claude Opus 4.6.
8-Hour Sustained Execution: What That Actually Means
The most essential distinction in GLM-5.1 is its capability for long-horizon job execution. GLM-5.1 can work autonomously on a single job for as much as 8 hours, finishing the total course of from planning and execution to testing, fixing, and supply.
For builders constructing autonomous brokers, this modifications the scope of what’s potential. Rather than orchestrating a mannequin over dozens of short-lived instrument calls, you may hand GLM-5.1 a fancy goal and let it run a whole ‘experiment–analyze–optimize’ loop autonomously.
The concrete engineering demonstrations make this tangible: GLM-5.1 can construct a whole Linux desktop setting from scratch in 8 hours; carry out 178 rounds of autonomous iteration on a vector database job and enhance efficiency to 1.5× the preliminary model; and optimize a CUDA kernel, growing speedup from 2.6× to 35.7× by sustained tuning.
That CUDA kernel result’s notable for ML engineers: bettering a kernel from 2.6× to 35.7× speedup by autonomous iterative optimization is a degree of depth that may take a talented human engineer vital time to duplicate manually.
Model Specifications and Deployment
GLM-5.1 is a 754-billion-parameter MoE mannequin launched below the MIT license on HuggingFace. It operates with a 200K context window and helps as much as 128K most output tokens — each essential for long-horizon duties that want to carry giant codebases or prolonged reasoning chains in reminiscence.
GLM-5.1 helps considering mode (providing a number of considering modes for various eventualities), streaming output, perform calling, context caching, structured output, and MCP for integrating exterior instruments and information sources.
For native deployment, the next open-source frameworks assist GLM-5.1: SGLang (v0.5.10+), vLLM (v0.19.0+), xLLM (v0.8.0+), Transformers (v0.5.3+), and OkTransformers (v0.5.3+).
For API entry, the mannequin is accessible by Z.AI’s API platform. Getting began requires putting in zai-sdk by way of pip and initializing a ZaiClient together with your API key. .
Key Takeaways
- GLM-5.1 units a brand new state-of-the-art on SWE-Bench Pro with a rating of 58.4, outperforming GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro — making it one of many the strongest publicly benchmarked mannequin for real-world software program engineering duties on the time of launch.
- The mannequin is constructed for long-horizon autonomous execution, able to working on a single complicated job for as much as 8 hours — working experiments, revising methods, and iterating throughout a whole bunch of rounds and hundreds of instrument calls with out human intervention.
- GLM-5.1 makes use of a MoE + DSA structure skilled with asynchronous reinforcement studying, which reduces coaching and inference prices in comparison with dense transformers whereas sustaining long-context constancy — a significant consideration for groups evaluating self-hosting.
- It is open-weight below the MIT license (754B parameters, 200K context window, 128K max output tokens) and helps native deployment by way of SGLang, vLLM, xLLM, Transformers, and OkTransformers, in addition to API entry by the Z.AI platform with OpenAI SDK compatibility.
- GLM-5.1 goes past coding — it additionally reveals sturdy enhancements in front-end prototyping, artifacts era, and workplace productiveness duties (Word, Excel, PowerPoint, PDF), positioning it as a general-purpose basis for each agentic techniques and high-quality content material workflows.
Check out the Weights, API and Technical details. Also, be happy to observe us on Twitter and don’t overlook to hitch our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us
The submit Z.AI Introduces GLM-5.1: An Open-Weight 754B Agentic Model That Achieves SOTA on SWE-Bench Pro and Sustains 8-Hour Autonomous Execution appeared first on MarkTechPost.
