Qwen Introduces Qwen3.7-Max: A Reasoning Agent Model With a 1M-Token Context Window
Most AI fashions in the present day usually are not designed for sustained, multi-step autonomous execution. Tasks like working a whole lot of iterative code modifications, or chaining instrument calls throughout hours with out human intervention, require a completely different form of mannequin structure and coaching focus.
Alibaba’s Qwen group formally introduced Qwen3.7-Max on the 2026 Alibaba Cloud Summit on May 20. Although, two preview variations of the Qwen3.7 sequence quietly appeared on Arena AI’s leaderboard with no press launch and no official API announcement.
Two Preview Models Released Simultaneously
Alibaba previewed two fashions concurrently: Qwen3.7-Max-Preview and Qwen3.7-Plus-Preview. They ranked thirteenth globally in textual content capabilities and sixteenth in imaginative and prescient capabilities, respectively, based on LM Arena.
In Text Arena, Qwen3.7-Max-Preview ranked #13 total, inserting Alibaba because the #6 lab in textual content. In Vision Arena, Qwen3.7-Plus-Preview ranked #16 total, inserting Alibaba because the #5 lab in imaginative and prescient. The mannequin rank and the lab rank are separate figures.
Qwen3.7-Plus-Preview is described as a high-performance balanced model preview, specializing in reasoning and logical expression, with its toolchain to be regularly opened sooner or later. It handles imaginative and prescient and multimodal inputs. Qwen3.7-Max is the text-only reasoning flagship. This article covers Qwen3.7-Max, as it’s the mannequin Alibaba formally introduced with API entry.
What is Qwen3.7-Max Designed For
Alibaba Qwen group described Qwen3.7-Max as its most superior and complete agent mannequin to this point. The mannequin is proprietary and closed-weight. It is able to dealing with coding and debugging, workplace workflow automation, and long-horizon duties spanning a whole lot and even hundreds of steps.
Extended-Thinking Mode
Qwen3.7-Max is a reasoning mannequin. The mannequin generates a chain of thought first — an inside sequence of steps the place it plans, checks its work, and corrects course earlier than committing to a ultimate reply. On interfaces like Qwen Chat, this exhibits up as a ‘Thinking’ mode you may swap on to see the mannequin’s reasoning hint.
Reasoning fashions produce considerably extra output tokens than normal completions. When Artificial Analysis ran its Intelligence Index analysis, Qwen3.7-Max generated about 97 million tokens, compared to an average of 24 million for fashions on that benchmark. For quick or easy duties, this overhead provides latency with out bettering output high quality. For multi-step planning, code refactoring, or lengthy agent chains, extended-thinking mode is the place the mannequin’s power applies.
Context Window
The mannequin options a 1M token context window, up from 256K on Qwen3.6 Max Preview. It helps textual content enter and output solely. Pricing has not but been introduced. Qwen3.6 Max Preview was priced at $1.30/$7.80 per million enter/output tokens on Alibaba Cloud.
A million-token context window can maintain a full mid-sized code repository or a massive stack of paperwork in a single request. Models typically purpose much less reliably because the context window fills. Independent long-context testing for Qwen3.7-Max shouldn’t be but obtainable.
Benchmark Results
Qwen3.7-Max scored 56.6 on the Artificial Analysis Intelligence Index, inserting it fifth total. That represents a 4.8-point achieve over its predecessor Qwen3.6 Max Preview (51.8), and places it forward of Google’s Gemini 3.5 Flash (55.3). GPT-5.5 (60.2), Claude Opus 4.7 (57.3), and Gemini 3.1 Pro Preview (57.2) nonetheless lead the general rankings.
The Intelligence Index v4.0 aggregates ten evaluations, together with GDPval-AA, Terminal-Bench Hard, SciCode, AA-Omniscience, Humanity’s Last Exam, and GPQA Diamond.

The enchancment over Qwen3.6 Max Preview shouldn’t be uniform. Most of the Index features are concentrated in scientific reasoning, agentic functionality, and coding. CritPt rose 9.7 share factors (from 3.7% to 13.4%), Humanity’s Last Exam jumped 9.2 factors (from 28.9% to 38.1%), and Terminal-Bench Hard climbed 6.9 factors (from 43.9% to 50.8%). GDPval-AA added 42 Elo factors (from 1504 to 1546). Scores on different benchmarks are largely flat in comparison with Qwen3.6 Max Preview.
One consequence on the Index requires cautious studying. On AA-Omniscience, Qwen3.7-Max’s uncooked accuracy really dropped 7.6 share factors (from 37.7% to 30.1%), whereas its hallucination fee fell 21.3 factors (from 44.2% to 22.9%). The mannequin is selecting to say “I don’t know” extra typically quite than recalling extra info. Its try fee fell from 67.3% to 48.0%, the bottom amongst frontier fashions within the comparability. The AA-Omniscience benchmark rewards appropriate solutions and penalizes hallucinations however has no penalty for refusing to reply. For use instances that depend upon broad factual recall, that is a significant limitation to check towards your workload.
In Text Arena, Qwen3.7-Max-Preview ranked #13 total with an Elo rating of 1,475. Category rankings embrace #7 in Math, #9 in Expert Prompts, #9 in Software and IT, and #10 in Coding.
All benchmark numbers are preliminary. The mannequin carries a ‘Preview’ mode, indicating Alibaba considers it an early construct.
Agentic Performance — Internal Test
In an inside Alibaba take a look at on a new chip platform, the mannequin autonomously carried out greater than 1,000 instrument calls and iterative code modifications to optimize a key kernel. Alibaba claimed the method improved inference velocity by roughly 10x in contrast with the earlier model.
Marktechpost’s Visual Explainer
May 2026
Key Takeaways:
- Alibaba launched two Qwen3.7 preview fashions: Max (textual content/reasoning) and Plus (multimodal).
- Qwen3.7-Max scored 56.6 on the Artificial Analysis Intelligence Index, rating #5 total — a 4.8-point achieve over Qwen3.6 Max Preview.
- The 1M-token context window doubles the 256K restrict from Qwen3.6 Max Preview; textual content solely, no picture enter.
- On AA-Omniscience, uncooked accuracy dropped whereas abstention rose — price testing for knowledge-recall use instances.
- The mannequin sustained 1,000+ instrument calls and 35-hour autonomous execution in Alibaba’s inside testing solely; no impartial verification but.
Check out the Technical details. and Docs. Also, be happy to comply with us on Twitter and don’t overlook to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us
The submit Qwen Introduces Qwen3.7-Max: A Reasoning Agent Model With a 1M-Token Context Window appeared first on MarkTechPost.



