|

Qwen Introduces Qwen3.7-Max: A Reasoning Agent Model With a 1M-Token Context Window

Most AI fashions in the present day usually are not designed for sustained, multi-step autonomous execution. Tasks like working a whole lot of iterative code modifications, or chaining instrument calls throughout hours with out human intervention, require a completely different form of mannequin structure and coaching focus.

Alibaba’s Qwen group formally introduced Qwen3.7-Max on the 2026 Alibaba Cloud Summit on May 20. Although, two preview variations of the Qwen3.7 sequence quietly appeared on Arena AI’s leaderboard with no press launch and no official API announcement.

Two Preview Models Released Simultaneously

Alibaba previewed two fashions concurrently: Qwen3.7-Max-Preview and Qwen3.7-Plus-Preview. They ranked thirteenth globally in textual content capabilities and sixteenth in imaginative and prescient capabilities, respectively, based on LM Arena.

In Text Arena, Qwen3.7-Max-Preview ranked #13 total, inserting Alibaba because the #6 lab in textual content. In Vision Arena, Qwen3.7-Plus-Preview ranked #16 total, inserting Alibaba because the #5 lab in imaginative and prescient. The mannequin rank and the lab rank are separate figures.

Qwen3.7-Plus-Preview is described as a high-performance balanced model preview, specializing in reasoning and logical expression, with its toolchain to be regularly opened sooner or later. It handles imaginative and prescient and multimodal inputs. Qwen3.7-Max is the text-only reasoning flagship. This article covers Qwen3.7-Max, as it’s the mannequin Alibaba formally introduced with API entry.

What is Qwen3.7-Max Designed For

Alibaba Qwen group described Qwen3.7-Max as its most superior and complete agent mannequin to this point. The mannequin is proprietary and closed-weight. It is able to dealing with coding and debugging, workplace workflow automation, and long-horizon duties spanning a whole lot and even hundreds of steps.

Extended-Thinking Mode

Qwen3.7-Max is a reasoning mannequin. The mannequin generates a chain of thought first — an inside sequence of steps the place it plans, checks its work, and corrects course earlier than committing to a ultimate reply. On interfaces like Qwen Chat, this exhibits up as a ‘Thinking’ mode you may swap on to see the mannequin’s reasoning hint.

Reasoning fashions produce considerably extra output tokens than normal completions. When Artificial Analysis ran its Intelligence Index analysis, Qwen3.7-Max generated about 97 million tokens, compared to an average of 24 million for fashions on that benchmark. For quick or easy duties, this overhead provides latency with out bettering output high quality. For multi-step planning, code refactoring, or lengthy agent chains, extended-thinking mode is the place the mannequin’s power applies.

Context Window

The mannequin options a 1M token context window, up from 256K on Qwen3.6 Max Preview. It helps textual content enter and output solely. Pricing has not but been introduced. Qwen3.6 Max Preview was priced at $1.30/$7.80 per million enter/output tokens on Alibaba Cloud.

A million-token context window can maintain a full mid-sized code repository or a massive stack of paperwork in a single request. Models typically purpose much less reliably because the context window fills. Independent long-context testing for Qwen3.7-Max shouldn’t be but obtainable.

Benchmark Results

Qwen3.7-Max scored 56.6 on the Artificial Analysis Intelligence Index, inserting it fifth total. That represents a 4.8-point achieve over its predecessor Qwen3.6 Max Preview (51.8), and places it forward of Google’s Gemini 3.5 Flash (55.3). GPT-5.5 (60.2), Claude Opus 4.7 (57.3), and Gemini 3.1 Pro Preview (57.2) nonetheless lead the general rankings.

The Intelligence Index v4.0 aggregates ten evaluations, together with GDPval-AA, Terminal-Bench Hard, SciCode, AA-Omniscience, Humanity’s Last Exam, and GPQA Diamond.

https://qwen.ai/weblog?id=qwen3.7

The enchancment over Qwen3.6 Max Preview shouldn’t be uniform. Most of the Index features are concentrated in scientific reasoning, agentic functionality, and coding. CritPt rose 9.7 share factors (from 3.7% to 13.4%), Humanity’s Last Exam jumped 9.2 factors (from 28.9% to 38.1%), and Terminal-Bench Hard climbed 6.9 factors (from 43.9% to 50.8%). GDPval-AA added 42 Elo factors (from 1504 to 1546). Scores on different benchmarks are largely flat in comparison with Qwen3.6 Max Preview.

One consequence on the Index requires cautious studying. On AA-Omniscience, Qwen3.7-Max’s uncooked accuracy really dropped 7.6 share factors (from 37.7% to 30.1%), whereas its hallucination fee fell 21.3 factors (from 44.2% to 22.9%). The mannequin is selecting to say “I don’t know” extra typically quite than recalling extra info. Its try fee fell from 67.3% to 48.0%, the bottom amongst frontier fashions within the comparability. The AA-Omniscience benchmark rewards appropriate solutions and penalizes hallucinations however has no penalty for refusing to reply. For use instances that depend upon broad factual recall, that is a significant limitation to check towards your workload.

In Text Arena, Qwen3.7-Max-Preview ranked #13 total with an Elo rating of 1,475. Category rankings embrace #7 in Math, #9 in Expert Prompts, #9 in Software and IT, and #10 in Coding.

All benchmark numbers are preliminary. The mannequin carries a ‘Preview’ mode, indicating Alibaba considers it an early construct.

Agentic Performance — Internal Test

In an inside Alibaba take a look at on a new chip platform, the mannequin autonomously carried out greater than 1,000 instrument calls and iterative code modifications to optimize a key kernel. Alibaba claimed the method improved inference velocity by roughly 10x in contrast with the earlier model.

Marktechpost’s Visual Explainer

How to Use Qwen3.7-Max
A sensible information for builders & knowledge scientists

May 2026






Slide 1 of 6
What is Qwen3.7-Max?
A proprietary reasoning mannequin from Alibaba, designed for long-horizon agent duties, code technology, and multi-step automation.
Context Window

1 million tokens — sufficient to suit a full mid-sized code repository in a single request.

Reasoning Model

Uses chain-of-thought (extended-thinking mode) earlier than producing a ultimate reply.

Input / Output

Text in, textual content out. No picture enter supported on this mannequin.

API String

Use qwen3.7-max when calling through Alibaba Cloud Model Studio.

Apache-compatible API
OpenAI & Anthropic spec
Preview — no open weights but

Slide 2 of 6
Quick Start: Chat Interface
The quickest technique to take a look at Qwen3.7-Max with no API key or setup required.
  • 1
    Go to Qwen Chat
    Navigate to chat.qwen.ai and create a free account.

  • 2
    Select the mannequin
    In the mannequin selector dropdown, select Qwen3.7-Max. It might seem as Qwen3.7-Max-Preview through the preview interval.

  • 3
    Enable Thinking Mode
    Toggle on Thinking Mode within the chat interface. This prompts chain-of-thought reasoning and exhibits the mannequin’s inside reasoning hint earlier than the ultimate reply.

  • 4
    Send your immediate
    Type your question. For greatest outcomes on advanced duties, be particular about steps, constraints, and anticipated output format.

💡

Use your hardest real-world prompts when testing. Multi-step math issues, advanced refactoring requests, and ambiguous knowledgeable questions reveal extra about mannequin high quality than easy prompts.

Slide 3 of 6
API Access
Qwen3.7-Max is appropriate with each OpenAI and Anthropic API specs. You can plug it into present pipelines with minimal adjustments.
OpenAI-compatible Python name
from openai import OpenAI

shopper = OpenAI(
    api_key="YOUR_DASHSCOPE_API_KEY",
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

response = shopper.chat.completions.create(
    mannequin="qwen3.7-max",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": "Explain chain-of-thought reasoning."}
    ]
)

print(response.selections[0].message.content material)

ℹ

Get your API key from Alibaba Cloud Model Studio (DashScope). The base URL for worldwide entry is dashscope-intl.aliyuncs.com.

⚠

Pricing has not but been introduced for Qwen3.7-Max. For reference, Qwen3.6 Max Preview was priced at $1.30 / $7.80 per million enter/output tokens.

Slide 4 of 6
Understanding Thinking Mode
Thinking Mode is the mannequin’s chain-of-thought reasoning layer. It determines how the mannequin approaches a drawback earlier than producing a response.
When to make use of it

Multi-step code refactoring, advanced math proofs, lengthy agent process chains, and ambiguous issues requiring step-by-step planning.

When to skip it

Short rewrites, easy classifications, fast lookups, or duties the place latency and token price must be minimised.


API: Enable considering through extra_body
response = shopper.chat.completions.create(
    mannequin="qwen3.7-max",
    messages=[{"role":"user","content":"Your prompt here"}],
    extra_body={"enable_thinking": True}
)

💡

Qwen3.7-Max generated ~97M tokens on Artificial Analysis benchmarks, vs. a mean of 24M for comparable fashions. Each considering token provides to latency and price — use considering mode selectively.

Slide 5 of 6
Agentic and Long-Horizon Tasks
Qwen3.7-Max is designed to run lengthy, autonomous process loops. In Alibaba’s inside testing, it executed 1,000+ instrument calls and sustained autonomous execution for as much as 35 hours.
  • 1
    Define instruments clearly
    Pass instrument definitions in the usual OpenAI instruments parameter. The mannequin helps perform calling and iterative instrument invocation natively.

  • 2
    Use the 1M context window deliberately
    Pass full process historical past, prior instrument outputs, and code state into context. Trim aggressively when the total context shouldn’t be wanted — each token is billed.

  • 3
    Target the ultimate reply in assertions
    Reasoning output is longer and extra variable than a normal completion. When writing checks, assert on the ultimate reply, not the precise wording of the considering hint.

  • 4
    Good use instances
    Kernel optimisation, code debugging loops, workplace workflow automation, and multi-step knowledge pipelines with iterative verification.

⚠

The 35-hour and 1,000+ instrument name figures come from Alibaba’s inside testing solely. No impartial verification exists for these particular claims.

Slide 6 of 6
Known Limitations
Understanding these limitations earlier than integrating will save debugging time and show you how to set the fitting expectations.
No picture enter

Qwen3.7-Max is text-only. For multimodal duties, use Qwen3.7-Plus-Preview as an alternative, which helps imaginative and prescient enter.

AA-Omniscience abstention

On the AA-Omniscience benchmark, the mannequin’s try fee dropped from 67.3% to 48.0%. It abstains extra and hallucinates much less — however its uncooked factual recall additionally dropped. Test rigorously for knowledge-recall duties.

Preview standing

The mannequin at the moment carries a — Preview suffix. Benchmark scores, behaviour, and pricing can change earlier than steady launch. No open-weight model is offered as of May 2026.

Long-context reliability

A 1M token context window is a ceiling, not a assure. Independent long-context testing for Qwen3.7-Max shouldn’t be but obtainable. Validate retrieval high quality in your particular workload.

ℹ

For the newest mannequin updates, test the official Qwen weblog at qwen.ai/weblog and Alibaba Cloud Model Studio docs.

Key Takeaways:

  • Alibaba launched two Qwen3.7 preview fashions: Max (textual content/reasoning) and Plus (multimodal).
  • Qwen3.7-Max scored 56.6 on the Artificial Analysis Intelligence Index, rating #5 total — a 4.8-point achieve over Qwen3.6 Max Preview.
  • The 1M-token context window doubles the 256K restrict from Qwen3.6 Max Preview; textual content solely, no picture enter.
  • On AA-Omniscience, uncooked accuracy dropped whereas abstention rose — price testing for knowledge-recall use instances.
  • The mannequin sustained 1,000+ instrument calls and 35-hour autonomous execution in Alibaba’s inside testing solely; no impartial verification but.


Check out the Technical details. and Docs.  Also, be happy to comply with us on Twitter and don’t overlook to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The submit Qwen Introduces Qwen3.7-Max: A Reasoning Agent Model With a 1M-Token Context Window appeared first on MarkTechPost.

Similar Posts