Qwen’s Former Lead on What Hybrid Thinking Got Wrong — and Why He Now Backs Agents

Junyang Lin was the technical lead of Alibaba’s Qwen mission. He introduced he was stepping down on March 3, 2026. He now lists himself as an unbiased researcher on his private website.

In a chat titled ‘Qwen: Towards a Generalist Model / Agent,‘ he walks by the Qwen household. It ends on a single line: “Training fashions -> coaching brokers.” He later expanded that line into an detailed publish as an unbiased researcher. This article reads the speak and the detailed publish collectively.

What Lin’s Talk Actually Covers

The speak is a tour of the Qwen mannequin household, not a single launch. It strikes by QwQ-32B, Qwen2.5-Max, Qwen3, Qwen2.5-VL, and Qwen2.5-Omni. Each cease reveals benchmark charts towards contemporaries. The named baselines embody DeepSearch-R1, Grok 3 Beta, Gemini 2.5 Pro, and OpenAI’s o-series.

The Qwen3 cease carries essentially the most element. Lin highlights hybrid considering modes: a considering mode for step-by-step reasoning, and a non-thinking mode for near-instant responses. He provides dynamic considering budgets, so callers can cap how a lot the mannequin causes. Qwen3 expanded multilingual help from 29 to 119 languages and dialects.

The presentation lists many mannequin varieties and sizes from 0.6B to 235B parameters. It additionally lists quantized codecs together with GGUF, GPTQ, AWQ, and MLX, all beneath Apache 2.0. Two demos comply with: a Web Dev demo and a Deep Research demo. The closing “Future work” slide factors at brokers. It lists extra pretraining, RL with atmosphere suggestions, longer context, and extra modalities. The final key point out is the “coaching fashions -> coaching brokers.”

Qwen3 Architecture, As Shown within the Talk

The speak contains the Qwen3 structure tables, reproduced beneath.

Model	Layers	Heads (Q/KV)	Tie Embedding / Experts (Total/Act.)	Context
Qwen3-0.6B	28	16 / 8	Tie: Yes	32K
Qwen3-1.7B	28	16 / 8	Tie: Yes	32K
Qwen3-4B	36	32 / 8	Tie: Yes	32K
Qwen3-8B	36	32 / 8	Tie: No	128K
Qwen3-14B	40	40 / 8	Tie: No	128K
Qwen3-32B	64	64 / 8	Tie: No	128K
Qwen3-30B-A3B	48	32 / 4	Experts: 128 / 8	128K
Qwen3-235B-A22B	94	64 / 4	Experts: 128 / 8	128K

The small dense fashions tie enter and output embeddings and use a 32K context. The bigger dense and MoE fashions drop tying and prolong context to 128K. The two MoE fashions activate 8 of 128 consultants per token.

Hybrid Thinking, and Why Merging is Hard

Lin presents hybrid considering as a clear function. The publish explains why it was exhausting to construct. Lin writes that considering mode and instruct mode pull in reverse instructions.

A powerful instruct mannequin is rewarded for directness, brevity, and low latency. A powerful considering mannequin is rewarded for spending extra tokens on exhausting issues. Merge the 2 carelessly, and each degrade. The considering habits will get bloated, and the instruct habits will get much less crisp.

Qwen3 tried the merge with a four-stage post-training pipeline. That pipeline included a long-CoT chilly begin, reasoning RL, and a “considering mode fusion” step. Later in 2025, the 2507 line shipped separate Instruct and Thinking variants as an alternative. Lin frames this as a knowledge drawback greater than a mannequin drawback.

Anthropic took the alternative route, and Lin calls it a helpful corrective. Claude 3.7 Sonnet shipped as a hybrid mannequin with a user-set considering funds. Claude 4 let reasoning interleave with device use, geared toward coding and long-running duties. His level: an extended reasoning hint doesn’t make a mannequin smarter. Thinking ought to be formed by the goal workload, not by the benchmark.

Interactive Explainer

From ‘Reasoning’ Thinking to ‘Agentic’ Thinking

Lin attracts a line between two eras. The first was reasoning considering, outlined by o1 and DeepSearch-R1. It taught the sector that RL wants deterministic, verifiable rewards, so math, code, and logic grew to become central. It additionally turned RL right into a techniques drawback of large-scale rollouts and verification.

The subsequent period, in his framing, is agentic considering: considering to be able to act. An agent formulates plans, decides when to behave, makes use of instruments, reads atmosphere suggestions, and revises. It is outlined by closed-loop interplay with the world, not by an extended inside monologue.

Lin lists what agentic considering should deal with that pure reasoning can keep away from:

Deciding when to cease considering and take an motion
Choosing which device to invoke, and in what order
Incorporating noisy or partial observations from the atmosphere
Revising plans after failures
Maintaining coherence throughout many turns and many device calls

The optimization goal modifications with the period. The desk beneath summarizes the distinction Lin attracts.

Dimension	Reasoning considering	Agentic considering
Judged by	Quality of inside deliberation earlier than a solution	Whether progress is sustained whereas performing
Reward sign	Verifiable solutions (math, code, logic)	Task success in an interactive atmosphere
Core object of coaching	The mannequin	The mannequin plus its atmosphere (the harness)
Infra bottleneck	Rollouts, verification, steady coverage updates	Tool servers, sandboxes, train-serve decoupling
Main failure mode	Verbose, low-value reasoning traces	Reward hacking by device entry and env leaks

Use Cases, With Examples

The distinction modifications the way you construct:

Coding brokers: A reasoning mannequin emits one patch from a stack hint. An agentic system runs the take a look at harness, reads the actual error, revises, and re-runs till the suite passes. Thinking right here ought to assist with codebase navigation, error restoration, and device orchestration.
Deep analysis: A reasoning mannequin writes an extended reply from reminiscence. An agentic system breaks the query into sub-queries, calls search, drops weak sources, and returns grounded citations. Qwen’s personal Deep Research demo sits on this class.
Multi-agent orchestration: Lin expects ‘harness engineering’ to matter extra. An orchestrator plans and routes work. Specialized sub-agents execute narrower duties and assist management context air pollution.

A Concrete Hook: Qwen3 Thinking Toggle

Hybrid considering is uncovered immediately in code. The enable_thinking flag switches modes within the chat template.

Copy Code

from transformers import AutoModelForCausalLM, AutoTokenizer

title = "Qwen/Qwen3-8B"
tok = AutoTokenizer.from_pretrained(title)
mannequin = AutoModelForCausalLM.from_pretrained(
    title, torch_dtype="auto", device_map="auto"
)

messages = [{"role": "user", "content": "Refactor this function and explain the change."}]

# enable_thinking=True  -> step-by-step considering mode
# enable_thinking=False -> near-instant, non-thinking mode
textual content = tok.apply_chat_template(
    messages, tokenize=False,
    add_generation_prompt=True, enable_thinking=True,
)
inputs = tok(textual content, return_tensors="pt").to(mannequin.machine)

# Qwen's beneficial sampling for considering mode
out = mannequin.generate(
    **inputs, max_new_tokens=2048,
    temperature=0.6, top_p=0.95, top_k=20,
)

enable_thinking=True is the default, and the output wraps reasoning in a <suppose>...</suppose> block. Qwen3 additionally accepts tender switches. Appending /suppose or /no_think to a person flip flips the mode per message. That per-turn management is what dynamic considering budgets construct on.

Why Agentic RL Infrastructure is Harder

The presentation’s core engineering level is about infrastructure. In reasoning RL, rollouts are largely self-contained trajectories with clear evaluators. In agentic RL, the coverage lives inside a harness of device servers, browsers, terminals, and sandboxes.

That harness forces a brand new requirement: coaching and inference have to be cleanly decoupled. Without it, rollout throughput collapses. A coding agent ready on reside take a look at execution stalls inference and starves coaching. GPU utilization drops nicely beneath what reasoning RL achieves.

Lin additionally reframes what to obsess over. In the SFT period, groups optimized knowledge variety. In the agent period, he argues groups ought to optimize atmosphere high quality: stability, realism, protection, and exploit resistance. He names reward hacking as the toughest drawback, as a result of device entry enlarges the assault floor for spurious optimization.

Key Takeaways

Junyang Lin left Qwen on March 3, 2026, and now publishes as an unbiased researcher.
His speak ends on one thesis: the sector is shifting from coaching fashions to coaching brokers.
Agentic considering is judged by sustained motion in an atmosphere, not by inside deliberation.
Agentic RL wants decoupled train-serve infra and high-quality environments, not simply verifiable rewards.
Reward hacking is the central danger as soon as fashions acquire actual device entry.

Sources:

Primary supply — the speak

https://www.youtube.com/watch?v=b0xlsQ_6wUQ

Primary supply — Junyang Lin’s Blog

“From ‘Reasoning’ Thinking to ‘Agentic’ Thinking”: https://justinlin610.github.io/weblog/from-reasoning-to-agentic-thinking/
His homepage (independent-researcher standing): https://justinlin610.github.io/

Qwen3 technical particulars (structure, 119 languages, hybrid considering)

Qwen3 Technical Report (arXiv:2505.09388): https://arxiv.org/abs/2505.09388 · HTML: https://arxiv.org/html/2505.09388v1

Code verification (enable_thinking, /suppose /no_think, sampling)

Qwen docs Quickstart: https://qwen.readthedocs.io/en/newest/getting_started/quickstart.html
Qwen3-8B mannequin card: https://huggingface.co/Qwen/Qwen3-8B
Qwen3-32B mannequin card: https://huggingface.co/Qwen/Qwen3-32B

Departure details (cited within the article)

TechCrunch: https://techcrunch.com/2026/03/03/alibabas-qwen-tech-lead-steps-down-after-major-ai-push/
Bloomberg: https://www.bloomberg.com/information/articles/2026-03-04/alibaba-qwen-head-who-warned-of-openai-gap-steps-down
VentureBeat: https://venturebeat.com/expertise/did-alibaba-just-kneecap-its-powerful-qwen-ai-team-key-figures-depart-in

Supporting departure/context protection (used for cross-checking, not all cited inline)

RecodeChinaAI (LatePost translation): https://www.recodechinaai.com/p/alibabas-qwen-lead-just-stepped-down
Simon Willison: https://simonwillison.web/2026/Mar/4/qwen/
Geopolitechs: https://www.geopolitechs.org/p/inside-the-stepping-down-of-qwens
OfficeChai: https://officechai.com/ai/alibaba-qwens-tech-lead-junyang-lin-steps-down/
MLQ News: https://mlq.ai/information/key-researcher-steps-down-from-alibabas-qwen-ai-project/
GenAI Assembling (essay evaluation, used to first find the essay): https://genaiassembling.substack.com/p/what-junyang-lin-saw

Two X posts

https://x.com/h100envy/standing/2068987470960623783
https://x.com/h100envy/standing/2073433806254624930

The publish Qwen’s Former Lead on What Hybrid Thinking Got Wrong — and Why He Now Backs Agents appeared first on MarkTechPost.

Qwen’s Former Lead on What Hybrid Thinking Got Wrong — and Why He Now Backs Agents

What Lin’s Talk Actually Covers

Qwen3 Architecture, As Shown within the Talk

Hybrid Thinking, and Why Merging is Hard

Interactive Explainer

From ‘Reasoning’ Thinking to ‘Agentic’ Thinking

Use Cases, With Examples

A Concrete Hook: Qwen3 Thinking Toggle

Why Agentic RL Infrastructure is Harder

Key Takeaways

Sources:

Z.ai Launches GLM-5V-Turbo: A Native Multimodal Vision Coding Model Optimized for OpenClaw and High-Capacity Agentic Engineering Workflows Everywhere

Building Supervised Fine-Tuning Data from NVIDIA Open-SWE-Traces: Trajectory Parsing, Patch Analysis, Token Budgets, and Tool-Use Metrics

Sakana AI Introduces KAME: A Tandem Speech-to-Speech Architecture That Injects LLM Knowledge in Real Time

Anthropic AI Releases Bloom: An Open-Source Agentic Framework for Automated Behavioral Evaluations of Frontier AI Models

How to Build a Neuro-Symbolic Hybrid Agent that Combines Logical Planning with Neural Perception for Robust Autonomous Decision-Making

Meet Memory OS: A 6-Layer Open-Source Memory Stack Built on Top of Hermes Agent

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

What Lin’s Talk Actually Covers

Qwen3 Architecture, As Shown within the Talk

Hybrid Thinking, and Why Merging is Hard

Interactive Explainer

From ‘Reasoning’ Thinking to ‘Agentic’ Thinking

Use Cases, With Examples

A Concrete Hook: Qwen3 Thinking Toggle

Why Agentic RL Infrastructure is Harder

Key Takeaways

Sources:

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!