|

Qwen’s Former Lead on What Hybrid Thinking Got Wrong — and Why He Now Backs Agents

Junyang Lin was the technical lead of Alibaba’s Qwen mission. He introduced he was stepping down on March 3, 2026. He now lists himself as an unbiased researcher on his private website.

In a chat titled ‘Qwen: Towards a Generalist Model / Agent,‘ he walks by the Qwen household. It ends on a single line: “Training fashions -> coaching brokers.” He later expanded that line into an detailed publish as an unbiased researcher. This article reads the speak and the detailed publish collectively.

What Lin’s Talk Actually Covers

The speak is a tour of the Qwen mannequin household, not a single launch. It strikes by QwQ-32B, Qwen2.5-Max, Qwen3, Qwen2.5-VL, and Qwen2.5-Omni. Each cease reveals benchmark charts towards contemporaries. The named baselines embody DeepSearch-R1, Grok 3 Beta, Gemini 2.5 Pro, and OpenAI’s o-series.

The Qwen3 cease carries essentially the most element. Lin highlights hybrid considering modes: a considering mode for step-by-step reasoning, and a non-thinking mode for near-instant responses. He provides dynamic considering budgets, so callers can cap how a lot the mannequin causes. Qwen3 expanded multilingual help from 29 to 119 languages and dialects.

The presentation lists many mannequin varieties and sizes from 0.6B to 235B parameters. It additionally lists quantized codecs together with GGUF, GPTQ, AWQ, and MLX, all beneath Apache 2.0. Two demos comply with: a Web Dev demo and a Deep Research demo. The closing “Future work” slide factors at brokers. It lists extra pretraining, RL with atmosphere suggestions, longer context, and extra modalities. The final key point out is the “coaching fashions -> coaching brokers.”

Qwen3 Architecture, As Shown within the Talk

The speak contains the Qwen3 structure tables, reproduced beneath.

Model Layers Heads (Q/KV) Tie Embedding / Experts (Total/Act.) Context
Qwen3-0.6B 28 16 / 8 Tie: Yes 32K
Qwen3-1.7B 28 16 / 8 Tie: Yes 32K
Qwen3-4B 36 32 / 8 Tie: Yes 32K
Qwen3-8B 36 32 / 8 Tie: No 128K
Qwen3-14B 40 40 / 8 Tie: No 128K
Qwen3-32B 64 64 / 8 Tie: No 128K
Qwen3-30B-A3B 48 32 / 4 Experts: 128 / 8 128K
Qwen3-235B-A22B 94 64 / 4 Experts: 128 / 8 128K

The small dense fashions tie enter and output embeddings and use a 32K context. The bigger dense and MoE fashions drop tying and prolong context to 128K. The two MoE fashions activate 8 of 128 consultants per token.

Hybrid Thinking, and Why Merging is Hard

Lin presents hybrid considering as a clear function. The publish explains why it was exhausting to construct. Lin writes that considering mode and instruct mode pull in reverse instructions.

A powerful instruct mannequin is rewarded for directness, brevity, and low latency. A powerful considering mannequin is rewarded for spending extra tokens on exhausting issues. Merge the 2 carelessly, and each degrade. The considering habits will get bloated, and the instruct habits will get much less crisp.

Qwen3 tried the merge with a four-stage post-training pipeline. That pipeline included a long-CoT chilly begin, reasoning RL, and a “considering mode fusion” step. Later in 2025, the 2507 line shipped separate Instruct and Thinking variants as an alternative. Lin frames this as a knowledge drawback greater than a mannequin drawback.

Anthropic took the alternative route, and Lin calls it a helpful corrective. Claude 3.7 Sonnet shipped as a hybrid mannequin with a user-set considering funds. Claude 4 let reasoning interleave with device use, geared toward coding and long-running duties. His level: an extended reasoning hint doesn’t make a mannequin smarter. Thinking ought to be formed by the goal workload, not by the benchmark.

Interactive Explainer

From ‘Reasoning’ Thinking to ‘Agentic’ Thinking

Lin attracts a line between two eras. The first was reasoning considering, outlined by o1 and DeepSearch-R1. It taught the sector that RL wants deterministic, verifiable rewards, so math, code, and logic grew to become central. It additionally turned RL right into a techniques drawback of large-scale rollouts and verification.

The subsequent period, in his framing, is agentic considering: considering to be able to act. An agent formulates plans, decides when to behave, makes use of instruments, reads atmosphere suggestions, and revises. It is outlined by closed-loop interplay with the world, not by an extended inside monologue.

Lin lists what agentic considering should deal with that pure reasoning can keep away from:

  • Deciding when to cease considering and take an motion
  • Choosing which device to invoke, and in what order
  • Incorporating noisy or partial observations from the atmosphere
  • Revising plans after failures
  • Maintaining coherence throughout many turns and many device calls

The optimization goal modifications with the period. The desk beneath summarizes the distinction Lin attracts.

Dimension Reasoning considering Agentic considering
Judged by Quality of inside deliberation earlier than a solution Whether progress is sustained whereas performing
Reward sign Verifiable solutions (math, code, logic) Task success in an interactive atmosphere
Core object of coaching The mannequin The mannequin plus its atmosphere (the harness)
Infra bottleneck Rollouts, verification, steady coverage updates Tool servers, sandboxes, train-serve decoupling
Main failure mode Verbose, low-value reasoning traces Reward hacking by device entry and env leaks

Use Cases, With Examples

The distinction modifications the way you construct:

  • Coding brokers: A reasoning mannequin emits one patch from a stack hint. An agentic system runs the take a look at harness, reads the actual error, revises, and re-runs till the suite passes. Thinking right here ought to assist with codebase navigation, error restoration, and device orchestration.
  • Deep analysis: A reasoning mannequin writes an extended reply from reminiscence. An agentic system breaks the query into sub-queries, calls search, drops weak sources, and returns grounded citations. Qwen’s personal Deep Research demo sits on this class.
  • Multi-agent orchestration: Lin expects ‘harness engineering’ to matter extra. An orchestrator plans and routes work. Specialized sub-agents execute narrower duties and assist management context air pollution.

A Concrete Hook: Qwen3 Thinking Toggle

Hybrid considering is uncovered immediately in code. The enable_thinking flag switches modes within the chat template.

from transformers import AutoModelForCausalLM, AutoTokenizer

title = "Qwen/Qwen3-8B"
tok = AutoTokenizer.from_pretrained(title)
mannequin = AutoModelForCausalLM.from_pretrained(
    title, torch_dtype="auto", device_map="auto"
)

messages = [{"role": "user", "content": "Refactor this function and explain the change."}]

# enable_thinking=True  -> step-by-step considering mode
# enable_thinking=False -> near-instant, non-thinking mode
textual content = tok.apply_chat_template(
    messages, tokenize=False,
    add_generation_prompt=True, enable_thinking=True,
)
inputs = tok(textual content, return_tensors="pt").to(mannequin.machine)

# Qwen's beneficial sampling for considering mode
out = mannequin.generate(
    **inputs, max_new_tokens=2048,
    temperature=0.6, top_p=0.95, top_k=20,
)

enable_thinking=True is the default, and the output wraps reasoning in a <suppose>...</suppose> block. Qwen3 additionally accepts tender switches. Appending /suppose or /no_think to a person flip flips the mode per message. That per-turn management is what dynamic considering budgets construct on.

Why Agentic RL Infrastructure is Harder

The presentation’s core engineering level is about infrastructure. In reasoning RL, rollouts are largely self-contained trajectories with clear evaluators. In agentic RL, the coverage lives inside a harness of device servers, browsers, terminals, and sandboxes.

That harness forces a brand new requirement: coaching and inference have to be cleanly decoupled. Without it, rollout throughput collapses. A coding agent ready on reside take a look at execution stalls inference and starves coaching. GPU utilization drops nicely beneath what reasoning RL achieves.

Lin additionally reframes what to obsess over. In the SFT period, groups optimized knowledge variety. In the agent period, he argues groups ought to optimize atmosphere high quality: stability, realism, protection, and exploit resistance. He names reward hacking as the toughest drawback, as a result of device entry enlarges the assault floor for spurious optimization.

Key Takeaways

  • Junyang Lin left Qwen on March 3, 2026, and now publishes as an unbiased researcher.
  • His speak ends on one thesis: the sector is shifting from coaching fashions to coaching brokers.
  • Agentic considering is judged by sustained motion in an atmosphere, not by inside deliberation.
  • Agentic RL wants decoupled train-serve infra and high-quality environments, not simply verifiable rewards.
  • Reward hacking is the central danger as soon as fashions acquire actual device entry.


Sources:

Primary supply — the speak

  • https://www.youtube.com/watch?v=b0xlsQ_6wUQ

Primary supply — Junyang Lin’s Blog

  • “From ‘Reasoning’ Thinking to ‘Agentic’ Thinking”: https://justinlin610.github.io/weblog/from-reasoning-to-agentic-thinking/
  • His homepage (independent-researcher standing): https://justinlin610.github.io/

Qwen3 technical particulars (structure, 119 languages, hybrid considering)

  • Qwen3 Technical Report (arXiv:2505.09388): https://arxiv.org/abs/2505.09388 · HTML: https://arxiv.org/html/2505.09388v1

Code verification (enable_thinking, /suppose /no_think, sampling)

  • Qwen docs Quickstart: https://qwen.readthedocs.io/en/newest/getting_started/quickstart.html
  • Qwen3-8B mannequin card: https://huggingface.co/Qwen/Qwen3-8B
  • Qwen3-32B mannequin card: https://huggingface.co/Qwen/Qwen3-32B

Departure details (cited within the article)

  • TechCrunch: https://techcrunch.com/2026/03/03/alibabas-qwen-tech-lead-steps-down-after-major-ai-push/
  • Bloomberg: https://www.bloomberg.com/information/articles/2026-03-04/alibaba-qwen-head-who-warned-of-openai-gap-steps-down
  • VentureBeat: https://venturebeat.com/expertise/did-alibaba-just-kneecap-its-powerful-qwen-ai-team-key-figures-depart-in

Supporting departure/context protection (used for cross-checking, not all cited inline)

  • RecodeChinaAI (LatePost translation): https://www.recodechinaai.com/p/alibabas-qwen-lead-just-stepped-down
  • Simon Willison: https://simonwillison.web/2026/Mar/4/qwen/
  • Geopolitechs: https://www.geopolitechs.org/p/inside-the-stepping-down-of-qwens
  • OfficeChai: https://officechai.com/ai/alibaba-qwens-tech-lead-junyang-lin-steps-down/
  • MLQ News: https://mlq.ai/information/key-researcher-steps-down-from-alibabas-qwen-ai-project/
  • GenAI Assembling (essay evaluation, used to first find the essay): https://genaiassembling.substack.com/p/what-junyang-lin-saw

Two X posts

  • https://x.com/h100envy/standing/2068987470960623783
  • https://x.com/h100envy/standing/2073433806254624930

The publish Qwen’s Former Lead on What Hybrid Thinking Got Wrong — and Why He Now Backs Agents appeared first on MarkTechPost.

Similar Posts