|

A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence

In this tutorial, we construct an end-to-end implementation round Qwen 3.6-35B-A3B and discover how a contemporary multimodal MoE mannequin can be utilized in sensible workflows. We start by organising the atmosphere, loading the mannequin adaptively primarily based on obtainable GPU reminiscence, and making a reusable chat framework that helps each commonplace responses and express pondering traces. From there, we work via essential capabilities akin to thinking-budget management, streamed technology with separated reasoning and solutions, imaginative and prescient enter dealing with, software calling, structured JSON technology, MoE routing inspection, benchmarking, retrieval-augmented technology, and session persistence. Through this course of, we run the mannequin for inference and additionally study how one can design a strong software layer on prime of Qwen 3.6 for actual experimentation and superior prototyping.

import subprocess, sys
def _pip(*a): subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *a])
_pip("--upgrade", "pip")
_pip("--upgrade",
    "transformers>=4.48.0", "speed up>=1.2.0", "bitsandbytes>=0.44.0",
    "pillow", "requests", "sentencepiece",
    "qwen-vl-utils[decord]", "sentence-transformers", "jsonschema")


import torch, os, json, time, re, gc, io, threading, textwrap, warnings
from collections import Counter
from typing import Any, Optional
warnings.filterwarnings("ignore")


assert torch.cuda.is_available(), "GPU required. Switch runtime to A100 / L4."
p = torch.cuda.get_device_properties(0)
VRAM_GB = p.total_memory / 1e9
print(f"GPU: {p.title} | VRAM: {VRAM_GB:.1f} GB | CUDA {torch.model.cuda} | torch {torch.__version__}")


if VRAM_GB >= 75:   LOAD_MODE = "bf16"
elif VRAM_GB >= 40: LOAD_MODE = "int8"
else:               LOAD_MODE = "int4"


attempt:
   import flash_attn
   ATTN_IMPL = "flash_attention_2"
besides Exception:
   ATTN_IMPL = "sdpa"
print(f"-> mode={LOAD_MODE}  attn={ATTN_IMPL}")


from transformers import (
   AutoModelForImageTextToText, AutoProcessor,
   BitsAndBytesConfig, Textual contentIteratorStreamer,
   StoppingCriteria, StoppingCriteriaList,
)


MODEL_ID = "Qwen/Qwen3.6-35B-A3B"
kwargs = dict(device_map="auto", trust_remote_code=True,
             low_cpu_mem_usage=True, attn_implementation=ATTN_IMPL,
             torch_dtype=torch.bfloat16)
if LOAD_MODE == "int8":
   kwargs["quantization_config"] = BitsAndBytesConfig(load_in_8bit=True)
elif LOAD_MODE == "int4":
   kwargs["quantization_config"] = BitsAndBytesConfig(
       load_in_4bit=True, bnb_4bit_quant_type="nf4",
       bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True)


print("Loading processor...")
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
print(f"Loading mannequin in {LOAD_MODE} (first run downloads ~70GB) ...")
t0 = time.time()
mannequin = AutoModelForImageTextToText.from_pretrained(MODEL_ID, **kwargs); mannequin.eval()
print(f"Loaded in {time.time()-t0:.0f}s  |  VRAM used: {torch.cuda.memory_allocated()/1e9:.1f} GB")


SAMPLING = {
   "thinking_general": dict(temperature=1.0, top_p=0.95, top_k=20, presence_penalty=1.5),
   "thinking_coding":  dict(temperature=0.6, top_p=0.95, top_k=20, presence_penalty=0.0),
   "instruct_general": dict(temperature=0.7, top_p=0.80, top_k=20, presence_penalty=1.5),
   "instruct_reason":  dict(temperature=1.0, top_p=1.00, top_k=40, presence_penalty=2.0),
}
THINK_OPEN, THINK_CLOSE = "<suppose>", "</suppose>"


def split_thinking(textual content: str):
   if THINK_OPEN in textual content and THINK_CLOSE in textual content:
       a = textual content.index(THINK_OPEN) + len(THINK_OPEN); b = textual content.index(THINK_CLOSE)
       return textual content[a:b].strip(), textual content[b + len(THINK_CLOSE):].strip()
   if THINK_CLOSE in textual content:
       b = textual content.index(THINK_CLOSE)
       return textual content[:b].strip(), textual content[b + len(THINK_CLOSE):].strip()
   return "", textual content.strip()

We arrange the total atmosphere required to run Qwen 3.6-35B-A3B in Google Colab and put in all supporting libraries for quantization, multimodal processing, retrieval, and schema validation. We then probe the obtainable GPU, dynamically choose the loading mode primarily based on VRAM, and configure the eye backend so the mannequin runs as effectively as attainable on the given {hardware}. After that, we load the processor and mannequin from Hugging Face and outline the core sampling presets and the thinking-splitting utility, which lay the inspiration for all later interactions.

class QwenChat:
   def __init__(self, mannequin, processor, system=None, instruments=None):
       self.mannequin, self.processor = mannequin, processor
       self.tokenizer = processor.tokenizer
       self.historical past: checklist[dict] = []
       if system: self.historical past.append({"position": "system", "content material": system})
       self.instruments = instruments


   def consumer(self, content material):      self.historical past.append({"position":"consumer","content material":content material}); return self
   def assistant(self, content material, reasoning=""):
       m = {"position":"assistant","content material":content material}
       if reasoning: m["reasoning_content"] = reasoning
       self.historical past.append(m); return self
   def tool_result(self, title, outcome):
       self.historical past.append({"position":"software","title":title,
           "content material": outcome if isinstance(outcome, str) else json.dumps(outcome)})
       return self


   def _inputs(self, enable_thinking, preserve_thinking):
       return self.processor.apply_chat_template(
           self.historical past, instruments=self.instruments, tokenize=True,
           add_generation_prompt=True, return_dict=True, return_tensors="pt",
           enable_thinking=enable_thinking, preserve_thinking=preserve_thinking,
       ).to(self.mannequin.machine)


   def generate(self, *, enable_thinking=True, preserve_thinking=False,
                max_new_tokens=2048, preset="thinking_general",
                stopping_criteria=None, append_to_history=True):
       inp = self._inputs(enable_thinking, preserve_thinking)
       cfg = SAMPLING[preset]
       gk = dict(**inp, max_new_tokens=max_new_tokens, do_sample=True,
                 temperature=cfg["temperature"], top_p=cfg["top_p"], top_k=cfg["top_k"],
                 repetition_penalty=1.0,
                 pad_token_id=self.tokenizer.pad_token_id or self.tokenizer.eos_token_id)
       if stopping_criteria is just not None: gk["stopping_criteria"] = stopping_criteria
       with torch.inference_mode(): out = self.mannequin.generate(**gk)
       uncooked = self.tokenizer.decode(out[0, inp["input_ids"].form[-1]:], skip_special_tokens=True)
       suppose, ans = split_thinking(uncooked)
       if append_to_history: self.assistant(ans, reasoning=suppose)
       return suppose, ans


   def stream(self, *, enable_thinking=True, preserve_thinking=False,
              max_new_tokens=2048, preset="thinking_general",
              on_thinking=None, on_answer=None):
       inp = self._inputs(enable_thinking, preserve_thinking)
       cfg = SAMPLING[preset]
       streamer = Textual contentIteratorStreamer(self.tokenizer, skip_prompt=True, skip_special_tokens=True)
       gk = dict(**inp, streamer=streamer, max_new_tokens=max_new_tokens, do_sample=True,
                 temperature=cfg["temperature"], top_p=cfg["top_p"], top_k=cfg["top_k"],
                 pad_token_id=self.tokenizer.pad_token_id or self.tokenizer.eos_token_id)
       t = threading.Thread(goal=self.mannequin.generate, kwargs=gk); t.begin()
       buf, in_think = "", enable_thinking
       think_text, answer_text = "", ""
       for piece in streamer:
           buf += piece
           if in_think:
               if THINK_CLOSE in buf:
                   close_at = buf.index(THINK_CLOSE)
                   resid = buf[:close_at]
                   if on_thinking: on_thinking(resid[len(think_text):])
                   think_text = resid
                   buf = buf[close_at + len(THINK_CLOSE):]
                   in_think = False
                   if buf and on_answer: on_answer(buf)
                   answer_text = buf; buf = ""
               else:
                   if on_thinking: on_thinking(piece)
                   think_text += piece
           else:
               if on_answer: on_answer(piece)
               answer_text += piece
       t.be a part of()
       self.assistant(answer_text.strip(), reasoning=think_text.strip())
       return think_text.strip(), answer_text.strip()


   def save(self, path):
       with open(path, "w") as f:
           json.dump({"historical past": self.historical past, "instruments": self.instruments}, f, indent=2)
   @classmethod
   def load(cls, mannequin, processor, path):
       with open(path) as f: knowledge = json.load(f)
       c = cls(mannequin, processor, instruments=knowledge.get("instruments"))
       c.historical past = knowledge["history"]; return c


class ThinkingFunds(StoppingCriteria):
   def __init__(self, tokenizer, finances: int):
       self.finances = finances
       self.open_ids  = tokenizer.encode(THINK_OPEN,  add_special_tokens=False)
       self.close_ids = tokenizer.encode(THINK_CLOSE, add_special_tokens=False)
       self.begin = None
   def _find(self, seq, needle):
       n = len(needle)
       for i in vary(len(seq)-n+1):
           if seq[i:i+n] == needle: return i
       return None
   def __call__(self, input_ids, scores, **kwargs):
       seq = input_ids[0].tolist()
       if self.begin is None:
           idx = self._find(seq, self.open_ids)
           if idx is just not None: self.begin = idx + len(self.open_ids)
           return False
       if self._find(seq[self.start:], self.close_ids) is just not None: return False
       return (len(seq) - self.begin) >= self.finances


TOOL_CALL_RE = re.compile(r"<tool_call>s*({.*?})s*</tool_call>", re.S)


def run_calculate(expr: str) -> str:
   if any(c not in "0123456789+-*/().% " for c in expr):
       return json.dumps({"error":"unlawful chars"})
   attempt:    return json.dumps({"outcome": eval(expr, {"__builtins__": {}}, {})})
   besides Exception as e: return json.dumps({"error": str(e)})


_DOCS = {
   "qwen3.6":  "Qwen3.6-35B-A3B is a 35B MoE with 3B energetic params and 262k native context.",
   "deltanet": "Gated DeltaNet is a linear-attention variant utilized in Qwen3.6's hybrid layers.",
   "moe":      "Qwen3.6 makes use of 256 consultants with 8 routed + 1 shared per token.",
}
def run_search_docs(q):
   hits = [v for k,v in _DOCS.items() if k in q.lower()]
   return json.dumps({"outcomes": hits or ["no hits"]})
def run_get_time():
   import datetime as dt
   return json.dumps({"iso": dt.datetime.utcnow().isoformat()+"Z"})


TOOL_FNS = {
   "calculate":   lambda a: run_calculate(a["expression"]),
   "search_docs": lambda a: run_search_docs(a["query"]),
   "get_time":    lambda a: run_get_time(),
}
TOOLS_SCHEMA = [
   {"type":"function","function":{"name":"calculate","description":"Evaluate arithmetic.",
     "parameters":{"type":"object","properties":{"expression":{"type":"string"}},"required":["expression"]}}},
   {"sort":"operate","operate":{"title":"search_docs","description":"Search inner docs.",
     "parameters":{"sort":"object","properties":{"question":{"sort":"string"}},"required":["query"]}}},
   {"sort":"operate","operate":{"title":"get_time","description":"Get present UTC time.",
     "parameters":{"sort":"object","properties":{}}}},
]

We construct the principle QwenChat dialog supervisor, which handles message historical past, software messages, chat template formatting, commonplace technology, streaming technology, and session persistence. We additionally outline the ThinkingFunds stopping criterion to regulate how a lot reasoning the mannequin is allowed to supply earlier than persevering with or stopping technology. In addition, we create the tool-calling assist layer, together with arithmetic, light-weight doc search, time lookup, and the software schema that enables the mannequin to work together with exterior capabilities in an agent-style loop.

def run_agent(user_msg, *, max_steps=5, verbose=True):
   chat = QwenChat(mannequin, processor,
       system="You are a useful assistant. Call instruments when useful, then reply.",
       instruments=TOOLS_SCHEMA)
   chat.consumer(user_msg)
   for step in vary(max_steps):
       suppose, uncooked = chat.generate(enable_thinking=True, preserve_thinking=True,
                                  preset="thinking_general", max_new_tokens=1024,
                                  append_to_history=False)
       calls = TOOL_CALL_RE.findall(uncooked)
       if verbose:
           print(f"n=== step {step+1} ===")
           print("reasoning:", textwrap.shorten(suppose, 200))
           print("uncooked     :", textwrap.shorten(uncooked, 300))
       if not calls:
           chat.assistant(uncooked, reasoning=suppose); return chat, uncooked
       chat.assistant(uncooked, reasoning=suppose)
       for payload in calls:
           attempt: parsed = json.hundreds(payload)
           besides json.JSONDecodeError:
               chat.tool_result("error", {"error":"unhealthy json"}); proceed
           fn = TOOL_FNS.get(parsed.get("title"))
           res = fn(parsed.get("arguments", {})) if fn else json.dumps({"error":"unknown"})
           if verbose: print(f" -> {parsed.get('title')}({parsed.get('arguments',{})}) = {res}")
           chat.tool_result(parsed.get("title"), res)
   return chat, "(max_steps reached)"


import jsonschema


MOVIE_SCHEMA = {
   "sort":"object",
   "required":["title","year","rating","genres","runtime_minutes"],
   "additionalProperties": False,
   "properties":{
       "title":{"sort":"string"},
       "12 months":{"sort":"integer","minimal":1900,"most":2030},
       "ranking":{"sort":"quantity","minimal":0,"most":10},
       "genres":{"sort":"array","objects":{"sort":"string"},"minItems":1},
       "runtime_minutes":{"sort":"integer","minimal":1,"most":500},
   },
}
def extract_json(textual content):
   textual content = re.sub(r"^```(?:json)?", "", textual content.strip())
   textual content = re.sub(r"```$", "", textual content.strip())
   s = textual content.discover("{")
   if s < 0: elevate ValueError("no object")
   d, e = 0, -1
   for i in vary(s, len(textual content)):
       if textual content[i] == "{": d += 1
       elif textual content[i] == "}":
           d -= 1
           if d == 0: e = i; break
   if e < 0: elevate ValueError("unbalanced braces")
   return json.hundreds(textual content[s:e+1])


def json_with_retry(immediate, schema, *, max_tries=3):
   sys_m = ("You reply with ONLY a single JSON object matching the consumer's schema. "
            "No markdown fences. No commentary. No <suppose> blocks.")
   chat = QwenChat(mannequin, processor, system=sys_m)
   chat.consumer(f"{immediate}nnRespond as JSON matching this schema:n{json.dumps(schema, indent=2)}")
   final = None
   for i in vary(max_tries):
       _, uncooked = chat.generate(enable_thinking=False, preset="instruct_general",
                              max_new_tokens=512, append_to_history=False)
       attempt:
           obj = extract_json(uncooked); jsonschema.validate(obj, schema)
           return obj, i+1
       besides Exception as e:
           final = str(e); chat.assistant(uncooked)
           chat.consumer(f"That failed validation: {final}. Produce ONLY legitimate JSON.")
   elevate RuntimeError(f"gave up after {max_tries}: {final}")


def benchmark(immediate, *, batch_sizes=(1,2,4), max_new_tokens=64):
   print(f"{'batch':>6} {'tok/s':>10} {'total_s':>10} {'VRAM_GB':>10}")
   print("-"*40)
   for bs in batch_sizes:
       gc.gather(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats()
       msgs = [[{"role":"user","content":prompt}] for _ in vary(bs)]
       texts = [processor.apply_chat_template(m, tokenize=False, add_generation_prompt=True,
                                               enable_thinking=False) for m in msgs]
       processor.tokenizer.padding_side = "left"
       inp = processor.tokenizer(texts, return_tensors="pt", padding=True).to(mannequin.machine)
       torch.cuda.synchronize(); t0 = time.time()
       with torch.inference_mode():
           out = mannequin.generate(**inp, max_new_tokens=max_new_tokens, do_sample=False,
               pad_token_id=processor.tokenizer.pad_token_id or processor.tokenizer.eos_token_id)
       torch.cuda.synchronize(); dt = time.time()-t0
       new_toks = (out.form[1] - inp["input_ids"].form[1]) * bs
       vram = torch.cuda.max_memory_allocated()/1e9
       print(f"{bs:>6d} {new_toks/dt:>10.1f} {dt:>10.2f} {vram:>10.1f}")


def build_rag():
   from sentence_transformers import SentenceTransformer
   import numpy as np
   embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
   KB = [
       "Qwen3.6-35B-A3B has 35B total params and 3B activated via MoE.",
       "Context length is 262,144 tokens natively, up to ~1M with YaRN.",
       "The MoE layer uses 256 experts with 8 routed and 1 shared per token.",
       "Thinking mode wraps internal reasoning in <think>...</think> blocks.",
       "preserve_thinking=True keeps prior reasoning across turns for agents.",
       "Gated DeltaNet is a linear-attention variant in the hybrid layers.",
       "The model accepts image, video, and text input natively.",
       "Sampling for coding tasks uses temperature=0.6 rather than 1.0.",
   ]
   KB_EMB = embedder.encode(KB, normalize_embeddings=True)
   def retrieve(q, okay=3):
       qv = embedder.encode([q], normalize_embeddings=True)[0]
       import numpy as _np
       return [KB[i] for i in _np.argsort(-(KB_EMB @ qv))[:k]]
   return retrieve


def rag_answer(question, retrieve, okay=3):
   ctx = retrieve(question, okay)
   sys_m = "Answer utilizing ONLY the supplied context. If inadequate, say so."
   consumer = "Context:n" + "n".be a part of(f"- {c}" for c in ctx) + f"nnQuestion: {question}"
   chat = QwenChat(mannequin, processor, system=sys_m); chat.consumer(consumer)
   _, ans = chat.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=300)
   return ans, ctx

We outline higher-level utility capabilities that flip the mannequin right into a extra full software framework for agentic, structured workflows. We implement the agent loop for iterative software use, add JSON extraction and validation with retry logic, create a benchmarking operate to measure technology throughput, and construct a light-weight semantic retrieval pipeline for mini-RAG. Together, these capabilities assist us transfer from fundamental prompting to extra sturdy workflows through which the mannequin can purpose, validate outputs, retrieve supporting context, and be systematically examined.

print("n" + "="*20, "§4 thinking-budget", "="*20)
c = QwenChat(mannequin, processor)
c.consumer("A frog is on the backside of a 30m properly. It climbs 3m/day, slips 2m/evening. "
      "How many days till it escapes? Explain.")
finances = ThinkingFunds(processor.tokenizer, finances=150)
suppose, ans = c.generate(enable_thinking=True, max_new_tokens=1200,
                        stopping_criteria=StoppingCriteriaList([budget]))
print(f"Thinking ~{len(processor.tokenizer.encode(suppose))} tok | Answer:n{ans or '(truncated)'}")


print("n" + "="*20, "§5 streaming break up", "="*20)
c = QwenChat(mannequin, processor)
c.consumer("Explain why transformers scale higher than RNNs, in two quick paragraphs.")
print("[THINKING >>] ", finish="", flush=True)
first = [True]
def _ot(x): print(x, finish="", flush=True)
def _oa(x):
   if first[0]: print("nn[ANSWER >>] ", finish="", flush=True); first[0] = False
   print(x, finish="", flush=True)
c.stream(enable_thinking=True, preset="thinking_general", max_new_tokens=700,
        on_thinking=_ot, on_answer=_oa); print()


print("n" + "="*20, "§6 imaginative and prescient", "="*20)
IMG = "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/CI_Demo/mathv-1327.jpg"
c = QwenChat(mannequin, processor)
c.historical past.append({"position":"consumer","content material":[
   {"type":"image","image":IMG},
   {"type":"text","text":"Describe this figure in one sentence, then state what it's asking."}]})
_, ans = c.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=300)
print("Describe:", ans)


GRD = "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.6/demo/RealWorld/RealWorld-04.png"
c = QwenChat(mannequin, processor)
c.historical past.append({"position":"consumer","content material":[
   {"type":"image","image":GRD},
   {"type":"text","text": "Locate every distinct object. Reply ONLY with JSON "
    "[{"label":...,"bbox_2d":[x1,y1,x2,y2]}, ...] in pixel coords."}]})
_, ans = c.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=800)
print("Grounding:", ans[:600])


print("n" + "="*20, "§7 YaRN override", "="*20)
YARN = {"text_config": {"rope_parameters": {
   "mrope_interleaved": True, "mrope_section": [11,11,10],
   "rope_type": "yarn", "rope_theta": 10_000_000,
   "partial_rotary_factor": 0.25, "issue": 4.0,
   "original_max_position_embeddings": 262_144}}}
print(json.dumps(YARN, indent=2))

We start working the superior demonstrations by testing thinking-budget management, break up streaming, multimodal imaginative and prescient prompting, and a YaRN configuration instance for prolonged context dealing with. We first observe how the mannequin causes underneath a restricted pondering finances, then stream its pondering and reply individually in order that we will examine each elements of the response circulation. We additionally ship image-based prompts for description and grounding duties, and lastly print a YaRN rope-configuration override that reveals how long-context settings will be ready for mannequin reloading.

print("n" + "="*20, "§8 agent loop", "="*20)
chat, last = run_agent(
   "What's 15% of 842 to 2 decimals? Also briefly clarify gated DeltaNet per the docs.",
   max_steps=4)
print("nFINAL:", last)


print("n" + "="*20, "§9 structured JSON", "="*20)
obj, tries = json_with_retry("Summarize the film Inception as structured metadata.",
                            MOVIE_SCHEMA)
print(f"({tries} tries)", json.dumps(obj, indent=2))


print("n" + "="*20, "§10 MoE routing", "="*20)
routers = []
for title, m in mannequin.named_modules():
   low = title.decrease()
   if (("gate" in low and ("moe" in low or "knowledgeable" in low)) or
       low.endswith(".router") or low.endswith(".gate")) and hasattr(m, "weight"):
       routers.append((title, m))
print(f"discovered {len(routers)} router-like modules")


TOP_K = 8
counts = [Counter() for _ in routers]
handles = []
def _mkhook(i):
   def h(_m, _i, out):
       lg = out[0] if isinstance(out, tuple) else out
       if lg.dim() != 2: return
       attempt:
           for eid in lg.topk(TOP_K, dim=-1).indices.flatten().tolist():
               counts[i][eid] += 1
       besides Exception: go
   return h
for i,(_,m) in enumerate(routers): handles.append(m.register_forward_hook(_mkhook(i)))
attempt:
   c = QwenChat(mannequin, processor); c.consumer("Write one quick sentence about sundown.")
   c.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=40)
lastly:
   for h in handles: h.take away()
complete = Counter()
for c_ in counts: complete.replace(c_)
print(f"distinct consultants activated: {len(complete)}")
for eid, n in complete.most_common(10): print(f"  knowledgeable #{eid:>3}  {n} fires")


print("n" + "="*20, "§11 benchmark", "="*20)
benchmark("In one sentence, what's entropy?", batch_sizes=(1,2,4), max_new_tokens=48)


print("n" + "="*20, "§12 mini-RAG", "="*20)
retrieve = build_rag()
ans, ctx = rag_answer("How many consultants are energetic per token, and why does that matter?", retrieve)
print("retrieved:"); [print(" -", c) for c in ctx]
print("reply:", ans)


print("n" + "="*20, "§13 save/resume", "="*20)
c = QwenChat(mannequin, processor); c.consumer("Give me a singular 5-letter codeword. Just the phrase.")
_, a1 = c.generate(enable_thinking=True, max_new_tokens=256); print("T1:", a1)
c.save("/content material/session.json")
del c; gc.gather()
r = QwenChat.load(mannequin, processor, "/content material/session.json")
r.consumer("Reverse the letters of that codeword.")
_, a2 = r.generate(enable_thinking=True, preserve_thinking=True, max_new_tokens=256)
print("T2:", a2)


print("n✓ tutorial full")

We proceed with the remaining demonstrations that showcase tool-augmented reasoning, schema-constrained JSON technology, MoE routing introspection, throughput benchmarking, retrieval-augmented answering, and save-resume session dealing with. We let the mannequin remedy a tool-using activity, generate structured film metadata with validation, examine which expert-like router modules activate throughout inference, and measure tokens-per-second throughout completely different batch sizes. Finally, we take a look at mini-RAG for context-grounded answering and confirm conversational persistence by saving a session, reloading it, and persevering with the interplay from the saved historical past.

In conclusion, we created a sensible and detailed workflow for utilizing Qwen 3.6-35B-A3B past easy textual content technology. We confirmed how one can mix adaptive loading, multimodal prompting, managed reasoning, tool-augmented interplay, schema-constrained outputs, light-weight RAG, and session save-resume patterns into one built-in system. We additionally inspected knowledgeable routing habits and measured throughput to grasp the mannequin’s usability and efficiency. Also, we turned Qwen 3.6 right into a working experimental playground the place we will research its capabilities, take a look at superior interplay patterns, and construct a robust basis for extra severe analysis or product-oriented functions.


Check out the Full Codes with Notebook here. Also, be happy to observe us on Twitter and don’t overlook to hitch our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The put up A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence appeared first on MarkTechPost.

Similar Posts