How to Build Production-Ready Agentic Systems with Z.AI GLM-5 Using Thinking Mode, Tool Calling, Streaming, and Multi-Turn Workflows

In this tutorial, we discover the complete capabilities of Z.AI’s GLM-5 mannequin and construct a whole understanding of how to use it for real-world, agentic purposes. We begin from the basics by organising the atmosphere utilizing the Z.AI SDK and its OpenAI-compatible interface, and then progressively transfer on to superior options equivalent to streaming responses, considering mode for deeper reasoning, and multi-turn conversations. As we proceed, we combine perform calling, structured outputs, and ultimately assemble a completely useful multi-tool agent powered by GLM-5. Also, we perceive every functionality in isolation, and additionally how Z.AI’s ecosystem allows us to construct scalable, production-ready AI techniques.

Copy Code

!pip set up -q zai-sdk openai wealthy


import os
import json
import time
from datetime import datetime
from typing import Optional
import getpass


API_KEY = os.environ.get("ZAI_API_KEY")


if not API_KEY:
   API_KEY = getpass.getpass(" Enter your Z.AI API key (hidden enter): ").strip()


if not API_KEY:
   elevate ValueError(
       " No API key supplied! Get one free at: https://z.ai/manage-apikey/apikey-list"
   )


os.environ["ZAI_API_KEY"] = API_KEY
print(f" API key configured (ends with ...{API_KEY[-4:]})")


from zai import ZaiClient


shopper = ZaiClient(api_key=API_KEY)
print(" ZaiClient initialized — prepared to use GLM-5!")




print("n" + "=" * 70)
print(" SECTION 2: Basic Chat Completion")
print("=" * 70)


response = shopper.chat.completions.create(
   mannequin="glm-5",
   messages=[
       {"role": "system", "content": "You are a concise, expert software architect."},
       {"role": "user", "content": "Explain the Mixture-of-Experts architecture in 3 sentences."},
   ],
   max_tokens=256,
   temperature=0.7,
)


print("n GLM-5 Response:")
print(response.decisions[0].message.content material)
print(f"n Usage: {response.utilization.prompt_tokens} immediate + {response.utilization.completion_tokens} completion tokens")




print("n" + "=" * 70)
print(" SECTION 3: Streaming Responses")
print("=" * 70)


print("n GLM-5 (streaming): ", finish="", flush=True)


stream = shopper.chat.completions.create(
   mannequin="glm-5",
   messages=[
       {"role": "user", "content": "Write a Python one-liner that checks if a number is prime."},
   ],
   stream=True,
   max_tokens=512,
   temperature=0.6,
)


full_response = ""
for chunk in stream:
   delta = chunk.decisions[0].delta
   if delta.content material:
       print(delta.content material, finish="", flush=True)
       full_response += delta.content material


print(f"nn Streamed {len(full_response)} characters")

We start by putting in the Z.AI and OpenAI SDKs, then securely seize our API key by hidden terminal enter utilizing getpass. We initialize the ZaiClient and fireplace off our first primary chat completion to GLM-5, asking it to clarify the Mixture-of-Experts structure. We then discover streaming responses, watching tokens arrive in actual time as GLM-5 generates a Python one-liner for prime checking.

Copy Code

print("n" + "=" * 70)
print(" SECTION 4: Thinking Mode (Chain-of-Thought)")
print("=" * 70)
print("GLM-5 can expose its inner reasoning earlier than giving a closing reply.")
print("This is very highly effective for math, logic, and advanced coding duties.n")


print("─── Thinking Mode + Streaming ───n")


stream = shopper.chat.completions.create(
   mannequin="glm-5",
   messages=[
       {
           "role": "user",
           "content": (
               "A farmer has 17 sheep. All but 9 run away. "
               "How many sheep does the farmer have left? "
               "Think carefully before answering."
           ),
       },
   ],
   considering={"kind": "enabled"},
   stream=True,
   max_tokens=2048,
   temperature=0.6,
)


reasoning_text = ""
answer_text = ""


for chunk in stream:
   delta = chunk.decisions[0].delta
   if hasattr(delta, "reasoning_content") and delta.reasoning_content:
       if not reasoning_text:
           print(" Reasoning:")
       print(delta.reasoning_content, finish="", flush=True)
       reasoning_text += delta.reasoning_content
   if delta.content material:
       if not answer_text and reasoning_text:
           print("nn Final Answer:")
       print(delta.content material, finish="", flush=True)
       answer_text += delta.content material


print(f"nn Reasoning: {len(reasoning_text)} chars | Answer: {len(answer_text)} chars")




print("n" + "=" * 70)
print(" SECTION 5: Multi-Turn Conversation")
print("=" * 70)


messages = [
   {"role": "system", "content": "You are a senior Python developer. Be concise."},
   {"role": "user", "content": "What's the difference between a list and a tuple in Python?"},
]


r1 = shopper.chat.completions.create(mannequin="glm-5", messages=messages, max_tokens=512, temperature=0.7)
assistant_reply_1 = r1.decisions[0].message.content material
messages.append({"function": "assistant", "content material": assistant_reply_1})
print(f"n User: {messages[1]['content']}")
print(f" GLM-5: {assistant_reply_1[:200]}...")


messages.append({"function": "person", "content material": "When ought to I exploit a NamedTuple as an alternative?"})
r2 = shopper.chat.completions.create(mannequin="glm-5", messages=messages, max_tokens=512, temperature=0.7)
assistant_reply_2 = r2.decisions[0].message.content material
print(f"n User: {messages[-1]['content']}")
print(f" GLM-5: {assistant_reply_2[:200]}...")


messages.append({"function": "assistant", "content material": assistant_reply_2})
messages.append({"function": "person", "content material": "Show me a sensible instance with kind hints."})
r3 = shopper.chat.completions.create(mannequin="glm-5", messages=messages, max_tokens=1024, temperature=0.7)
assistant_reply_3 = r3.decisions[0].message.content material
print(f"n User: {messages[-1]['content']}")
print(f" GLM-5: {assistant_reply_3[:300]}...")


print(f"n Conversation: {len(messages)+1} messages, {r3.utilization.total_tokens} complete tokens in final name")

We activate GLM-5’s considering mode to observe its inner chain-of-thought reasoning streamed dwell by the reasoning_content area earlier than the ultimate reply seems. We then construct a multi-turn dialog the place we ask about Python lists vs tuples, comply with up on NamedTuples, and request a sensible instance with kind hints, all whereas GLM-5 maintains full context throughout turns. We monitor how the dialog grows in message depend and token utilization with every successive change.

Copy Code

print("n" + "=" * 70)
print(" SECTION 6: Function Calling (Tool Use)")
print("=" * 70)
print("GLM-5 can determine WHEN and HOW to name exterior capabilities you outline.n")


instruments = [
   {
       "type": "function",
       "function": {
           "parameters": {
               "type": "object",
               "properties": {
                   "city": {
                       "type": "string",
                       "description": "City name, e.g. 'San Francisco', 'Tokyo'",
                   },
                   "unit": {
                       "type": "string",
                       "enum": ["celsius", "fahrenheit"],
                       "description": "Temperature unit (default: celsius)",
                   },
               },
               "required": ["city"],
           },
       },
   },
   {
       "kind": "perform",
       "perform": {
           "identify": "calculate",
           "description": "Evaluate a mathematical expression safely",
           "parameters": {
               "kind": "object",
               "properties": {
                   "expression": {
                       "kind": "string",
                       "description": "Math expression, e.g. '2**10 + 3*7'",
                   }
               },
               "required": ["expression"],
           },
       },
   },
]




def get_weather(metropolis: str, unit: str = "celsius") -> dict:
   weather_db = {
       "san francisco": {"temp": 18, "situation": "Foggy", "humidity": 78},
       "tokyo": {"temp": 28, "situation": "Sunny", "humidity": 55},
       "london": {"temp": 14, "situation": "Rainy", "humidity": 85},
       "big apple": {"temp": 22, "situation": "Partly Cloudy", "humidity": 60},
   }
   information = weather_db.get(metropolis.decrease(), {"temp": 20, "situation": "Clear", "humidity": 50})
   if unit == "fahrenheit":
       information["temp"] = spherical(information["temp"] * 9 / 5 + 32)
   return {"metropolis": metropolis, "unit": unit or "celsius", **information}




def calculate(expression: str) -> dict:
   allowed = set("0123456789+-*/.()% ")
   if not all(c in allowed for c in expression):
       return {"error": "Invalid characters in expression"}
   attempt:
       outcome = eval(expression)
       return {"expression": expression, "outcome": outcome}
   besides Exception as e:
       return {"error": str(e)}




TOOL_REGISTRY = {"get_weather": get_weather, "calculate": calculate}




def run_tool_call(user_message: str):
   print(f"n User: {user_message}")
   messages = [{"role": "user", "content": user_message}]


   response = shopper.chat.completions.create(
       mannequin="glm-5",
       messages=messages,
       instruments=instruments,
       tool_choice="auto",
       max_tokens=1024,
   )


   assistant_msg = response.decisions[0].message
   messages.append(assistant_msg.model_dump())


   if assistant_msg.tool_calls:
       for tc in assistant_msg.tool_calls:
           fn_name = tc.perform.identify
           fn_args = json.masses(tc.perform.arguments)
           print(f"    Tool name: {fn_name}({fn_args})")


           outcome = TOOL_REGISTRY[fn_name](**fn_args)
           print(f"    Result: {outcome}")


           messages.append({
               "function": "instrument",
               "content material": json.dumps(outcome, ensure_ascii=False),
               "tool_call_id": tc.id,
           })


       closing = shopper.chat.completions.create(
           mannequin="glm-5",
           messages=messages,
           instruments=instruments,
           max_tokens=1024,
       )
       print(f" GLM-5: {closing.decisions[0].message.content material}")
   else:
       print(f" GLM-5: {assistant_msg.content material}")




run_tool_call("What's the climate like in Tokyo proper now?")
run_tool_call("What is 2^20 + 3^10 - 1024?")
run_tool_call("Compare the climate in San Francisco and London, and calculate the temperature distinction.")




print("n" + "=" * 70)
print(" SECTION 7: Structured JSON Output")
print("=" * 70)
print("Force GLM-5 to return well-structured JSON for downstream processing.n")


response = shopper.chat.completions.create(
   mannequin="glm-5",
   messages=[
       {
           "role": "system",
           "content": (
               "You are a data extraction assistant. "
               "Always respond with valid JSON only — no markdown, no explanation."
           ),
       },
       {
           "role": "user",
           "content": (
               "Extract structured data from this text:nn"
               '"Acme Corp reported Q3 2025 revenue of $4.2B, up 18% YoY. '
               "Net income was $890M. The company announced 3 new products "
               "and plans to expand into 5 new markets by 2026. CEO Jane Smith "
               'said she expects 25% growth next year."nn'
               "Return JSON with keys: company, quarter, revenue, revenue_growth, "
               "net_income, new_products, new_markets, ceo, growth_forecast"
           ),
       },
   ],
   max_tokens=512,
   temperature=0.1,
)


raw_output = response.decisions[0].message.content material
print(" Raw output:")
print(raw_output)


attempt:
   clear = raw_output.strip()
   if clear.startswith("```"):
       clear = clear.cut up("n", 1)[1].rsplit("```", 1)[0]
   parsed = json.masses(clear)
   print("n Parsed JSON:")
   print(json.dumps(parsed, indent=2))
besides json.JSONDecodeError as e:
   print(f"n JSON parsing failed: {e}")
   print("Tip: You can add response_format={'kind': 'json_object'} for stricter enforcement")

We outline two instruments, a climate lookup and a math calculator, then let GLM-5 autonomously determine when to invoke them based mostly on the person’s pure language question. We run a whole tool-calling round-trip: the mannequin selects the perform, we execute it regionally, feed the outcome again, and GLM-5 synthesizes a closing human-readable reply. We then swap to structured output, prompting GLM-5 to extract monetary information from uncooked textual content into clear, parseable JSON.

Copy Code

print("n" + "=" * 70)
print(" SECTION 8: Multi-Tool Agentic Loop")
print("=" * 70)
print("Build a whole agent that may use a number of instruments throughout turns.n")




class GLM5Agent:


   def __init__(self, system_prompt: str, instruments: record, tool_registry: dict):
       self.shopper = ZaiClient(api_key=API_KEY)
       self.messages = [{"role": "system", "content": system_prompt}]
       self.instruments = instruments
       self.registry = tool_registry
       self.max_iterations = 5


   def chat(self, user_input: str) -> str:
       self.messages.append({"function": "person", "content material": user_input})


       for iteration in vary(self.max_iterations):
           response = self.shopper.chat.completions.create(
               mannequin="glm-5",
               messages=self.messages,
               instruments=self.instruments,
               tool_choice="auto",
               max_tokens=2048,
               temperature=0.6,
           )


           msg = response.decisions[0].message
           self.messages.append(msg.model_dump())


           if not msg.tool_calls:
               return msg.content material


           for tc in msg.tool_calls:
               fn_name = tc.perform.identify
               fn_args = json.masses(tc.perform.arguments)
               print(f"    [{iteration+1}] {fn_name}({fn_args})")


               if fn_name in self.registry:
                   outcome = self.registry[fn_name](**fn_args)
               else:
                   outcome = {"error": f"Unknown perform: {fn_name}"}


               self.messages.append({
                   "function": "instrument",
                   "content material": json.dumps(outcome, ensure_ascii=False),
                   "tool_call_id": tc.id,
               })


       return " Agent reached most iterations with no closing reply."




extended_tools = instruments + [
   {
       "type": "function",
       "function": {
           "name": "get_current_time",
           "description": "Get the current date and time in ISO format",
           "parameters": {
               "type": "object",
               "properties": {},
               "required": [],
           },
       },
   },
   {
       "kind": "perform",
       "perform": {
           "identify": "unit_converter",
           "description": "Convert between models (size, weight, temperature)",
           "parameters": {
               "kind": "object",
               "properties": {
                   "worth": {"kind": "quantity", "description": "Numeric worth to convert"},
                   "from_unit": {"kind": "string", "description": "Source unit (e.g., 'km', 'miles', 'kg', 'lbs', 'celsius', 'fahrenheit')"},
                   "to_unit": {"kind": "string", "description": "Target unit"},
               },
               "required": ["value", "from_unit", "to_unit"],
           },
       },
   },
]




def get_current_time() -> dict:
   return {"datetime": datetime.now().isoformat(), "timezone": "UTC"}




def unit_converter(worth: float, from_unit: str, to_unit: str) -> dict:
   conversions = {
       ("km", "miles"): lambda v: v * 0.621371,
       ("miles", "km"): lambda v: v * 1.60934,
       ("kg", "lbs"): lambda v: v * 2.20462,
       ("lbs", "kg"): lambda v: v * 0.453592,
       ("celsius", "fahrenheit"): lambda v: v * 9 / 5 + 32,
       ("fahrenheit", "celsius"): lambda v: (v - 32) * 5 / 9,
       ("meters", "ft"): lambda v: v * 3.28084,
       ("ft", "meters"): lambda v: v * 0.3048,
   }
   key = (from_unit.decrease(), to_unit.decrease())
   if key in conversions:
       outcome = spherical(conversions[key](worth), 4)
       return {"worth": worth, "from": from_unit, "to": to_unit, "outcome": outcome}
   return {"error": f"Conversion {from_unit} → {to_unit} not supported"}




extended_registry = {
   **TOOL_REGISTRY,
   "get_current_time": get_current_time,
   "unit_converter": unit_converter,
}


agent = GLM5Agent(
   system_prompt=(
       "You are a useful assistant with entry to climate, math, time, and "
       "unit conversion instruments. Use them every time they will help reply the person's "
       "query precisely. Always present your work."
   ),
   instruments=extended_tools,
   tool_registry=extended_registry,
)


print(" User: What time is it? Also, if it is 28°C in Tokyo, what's that in Fahrenheit?")
print("   And what's 2^16?")
outcome = agent.chat(
   "What time is it? Also, if it is 28°C in Tokyo, what's that in Fahrenheit? "
   "And what's 2^16?"
)
print(f"n Agent: {outcome}")




print("n" + "=" * 70)
print("  SECTION 9: Thinking Mode ON vs OFF Comparison")
print("=" * 70)
print("See how considering mode improves accuracy on a tough logic downside.n")


tricky_question = (
   "I've 12 cash. One of them is counterfeit and weighs in another way than the remainder. "
)


print("─── WITHOUT Thinking Mode ───")
t0 = time.time()
r_no_think = shopper.chat.completions.create(
   mannequin="glm-5",
   messages=[{"role": "user", "content": tricky_question}],
   considering={"kind": "disabled"},
   max_tokens=2048,
   temperature=0.6,
)
t1 = time.time()
print(f"  Time: {t1-t0:.1f}s | Tokens: {r_no_think.utilization.completion_tokens}")
print(f" Answer (first 300 chars): {r_no_think.decisions[0].message.content material[:300]}...")


print("n─── WITH Thinking Mode ───")
t0 = time.time()
r_think = shopper.chat.completions.create(
   mannequin="glm-5",
   messages=[{"role": "user", "content": tricky_question}],
   considering={"kind": "enabled"},
   max_tokens=4096,
   temperature=0.6,
)
t1 = time.time()
print(f"  Time: {t1-t0:.1f}s | Tokens: {r_think.utilization.completion_tokens}")
print(f" Answer (first 300 chars): {r_think.decisions[0].message.content material[:300]}...")

We construct a reusable GLM5Agent class that runs a full agentic loop, routinely dispatching to climate, math, time, and unit conversion instruments throughout a number of iterations till it reaches a closing reply. We check it with a fancy multi-part question that requires calling three completely different instruments in a single flip. We then run a side-by-side comparability of the identical difficult 12-coin logic puzzle with considering mode disabled versus enabled, measuring each response time and reply high quality.

Copy Code

print("n" + "=" * 70)
print(" SECTION 10: OpenAI SDK Compatibility")
print("=" * 70)
print("GLM-5 is absolutely appropriate with the OpenAI Python SDK.")
print("Just change the base_url — your present OpenAI code works as-is!n")


from openai import OpenAI


openai_client = OpenAI(
   api_key=API_KEY,
   base_url="https://api.z.ai/api/paas/v4/",
)


completion = openai_client.chat.completions.create(
   mannequin="glm-5",
   messages=[
       {"role": "system", "content": "You are a writing assistant."},
       {
           "role": "user",
           "content": "Write a 4-line poem about artificial intelligence discovering nature.",
       },
   ],
   max_tokens=256,
   temperature=0.9,
)


print(" GLM-5 (through OpenAI SDK):")
print(completion.decisions[0].message.content material)


print("n Streaming (through OpenAI SDK):")
stream = openai_client.chat.completions.create(
   mannequin="glm-5",
   messages=[
       {
           "role": "user",
           "content": "List 3 creative use cases for a 744B parameter MoE model. Be brief.",
       }
   ],
   stream=True,
   max_tokens=512,
)


for chunk in stream:
   if chunk.decisions[0].delta.content material:
       print(chunk.decisions[0].delta.content material, finish="", flush=True)
print()




print("n" + "=" * 70)
print(" Tutorial Complete!")
print("=" * 70)
print("""
You've discovered how to use GLM-5 for:


  Basic chat completions
  Real-time streaming responses
  Thinking mode (chain-of-thought reasoning)
  Multi-turn conversations with context
  Function calling / instrument use
  Structured JSON output extraction
  Building a multi-tool agentic loop
  Comparing considering mode ON vs OFF
  Drop-in OpenAI SDK compatibility


 Next steps:
 • GLM-5 Docs:       https://docs.z.ai/guides/llm/glm-5
 • Function Calling:  https://docs.z.ai/guides/capabilities/function-calling
 • Structured Output: https://docs.z.ai/guides/capabilities/struct-output
 • Context Caching:   https://docs.z.ai/guides/capabilities/cache
 • Web Search Tool:   https://docs.z.ai/guides/instruments/web-search
 • GitHub:            https://github.com/zai-org/GLM-5
 • API Keys:          https://z.ai/manage-apikey/apikey-list


 Pro tip: GLM-5 additionally helps net search and context caching
  through the API for much more highly effective purposes!
""")

We show that GLM-5 works as a drop-in alternative with the usual OpenAI Python SDK; we merely level base_url, and every little thing works identically. We check each a normal completion for artistic writing and a streaming name that lists use instances for a 744B MoE mannequin. We wrap up with a full abstract of all ten capabilities coated and hyperlinks to the official docs for deeper exploration.

Check out the Full Codes Notebook here. Also, be at liberty to comply with us on Twitter and don’t overlook to be part of our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The submit How to Build Production-Ready Agentic Systems with Z.AI GLM-5 Using Thinking Mode, Tool Calling, Streaming, and Multi-Turn Workflows appeared first on MarkTechPost.

How to Build Production-Ready Agentic Systems with Z.AI GLM-5 Using Thinking Mode, Tool Calling, Streaming, and Multi-Turn Workflows

Building an A2A-Compliant Random Number Agent: A Step-by-Step Guide to Implementing the Low-Level Executor Pattern with Python

How to Build a Production-Ready Gemma 3 1B Instruct Generation AI Pipeline with Hugging Face Transformers, Chat Templates, and Colab Inference

Google AI Introduces the WebMCP to Enable Direct and Structured Website Interactions for New AI Agents

Google AI Introduces VISTA: A Test Time Self Improving Agent for Text to Video Generation

AREAL: Accelerating Large Reasoning Model Training with Fully Asynchronous Reinforcement Learning

Liquid AI Releases LFM2-8B-A1B: An On-Device Mixture-of-Experts with 8.3B Params and a 1.5B Active Params per Token

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!