GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning Effort, Function Calling, and Long-Context Retrieval

In this tutorial, we work with GLM-5.2 and use its hosted, OpenAI-compatible API as an alternative of working the total mannequin regionally. We start by organising a number of supplier choices, securely loading the API key, and making a reusable chat wrapper that helps regular chat, considering mode, streaming, device calling, and token monitoring. Then we transfer past a easy chatbot instance and take a look at the mannequin in additional sensible conditions, together with reasoning-effort management, streamed reasoning and solutions, perform calling, a small tool-using agent, structured JSON output, long-context retrieval, and value estimation.

Setting Up the GLM-5.2 OpenAI-Compatible Client and Reusable Chat Wrapper

Copy Code

import sys, subprocess
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U", "openai"], test=False)
import os, re, json, time, getpass
from openai import OpenAI
PROVIDERS = {
   "zai":         {"base_url": "https://api.z.ai/api/paas/v4/",   "mannequin": "glm-5.2",        "env": "ZAI_API_KEY"},
   "openrouter":  {"base_url": "https://openrouter.ai/api/v1",    "mannequin": "z-ai/glm-5.2",   "env": "OPENROUTER_API_KEY"},
   "collectively":    {"base_url": "https://api.collectively.xyz/v1",     "mannequin": "zai-org/GLM-5.2","env": "TOGETHER_API_KEY"},
   "requesty":    {"base_url": "https://router.requesty.ai/v1",   "mannequin": "zai/glm-5.2",    "env": "REQUESTY_API_KEY"},
   "huggingface": {"base_url": "https://router.huggingface.co/v1","mannequin": "zai-org/GLM-5.2","env": "HF_TOKEN"},
}
PROVIDER = "zai"
CFG   = PROVIDERS[PROVIDER]
MODEL = CFG["model"]
def load_api_key(env_name):
   strive:
       from google.colab import userdata
       v = userdata.get(env_name)
       if v: return v
   besides Exception:
       cross
   if os.environ.get(env_name):
       return os.environ[env_name]
   return getpass.getpass(f"Enter your {env_name}: ")
consumer = OpenAI(api_key=load_api_key(CFG["env"]), base_url=CFG["base_url"])
PRICE_IN_PER_M, PRICE_OUT_PER_M = 1.40, 4.40
_USAGE = {"in": 0, "out": 0, "calls": 0}
def _track(utilization):
   if utilization:
       _USAGE["in"]    += getattr(utilization, "prompt_tokens", 0) or 0
       _USAGE["out"]   += getattr(utilization, "completion_tokens", 0) or 0
       _USAGE["calls"] += 1
def get_reasoning(obj):
   """Pull GLM's hidden reasoning hint from a message/delta (a provider-extra discipline)."""
   val = getattr(obj, "reasoning_content", None)
   if val: return val
   additional = getattr(obj, "model_extra", None) or {}
   if additional.get("reasoning_content"): return additional["reasoning_content"]
   strive:    return obj.to_dict().get("reasoning_content")
   besides Exception: return None
def chat(messages, effort=None, considering=True, instruments=None, tool_choice="auto",
        stream=False, max_tokens=2048, temperature=1.0, tool_stream=False):
   """
   effort:   None | "excessive" | "max"   (GLM-5.2 thinking-effort degree; max is the mannequin default)
   considering: True -> deep considering on; False -> off (quick, low cost, low-latency)
   GLM-specific params undergo extra_body so any OpenAI consumer works.
   """
   additional = {"considering": {"kind": "enabled" if considering else "disabled"}}
   if effort and considering: additional["reasoning_effort"] = effort
   if tool_stream:         additional["tool_stream"] = True
   kwargs = dict(mannequin=MODEL, messages=messages, max_tokens=max_tokens,
                 temperature=temperature, stream=stream, extra_body=additional)
   if instruments:
       kwargs.replace(instruments=instruments, tool_choice=tool_choice)
   if stream:
       kwargs["stream_options"] = {"include_usage": True}
   return consumer.chat.completions.create(**kwargs)

We arrange the whole basis for utilizing GLM-5.2 by means of an OpenAI-compatible API. We outline a number of supplier choices, load the API key securely, create the OpenAI consumer, and arrange token-cost monitoring for your entire pocket book. We additionally construct a reusable chat wrapper so that each subsequent demo can use considering mode, reasoning effort, streaming, device calling, and provider-specific parameters cleanly.

Basic Chat, Thinking-Effort Control, and Streamed Reasoning with GLM-5.2

Copy Code

def demo_basic():
   print("n=== 1. BASIC CHAT / SANITY CHECK =========================")
   resp = chat([{"role": "system", "content": "You are a concise technical assistant."},
                {"role": "user",   "content": "In one sentence, what is GLM-5.2 best at?"}],
               considering=False, max_tokens=200)
   _track(resp.utilization)
   print(resp.selections[0].message.content material.strip())
def demo_effort():
   print("n=== 2. THINKING-EFFORT CONTROL (off / excessive / max) ========")
   downside = ("Train A leaves metropolis A at 9:00 going 60 km/h towards metropolis B. "
              "Train B leaves B (420 km away) at 9:30 going 90 km/h towards A. "
              "At what clock time do they meet? Show the important thing steps briefly.")
   for label, kw in [("thinking OFF", dict(thinking=False)),
                     ("effort=high",  dict(thinking=True, effort="high")),
                     ("effort=max",   dict(thinking=True, effort="max"))]:
       t0 = time.time()
       resp = chat([{"role": "user", "content": problem}], max_tokens=2000, **kw)
       dt = time.time() - t0
       _track(resp.utilization)
       msg, u = resp.selections[0].message, resp.utilization
       print(f"n--- {label} | {dt:0.1f}s | out_tokens={getattr(u,'completion_tokens',0)} ---")
       r = get_reasoning(msg)
       if r:
           print("  [reasoning, first 220 chars]: " + " ".be part of(r.break up())[:220] + " ...")
       print("  : " + " ".be part of((msg.content material or '').break up())[:350])
def demo_streaming():
   print("n=== 3. STREAMING: reasoning channel vs reply channel ====")
   stream = chat([{"role": "user", "content":
                   "Explain why the sky is blue, then give a one-line TL;DR."}],
                 considering=True, effort="excessive", stream=True, max_tokens=1200)
   saw_r = saw_a = False
   utilization = None
   for chunk in stream:
       if getattr(chunk, "utilization", None): utilization = chunk.utilization
       if not chunk.selections: proceed
       delta = chunk.selections[0].delta
       r = get_reasoning(delta)
       if r:
           if not saw_r: print("n[thinking] ", finish="", flush=True); saw_r = True
           print(r, finish="", flush=True)
       if getattr(delta, "content material", None):
           if not saw_a: print("nn ", finish="", flush=True); saw_a = True
           print(delta.content material, finish="", flush=True)
   print()
   _track(utilization)

We begin testing GLM-5.2 with primary chat, reasoning-effort management, and streaming output. We first run a easy sanity test, then examine the identical downside throughout thinking-off, high-effort, and max-effort modes to observe modifications in latency and output tokens. We additionally stream the mannequin response so we are able to view the reasoning channel and the ultimate reply individually because the response is being generated.

Function Calling and a Multi-Step Tool-Using GLM-5.2 Agent

Copy Code

def tool_calculator(expression: str):
   if not re.fullmatch(r"[0-9+-*/(). %]+", expression or ""):
       return {"error": "unsupported characters"}
   strive:    return {"outcome": eval(expression, {"__builtins__": {}}, {})}
   besides Exception as e: return {"error": str(e)}
_CITY_POP = {"tokyo": 37_400_068, "delhi": 32_900_000, "shanghai": 28_500_000,
            "sao paulo": 22_400_000, "mexico metropolis": 21_800_000}
def tool_city_population(metropolis: str):
   return {"metropolis": metropolis, "inhabitants": _CITY_POP.get((metropolis or "").strip().decrease())}
TOOLS = [
   {"type": "function", "function": {
       "name": "calculator", "description": "Evaluate basic arithmetic like '37400068/21800000'.",
       "parameters": {"type": "object", "properties": {"expression": {"type": "string"}},
                      "required": ["expression"]}}},
   {"kind": "perform", "perform": {
       "title": "city_population", "description": "Look up the metro inhabitants of a metropolis.",
       "parameters": {"kind": "object", "properties": {"metropolis": {"kind": "string"}},
                      "required": ["city"]}}},
]
TOOL_IMPLS = {"calculator": tool_calculator, "city_population": tool_city_population}
def run_tool_loop(messages, max_rounds=6, effort="max"):
   """Full loop: mannequin -> tool_calls -> execute -> feed outcomes again -> repeat."""
   for _ in vary(max_rounds):
       resp = chat(messages, instruments=TOOLS, considering=True, effort=effort,
                   max_tokens=1500, temperature=0.3)
       _track(resp.utilization)
       m = resp.selections[0].message
       if not getattr(m, "tool_calls", None):
           return m.content material
       messages.append({
           "function": "assistant", "content material": m.content material or "",
           "tool_calls": [{"id": tc.id, "type": "function",
                           "function": {"name": tc.function.name,
                                        "arguments": tc.function.arguments}}
                          for tc in m.tool_calls]})
       for tc in m.tool_calls:
           strive:    args = json.masses(tc.perform.arguments or "{}")
           besides json.JSONDecodeError: args = {}
           outcome = TOOL_IMPLS.get(tc.perform.title, lambda **ok: {"error": "unknown"})(**args)
           print(f"   ↳ {tc.perform.title}({args}) -> {outcome}")
           messages.append({"function": "device", "tool_call_id": tc.id,
                            "content material": json.dumps(outcome)})
   return "(stopped: max device rounds reached)"
def demo_tools():
   print("n=== 4. FUNCTION / TOOL CALLING ===========================")
   q = ("How many instances bigger is Tokyo's metro inhabitants than Mexico City's? "
        "Use the instruments, then reply with the ratio to one decimal place.")
   print("Final:", " ".be part of((run_tool_loop([{"role": "user", "content": q}]) or "").break up()))
def demo_agent():
   print("n=== 5. MINI MULTI-STEP AGENT (instruments + max effort) ========")
   activity = ("Rank Tokyo, Delhi, and Shanghai by metro inhabitants (largest first), "
           "then compute the mixed inhabitants of the highest two and report it. "
           "Use the instruments for each lookup and sum; by no means guess numbers.")
   ans = run_tool_loop([{"role": "system", "content": "You are a careful analyst."},
                        {"role": "user",   "content": task}])
   print("Final:", " ".be part of((ans or "").break up()))

We join GLM-5.2 to exterior instruments and construct a small tool-using workflow. We outline a calculator and a city-population lookup device, register them in an OpenAI-style device schema, and create a loop through which the mannequin requests device calls and receives device outcomes. We then use this setup for a direct function-calling activity and a small multi-step agent that appears up populations, ranks cities, and performs calculations with out guessing.

Structured JSON Output and Long-Context Retrieval with GLM-5.2

Copy Code

def tool_calculator(expression: str):
   if not re.fullmatch(r"[0-9+-*/(). %]+", expression or ""):
       return {"error": "unsupported characters"}
   strive:    return {"outcome": eval(expression, {"__builtins__": {}}, {})}
   besides Exception as e: return {"error": str(e)}
_CITY_POP = {"tokyo": 37_400_068, "delhi": 32_900_000, "shanghai": 28_500_000,
            "sao paulo": 22_400_000, "mexico metropolis": 21_800_000}
def tool_city_population(metropolis: str):
   return {"metropolis": metropolis, "inhabitants": _CITY_POP.get((metropolis or "").strip().decrease())}
TOOLS = [
   {"type": "function", "function": {
       "name": "calculator", "description": "Evaluate basic arithmetic like '37400068/21800000'.",
       "parameters": {"type": "object", "properties": {"expression": {"type": "string"}},
                      "required": ["expression"]}}},
   {"kind": "perform", "perform": {
       "title": "city_population", "description": "Look up the metro inhabitants of a metropolis.",
       "parameters": {"kind": "object", "properties": {"metropolis": {"kind": "string"}},
                      "required": ["city"]}}},
]
TOOL_IMPLS = {"calculator": tool_calculator, "city_population": tool_city_population}
def run_tool_loop(messages, max_rounds=6, effort="max"):
   """Full loop: mannequin -> tool_calls -> execute -> feed outcomes again -> repeat."""
   for _ in vary(max_rounds):
       resp = chat(messages, instruments=TOOLS, considering=True, effort=effort,
                   max_tokens=1500, temperature=0.3)
       _track(resp.utilization)
       m = resp.selections[0].message
       if not getattr(m, "tool_calls", None):
           return m.content material
       messages.append({
           "function": "assistant", "content material": m.content material or "",
           "tool_calls": [{"id": tc.id, "type": "function",
                           "function": {"name": tc.function.name,
                                        "arguments": tc.function.arguments}}
                          for tc in m.tool_calls]})
       for tc in m.tool_calls:
           strive:    args = json.masses(tc.perform.arguments or "{}")
           besides json.JSONDecodeError: args = {}
           outcome = TOOL_IMPLS.get(tc.perform.title, lambda **ok: {"error": "unknown"})(**args)
           print(f"   ↳ {tc.perform.title}({args}) -> {outcome}")
           messages.append({"function": "device", "tool_call_id": tc.id,
                            "content material": json.dumps(outcome)})
   return "(stopped: max device rounds reached)"
def demo_tools():
   print("n=== 4. FUNCTION / TOOL CALLING ===========================")
   q = ("How many instances bigger is Tokyo's metro inhabitants than Mexico City's? "
        "Use the instruments, then reply with the ratio to one decimal place.")
   print("Final:", " ".be part of((run_tool_loop([{"role": "user", "content": q}]) or "").break up()))
def demo_agent():
   print("n=== 5. MINI MULTI-STEP AGENT (instruments + max effort) ========")
   activity = ("Rank Tokyo, Delhi, and Shanghai by metro inhabitants (largest first), "
           "then compute the mixed inhabitants of the highest two and report it. "
           "Use the instruments for each lookup and sum; by no means guess numbers.")
   ans = run_tool_loop([{"role": "system", "content": "You are a careful analyst."},
                        {"role": "user",   "content": task}])
   print("Final:", " ".be part of((ans or "").break up()))

We deal with dependable, structured output and long-context retrieval. We create a JSON extraction helper, ask the mannequin to return a strict JSON object, and retry as soon as if the primary response is just not legitimate JSON. We additionally construct an artificial lengthy doc with a hidden “needle” and ship it to GLM-5.2 to test whether or not the mannequin retrieves the precise launch code from the offered context.

Running All Demos with GLM-5.2 Token and Cost Accounting

Copy Code

def cost_summary():
   print("n=== 8. TOKEN + COST ACCOUNTING ===========================")
   value = _USAGE["in"]/1e6*PRICE_IN_PER_M + _USAGE["out"]/1e6*PRICE_OUT_PER_M
   print(f"  calls: {_USAGE['calls']} | enter: {_USAGE['in']:,} tok | output: {_USAGE['out']:,} tok")
   print(f"  estimated spend @ ${PRICE_IN_PER_M}/{PRICE_OUT_PER_M} per 1M: ${value:0.4f}")
DEMOS = [demo_basic, demo_effort, demo_streaming, demo_tools,
        demo_agent, demo_structured, demo_long_context]
print(f"Provider={PROVIDER}   mannequin={MODEL}")
for fn in DEMOS:
   strive:    fn()
   besides Exception as e:
       print(f"  [skipped {fn.__name__}: {type(e).__name__}: {e}]")
cost_summary()
print("nDone. Tweak PROVIDER / effort / max_tokens and re-run any demo perform.")

We end the tutorial by accumulating utilization info and working all demos from high to backside. We calculate the estimated value from complete enter and output tokens, then print a compact abstract of calls, token counts, and spend. We additionally use a driver loop so {that a} single failed demo doesn’t halt your entire pocket book, making the tutorial simpler to run, debug, and reuse.

Conclusion

In conclusion, we now have a sensible and reusable workflow for utilizing GLM-5.2 in Python purposes. We discovered how to management its reasoning conduct, examine totally different considering modes, join it with instruments, validate structured outputs, take a look at long-context inputs, and monitor token utilization with estimated value. It offers us a robust start line for constructing extra superior programs akin to analysis assistants, doc evaluation instruments, coding brokers, long-context retrieval workflows, or API-based reasoning pipelines. We completed with a setup that’s light-weight sufficient for Colab however nonetheless shut to how we’d construct with GLM-5.2 in an actual venture.

Check out the Full Codes here. Also, be at liberty to observe us on Twitter and don’t neglect to be part of our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The publish GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning Effort, Function Calling, and Long-Context Retrieval appeared first on MarkTechPost.

GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning Effort, Function Calling, and Long-Context Retrieval

Setting Up the GLM-5.2 OpenAI-Compatible Client and Reusable Chat Wrapper

Basic Chat, Thinking-Effort Control, and Streamed Reasoning with GLM-5.2

Function Calling and a Multi-Step Tool-Using GLM-5.2 Agent

Structured JSON Output and Long-Context Retrieval with GLM-5.2

Running All Demos with GLM-5.2 Token and Cost Accounting

Conclusion

VERINA: Evaluating LLMs on End-to-End Verifiable Code Generation with Formal Proofs

The State of Voice AI in 2025: Trends, Breakthroughs, and Market Leaders

How BM25 and RAG Retrieve Information Differently?

Top 10 AI Blogs and News Websites for AI Developers and Engineers in 2025

Salesforce CodeGen Tutorial: Generate, Validate, and Rerank Python Functions With Unit Tests and Safety Checks

A Complete Workflow for Automated Prompt Optimization Using Gemini Flash, Few-Shot Selection, and Evolutionary Instruction Search

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Setting Up the GLM-5.2 OpenAI-Compatible Client and Reusable Chat Wrapper

Basic Chat, Thinking-Effort Control, and Streamed Reasoning with GLM-5.2

Function Calling and a Multi-Step Tool-Using GLM-5.2 Agent

Structured JSON Output and Long-Context Retrieval with GLM-5.2

Running All Demos with GLM-5.2 Token and Cost Accounting

Conclusion

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!