GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning Effort, Function Calling, and Long-Context Retrieval
In this tutorial, we work with GLM-5.2 and use its hosted, OpenAI-compatible API as an alternative of working the total mannequin regionally. We start by organising a number of supplier choices, securely loading the API key, and making a reusable chat wrapper that helps regular chat, considering mode, streaming, device calling, and token monitoring. Then we transfer past a easy chatbot instance and take a look at the mannequin in additional sensible conditions, together with reasoning-effort management, streamed reasoning and solutions, perform calling, a small tool-using agent, structured JSON output, long-context retrieval, and value estimation.
Setting Up the GLM-5.2 OpenAI-Compatible Client and Reusable Chat Wrapper
import sys, subprocess
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U", "openai"], test=False)
import os, re, json, time, getpass
from openai import OpenAI
PROVIDERS = {
"zai": {"base_url": "https://api.z.ai/api/paas/v4/", "mannequin": "glm-5.2", "env": "ZAI_API_KEY"},
"openrouter": {"base_url": "https://openrouter.ai/api/v1", "mannequin": "z-ai/glm-5.2", "env": "OPENROUTER_API_KEY"},
"collectively": {"base_url": "https://api.collectively.xyz/v1", "mannequin": "zai-org/GLM-5.2","env": "TOGETHER_API_KEY"},
"requesty": {"base_url": "https://router.requesty.ai/v1", "mannequin": "zai/glm-5.2", "env": "REQUESTY_API_KEY"},
"huggingface": {"base_url": "https://router.huggingface.co/v1","mannequin": "zai-org/GLM-5.2","env": "HF_TOKEN"},
}
PROVIDER = "zai"
CFG = PROVIDERS[PROVIDER]
MODEL = CFG["model"]
def load_api_key(env_name):
strive:
from google.colab import userdata
v = userdata.get(env_name)
if v: return v
besides Exception:
cross
if os.environ.get(env_name):
return os.environ[env_name]
return getpass.getpass(f"Enter your {env_name}: ")
consumer = OpenAI(api_key=load_api_key(CFG["env"]), base_url=CFG["base_url"])
PRICE_IN_PER_M, PRICE_OUT_PER_M = 1.40, 4.40
_USAGE = {"in": 0, "out": 0, "calls": 0}
def _track(utilization):
if utilization:
_USAGE["in"] += getattr(utilization, "prompt_tokens", 0) or 0
_USAGE["out"] += getattr(utilization, "completion_tokens", 0) or 0
_USAGE["calls"] += 1
def get_reasoning(obj):
"""Pull GLM's hidden reasoning hint from a message/delta (a provider-extra discipline)."""
val = getattr(obj, "reasoning_content", None)
if val: return val
additional = getattr(obj, "model_extra", None) or {}
if additional.get("reasoning_content"): return additional["reasoning_content"]
strive: return obj.to_dict().get("reasoning_content")
besides Exception: return None
def chat(messages, effort=None, considering=True, instruments=None, tool_choice="auto",
stream=False, max_tokens=2048, temperature=1.0, tool_stream=False):
"""
effort: None | "excessive" | "max" (GLM-5.2 thinking-effort degree; max is the mannequin default)
considering: True -> deep considering on; False -> off (quick, low cost, low-latency)
GLM-specific params undergo extra_body so any OpenAI consumer works.
"""
additional = {"considering": {"kind": "enabled" if considering else "disabled"}}
if effort and considering: additional["reasoning_effort"] = effort
if tool_stream: additional["tool_stream"] = True
kwargs = dict(mannequin=MODEL, messages=messages, max_tokens=max_tokens,
temperature=temperature, stream=stream, extra_body=additional)
if instruments:
kwargs.replace(instruments=instruments, tool_choice=tool_choice)
if stream:
kwargs["stream_options"] = {"include_usage": True}
return consumer.chat.completions.create(**kwargs)
We arrange the whole basis for utilizing GLM-5.2 by means of an OpenAI-compatible API. We outline a number of supplier choices, load the API key securely, create the OpenAI consumer, and arrange token-cost monitoring for your entire pocket book. We additionally construct a reusable chat wrapper so that each subsequent demo can use considering mode, reasoning effort, streaming, device calling, and provider-specific parameters cleanly.
Basic Chat, Thinking-Effort Control, and Streamed Reasoning with GLM-5.2
def demo_basic():
print("n=== 1. BASIC CHAT / SANITY CHECK =========================")
resp = chat([{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "In one sentence, what is GLM-5.2 best at?"}],
considering=False, max_tokens=200)
_track(resp.utilization)
print(resp.selections[0].message.content material.strip())
def demo_effort():
print("n=== 2. THINKING-EFFORT CONTROL (off / excessive / max) ========")
downside = ("Train A leaves metropolis A at 9:00 going 60 km/h towards metropolis B. "
"Train B leaves B (420 km away) at 9:30 going 90 km/h towards A. "
"At what clock time do they meet? Show the important thing steps briefly.")
for label, kw in [("thinking OFF", dict(thinking=False)),
("effort=high", dict(thinking=True, effort="high")),
("effort=max", dict(thinking=True, effort="max"))]:
t0 = time.time()
resp = chat([{"role": "user", "content": problem}], max_tokens=2000, **kw)
dt = time.time() - t0
_track(resp.utilization)
msg, u = resp.selections[0].message, resp.utilization
print(f"n--- {label} | {dt:0.1f}s | out_tokens={getattr(u,'completion_tokens',0)} ---")
r = get_reasoning(msg)
if r:
print(" [reasoning, first 220 chars]: " + " ".be part of(r.break up())[:220] + " ...")
print(" : " + " ".be part of((msg.content material or '').break up())[:350])
def demo_streaming():
print("n=== 3. STREAMING: reasoning channel vs reply channel ====")
stream = chat([{"role": "user", "content":
"Explain why the sky is blue, then give a one-line TL;DR."}],
considering=True, effort="excessive", stream=True, max_tokens=1200)
saw_r = saw_a = False
utilization = None
for chunk in stream:
if getattr(chunk, "utilization", None): utilization = chunk.utilization
if not chunk.selections: proceed
delta = chunk.selections[0].delta
r = get_reasoning(delta)
if r:
if not saw_r: print("n[thinking] ", finish="", flush=True); saw_r = True
print(r, finish="", flush=True)
if getattr(delta, "content material", None):
if not saw_a: print("nn ", finish="", flush=True); saw_a = True
print(delta.content material, finish="", flush=True)
print()
_track(utilization)
We begin testing GLM-5.2 with primary chat, reasoning-effort management, and streaming output. We first run a easy sanity test, then examine the identical downside throughout thinking-off, high-effort, and max-effort modes to observe modifications in latency and output tokens. We additionally stream the mannequin response so we are able to view the reasoning channel and the ultimate reply individually because the response is being generated.
Function Calling and a Multi-Step Tool-Using GLM-5.2 Agent
def tool_calculator(expression: str):
if not re.fullmatch(r"[0-9+-*/(). %]+", expression or ""):
return {"error": "unsupported characters"}
strive: return {"outcome": eval(expression, {"__builtins__": {}}, {})}
besides Exception as e: return {"error": str(e)}
_CITY_POP = {"tokyo": 37_400_068, "delhi": 32_900_000, "shanghai": 28_500_000,
"sao paulo": 22_400_000, "mexico metropolis": 21_800_000}
def tool_city_population(metropolis: str):
return {"metropolis": metropolis, "inhabitants": _CITY_POP.get((metropolis or "").strip().decrease())}
TOOLS = [
{"type": "function", "function": {
"name": "calculator", "description": "Evaluate basic arithmetic like '37400068/21800000'.",
"parameters": {"type": "object", "properties": {"expression": {"type": "string"}},
"required": ["expression"]}}},
{"kind": "perform", "perform": {
"title": "city_population", "description": "Look up the metro inhabitants of a metropolis.",
"parameters": {"kind": "object", "properties": {"metropolis": {"kind": "string"}},
"required": ["city"]}}},
]
TOOL_IMPLS = {"calculator": tool_calculator, "city_population": tool_city_population}
def run_tool_loop(messages, max_rounds=6, effort="max"):
"""Full loop: mannequin -> tool_calls -> execute -> feed outcomes again -> repeat."""
for _ in vary(max_rounds):
resp = chat(messages, instruments=TOOLS, considering=True, effort=effort,
max_tokens=1500, temperature=0.3)
_track(resp.utilization)
m = resp.selections[0].message
if not getattr(m, "tool_calls", None):
return m.content material
messages.append({
"function": "assistant", "content material": m.content material or "",
"tool_calls": [{"id": tc.id, "type": "function",
"function": {"name": tc.function.name,
"arguments": tc.function.arguments}}
for tc in m.tool_calls]})
for tc in m.tool_calls:
strive: args = json.masses(tc.perform.arguments or "{}")
besides json.JSONDecodeError: args = {}
outcome = TOOL_IMPLS.get(tc.perform.title, lambda **ok: {"error": "unknown"})(**args)
print(f" ↳ {tc.perform.title}({args}) -> {outcome}")
messages.append({"function": "device", "tool_call_id": tc.id,
"content material": json.dumps(outcome)})
return "(stopped: max device rounds reached)"
def demo_tools():
print("n=== 4. FUNCTION / TOOL CALLING ===========================")
q = ("How many instances bigger is Tokyo's metro inhabitants than Mexico City's? "
"Use the instruments, then reply with the ratio to one decimal place.")
print("Final:", " ".be part of((run_tool_loop([{"role": "user", "content": q}]) or "").break up()))
def demo_agent():
print("n=== 5. MINI MULTI-STEP AGENT (instruments + max effort) ========")
activity = ("Rank Tokyo, Delhi, and Shanghai by metro inhabitants (largest first), "
"then compute the mixed inhabitants of the highest two and report it. "
"Use the instruments for each lookup and sum; by no means guess numbers.")
ans = run_tool_loop([{"role": "system", "content": "You are a careful analyst."},
{"role": "user", "content": task}])
print("Final:", " ".be part of((ans or "").break up()))
We join GLM-5.2 to exterior instruments and construct a small tool-using workflow. We outline a calculator and a city-population lookup device, register them in an OpenAI-style device schema, and create a loop through which the mannequin requests device calls and receives device outcomes. We then use this setup for a direct function-calling activity and a small multi-step agent that appears up populations, ranks cities, and performs calculations with out guessing.
Structured JSON Output and Long-Context Retrieval with GLM-5.2
def tool_calculator(expression: str):
if not re.fullmatch(r"[0-9+-*/(). %]+", expression or ""):
return {"error": "unsupported characters"}
strive: return {"outcome": eval(expression, {"__builtins__": {}}, {})}
besides Exception as e: return {"error": str(e)}
_CITY_POP = {"tokyo": 37_400_068, "delhi": 32_900_000, "shanghai": 28_500_000,
"sao paulo": 22_400_000, "mexico metropolis": 21_800_000}
def tool_city_population(metropolis: str):
return {"metropolis": metropolis, "inhabitants": _CITY_POP.get((metropolis or "").strip().decrease())}
TOOLS = [
{"type": "function", "function": {
"name": "calculator", "description": "Evaluate basic arithmetic like '37400068/21800000'.",
"parameters": {"type": "object", "properties": {"expression": {"type": "string"}},
"required": ["expression"]}}},
{"kind": "perform", "perform": {
"title": "city_population", "description": "Look up the metro inhabitants of a metropolis.",
"parameters": {"kind": "object", "properties": {"metropolis": {"kind": "string"}},
"required": ["city"]}}},
]
TOOL_IMPLS = {"calculator": tool_calculator, "city_population": tool_city_population}
def run_tool_loop(messages, max_rounds=6, effort="max"):
"""Full loop: mannequin -> tool_calls -> execute -> feed outcomes again -> repeat."""
for _ in vary(max_rounds):
resp = chat(messages, instruments=TOOLS, considering=True, effort=effort,
max_tokens=1500, temperature=0.3)
_track(resp.utilization)
m = resp.selections[0].message
if not getattr(m, "tool_calls", None):
return m.content material
messages.append({
"function": "assistant", "content material": m.content material or "",
"tool_calls": [{"id": tc.id, "type": "function",
"function": {"name": tc.function.name,
"arguments": tc.function.arguments}}
for tc in m.tool_calls]})
for tc in m.tool_calls:
strive: args = json.masses(tc.perform.arguments or "{}")
besides json.JSONDecodeError: args = {}
outcome = TOOL_IMPLS.get(tc.perform.title, lambda **ok: {"error": "unknown"})(**args)
print(f" ↳ {tc.perform.title}({args}) -> {outcome}")
messages.append({"function": "device", "tool_call_id": tc.id,
"content material": json.dumps(outcome)})
return "(stopped: max device rounds reached)"
def demo_tools():
print("n=== 4. FUNCTION / TOOL CALLING ===========================")
q = ("How many instances bigger is Tokyo's metro inhabitants than Mexico City's? "
"Use the instruments, then reply with the ratio to one decimal place.")
print("Final:", " ".be part of((run_tool_loop([{"role": "user", "content": q}]) or "").break up()))
def demo_agent():
print("n=== 5. MINI MULTI-STEP AGENT (instruments + max effort) ========")
activity = ("Rank Tokyo, Delhi, and Shanghai by metro inhabitants (largest first), "
"then compute the mixed inhabitants of the highest two and report it. "
"Use the instruments for each lookup and sum; by no means guess numbers.")
ans = run_tool_loop([{"role": "system", "content": "You are a careful analyst."},
{"role": "user", "content": task}])
print("Final:", " ".be part of((ans or "").break up()))
We deal with dependable, structured output and long-context retrieval. We create a JSON extraction helper, ask the mannequin to return a strict JSON object, and retry as soon as if the primary response is just not legitimate JSON. We additionally construct an artificial lengthy doc with a hidden “needle” and ship it to GLM-5.2 to test whether or not the mannequin retrieves the precise launch code from the offered context.
Running All Demos with GLM-5.2 Token and Cost Accounting
def cost_summary():
print("n=== 8. TOKEN + COST ACCOUNTING ===========================")
value = _USAGE["in"]/1e6*PRICE_IN_PER_M + _USAGE["out"]/1e6*PRICE_OUT_PER_M
print(f" calls: {_USAGE['calls']} | enter: {_USAGE['in']:,} tok | output: {_USAGE['out']:,} tok")
print(f" estimated spend @ ${PRICE_IN_PER_M}/{PRICE_OUT_PER_M} per 1M: ${value:0.4f}")
DEMOS = [demo_basic, demo_effort, demo_streaming, demo_tools,
demo_agent, demo_structured, demo_long_context]
print(f"Provider={PROVIDER} mannequin={MODEL}")
for fn in DEMOS:
strive: fn()
besides Exception as e:
print(f" [skipped {fn.__name__}: {type(e).__name__}: {e}]")
cost_summary()
print("nDone. Tweak PROVIDER / effort / max_tokens and re-run any demo perform.")
We end the tutorial by accumulating utilization info and working all demos from high to backside. We calculate the estimated value from complete enter and output tokens, then print a compact abstract of calls, token counts, and spend. We additionally use a driver loop so {that a} single failed demo doesn’t halt your entire pocket book, making the tutorial simpler to run, debug, and reuse.
Conclusion
In conclusion, we now have a sensible and reusable workflow for utilizing GLM-5.2 in Python purposes. We discovered how to management its reasoning conduct, examine totally different considering modes, join it with instruments, validate structured outputs, take a look at long-context inputs, and monitor token utilization with estimated value. It offers us a robust start line for constructing extra superior programs akin to analysis assistants, doc evaluation instruments, coding brokers, long-context retrieval workflows, or API-based reasoning pipelines. We completed with a setup that’s light-weight sufficient for Colab however nonetheless shut to how we’d construct with GLM-5.2 in an actual venture.
Check out the Full Codes here. Also, be at liberty to observe us on Twitter and don’t neglect to be part of our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The publish GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning Effort, Function Calling, and Long-Context Retrieval appeared first on MarkTechPost.
