Zhipu AI Releases GLM-4.6V: A 128K Context Vision Language Model with Native Tool Calling
Zhipu AI has open sourced the GLM-4.6V sequence as a pair of imaginative and prescient language fashions that deal with photographs, video and instruments as first-class inputs for brokers, not as afterthoughts bolted on high of textual content.
Model lineup and context size
The sequence has 2 fashions. GLM-4.6V is a 106B parameter basis mannequin for cloud and excessive efficiency cluster workloads. GLM-4.6V-Flash is a 9B parameter variant tuned for native deployment and low latency use.
GLM-4.6V extends the coaching context window to 128K tokens. In follow this helps roughly 150 pages of dense paperwork, 200 slide pages or one hour of video in a single cross as a result of pages are encoded as photographs and consumed by the visible encoder.
Native multimodal instrument use
The major technical change is native multimodal Function Calling. Traditional instrument use in LLM programs routes all the things via textual content. Images or pages are first became descriptions, the mannequin calls instruments utilizing textual content arguments after which reads textual responses. This wastes data and will increase latency.
GLM-4.6V introduces native multimodal Function Calling. Images, screenshots and doc pages cross instantly as instrument parameters. Tools can return search outcome grids, charts, rendered internet pages or product photographs. The mannequin consumes these visible outputs and fuses them with textual content in the identical reasoning chain. This closes the loop from notion to understanding to execution and is explicitly positioned because the bridge between visible notion and executable motion for multimodal brokers.
To assist this, Zhipu AI extends the Model Context Protocol with URL based mostly multimodal dealing with. Tools obtain and return URLs that determine particular photographs or frames, which avoids file dimension limits and permits exact choice inside multi picture contexts.
Rich textual content content material, internet search and frontend replication
Zhipu AI analysis workforce describes 4 canonical eventualities:
First, wealthy textual content content material understanding and creation. GLM-4.6V reads combined inputs comparable to papers, studies or slide decks and produces structured picture textual content interleaved outputs. It understands textual content, charts, figures, tables and formulation in the identical doc. During technology it could possibly crop related visuals or retrieve exterior photographs via instruments, then run a visible audit step that filters low high quality photographs and composes the ultimate article with inline figures.
Second, visible internet search. The mannequin can detect consumer intent, plan which search instruments to name and mix textual content to picture and picture to textual content search. It then aligns retrieved photographs and textual content, selects the related proof and outputs a structured reply, for instance a visible comparability of merchandise or locations.
Third, frontend replication and visible interplay. GLM-4.6V is tuned for design to code workflows. From a UI screenshot, it reconstructs pixel correct HTML, CSS and JavaScript. Developers can then mark a area on the screenshot and problem pure language directions, for instance transfer this button left or change this card background. The mannequin maps these directions again to the code and returns an up to date snippet.
Fourth, multimodal doc understanding at lengthy context. GLM-4.6V can learn multi doc inputs as much as the 128K token context restrict by treating pages as photographs. The analysis workforce studies a case the place the mannequin processes monetary studies from 4 public corporations, extracts core metrics and builds a comparability desk, and a case the place it summarises a full soccer match whereas protecting the power to reply questions on particular targets and timestamps.
Architecture, knowledge and reinforcement studying
The GLM-4.6V fashions belong to the GLM-V household and based mostly on the tech report for GLM-4.5V and GLM-4.1V-Thinking. The analysis workforce highlights three major technical components.
First, lengthy sequence modeling. GLM-4.6V extends the coaching context window to 128K tokens and runs continuous pre coaching on large lengthy context picture textual content corpora. It makes use of compression alignment concepts from Glyph in order that visible tokens can carry dense data that’s aligned with language tokens.
Second, world data enhancement. Zhipu AI workforce provides a billion scale multimodal notion and world data dataset at pre coaching time. This covers layered encyclopedic ideas and on a regular basis visible entities. The acknowledged purpose is to enhance each primary notion and cross modal query answering completeness, not solely benchmarks.
Third, agentic knowledge synthesis and prolonged MCP. The analysis workforce generates massive artificial traces the place the mannequin calls instruments, processes visible outputs and iterates on plans. They lengthen MCP with URL based mostly multimodal dealing with and an interleaved output mechanism. The technology stack follows a Draft, Image Selection, Final Polish sequence. The mannequin can autonomously name cropping or search instruments between these levels to position photographs on the proper positions within the output.
Tool invocation is a part of the reinforcement studying goal. GLM-4.6V makes use of RL to align planning, instruction following and format adherence in advanced instrument chains.
Performance

Key Takeaways
- GLM-4.6V is a 106B multimodal basis mannequin with a 128K token coaching context, and GLM-4.6V-Flash is a 9B variant optimized for native and low latency use.
- Both fashions assist native multimodal Function Calling so instruments can devour and return photographs, video frames and doc pages instantly, which hyperlinks visible notion to executable actions for brokers.
- GLM-4.6V is skilled for lengthy context multimodal understanding and interleaved technology, so it could possibly learn massive combined doc units and emit structured textual content with inline figures and power chosen photographs in a single cross.
- The sequence achieves state-of-the-art efficiency on main multimodal benchmarks at comparable parameter scales and is launched as open supply weights underneath the MIT license on Hugging Face and ModelScope.
Check out the Model Card on HF and Technical details. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The submit Zhipu AI Releases GLM-4.6V: A 128K Context Vision Language Model with Native Tool Calling appeared first on MarkTechPost.
