Zhipu AI has open sourced the GLM-4.6V collection as a pair of imaginative and prescient language fashions that deal with pictures, video and instruments as firstclass inputs for brokers, not as afterthoughts bolted on high of textual content.
Mannequin lineup and context size
The collection has 2 fashions. GLM-4.6V is a 106B parameter basis mannequin for cloud and excessive efficiency cluster workloads. GLM-4.6V-Flash is a 9B parameter variant tuned for native deployment and low latency use.
GLM-4.6V extends the coaching context window to 128K tokens. In observe this helps roughly 150 pages of dense paperwork, 200 slide pages or one hour of video in a single move as a result of pages are encoded as pictures and consumed by the visible encoder.
Native multimodal instrument use
The primary technical change is native multimodal Operate Calling. Conventional instrument use in LLM programs routes all the pieces by means of textual content. Photographs or pages are first changed into descriptions, the mannequin calls instruments utilizing textual content arguments after which reads textual responses. This wastes data and will increase latency.
GLM-4.6V introduces native multimodal Operate Calling. Photographs, screenshots and doc pages move immediately as instrument parameters. Instruments can return search consequence grids, charts, rendered internet pages or product pictures. The mannequin consumes these visible outputs and fuses them with textual content in the identical reasoning chain. This closes the loop from notion to understanding to execution and is explicitly positioned because the bridge between visible notion and executable motion for multimodal brokers.
To help this, Zhipu AI extends the Mannequin Context Protocol with URL primarily based multimodal dealing with. Instruments obtain and return URLs that determine particular pictures or frames, which avoids file measurement limits and permits exact choice inside multi picture contexts.
Wealthy textual content content material, internet search and frontend replication
Zhipu AI analysis staff describes 4 canonical situations:
First, wealthy textual content content material understanding and creation. GLM-4.6V reads combined inputs akin to papers, reviews or slide decks and produces structured picture textual content interleaved outputs. It understands textual content, charts, figures, tables and formulation in the identical doc. Throughout technology it could crop related visuals or retrieve exterior pictures by means of instruments, then run a visible audit step that filters low high quality pictures and composes the ultimate article with inline figures.
Second, visible internet search. The mannequin can detect consumer intent, plan which search instruments to name and mix textual content to picture and picture to textual content search. It then aligns retrieved pictures and textual content, selects the related proof and outputs a structured reply, for instance a visible comparability of merchandise or locations.
Third, frontend replication and visible interplay. GLM-4.6V is tuned for design to code workflows. From a UI screenshot, it reconstructs pixel correct HTML, CSS and JavaScript. Builders can then mark a area on the screenshot and situation pure language directions, for instance transfer this button left or change this card background. The mannequin maps these directions again to the code and returns an up to date snippet.
Fourth, multimodal doc understanding at lengthy context. GLM-4.6V can learn multi doc inputs as much as the 128K token context restrict by treating pages as pictures. The analysis staff reviews a case the place the mannequin processes monetary reviews from 4 public firms, extracts core metrics and builds a comparability desk, and a case the place it summarises a full soccer match whereas preserving the power to reply questions on particular targets and timestamps.
Structure, information and reinforcement studying
The GLM-4.6V fashions belong to the GLM-V household and primarily based on the tech report for GLM-4.5V and GLM-4.1V-Considering. The analysis staff highlights three predominant technical components.
First, lengthy sequence modeling. GLM-4.6V extends the coaching context window to 128K tokens and runs continuous pre coaching on huge lengthy context picture textual content corpora. It makes use of compression alignment concepts from Glyph in order that visible tokens can carry dense data that’s aligned with language tokens.
Second, world information enhancement. Zhipu AI staff provides a billion scale multimodal notion and world information dataset at pre coaching time. This covers layered encyclopedic ideas and on a regular basis visible entities. The said objective is to enhance each primary notion and cross modal query answering completeness, not solely benchmarks.
Third, agentic information synthesis and prolonged MCP. The analysis staff generates massive artificial traces the place the mannequin calls instruments, processes visible outputs and iterates on plans. They lengthen MCP with URL primarily based multimodal dealing with and an interleaved output mechanism. The technology stack follows a Draft, Picture Choice, Ultimate Polish sequence. The mannequin can autonomously name cropping or search instruments between these levels to position pictures on the proper positions within the output.
Software invocation is a part of the reinforcement studying goal. GLM-4.6V makes use of RL to align planning, instruction following and format adherence in complicated instrument chains.
Efficiency

Key Takeaways
- GLM-4.6V is a 106B multimodal basis mannequin with a 128K token coaching context, and GLM-4.6V-Flash is a 9B variant optimized for native and low latency use.
- Each fashions help native multimodal Operate Calling so instruments can eat and return pictures, video frames and doc pages immediately, which hyperlinks visible notion to executable actions for brokers.
- GLM-4.6V is educated for lengthy context multimodal understanding and interleaved technology, so it could learn massive combined doc units and emit structured textual content with inline figures and power chosen pictures in a single move.
- The collection achieves cutting-edge efficiency on main multimodal benchmarks at related parameter scales and is launched as open supply weights underneath the MIT license on Hugging Face and ModelScope.
Try the Mannequin Card on HF and Technical particulars. Be at liberty to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be at liberty to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you may be part of us on telegram as effectively.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.
Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the most recent breakthroughs, get unique updates, and join with a worldwide community of future-focused thinkers.
Unlock tomorrow’s developments in the present day: learn extra, subscribe to our publication, and develop into a part of the NextTech group at NextTech-news.com

