Close Menu
  • Home
  • Opinion
  • Region
    • Africa
    • Asia
    • Europe
    • Middle East
    • North America
    • Oceania
    • South America
  • AI & Machine Learning
  • Robotics & Automation
  • Space & Deep Tech
  • Web3 & Digital Economies
  • Climate & Sustainability Tech
  • Biotech & Future Health
  • Mobility & Smart Cities
  • Global Tech Pulse
  • Cybersecurity & Digital Rights
  • Future of Work & Education
  • Trend Radar & Startup Watch
  • Creator Economy & Culture
What's Hot

Korea’s New Export Regulation Targets a Crucial Blind Spot: Digital Startups Promoting Globally – KoreaTechDesk

March 14, 2026

Logitech’s Brio 100 Webcam Delivers Day by day Reliability By Providing Clear Video With out the Premium Value

March 14, 2026

The Infinite is a wonderful have a look at area in VR

March 14, 2026
Facebook X (Twitter) Instagram LinkedIn RSS
NextTech NewsNextTech News
Facebook X (Twitter) Instagram LinkedIn RSS
  • Home
  • Africa
  • Asia
  • Europe
  • Middle East
  • North America
  • Oceania
  • South America
  • Opinion
Trending
  • Korea’s New Export Regulation Targets a Crucial Blind Spot: Digital Startups Promoting Globally – KoreaTechDesk
  • Logitech’s Brio 100 Webcam Delivers Day by day Reliability By Providing Clear Video With out the Premium Value
  • The Infinite is a wonderful have a look at area in VR
  • REVIEW: GravaStar Mercury V75 Professional Neon Graffiti keyboard: Cyberpunk model meets premium construct
  • Irish office advantages market outlined by accessibility, finds Morgan McKinley
  • The Sons of Hamad Suhail Al Khaili Help the “Mom of the Nation Endowment for Orphans” Marketing campaign with Growth Initiatives Value AED 100 Million
  • Planets Collide Round a Distant Star and Depart Clear Traces in Its Mild
  • MiroMind Pronounces Three AI Scientists Becoming a member of the Crew from xAI, FAIR, and Main World Universities
Saturday, March 14
NextTech NewsNextTech News
Home - AI & Machine Learning - GLM-4.1V-Pondering: Advancing Normal-Goal Multimodal Understanding and Reasoning
AI & Machine Learning

GLM-4.1V-Pondering: Advancing Normal-Goal Multimodal Understanding and Reasoning

NextTechBy NextTechJuly 18, 2025No Comments4 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Telegram Email Copy Link
Follow Us
Google News Flipboard
GLM-4.1V-Pondering: Advancing Normal-Goal Multimodal Understanding and Reasoning
Share
Facebook Twitter LinkedIn Pinterest Email


Imaginative and prescient-language fashions (VLMs) play a vital position in at present’s clever programs by enabling an in depth understanding of visible content material. The complexity of multimodal intelligence duties has grown, starting from scientific problem-solving to the event of autonomous brokers. Present calls for on VLMs have far exceeded easy visible content material notion, with rising consideration on superior reasoning. Whereas current works present that long-form reasoning and scalable RL considerably improve LLMs’ problem-solving skills, present efforts primarily concentrate on particular domains to enhance VLM reasoning. The open-source neighborhood at present lacks a multimodal reasoning mannequin that outperforms conventional non-thinking fashions of comparable parameter scale throughout numerous duties.

Researchers from Zhipu AI and Tsinghua College have proposed GLM-4.1V-Pondering, a VLM designed to advance general-purpose multimodal understanding and reasoning. The strategy then introduces Reinforcement Studying with Curriculum Sampling (RLCS) to unlock the mannequin’s full potential, enabling enhancements throughout STEM drawback fixing, video understanding, content material recognition, coding, grounding, GUI-based brokers, and lengthy doc understanding. Researchers open-sourced GLM-4.1V-9B-Pondering, which units a brand new benchmark amongst equally sized fashions. It additionally delivers aggressive, and in some instances superior efficiency in comparison with proprietary fashions like GPT-4o on difficult duties akin to lengthy doc understanding and STEM reasoning.

AD 4nXcl0itNo00 KWEP pV2x1Lr7gDLyU t9vpfbjCHjoOelz4mQ B8CUX5eAm02J5DbQ2Oq9Su04wQgKvZJErHw1002bcxHt n jLIrvbd oOsPaIAJpqchR5zE84wqfe4fPzyvwm8A?key=D5fiJgWVkkGvQ1o6ONtunAAD 4nXcl0itNo00 KWEP pV2x1Lr7gDLyU t9vpfbjCHjoOelz4mQ B8CUX5eAm02J5DbQ2Oq9Su04wQgKvZJErHw1002bcxHt n jLIrvbd oOsPaIAJpqchR5zE84wqfe4fPzyvwm8A?key=D5fiJgWVkkGvQ1o6ONtunA
square unsloth ad

GLM-4.1V-Pondering accommodates three core elements: a imaginative and prescient encoder, an MLP adapter, and an LLM decoder. It makes use of AIMv2-Enormous because the imaginative and prescient encoder and GLM because the LLM, changing the unique 2D convolutions with 3D convolutions for temporal downsampling. The mannequin integrates 2D-RoPE to assist arbitrary picture resolutions and side ratios, and course of excessive side ratios over 200:1 and excessive resolutions past 4K. Researchers lengthen RoPE to 3D-RoPE within the LLM to enhance spatial understanding in multimodal contexts. For temporal modeling in movies, time index tokens are added after every body token, with timestamps encoded as strings to assist the mannequin perceive real-world temporal gaps between frames

Throughout pre-training, the researchers use quite a lot of datasets, combining giant tutorial corpora with interleaved image-text knowledge wealthy in information. By together with pure textual content knowledge, the mannequin’s core language capabilities are preserved, leading to higher move@okay efficiency than different state-of-the-art pre-trained base fashions of comparable dimension. The supervised fine-tuning stage transforms the bottom VLM into one able to lengthy CoT inference utilizing a curated long-CoT corpus throughout verifiable, like STEM issues, and non-verifiable duties akin to instruction following. Lastly, the RL part employs a mix of RLVR and RLHF to conduct large-scale coaching throughout all multimodal domains, together with STEM drawback fixing, grounding, optical character recognition, GUI brokers, and lots of extra.

GLM-4.1V-9B-Pondering outperforms all competing open-source fashions beneath 10B parameters in Normal VQA duties protecting each single-image and multi-image settings. It achieves the best efficiency on difficult STEM benchmarks, together with MMMU_Val, MMMU_Pro, VideoMMMU, and AI2D. Within the OCR and Chart domains, the mannequin units new state-of-the-art scores on ChartQAPro and ChartMuseum. For Lengthy Doc Understanding, GLM-4.1V-9B-Pondering outperforms all different fashions on MMLongBench, whereas establishing new state-of-the-art leads to GUI Brokers and multimodal Coding duties. Lastly, the mannequin reveals sturdy Video Understanding efficiency, outperforming VideoMME, MMVU, and MotionBench benchmarks.

AD 4nXcNE28e4MwWcjvUDUFaH9RQc34rRuYktmC6v9264uBrVuRmyNYjrywOD0mUsXjkJxu1PHfk 1fWTghc5eAD 4nXcNE28e4MwWcjvUDUFaH9RQc34rRuYktmC6v9264uBrVuRmyNYjrywOD0mUsXjkJxu1PHfk 1fWTghc5e

In conclusion, researchers launched GLM-4.1V-Pondering, which represents a step towards general-purpose multimodal reasoning. Its 9B-parameter mannequin outperforms bigger fashions just like the one which exceeds 70B parameters. Nonetheless, a number of limitations stay, akin to inconsistent enhancements in reasoning high quality by means of RL, instability throughout coaching, and difficulties with advanced instances. Future developments ought to concentrate on bettering supervision and analysis of mannequin reasoning, with reward fashions evaluating intermediate reasoning steps whereas detecting hallucinations and logical inconsistencies. Furthermore, exploring methods to stop reward hacking in subjective analysis duties is essential to attain general-purpose intelligence.


Try the Paper and GitHub Web page. All credit score for this analysis goes to the researchers of this undertaking.

Sponsorship Alternative
Attain probably the most influential AI builders worldwide. 1M+ month-to-month readers, 500K+ neighborhood builders, infinite prospects. [Explore Sponsorship]


photo sajjad Ansari

Sajjad Ansari is a closing yr undergraduate from IIT Kharagpur. As a Tech fanatic, he delves into the sensible functions of AI with a concentrate on understanding the influence of AI applied sciences and their real-world implications. He goals to articulate advanced AI ideas in a transparent and accessible method.

Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the newest breakthroughs, get unique updates, and join with a worldwide community of future-focused thinkers.
Unlock tomorrow’s traits at present: learn extra, subscribe to our publication, and grow to be a part of the NextTech neighborhood at NextTech-news.com

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
NextTech
  • Website

Related Posts

Mannequin Context Protocol (MCP) vs. AI Agent Expertise: A Deep Dive into Structured Instruments and Behavioral Steerage for LLMs

March 13, 2026

Prime LiDAR Annotation Corporations for AI & 3D Level Cloud Information

March 13, 2026

The best way to Construct an Autonomous Machine Studying Analysis Loop in Google Colab Utilizing Andrej Karpathy’s AutoResearch Framework for Hyperparameter Discovery and Experiment Monitoring

March 13, 2026
Add A Comment
Leave A Reply Cancel Reply

Economy News

Korea’s New Export Regulation Targets a Crucial Blind Spot: Digital Startups Promoting Globally – KoreaTechDesk

By NextTechMarch 14, 2026

South Korea’s export system was constructed for containers, customs declarations, and bodily shipments. That structure…

Logitech’s Brio 100 Webcam Delivers Day by day Reliability By Providing Clear Video With out the Premium Value

March 14, 2026

The Infinite is a wonderful have a look at area in VR

March 14, 2026
Top Trending

Korea’s New Export Regulation Targets a Crucial Blind Spot: Digital Startups Promoting Globally – KoreaTechDesk

By NextTechMarch 14, 2026

South Korea’s export system was constructed for containers, customs declarations, and bodily…

Logitech’s Brio 100 Webcam Delivers Day by day Reliability By Providing Clear Video With out the Premium Value

By NextTechMarch 14, 2026

Day by day video interactions have one important requirement: good image and…

The Infinite is a wonderful have a look at area in VR

By NextTechMarch 14, 2026

A brand new immersive expertise in Mississauga will allow you to see…

Subscribe to News

Get the latest sports news from NewsSite about world, sports and politics.

NEXTTECH-LOGO
Facebook X (Twitter) Instagram YouTube

AI & Machine Learning

Robotics & Automation

Space & Deep Tech

Web3 & Digital Economies

Climate & Sustainability Tech

Biotech & Future Health

Mobility & Smart Cities

Global Tech Pulse

Cybersecurity & Digital Rights

Future of Work & Education

Creator Economy & Culture

Trend Radar & Startup Watch

News By Region

Africa

Asia

Europe

Middle East

North America

Oceania

South America

2025 © NextTech-News. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms Of Service
  • Advertise With Us
  • Write For Us
  • Submit Article & Press Release

Type above and press Enter to search. Press Esc to cancel.

Subscribe For Latest Updates

Sign up to best of Tech news, informed analysis and opinions on what matters to you.

Invalid email address
 We respect your inbox and never send spam. You can unsubscribe from our newsletter at any time.     
Thanks for subscribing!