How can we train AI brokers to reliably discover and click on the precise on display screen ingredient we imply once we give them a easy instruction? A staff of researchers from ML Foundations has launched Gelato-30B-A3B, a state-of-the-art grounding mannequin for graphical consumer interfaces that’s designed to plug into laptop use brokers and convert pure language directions into dependable click on places. The mannequin is skilled on the Click on 100k dataset and reaches 63.88% accuracy on ScreenSpot Professional and 69.15% on OS-World-G, with 74.65% on OS-World-G Refined. It surpasses GTA1-32B and bigger imaginative and prescient language fashions comparable to Qwen3-VL-235B-A22B-Instruct.

What Gelato 30B A3B Does in An Agent Stack?
Gelato-30B-A3B is a 31B parameter mannequin that positive tunes Qwen3-VL-30B-A3B Instruct with a mix of specialists structure. It takes a screenshot and a textual instruction as enter and produces a single click on coordinate as output.
The mannequin is positioned as a modular grounding element. A planner mannequin, for instance GPT 5 within the Gelato experiments, decides the subsequent excessive degree motion and calls Gelato to resolve that step right into a concrete click on on the display screen. This separation between planning and grounding is necessary when an agent should function throughout many working techniques and purposes with completely different layouts.


Click on 100k, A Focused Dataset For GUI Grounding
Click on 100k is the dataset that underlies Gelato. It pairs laptop display screen pictures with pure language directions, bounding bins for the goal ingredient, picture dimensions, and normalized bounding bins. Every pattern is ready up as a low degree command, for instance ‘faucet on the ingredient between Background and Notifications choices’ with a exact area.
The dataset is constructed by filtering and unifying a number of public sources. The record contains ShowUI, AutoGUI, PC Agent E, WaveUI, OS Atlas, UGround, PixMo Factors, SeeClick, UI VISION, a JEDI subset that focuses on spreadsheet and textual content cell manipulation, and movies from 85 skilled software tutorials annotated with Claude-4-Sonnet. Every supply contributes at most 50k samples, and all sources are mapped right into a shared schema with pictures, directions, bounding bins, and normalized coordinates.
The analysis staff then runs an aggressive filtering pipeline. OmniParser discards clicks that don’t land on detected interface parts. Qwen2.5-7B-VL and SE-GUI-3B take away trivial examples, comparable to straightforward hyperlink clicks. GTA1-7B-2507 and UI-Venus-7B take away samples the place the instruction and click on area don’t match. A Qwen2.5-7B-VL baseline skilled on a balanced 10k subset reveals that this mixture provides a +9 pp accuracy acquire on ScreenSpot Professional in contrast with coaching on unfiltered information.
Skilled software protection is a particular focus. Click on 100k provides information from UI VISION and the JEDI subset, after which augments this with 80+ tutorial movies for actual desktop instruments. Claude 4 Sonnet generates bounding bins and low degree directions for these movies, adopted by guide inspection and corrections.


GRPO Coaching On High Of Qwen3 VL
On the coaching aspect, Gelato 30B A3B makes use of GRPO, a reinforcement studying algorithm that derives from work on DeepSeekMath and comparable techniques. The analysis staff observe the DAPO setup. They take away the KL divergence time period from the target, set the clip increased threshold to 0.28, and skip rollouts with zero benefit. Rewards are sparse and are solely given when the anticipated click on falls contained in the goal bounding field, just like the GTA1 recipe.


They initialize from Qwen3 VL 30B A3B Instruct and run 100 RL steps on 32 A100 GPUs with 40 GB reminiscence. One of the best checkpoint seems at step 84 (marked as inexperienced cross within the above picture), chosen by the imply efficiency throughout ScreenSpot Professional, OS World G, and OS World G Refined. At this level the mannequin reaches 63.88% on ScreenSpot-Professional and 67.19% and 73.40% on OS World G and OS World G Refined. A easy refusal prompting technique, which appends an instruction to reply with refusal when the ingredient can’t be discovered, raises the OS-World-G scores to 69.15% and 74.65%.
Finish To Finish Agent Outcomes On OS World
To check Gelato past static grounding benchmarks, the analysis staff plugs it into the GTA1.5 agent framework and runs full laptop use brokers on the OS World surroundings. On this setup GPT 5 acts because the planner. Gelato 30B A3B gives grounding, the agent has at most 50 steps, and it waits 3 seconds between actions.
The analysis stories three runs per mannequin on a set OS World snapshot. Gelato-30B-A3B reaches 58.71% automated success price with a small normal deviation, in contrast with 56.97% for GTA1 32B in the identical harness. As a result of the automated OS World analysis misses some legitimate options, additionally they run human analysis on 20 problematic duties. Below human scoring, Gelato reaches 61.85% success, whereas GTA1-32B reaches 59.47%.
Key Takeaways
- Gelato-30B-A3B is a Qwen3-VL-30B-A3B Instruct primarily based combination of specialists mannequin that performs state-of-the-art GUI grounding on ScreenSpot Professional and OS World G benchmarks, surpassing GTA1-32B and bigger VLMs comparable to Qwen3-VL-235B-A22B-Instruct.
- The mannequin is skilled on Click on 100k, a curated grounding dataset that merges and filters a number of public GUI datasets {and professional} software traces, pairing actual screens with low degree pure language instructions and exact click on coordinates.
- Gelato-30B-A3B makes use of a GRPO reinforcement studying recipe on high of Qwen3-VL, with sparse rewards that solely set off when the anticipated click on lies inside the bottom fact bounding field, which considerably boosts grounding accuracy over supervised baselines.
- When built-in into an agent framework with GPT-5 appearing because the planner, Gelato-30B-A3B improves success charges on OS World laptop use duties in contrast with GTA1-32B, demonstrating that higher grounding straight interprets into stronger finish to finish agent efficiency.
Gelato-30B-A3B is a vital step for grounded laptop use as a result of it reveals {that a} Qwen3-VL primarily based MoE mannequin, skilled on a rigorously filtered Click on 100k dataset, can beat each GTA1-32B and far bigger VLMs like Qwen3-VL-235B-A22B Instruct on ScreenSpot Professional and OS-World-G whereas staying accessible by way of Hugging Face. General, Gelato-30B-A3B establishes a transparent new baseline for open laptop grounding fashions.
Try the Repo and Mannequin Weights. Be happy to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to observe us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you’ll be able to be part of us on telegram as properly.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.
Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the most recent breakthroughs, get unique updates, and join with a world community of future-focused thinkers.
Unlock tomorrow’s traits immediately: learn extra, subscribe to our e-newsletter, and grow to be a part of the NextTech neighborhood at NextTech-news.com

