TL;DR: AgentFlow is a trainable agent framework with 4 modules—Planner, Executor, Verifier, Generator—coordinated by an specific reminiscence and toolset. The planner is optimized within the loop with a brand new on-policy methodology, Circulation-GRPO, which broadcasts a trajectory-level final result reward to each flip and applies token-level PPO-style updates with KL regularization and group-normalized benefits. On ten benchmarks, a 7B spine tuned with Circulation-GRPO reviews +14.9% (search), +14.0% (agentic), +14.5% (math), and +4.1% (science) over sturdy baselines.
What’s AgentFlow?
AgentFlow formalizes multi-turn, tool-integrated reasoning as an Markov Determination Course of (MDP). At every flip, the Planner proposes a sub-goal and selects a software plus context; the Executor calls the software; the Verifier alerts whether or not to proceed; the Generator emits the ultimate reply on termination. A structured, evolving reminiscence information states, software calls, and verification alerts, constraining context development and making trajectories auditable. Solely the planner is educated; different modules may be fastened engines.
The general public implementation showcases a modular toolkit (e.g., base_generator, python_coder, google_search, wikipedia_search, web_search) and ships quick-start scripts for inference, coaching, and benchmarking. The repository is MIT-licensed.

Coaching methodology: Circulation-GRPO
Circulation-GRPO (Circulation-based Group Refined Coverage Optimization) converts long-horizon, sparse-reward optimization into tractable single-turn updates:
- Closing-outcome reward broadcast: a single, verifiable trajectory-level sign (LLM-as-judge correctness) is assigned to each flip, aligning native planning steps with world success.
- Token-level clipped goal: importance-weighted ratios are computed per token, with PPO-style clipping and a KL penalty to a reference coverage to stop drift.
- Group-normalized benefits: variance discount throughout teams of on-policy rollouts stabilizes updates.


Understanding the outcomes and benchmarks
Benchmarks. The analysis staff evaluates 4 activity sorts: knowledge-intensive search (Bamboogle, 2Wiki, HotpotQA, Musique), agentic reasoning (GAIA textual cut up), math (AIME-24, AMC-23, Recreation of 24), and science (GPQA, MedQA). GAIA is a tooling-oriented benchmark for basic assistants; the textual cut up excludes multimodal necessities.
Fundamental numbers (7B spine after Circulation-GRPO). Common positive factors over sturdy baselines: +14.9% (search), +14.0% (agentic), +14.5% (math), +4.1% (science). The analysis staff state their 7B system surpasses GPT-4o on the reported suite. The mission web page additionally reviews coaching results corresponding to improved planning high quality, decreased tool-calling errors (as much as 28.4% on GAIA), and constructive tendencies with bigger flip budgets and mannequin scale.
Ablations. On-line Circulation-GRPO improves efficiency by +17.2% vs. a frozen-planner baseline, whereas offline supervised fine-tuning of the planner degrades efficiency by −19.0% on their composite metric.


Key Takeaways
- Modular agent, planner-only coaching. AgentFlow constructions an agent into Planner–Executor–Verifier–Generator with an specific reminiscence; solely the Planner is educated in-loop.
- Circulation-GRPO converts long-horizon RL to single-turn updates. A trajectory-level final result reward is broadcast to each flip; updates use token-level PPO-style clipping with KL regularization and group-normalized benefits.
- The analysis team-reported positive factors on 10 benchmarks. With a 7B spine, AgentFlow reviews common enhancements of +14.9% (search), +14.0% (agentic/GAIA textual), +14.5% (math), +4.1% (science) over sturdy baselines, and states surpassing GPT-4o on the identical suite.
- Software-use reliability improves. The analysis staff report decreased tool-calling errors (e.g., on GAIA) and higher planning high quality underneath bigger flip budgets and mannequin scale.
AgentFlow formalizes tool-using brokers into 4 modules (planner, executor, verifier, generator) and trains solely the planner in-loop through Circulation-GRPO, which broadcasts a single trajectory-level reward to each flip with token-level PPO-style updates and KL management. Reported outcomes on ten benchmarks present common positive factors of +14.9% (search), +14.0% (agentic/GAIA textual cut up), +14.5% (math), and +4.1% (science); the analysis staff moreover state the 7B system surpasses GPT-4o on this suite. Implementation, instruments, and quick-start scripts are MIT-licensed within the GitHub repo.
Take a look at the Technical Paper, GitHub Web page and Mission Web page. Be at liberty to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be at liberty to comply with us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you may be part of us on telegram as properly.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.
Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the most recent breakthroughs, get unique updates, and join with a world community of future-focused thinkers.
Unlock tomorrow’s tendencies at this time: learn extra, subscribe to our publication, and develop into a part of the NextTech group at NextTech-news.com

