StepFun has launched Step-DeepResearch, a 32B parameter finish to finish deep analysis agent that goals to show net search into precise analysis workflows with lengthy horizon reasoning, device use and structured reporting. The mannequin is constructed on Qwen2.5 32B-Base and is educated to behave as a single agent that plans, explores sources, verifies proof and writes experiences with citations, whereas holding inference price low.
From Search to Deep Analysis
Most present net brokers are tuned for multi-hop question-answering benchmarks. They attempt to match floor fact solutions for brief questions. That is nearer to focused retrieval than to actual analysis. Deep analysis duties are totally different. They contain latent intent recognition, lengthy horizon resolution making, multi-turn device use, structured-reasoning and cross-source verification below uncertainty.
Step-DeepResearch reframes this as sequential resolution making over a compact set of atomic capabilities. The analysis staff defines 4 atomic capabilities, planning and job decomposition, deep-information looking for, reflection and verification, {and professional} report era. As a substitute of orchestrating many exterior brokers, the system internalizes this loop right into a single mannequin that decides the following motion at every step.
Knowledge Synthesis round Atomic Capabilities
To show these atomic capabilities, the analysis staff builds separate knowledge pipelines for every ability. For planning, they begin from prime quality technical experiences, survey papers and monetary evaluation paperwork. They reverse-engineer reasonable analysis plans and job bushes from titles, abstracts and construction, then generate trajectories that observe these plans. This exposes the mannequin to lengthy horizon challenge buildings, not solely brief query templates.
For deep data looking for, they assemble graph based mostly queries over data graphs similar to Wikidata5m and CN-DBpedia. They pattern subgraphs, increase them utilizing search, and synthesize questions that require multi hop reasoning throughout entities and paperwork. A separate pipeline makes use of a Wiki type hyperlink index to power cross doc retrieval and mixture of proof. Simple questions {that a} robust mannequin can already remedy with a easy ReAct type technique are filtered out, so coaching focuses on exhausting search issues.
Reflection and verification knowledge is generated via self-correction loops and multi-agent instructor traces. Trainer brokers extract claims, plan checks, confirm details, replan if inconsistencies seem and solely then write experiences. The ensuing trajectories are cleaned and used as supervision for a single scholar agent. Report era is educated in 2 phases, mid coaching for area type and depth utilizing question report pairs, then supervised fine-tuning with strict formatting and plan consistency constraints.
Progressive Coaching on Qwen2.5-32B-Base
The coaching pipeline has 3 levels, agentic mid-training, supervised fine-tuning and reinforcement studying. In mid coaching stage-1, the staff injects atomic capabilities with out instruments, utilizing context size as much as 32k tokens. The information covers lively studying, artificial reasoning traces, summarization and reflection. The analysis staff present regular features on SimpleQA, TriviaQA and FRAMES as coaching scales as much as about 150B tokens, with the biggest features on FRAMES, which stresses structured reasoning.
In stage-2, the context extends to 128k tokens and specific device calls are launched. The mannequin learns duties similar to URL based mostly question-answering, deep net search, lengthy doc summarization and lengthy dialogue reasoning. This stage aligns the mannequin with actual analysis situations the place search, looking and evaluation have to be combined in a single trajectory.
Throughout supervised fine-tuning, the 4 atomic capabilities are composed into full deep search and deep analysis traces. Knowledge cleansing retains trajectories which might be appropriate and brief by way of steps and gear calls. The pipeline injects managed device errors adopted by correction to enhance robustness, and enforces quotation codecs in order that experiences keep grounded within the retrieved sources.
Reinforcement studying then optimizes the agent in an actual device atmosphere. The analysis staff builds duties and checklists via reverse synthesis, and trains a guidelines type Rubrics Decide to attain experiences alongside effective grained dimensions. The reward design converts ternary rubric labels into uneven binary rewards that seize each constructive targets and violations. The coverage is educated with PPO and a discovered critic, utilizing generalized benefit estimation with close to zero low cost in order that lengthy trajectories usually are not truncated.
Single Agent ReAct Structure and Search Stack
At inference time, Step-DeepResearch runs as a single ReAct type agent that alternates considering, device calls and observations till it decides to output a report. The device set contains batch net search, a todo supervisor, shell instructions and file operations. Execution runs in a sandbox with terminal persistence via tmux. A notion oriented browser reduces redundant web page captures by utilizing perceptual hash distance. Instruments for doc parsing, audio transcription and picture evaluation assist multimodal inputs.
Data acquisition makes use of 2 associated assets. StepFun staff states that its Search API is grounded in additional than 20M prime quality papers and 600 premium indices. The analysis staff then describes a curated authority indexing technique that isolates greater than 600 trusted domains, together with authorities, tutorial and institutional websites. Retrieval operates at paragraph degree and makes use of authority conscious rating so that prime belief domains are most popular when relevance is comparable.
The file instruments assist patch based mostly enhancing, so the agent can replace solely modified sections of a report. A abstract conscious storage scheme writes full device outputs to native information and injects solely compact summaries into the context. This acts as exterior reminiscence and avoids context overflow for lengthy tasks.
Analysis, Value and Entry
To measure deep analysis habits, the staff introduce ADR-Bench, a Chinese language benchmark with 110 open ended duties throughout 9 domains. 70 duties cowl common domains similar to schooling, science and engineering and social life, evaluated by professional facet by facet comparability. 40 duties in finance and regulation are scored with specific rubrics that observe atomicity and verifiability constraints.
On Scale AI Analysis Rubrics, Step-DeepResearch reaches 61.42 p.c rubric compliance, which is akin to OpenAI-DeepResearch and Gemini-DeepResearch, and clearly forward of a number of open and proprietary baselines. On ADR-Bench, expert-based Elo rankings present that the 32B mannequin outperforms bigger open-models similar to MiniMax-M2, GLM-4.6 and DeepSeek-V3.2, and is aggressive with techniques like Kimi-Researcher and MiniMax-Agent-Professional.
Key Takeaways
- Single agent, atomic functionality design: Step-DeepResearch is a 32B parameter single agent constructed on Qwen2.-32B-Base, it internalizes 4 atomic capabilities, planning, deep data looking for, reflection and verification, {and professional} report era, as a substitute of counting on many exterior brokers.
- Focused knowledge synthesis for every ability: The analysis staff builds separate knowledge pipelines for planning, deep data looking for, reflection and report writing, utilizing reverse-engineered plans from actual experiences, graph-based queries over Wikidata5m and CN-DBpedia, multi-agent instructor traces and strict report formatting knowledge.
- Three stage coaching with lengthy context and RL: Coaching makes use of mid coaching, supervised fine-tuning and reinforcement studying, with mid coaching as much as 150B tokens at 32k after which 128k context, SFT composes full deep analysis trajectories, and PPO based mostly RL with a Rubrics Decide optimizes experiences in opposition to effective grained checklists.
- ReAct structure with curated search and exterior reminiscence: At inference time the mannequin runs a ReAct loop that calls instruments for batch net search, todo, shell and file operations, makes use of a Search API grounded in additional than 20M papers and 600 premium indices together with 600+trusted domains, and depends on patch enhancing and abstract conscious storage to behave as exterior reminiscence.
- Aggressive high quality with decrease price: On Scale AI Analysis Rubrics the mannequin reaches 61.42 p.c rubric compliance and is aggressive with OpenAI-DeepResearch and Gemini-DeepResearch, on ADR Bench it achieves 67.1 p.c win or tie charge in opposition to robust baselines.
Take a look at the Paper and Repo. Additionally, be at liberty to observe us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you may be part of us on telegram as nicely.
Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the newest breakthroughs, get unique updates, and join with a world community of future-focused thinkers.
Unlock tomorrow’s developments right now: learn extra, subscribe to our publication, and grow to be a part of the NextTech neighborhood at NextTech-news.com

