What does an finish to finish stack for terminal brokers appear like while you mix structured toolkits, artificial RL environments, and benchmark aligned analysis? A workforce of researchers from CAMEL AI, Eigent AI and different collaborators have launched SETA, a toolkit and atmosphere stack that focuses on reinforcement studying for terminal brokers. The venture targets brokers that function inside a Unix type shell and should full verifiable duties beneath a benchmark harness reminiscent of Terminal Bench.
Three fundamental contributions:
- A state-of-the-art terminal agent on Terminal Bench: They obtain state-of-the-art efficiency with a Claude Sonnet 4.5 primarily based agent on Terminal Bench 2.0 and with a GPT 4.1 primarily based agent on Terminal Bench 1.0. The comparability is restricted to brokers that use the identical base mannequin.
- Scalable RL coaching with artificial terminal environments: The analysis workforce launch an preliminary artificial dataset with 400 terminal duties that cowl a variety of problem ranges. Out of those, 260 duties are used for RLVR finetuning of a Qwen3-8B mannequin.
- A clear agent design that generalizes throughout coaching and analysis frameworks: The identical agent implementation is used for each native activity runs and the official Terminal Bench analysis harness.
Terminal Toolkit and log construction
The SETA code repository showcases a Terminal Toolkit that turns a language mannequin into an executable terminal agent. For every activity run, the framework creates a structured log listing beneath analysis/terminal_bench_run. The README web page exhibits a concrete format for a activity referred to as play-zork.
Key recordsdata embrace:
chatagent.logwhich data the complete historical past of agent messages and gear calls together with take a look at outcomes.- A
periodslisting withsession_logsthat seize terminal interactions from the toolkit. - Inside
session_logs, recordsdata reminiscent ofblocking_commands.log,session_run_zork_1_correct_path.log,session_zork-1.log, andsession_zork_start.logretailer command output for various periods and modes. exams.logandexams.log.stripwhich report the take a look at run output, with the latter eradicating terminal management characters.
This construction provides a concrete strategy to debug an agent. You may hint from excessive degree chat selections in chatagent.log right down to particular person shell instructions within the session logs and ensure success or failure from the take a look at logs.
For official Terminal Bench analysis, the GitHub repository offers a separate entry level beneath analysis/terminal_bench_eval. A developer strikes into that listing and runs run_eval.sh for Terminal Bench 1.0 and run_tb2.sh for Terminal Bench 2.0.
Outcomes are written into analysis/terminal_bench_eval/run/{run_id}/outcomes.json. Activity particular session logs are positioned beneath analysis/terminal_bench_eval/logs/camel_logs/{task_id}. The agent class that binds the CAMEL agent to the benchmark is applied in tbench_camel_agent.py.
Word Taking Toolkit as persistent reminiscence
The analysis workforce additionally introduces a Word Taking Toolkit described as persistent reminiscence for lengthy horizon duties. They present instance notice taking software calls the place the agent writes and reads notes in a structured manner whereas fixing terminal duties. The present public materials focuses on the existence of this toolkit and the examples of use. It doesn’t but describe a full coaching goal for notice utilization.
The vital level is that the agent has an express channel the place it may well externalize intermediate outcomes and hints, separate from the uncooked terminal buffer.
Understanding the efficiency
SETA’s agent harness achieves main outcomes on Terminal Bench. With Claude Sonnet-4.5 because the spine, the CAMEL terminal agent reaches 46.5% accuracy on Terminal Bench 2.0 throughout 89 actual world duties, rating first and outperforming the second system by 3 proportion factors, with particularly robust leads to git workflows, DevOps automation, and code safety duties. On Terminal Bench 1.0, a GPT 4.1 primarily based agent attains 35% accuracy, which is 4.7 proportion factors above the following entry, once more inside the identical mannequin household. As compared, a supervised Qwen3 8B baseline attains 3.4% on Terminal Bench 2.0, and the Qwen3 8B terminal agent educated with the SETA RL pipeline improves over this baseline on the curated artificial environments.
Key Takeaways
- SETA is a joint group venture that gives each agent toolkits and artificial RL environments particularly for terminal brokers, aligned with the Terminal Bench analysis format.
- The framework reviews state-of-the-art efficiency for CAMEL terminal brokers on Terminal Bench 1.0 and a couple of.0 when utilizing Claude Sonnet 4.5 and GPT 4.1 as the bottom fashions, evaluated in opposition to brokers constructed on the identical mannequin households.
- The SETA RL dataset on Hugging Face incorporates 400 artificial terminal duties, every packaged as
activity.yaml,Dockerfile, andrun-tests.sh, with 260 duties used for RLVR finetuning of a Qwen3-8B primarily based agent. - The open supply SETA codebase exposes a Terminal Toolkit with structured logging and a Word Taking Toolkit for lengthy horizon reminiscence, and integrates straight with Terminal Bench analysis scripts and logging paths within the
setaGitHub repository. - The general design demonstrates a clear path from artificial RL environments to benchmark verified brokers, giving builders a reproducible stack to coach, debug, and consider terminal brokers relatively than counting on advert hoc software calling examples.
Take a look at the Weblog, Technical particulars, GitHub Repo and Weights. Additionally, be happy to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you may be a part of us on telegram as properly.
Take a look at our newest launch of ai2025.dev, a 2025-focused analytics platform that turns mannequin launches, benchmarks, and ecosystem exercise right into a structured dataset you may filter, evaluate, and export.
Michal Sutter is a knowledge science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a strong basis in statistical evaluation, machine studying, and information engineering, Michal excels at reworking complicated datasets into actionable insights.
Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the most recent breakthroughs, get unique updates, and join with a world community of future-focused thinkers.
Unlock tomorrow’s tendencies at this time: learn extra, subscribe to our e-newsletter, and turn into a part of the NextTech group at NextTech-news.com

