On this tutorial, we implement an agentic chain-of-thought pruning framework that generates a number of reasoning paths in parallel and dynamically reduces them utilizing consensus alerts and early stopping. We give attention to bettering reasoning effectivity by decreasing pointless token utilization whereas preserving reply correctness, demonstrating that self-consistency and light-weight graph-based settlement can function efficient proxies for reasoning high quality. We design your entire pipeline utilizing a compact instruction-tuned mannequin and progressive sampling to simulate how an agent can resolve when it has reasoned “sufficient.” Take a look at the FULL CODES right here.
!pip -q set up -U transformers speed up bitsandbytes networkx scikit-learn
import re, time, random, math
import numpy as np
import torch
import networkx as nx
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from sklearn.feature_extraction.textual content import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
SEED = 7
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
mannequin = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
device_map="auto",
torch_dtype=torch.float16,
load_in_4bit=True
)
mannequin.eval()
SYSTEM = "You're a cautious downside solver. Hold reasoning temporary and output a last numeric reply."
FINAL_RE = re.compile(r"Remaining:s*([-d]+(?:.d+)?)")
We arrange the Colab setting and cargo all required libraries for environment friendly agentic reasoning. We initialize a light-weight instruction-tuned language mannequin with quantization to make sure secure execution on restricted GPU sources. We additionally outline world configuration, randomness management, and the core prompting sample used all through the tutorial. Take a look at the FULL CODES right here.
def make_prompt(q):
return (
f"{SYSTEM}nn"
f"Downside: {q}n"
f"Reasoning: (temporary)n"
f"Remaining: "
)
def parse_final_number(textual content):
m = FINAL_RE.search(textual content)
if m:
return m.group(1).strip()
nums = re.findall(r"[-]?d+(?:.d+)?", textual content)
return nums[-1] if nums else None
def is_correct(pred, gold):
if pred is None:
return 0
strive:
return int(abs(float(pred) - float(gold)) < 1e-9)
besides:
return int(str(pred).strip() == str(gold).strip())
def tok_len(textual content):
return len(tokenizer.encode(textual content))
We outline helper capabilities that construction prompts, extract last numeric solutions, and consider correctness in opposition to floor reality. We standardize how solutions are parsed in order that totally different reasoning paths will be in contrast constantly. We additionally introduce token-counting utilities that permit us to later measure reasoning effectivity. Take a look at the FULL CODES right here.
@torch.no_grad()
def generate_paths(query, n, max_new_tokens=64, temperature=0.7, top_p=0.9):
immediate = make_prompt(query)
inputs = tokenizer(immediate, return_tensors="pt").to(mannequin.system)
gen_cfg = GenerationConfig(
do_sample=True,
temperature=temperature,
top_p=top_p,
max_new_tokens=max_new_tokens,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
num_return_sequences=n
)
out = mannequin.generate(**inputs, generation_config=gen_cfg)
prompt_tok = inputs["input_ids"].form[1]
paths = []
for i in vary(out.form[0]):
seq = out[i]
gen_ids = seq[prompt_tok:]
completion = tokenizer.decode(gen_ids, skip_special_tokens=True)
paths.append({
"prompt_tokens": int(prompt_tok),
"gen_tokens": int(gen_ids.form[0]),
"completion": completion
})
return paths
We implement quick multi-sample era that produces a number of reasoning paths in a single mannequin name. We extract solely the generated continuation to isolate the reasoning output for every path. We retailer token utilization and completions in a structured format to assist downstream pruning selections. Take a look at the FULL CODES right here.
def consensus_strength(completions, sim_threshold=0.22):
if len(completions) <= 1:
return [0.0] * len(completions)
vec = TfidfVectorizer(ngram_range=(1,2), max_features=2500)
X = vec.fit_transform(completions)
S = cosine_similarity(X)
G = nx.Graph()
n = len(completions)
G.add_nodes_from(vary(n))
for i in vary(n):
for j in vary(i+1, n):
w = float(S[i, j])
if w >= sim_threshold:
G.add_edge(i, j, weight=w)
energy = [0.0] * n
for u, v, d in G.edges(knowledge=True):
w = float(d.get("weight", 0.0))
energy[u] += w
energy[v] += w
return energy
We assemble a light-weight consensus mechanism utilizing a similarity graph over generated reasoning paths. We compute pairwise similarity scores and convert them right into a graph-based energy sign for every path. It permits us to approximate settlement between reasoning trajectories with out costly mannequin calls. Take a look at the FULL CODES right here.
def pick_final_answer(paths):
solutions = [parse_final_number(p["completion"]) for p in paths]
strengths = consensus_strength([p["completion"] for p in paths])
teams = {}
for i, a in enumerate(solutions):
if a is None:
proceed
teams.setdefault(a, {"idx": [], "energy": 0.0, "tokens": 0})
teams[a]["idx"].append(i)
teams[a]["strength"] += strengths[i]
teams[a]["tokens"] += paths[i]["gen_tokens"]
if not teams:
return None, {"solutions": solutions, "strengths": strengths}
ranked = sorted(
teams.gadgets(),
key=lambda kv: (len(kv[1]["idx"]), kv[1]["strength"], -kv[1]["tokens"]),
reverse=True
)
best_answer = ranked[0][0]
best_indices = ranked[0][1]["idx"]
best_i = sorted(best_indices, key=lambda i: (paths[i]["gen_tokens"], -strengths[i]))[0]
return best_answer, {"solutions": solutions, "strengths": strengths, "best_i": best_i}
def pruned_agent_answer(
query,
batch_size=2,
k_max=10,
max_new_tokens=64,
temperature=0.7,
top_p=0.9,
stop_min_samples=4,
stop_ratio=0.67,
stop_margin=2
):
paths = []
prompt_tokens_once = tok_len(make_prompt(query))
total_gen_tokens = 0
whereas len(paths) < k_max:
n = min(batch_size, k_max - len(paths))
new_paths = generate_paths(
query,
n=n,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p
)
paths.lengthen(new_paths)
total_gen_tokens += sum(p["gen_tokens"] for p in new_paths)
if len(paths) >= stop_min_samples:
solutions = [parse_final_number(p["completion"]) for p in paths]
counts = {}
for a in solutions:
if a is None:
proceed
counts[a] = counts.get(a, 0) + 1
if counts:
sorted_counts = sorted(counts.gadgets(), key=lambda kv: kv[1], reverse=True)
top_a, top_c = sorted_counts[0]
second_c = sorted_counts[1][1] if len(sorted_counts) > 1 else 0
if top_c >= math.ceil(stop_ratio * len(paths)) and (top_c - second_c) >= stop_margin:
last, dbg = pick_final_answer(paths)
return {
"last": last,
"paths": paths,
"early_stopped_at": len(paths),
"tokens_total": int(prompt_tokens_once * len(paths) + total_gen_tokens),
"debug": dbg
}
last, dbg = pick_final_answer(paths)
return {
"last": last,
"paths": paths,
"early_stopped_at": None,
"tokens_total": int(prompt_tokens_once * len(paths) + total_gen_tokens),
"debug": dbg
}
We implement the core agentic pruning logic that teams reasoning paths by last solutions and ranks them utilizing consensus and effectivity alerts. We introduce progressive sampling with early stopping to terminate era as soon as enough confidence emerges. We then choose a last reply that balances settlement energy and minimal token utilization. Take a look at the FULL CODES right here.
def baseline_answer(query, okay=10, max_new_tokens=64):
paths = generate_paths(query, n=okay, max_new_tokens=max_new_tokens)
prompt_tokens_once = tok_len(make_prompt(query))
total_gen_tokens = sum(p["gen_tokens"] for p in paths)
solutions = [parse_final_number(p["completion"]) for p in paths]
counts = {}
for a in solutions:
if a is None:
proceed
counts[a] = counts.get(a, 0) + 1
last = max(counts.gadgets(), key=lambda kv: kv[1])[0] if counts else None
return {
"last": last,
"paths": paths,
"tokens_total": int(prompt_tokens_once * okay + total_gen_tokens)
}
DATA = [
{"q": "If a store sells 3 notebooks for $12, how much does 1 notebook cost?", "a": "4"},
{"q": "What is 17*6?", "a": "102"},
{"q": "A rectangle has length 9 and width 4. What is its area?", "a": "36"},
{"q": "If you buy 5 apples at $2 each, how much do you pay?", "a": "10"},
{"q": "What is 144 divided by 12?", "a": "12"},
{"q": "If x=8, what is 3x+5?", "a": "29"},
{"q": "A jar has 30 candies. You eat 7. How many remain?", "a": "23"},
{"q": "If a train travels 60 km in 1.5 hours, what is its average speed (km/h)?", "a": "40"},
{"q": "Compute: (25 - 9) * 3", "a": "48"},
{"q": "What is the next number in the pattern: 2, 4, 8, 16, ?", "a": "32"},
]
base_acc, base_tok = [], []
prun_acc, prun_tok = [], []
for merchandise in DATA:
b = baseline_answer(merchandise["q"], okay=8, max_new_tokens=56)
base_acc.append(is_correct(b["final"], merchandise["a"]))
base_tok.append(b["tokens_total"])
p = pruned_agent_answer(merchandise["q"], max_new_tokens=56)
prun_acc.append(is_correct(p["final"], merchandise["a"]))
prun_tok.append(p["tokens_total"])
print("Baseline accuracy:", float(np.imply(base_acc)))
print("Baseline avg tokens:", float(np.imply(base_tok)))
print("Pruned accuracy:", float(np.imply(prun_acc)))
print("Pruned avg tokens:", float(np.imply(prun_tok)))
We evaluate the pruned agentic method in opposition to a set self-consistency baseline. We consider each strategies on accuracy and token consumption to quantify the effectivity features from pruning. We conclude by reporting combination metrics that show how dynamic pruning preserves correctness whereas decreasing reasoning price.
In conclusion, we demonstrated that agentic pruning can considerably cut back efficient token consumption with out sacrificing accuracy by stopping reasoning as soon as enough consensus emerges. We confirmed that combining self-consistency, similarity-based consensus graphs, and early-stop heuristics supplies a sensible and scalable method to reasoning effectivity in agentic programs. This framework serves as a basis for extra superior agentic behaviors, corresponding to mid-generation pruning, budget-aware reasoning, and adaptive management over reasoning depth in real-world AI brokers.
Take a look at the FULL CODES right here. Additionally, be happy to comply with us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as effectively.
Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the most recent breakthroughs, get unique updates, and join with a world community of future-focused thinkers.
Unlock tomorrow’s traits at the moment: learn extra, subscribe to our e-newsletter, and grow to be a part of the NextTech group at NextTech-news.com

