Close Menu
  • Home
  • Opinion
  • Region
    • Africa
    • Asia
    • Europe
    • Middle East
    • North America
    • Oceania
    • South America
  • AI & Machine Learning
  • Robotics & Automation
  • Space & Deep Tech
  • Web3 & Digital Economies
  • Climate & Sustainability Tech
  • Biotech & Future Health
  • Mobility & Smart Cities
  • Global Tech Pulse
  • Cybersecurity & Digital Rights
  • Future of Work & Education
  • Trend Radar & Startup Watch
  • Creator Economy & Culture
What's Hot

Manycore Tech Inc. Unveils Strategic Roadmap, Opens Spatial-Intelligence Capabilities, and Launches Two New Merchandise

December 9, 2025

Deloitte confirms Vodacom Safaricom deal honest to shareholders

December 9, 2025

Canadians can now watch music movies on Spotify

December 9, 2025
Facebook X (Twitter) Instagram LinkedIn RSS
NextTech NewsNextTech News
Facebook X (Twitter) Instagram LinkedIn RSS
  • Home
  • Africa
  • Asia
  • Europe
  • Middle East
  • North America
  • Oceania
  • South America
  • Opinion
Trending
  • Manycore Tech Inc. Unveils Strategic Roadmap, Opens Spatial-Intelligence Capabilities, and Launches Two New Merchandise
  • Deloitte confirms Vodacom Safaricom deal honest to shareholders
  • Canadians can now watch music movies on Spotify
  • Round retail is South Africa’s subsequent huge alternative
  • EU launches probe into Google’s use of on-line content material for AI
  • Construct an AI-powered enterprise with 40% off on Salesforce’s core platforms
  • Evaluating the perfect smartphones in 2025
  • The price of Starlink in Africa vs different web service suppliers
Tuesday, December 9
NextTech NewsNextTech News
Home - AI & Machine Learning - How We Be taught Step-Stage Rewards from Preferences to Clear up Sparse-Reward Environments Utilizing On-line Course of Reward Studying
AI & Machine Learning

How We Be taught Step-Stage Rewards from Preferences to Clear up Sparse-Reward Environments Utilizing On-line Course of Reward Studying

NextTechBy NextTechDecember 3, 2025No Comments8 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Telegram Email Copy Link
Follow Us
Google News Flipboard
How We Be taught Step-Stage Rewards from Preferences to Clear up Sparse-Reward Environments Utilizing On-line Course of Reward Studying
Share
Facebook Twitter LinkedIn Pinterest Email


On this tutorial, we discover On-line Course of Reward Studying (OPRL) and reveal how we will be taught dense, step-level reward alerts from trajectory preferences to resolve sparse-reward reinforcement studying duties. We stroll by every part, from the maze setting and reward-model community to desire era, coaching loops, and analysis, whereas observing how the agent step by step improves its behaviour by on-line preference-driven shaping. By operating this end-to-end implementation, we achieve a sensible understanding of how OPRL permits higher credit score project, sooner studying, and extra steady coverage optimization in difficult environments the place the agent would in any other case wrestle to find significant rewards. Take a look at the FULL CODE NOTEBOOK.

import numpy as np
import torch
import torch.nn as nn
import torch.nn.useful as F
from torch.optim import Adam
import matplotlib.pyplot as plt
from collections import deque
import random


torch.manual_seed(42)
np.random.seed(42)
random.seed(42)


class MazeEnv:
   def __init__(self, dimension=8):
       self.dimension = dimension
       self.begin = (0, 0)
       self.purpose = (size-1, size-1)
       self.obstacles = set([(i, size//2) for i in range(1, size-2)])
       self.reset()
  
   def reset(self):
       self.pos = self.begin
       self.steps = 0
       return self._get_state()
  
   def _get_state(self):
       state = np.zeros(self.dimension * self.dimension)
       state[self.pos[0] * self.dimension + self.pos[1]] = 1
       return state
  
   def step(self, motion):
       strikes = [(-1,0), (0,1), (1,0), (0,-1)]
       new_pos = (self.pos[0] + strikes[action][0],
                  self.pos[1] + strikes[action][1])
       if (0 <= new_pos[0] < self.dimension and
           0 <= new_pos[1] < self.dimension and
           new_pos not in self.obstacles):
           self.pos = new_pos
       self.steps += 1
       executed = self.pos == self.purpose or self.steps >= 60
       reward = 10.0 if self.pos == self.purpose else 0.0
       return self._get_state(), reward, executed
  
   def render(self):
       grid = [['.' for _ in range(self.size)] for _ in vary(self.dimension)]
       for obs in self.obstacles:
           grid[obs[0]][obs[1]] = '█'
       grid[self.goal[0]][self.goal[1]] = 'G'
       grid[self.pos[0]][self.pos[1]] = 'A'
       return 'n'.be a part of([''.join(row) for row in grid])


class ProcessRewardModel(nn.Module):
   def __init__(self, state_dim, hidden=128):
       tremendous().__init__()
       self.internet = nn.Sequential(
           nn.Linear(state_dim, hidden),
           nn.LayerNorm(hidden),
           nn.ReLU(),
           nn.Linear(hidden, hidden),
           nn.LayerNorm(hidden),
           nn.ReLU(),
           nn.Linear(hidden, 1),
           nn.Tanh()
       )
   def ahead(self, states):
       return self.internet(states)
   def trajectory_reward(self, states):
       return self.ahead(states).sum()


class PolicyNetwork(nn.Module):
   def __init__(self, state_dim, action_dim, hidden=128):
       tremendous().__init__()
       self.spine = nn.Sequential(
           nn.Linear(state_dim, hidden),
           nn.ReLU(),
           nn.Linear(hidden, hidden),
           nn.ReLU()
       )
       self.actor = nn.Linear(hidden, action_dim)
       self.critic = nn.Linear(hidden, 1)
   def ahead(self, state):
       options = self.spine(state)
       return self.actor(options), self.critic(options)

We arrange your complete basis of our OPRL system by importing libraries, defining the maze setting, and constructing the reward and coverage networks. We set up how states are represented, how obstacles block motion, and the way the sparse reward construction works. We additionally design the core neural fashions that can later be taught course of rewards and drive the coverage’s selections. Take a look at the FULL CODE NOTEBOOK.

class OPRLAgent:
   def __init__(self, state_dim, action_dim, lr=3e-4):
       self.coverage = PolicyNetwork(state_dim, action_dim)
       self.reward_model = ProcessRewardModel(state_dim)
       self.policy_opt = Adam(self.coverage.parameters(), lr=lr)
       self.reward_opt = Adam(self.reward_model.parameters(), lr=lr)
       self.trajectories = deque(maxlen=200)
       self.preferences = deque(maxlen=500)
       self.action_dim = action_dim
  
   def select_action(self, state, epsilon=0.1):
       if random.random() < epsilon:
           return random.randint(0, self.action_dim - 1)
       state_t = torch.FloatTensor(state).unsqueeze(0)
       with torch.no_grad():
           logits, _ = self.coverage(state_t)
           probs = F.softmax(logits, dim=-1)
           return torch.multinomial(probs, 1).merchandise()
  
   def collect_trajectory(self, env, epsilon=0.1):
       states, actions, rewards = [], [], []
       state = env.reset()
       executed = False
       whereas not executed:
           motion = self.select_action(state, epsilon)
           next_state, reward, executed = env.step(motion)
           states.append(state)
           actions.append(motion)
           rewards.append(reward)
           state = next_state
       traj = {
           'states': torch.FloatTensor(np.array(states)),
           'actions': torch.LongTensor(actions),
           'rewards': torch.FloatTensor(rewards),
           'return': float(sum(rewards))
       }
       self.trajectories.append(traj)
       return traj

We start setting up the OPRL agent by implementing motion choice and trajectory assortment. We use an ε-greedy technique to make sure exploration and collect sequences of states, actions, and returns. As we run the agent by the maze, we retailer total trajectories that can later function desire information for shaping the reward mannequin. Take a look at the FULL CODE NOTEBOOK.

  def generate_preference(self):
       if len(self.trajectories) < 2:
           return
       t1, t2 = random.pattern(checklist(self.trajectories), 2)
       label = 1.0 if t1['return'] > t2['return'] else 0.0
       self.preferences.append({'t1': t1, 't2': t2, 'label': label})
  
   def train_reward_model(self, n_updates=5):
       if len(self.preferences) < 32:
           return 0.0
       total_loss = 0.0
       for _ in vary(n_updates):
           batch = random.pattern(checklist(self.preferences), 32)
           loss = 0.0
           for merchandise in batch:
               r1 = self.reward_model.trajectory_reward(merchandise['t1']['states'])
               r2 = self.reward_model.trajectory_reward(merchandise['t2']['states'])
               logit = r1 - r2
               pred_prob = torch.sigmoid(logit)
               label = merchandise['label']
               loss += -(label * torch.log(pred_prob + 1e-8) +
                        (1-label) * torch.log(1 - pred_prob + 1e-8))
           loss = loss / len(batch)
           self.reward_opt.zero_grad()
           loss.backward()
           torch.nn.utils.clip_grad_norm_(self.reward_model.parameters(), 1.0)
           self.reward_opt.step()
           total_loss += loss.merchandise()
       return total_loss / n_updates

We generate desire pairs from collected trajectories and prepare the method reward mannequin utilizing the Bradley–Terry formulation. We evaluate trajectory-level scores, compute chances, and replace the reward mannequin to mirror which behaviours seem higher. This enables us to be taught dense, differentiable, step-level rewards that information the agent even when the setting itself is sparse. Take a look at the FULL CODE NOTEBOOK.

 def train_policy(self, n_updates=3, gamma=0.98):
       if len(self.trajectories) < 5:
           return 0.0
       total_loss = 0.0
       for _ in vary(n_updates):
           traj = random.selection(checklist(self.trajectories))
           with torch.no_grad():
               process_rewards = self.reward_model(traj['states']).squeeze()
           shaped_rewards = traj['rewards'] + 0.1 * process_rewards
           returns = []
           G = 0
           for r in reversed(shaped_rewards.tolist()):
               G = r + gamma * G
               returns.insert(0, G)
           returns = torch.FloatTensor(returns)
           returns = (returns - returns.imply()) / (returns.std() + 1e-8)
           logits, values = self.coverage(traj['states'])
           log_probs = F.log_softmax(logits, dim=-1)
           action_log_probs = log_probs.collect(1, traj['actions'].unsqueeze(1))
           benefits = returns - values.squeeze().detach()
           policy_loss = -(action_log_probs.squeeze() * benefits).imply()
           value_loss = F.mse_loss(values.squeeze(), returns)
           entropy = -(F.softmax(logits, dim=-1) * log_probs).sum(-1).imply()
           loss = policy_loss + 0.5 * value_loss - 0.01 * entropy
           self.policy_opt.zero_grad()
           loss.backward()
           torch.nn.utils.clip_grad_norm_(self.coverage.parameters(), 1.0)
           self.policy_opt.step()
           total_loss += loss.merchandise()
       return total_loss / n_updates


def train_oprl(episodes=500, render_interval=100):
   env = MazeEnv(dimension=8)
   agent = OPRLAgent(state_dim=64, action_dim=4, lr=3e-4)
   returns, reward_losses, policy_losses = [], [], []
   success_rate = []
   for ep in vary(episodes):
       epsilon = max(0.05, 0.5 - ep / 1000)
       traj = agent.collect_trajectory(env, epsilon)
       returns.append(traj['return'])
       if ep % 2 == 0 and ep > 10:
           agent.generate_preference()
       if ep > 20 and ep % 2 == 0:
           rew_loss = agent.train_reward_model(n_updates=3)
           reward_losses.append(rew_loss)
       if ep > 10:
           pol_loss = agent.train_policy(n_updates=2)
           policy_losses.append(pol_loss)
       success = 1 if traj['return'] > 5 else 0
       success_rate.append(success)
       if ep % render_interval == 0 and ep > 0:
           test_env = MazeEnv(dimension=8)
           agent.collect_trajectory(test_env, epsilon=0)
           print(test_env.render())
   return returns, reward_losses, policy_losses, success_rate

We prepare the coverage utilizing formed rewards produced by the realized course of reward mannequin. We compute returns, benefits, worth estimates, and entropy bonuses, enabling the agent to enhance its technique over time. We then construct a full coaching loop by which exploration decays, preferences accumulate, and each the reward mannequin and the coverage are up to date repeatedly. Take a look at the FULL CODE NOTEBOOK.

print("Coaching OPRL Agent on Sparse Reward Maze...n")
returns, rew_losses, pol_losses, success = train_oprl(episodes=500, render_interval=250)


fig, axes = plt.subplots(2, 2, figsize=(14, 10))


axes[0,0].plot(returns, alpha=0.3)
axes[0,0].plot(np.convolve(returns, np.ones(20)/20, mode="legitimate"), linewidth=2)
axes[0,0].set_xlabel('Episode')
axes[0,0].set_ylabel('Return')
axes[0,0].set_title('Agent Efficiency')
axes[0,0].grid(alpha=0.3)


success_smooth = np.convolve(success, np.ones(20)/20, mode="legitimate")
axes[0,1].plot(success_smooth, linewidth=2, coloration="inexperienced")
axes[0,1].set_xlabel('Episode')
axes[0,1].set_ylabel('Success Fee')
axes[0,1].set_title('Objective Success Fee')
axes[0,1].grid(alpha=0.3)


axes[1,0].plot(rew_losses, linewidth=2, coloration="orange")
axes[1,0].set_xlabel('Replace Step')
axes[1,0].set_ylabel('Loss')
axes[1,0].set_title('Reward Mannequin Loss')
axes[1,0].grid(alpha=0.3)


axes[1,1].plot(pol_losses, linewidth=2, coloration="pink")
axes[1,1].set_xlabel('Replace Step')
axes[1,1].set_ylabel('Loss')
axes[1,1].set_title('Coverage Loss')
axes[1,1].grid(alpha=0.3)


plt.tight_layout()
plt.present()


print("OPRL Coaching Full!")
print("Course of rewards, desire studying, reward shaping, and on-line updates demonstrated.")

We visualize the educational dynamics by plotting returns, success charges, reward-model loss, and coverage loss. We monitor how the agent’s efficiency evolves as OPRL shapes the reward panorama. By the tip of the visualization, we clearly see the affect of course of rewards on fixing a difficult, sparse-reward maze.

In conclusion, we see how OPRL transforms sparse terminal outcomes into wealthy on-line suggestions that repeatedly guides the agent’s behaviour. We watch the method reward mannequin be taught preferences, form the return sign, and speed up the coverage’s skill to succeed in the purpose. With bigger mazes, various shaping strengths, and even actual human desire suggestions, we recognize how OPRL offers a versatile and highly effective framework for credit score project in advanced decision-making duties. We end with a transparent, hands-on understanding of how OPRL operates and the way we will lengthen it to extra superior agentic RL settings.


Take a look at the FULL CODE NOTEBOOK and Paper. Be happy to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be at liberty to observe us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as effectively.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.

🙌 Comply with MARKTECHPOST: Add us as a most well-liked supply on Google.

Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the newest breakthroughs, get unique updates, and join with a world community of future-focused thinkers.
Unlock tomorrow’s traits at present: learn extra, subscribe to our publication, and turn out to be a part of the NextTech group at NextTech-news.com

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
NextTech
  • Website

Related Posts

Zhipu AI Releases GLM-4.6V: A 128K Context Imaginative and prescient Language Mannequin with Native Software Calling

December 9, 2025

Jina AI Releases Jina-VLM: A 2.4B Multilingual Imaginative and prescient Language Mannequin Targeted on Token Environment friendly Visible QA

December 9, 2025

Interview: From CUDA to Tile-Based mostly Programming: NVIDIA’s Stephen Jones on Constructing the Way forward for AI

December 8, 2025
Add A Comment
Leave A Reply Cancel Reply

Economy News

Manycore Tech Inc. Unveils Strategic Roadmap, Opens Spatial-Intelligence Capabilities, and Launches Two New Merchandise

By NextTechDecember 9, 2025

On December 9, 2025, Manycore Tech Inc., one of many “Hangzhou Six Little Dragons,” introduced…

Deloitte confirms Vodacom Safaricom deal honest to shareholders

December 9, 2025

Canadians can now watch music movies on Spotify

December 9, 2025
Top Trending

Manycore Tech Inc. Unveils Strategic Roadmap, Opens Spatial-Intelligence Capabilities, and Launches Two New Merchandise

By NextTechDecember 9, 2025

On December 9, 2025, Manycore Tech Inc., one of many “Hangzhou Six…

Deloitte confirms Vodacom Safaricom deal honest to shareholders

By NextTechDecember 9, 2025

Vodacom Group, South Africa’s largest cell operator, has acquired an unbiased equity…

Canadians can now watch music movies on Spotify

By NextTechDecember 9, 2025

Spotify is bringing the music movies beta to Canada after launching the…

Subscribe to News

Get the latest sports news from NewsSite about world, sports and politics.

NEXTTECH-LOGO
Facebook X (Twitter) Instagram YouTube

AI & Machine Learning

Robotics & Automation

Space & Deep Tech

Web3 & Digital Economies

Climate & Sustainability Tech

Biotech & Future Health

Mobility & Smart Cities

Global Tech Pulse

Cybersecurity & Digital Rights

Future of Work & Education

Creator Economy & Culture

Trend Radar & Startup Watch

News By Region

Africa

Asia

Europe

Middle East

North America

Oceania

South America

2025 © NextTech-News. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms Of Service
  • Advertise With Us
  • Write For Us
  • Submit Article & Press Release

Type above and press Enter to search. Press Esc to cancel.

Subscribe For Latest Updates

Sign up to best of Tech news, informed analysis and opinions on what matters to you.

Invalid email address
 We respect your inbox and never send spam. You can unsubscribe from our newsletter at any time.     
Thanks for subscribing!