Close Menu
  • Home
  • Opinion
  • Region
    • Africa
    • Asia
    • Europe
    • Middle East
    • North America
    • Oceania
    • South America
  • AI & Machine Learning
  • Robotics & Automation
  • Space & Deep Tech
  • Web3 & Digital Economies
  • Climate & Sustainability Tech
  • Biotech & Future Health
  • Mobility & Smart Cities
  • Global Tech Pulse
  • Cybersecurity & Digital Rights
  • Future of Work & Education
  • Trend Radar & Startup Watch
  • Creator Economy & Culture
What's Hot

Google DeepMind Introduces Aletheia: The AI Agent Shifting from Math Competitions to Absolutely Autonomous Skilled Analysis Discoveries

February 13, 2026

👨🏿‍🚀TechCabal Each day – inDrive’s new aspect hustle

February 13, 2026

Angus Taylor is the brand new Opposition Chief, what this implies for Australia’s tech and power future

February 13, 2026
Facebook X (Twitter) Instagram LinkedIn RSS
NextTech NewsNextTech News
Facebook X (Twitter) Instagram LinkedIn RSS
  • Home
  • Africa
  • Asia
  • Europe
  • Middle East
  • North America
  • Oceania
  • South America
  • Opinion
Trending
  • Google DeepMind Introduces Aletheia: The AI Agent Shifting from Math Competitions to Absolutely Autonomous Skilled Analysis Discoveries
  • 👨🏿‍🚀TechCabal Each day – inDrive’s new aspect hustle
  • Angus Taylor is the brand new Opposition Chief, what this implies for Australia’s tech and power future
  • Tyndall to steer Eire in new €50m EU quantum pilot P4Q
  • Digital belief platform IDfy raises Rs 476 Cr in funding spherical led by Neo Asset Administration
  • Expertise Leaker Offers Us Greatest Look But on the iPhone 17e
  • Mazagan Seashore & Golf Resort Welcomes Ramadan with a Culinary Expertise Impressed by Moroccan Heritage
  • Methods to Align Giant Language Fashions with Human Preferences Utilizing Direct Desire Optimization, QLoRA, and Extremely-Suggestions
Friday, February 13
NextTech NewsNextTech News
Home - AI & Machine Learning - Methods to Align Giant Language Fashions with Human Preferences Utilizing Direct Desire Optimization, QLoRA, and Extremely-Suggestions
AI & Machine Learning

Methods to Align Giant Language Fashions with Human Preferences Utilizing Direct Desire Optimization, QLoRA, and Extremely-Suggestions

NextTechBy NextTechFebruary 13, 2026No Comments6 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Telegram Email Copy Link
Follow Us
Google News Flipboard
Methods to Align Giant Language Fashions with Human Preferences Utilizing Direct Desire Optimization, QLoRA, and Extremely-Suggestions
Share
Facebook Twitter LinkedIn Pinterest Email


On this tutorial, we implement an end-to-end Direct Desire Optimization workflow to align a big language mannequin with human preferences with out utilizing a reward mannequin. We mix TRL’s DPOTrainer with QLoRA and PEFT to make preference-based alignment possible on a single Colab GPU. We practice straight on the UltraFeedback binarized dataset, the place every immediate has a selected and a rejected response, permitting us to form mannequin habits and elegance somewhat than simply factual recall.

import os
import math
import random
import torch


!pip -q set up -U "transformers>=4.45.0" "datasets>=2.19.0" "speed up>=0.33.0" "trl>=0.27.0" "peft>=0.12.0" "bitsandbytes>=0.43.0" "sentencepiece" "consider"


SEED = 42
random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)


MODEL_NAME = os.environ.get("MODEL_NAME", "Qwen/Qwen2-0.5B-Instruct")
DATASET_NAME = "HuggingFaceH4/ultrafeedback_binarized"
OUTPUT_DIR = "dpo_ultrafeedback_qlora"


MAX_TRAIN_SAMPLES = 8000
MAX_EVAL_SAMPLES  = 200
MAX_PROMPT_LEN = 512
MAX_COMPLETION_LEN = 256


BETA = 0.1
LR = 2e-4
EPOCHS = 1
PER_DEVICE_BS = 2
GRAD_ACCUM = 8


LOGGING_STEPS = 10
SAVE_STEPS = 200


gadget = "cuda" if torch.cuda.is_available() else "cpu"
print("Gadget:", gadget, "GPU:", torch.cuda.get_device_name(0) if gadget == "cuda" else "None")

We arrange the execution surroundings and set up all required libraries for DPO, PEFT, and quantized coaching. We outline all world hyperparameters, dataset limits, and optimization settings in a single place. We additionally initialize the random quantity generator and ensure GPU availability to make sure reproducible runs.

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


bnb_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8 else torch.float16,
)


tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
if tokenizer.pad_token is None:
   tokenizer.pad_token = tokenizer.eos_token


mannequin = AutoModelForCausalLM.from_pretrained(
   MODEL_NAME,
   quantization_config=bnb_config,
   torch_dtype=torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8 else torch.float16,
   device_map="auto",
)
mannequin.config.use_cache = False

We load the tokenizer and the bottom language mannequin utilizing 4-bit quantization to attenuate reminiscence utilization. We configure bitsandbytes to allow environment friendly QLoRA-style computation on Colab GPUs. We put together the mannequin for coaching by disabling cache utilization to keep away from incompatibilities throughout backpropagation.

from peft import LoraConfig, get_peft_model


lora_config = LoraConfig(
   r=16,
   lora_alpha=32,
   lora_dropout=0.05,
   bias="none",
   task_type="CAUSAL_LM",
   target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj", "gate_proj"],
)


mannequin = get_peft_model(mannequin, lora_config)
mannequin.print_trainable_parameters()


mannequin.gradient_checkpointing_enable()

We connect LoRA adapters to the mannequin’s consideration and feed-forward projection layers. We limit coaching to a small set of parameters to make fine-tuning environment friendly and steady. We allow gradient checkpointing to additional scale back GPU reminiscence consumption throughout coaching.

from datasets import load_dataset


ds = load_dataset(DATASET_NAME)


train_split = "train_prefs" if "train_prefs" in ds else ("practice" if "practice" in ds else checklist(ds.keys())[0])
test_split  = "test_prefs" if "test_prefs" in ds else ("take a look at" if "take a look at" in ds else None)


train_raw = ds[train_split]
test_raw = ds[test_split] if test_split will not be None else None


print("Splits:", ds.keys())
print("Utilizing practice cut up:", train_split, "dimension:", len(train_raw))
if test_raw will not be None:
   print("Utilizing take a look at cut up:", test_split, "dimension:", len(test_raw))


def _extract_last_user_and_assistant(messages):
   last_user_idx = None
   last_asst_idx = None
   for i, m in enumerate(messages):
       if m.get("function") == "person":
           last_user_idx = i
       if m.get("function") == "assistant":
           last_asst_idx = i


   if last_user_idx is None or last_asst_idx is None:
       return None, None


   prompt_messages = messages[: last_user_idx + 1]
   assistant_text = messages[last_asst_idx].get("content material", "")
   return prompt_messages, assistant_text


def format_example(ex):
   chosen_msgs = ex["chosen"]
   rejected_msgs = ex["rejected"]


   prompt_msgs_c, chosen_text = _extract_last_user_and_assistant(chosen_msgs)
   prompt_msgs_r, rejected_text = _extract_last_user_and_assistant(rejected_msgs)


   if prompt_msgs_c is None or prompt_msgs_r is None:
       return {"immediate": None, "chosen": None, "rejected": None}


   prompt_text = tokenizer.apply_chat_template(
       prompt_msgs_c, tokenize=False, add_generation_prompt=True
   )


   return {
       "immediate": prompt_text,
       "chosen": chosen_text.strip(),
       "rejected": rejected_text.strip(),
   }


train_raw = train_raw.shuffle(seed=SEED)
train_raw = train_raw.choose(vary(min(MAX_TRAIN_SAMPLES, len(train_raw))))


train_ds = train_raw.map(format_example, remove_columns=train_raw.column_names)
train_ds = train_ds.filter(lambda x: x["prompt"] will not be None and len(x["chosen"]) > 0 and len(x["rejected"]) > 0)


if test_raw will not be None:
   test_raw = test_raw.shuffle(seed=SEED)
   test_raw = test_raw.choose(vary(min(MAX_EVAL_SAMPLES, len(test_raw))))
   eval_ds = test_raw.map(format_example, remove_columns=test_raw.column_names)
   eval_ds = eval_ds.filter(lambda x: x["prompt"] will not be None and len(x["chosen"]) > 0 and len(x["rejected"]) > 0)
else:
   eval_ds = None


print("Prepare examples:", len(train_ds), "Eval examples:", len(eval_ds) if eval_ds will not be None else 0)
print(train_ds[0])

We load the UltraFeedback binarized dataset and dynamically choose the suitable practice and take a look at splits. We extract immediate, chosen, and rejected responses from multi-turn conversations and format them utilizing the mannequin’s chat template. We shuffle, filter, and subsample the information to create clear and environment friendly coaching and analysis datasets.

from trl import DPOTrainer, DPOConfig


use_bf16 = torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8
use_fp16 = torch.cuda.is_available() and never use_bf16


training_args = DPOConfig(
   output_dir=OUTPUT_DIR,
   beta=BETA,
   per_device_train_batch_size=PER_DEVICE_BS,
   gradient_accumulation_steps=GRAD_ACCUM,
   num_train_epochs=EPOCHS,
   learning_rate=LR,
   lr_scheduler_type="cosine",
   warmup_ratio=0.05,
   logging_steps=LOGGING_STEPS,
   save_steps=SAVE_STEPS,
   save_total_limit=2,
   bf16=use_bf16,
   fp16=use_fp16,
   optim="paged_adamw_8bit",
   max_length=MAX_PROMPT_LEN + MAX_COMPLETION_LEN,
   max_prompt_length=MAX_PROMPT_LEN,
   report_to="none",
)


coach = DPOTrainer(
   mannequin=mannequin,
   args=training_args,
   processing_class=tokenizer,
   train_dataset=train_ds,
   eval_dataset=eval_ds,
)


coach.practice()


coach.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)


print("Saved to:", OUTPUT_DIR)

We configure the DPO coaching goal with rigorously chosen optimization and scheduling parameters. We initialize the DPOTrainer to straight optimize choice pairs and not using a reward mannequin. We practice the LoRA adapters and save the aligned mannequin artifacts for later inference.

from peft import PeftModel
from transformers import pipeline


def generate_text(model_for_gen, immediate, max_new_tokens=180):
   model_for_gen.eval()
   inputs = tokenizer(immediate, return_tensors="pt", truncation=True, max_length=MAX_PROMPT_LEN).to(model_for_gen.gadget)
   with torch.no_grad():
       out = model_for_gen.generate(
           **inputs,
           max_new_tokens=max_new_tokens,
           do_sample=True,
           temperature=0.7,
           top_p=0.95,
           pad_token_id=tokenizer.eos_token_id,
       )
   return tokenizer.decode(out[0], skip_special_tokens=True)


base_model = AutoModelForCausalLM.from_pretrained(
   MODEL_NAME,
   quantization_config=bnb_config,
   torch_dtype=torch.bfloat16 if use_bf16 else torch.float16,
   device_map="auto",
)
base_model.config.use_cache = True


dpo_model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)
dpo_model.config.use_cache = True


sample_pool = eval_ds if eval_ds will not be None and len(eval_ds) > 0 else train_ds
samples = [sample_pool[i] for i in random.pattern(vary(len(sample_pool)), okay=min(3, len(sample_pool)))]


for i, ex in enumerate(samples, 1):
   immediate = ex["prompt"]
   print("n" + "="*90)
   print(f"Pattern #{i}")
   print("- Immediate:n", immediate)


   base_out = generate_text(base_model, immediate)
   dpo_out  = generate_text(dpo_model, immediate)


   print("n- Base mannequin output:n", base_out)
   print("n- DPO (LoRA) output:n", dpo_out)


print("nDone.")

We reload the bottom mannequin and connect the educated DPO LoRA adapters for inference. We generate responses from each the unique and aligned fashions utilizing the identical prompts for comparability. We qualitatively consider how choice optimization modifications mannequin habits by inspecting the outputs facet by facet.

In conclusion, we demonstrated how DPO offers a steady and environment friendly various to RLHF by straight optimizing choice pairs with a easy, well-defined goal. We confirmed that parameter-efficient fine-tuning with LoRA and 4-bit quantization allows sensible experimentation even underneath tight compute constraints. We qualitatively validated alignment by evaluating generations earlier than and after DPO coaching, confirming that the mannequin learns to want higher-quality responses whereas remaining light-weight and deployable.


Try the FULL CODES right here. Additionally, be at liberty to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as effectively.


Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the newest breakthroughs, get unique updates, and join with a world community of future-focused thinkers.
Unlock tomorrow’s traits right now: learn extra, subscribe to our publication, and turn out to be a part of the NextTech group at NextTech-news.com

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
NextTech
  • Website

Related Posts

Google DeepMind Introduces Aletheia: The AI Agent Shifting from Math Competitions to Absolutely Autonomous Skilled Analysis Discoveries

February 13, 2026

Is This AGI? Google’s Gemini 3 Deep Suppose Shatters Humanity’s Final Examination And Hits 84.6% On ARC-AGI-2 Efficiency Right this moment

February 13, 2026

OpenAI Releases a Analysis Preview of GPT‑5.3-Codex-Spark: A 15x Quicker AI Coding Mannequin Delivering Over 1000 Tokens Per Second on Cerebras {Hardware}

February 13, 2026
Add A Comment
Leave A Reply Cancel Reply

Economy News

Google DeepMind Introduces Aletheia: The AI Agent Shifting from Math Competitions to Absolutely Autonomous Skilled Analysis Discoveries

By NextTechFebruary 13, 2026

Google DeepMind crew has launched Aletheia, a specialised AI agent…

👨🏿‍🚀TechCabal Each day – inDrive’s new aspect hustle

February 13, 2026

Angus Taylor is the brand new Opposition Chief, what this implies for Australia’s tech and power future

February 13, 2026
Top Trending

Google DeepMind Introduces Aletheia: The AI Agent Shifting from Math Competitions to Absolutely Autonomous Skilled Analysis Discoveries

By NextTechFebruary 13, 2026

Google DeepMind crew has launched Aletheia, a…

👨🏿‍🚀TechCabal Each day – inDrive’s new aspect hustle

By NextTechFebruary 13, 2026

Picture Supply: TechCabal In case you’re an inDrive consumer in South Africa,…

Angus Taylor is the brand new Opposition Chief, what this implies for Australia’s tech and power future

By NextTechFebruary 13, 2026

Angus Taylor should function the Twitter brand on his web site (come…

Subscribe to News

Get the latest sports news from NewsSite about world, sports and politics.

NEXTTECH-LOGO
Facebook X (Twitter) Instagram YouTube

AI & Machine Learning

Robotics & Automation

Space & Deep Tech

Web3 & Digital Economies

Climate & Sustainability Tech

Biotech & Future Health

Mobility & Smart Cities

Global Tech Pulse

Cybersecurity & Digital Rights

Future of Work & Education

Creator Economy & Culture

Trend Radar & Startup Watch

News By Region

Africa

Asia

Europe

Middle East

North America

Oceania

South America

2025 © NextTech-News. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms Of Service
  • Advertise With Us
  • Write For Us
  • Submit Article & Press Release

Type above and press Enter to search. Press Esc to cancel.

Subscribe For Latest Updates

Sign up to best of Tech news, informed analysis and opinions on what matters to you.

Invalid email address
 We respect your inbox and never send spam. You can unsubscribe from our newsletter at any time.     
Thanks for subscribing!