Close Menu
  • Home
  • Opinion
  • Region
    • Africa
    • Asia
    • Europe
    • Middle East
    • North America
    • Oceania
    • South America
  • AI & Machine Learning
  • Robotics & Automation
  • Space & Deep Tech
  • Web3 & Digital Economies
  • Climate & Sustainability Tech
  • Biotech & Future Health
  • Mobility & Smart Cities
  • Global Tech Pulse
  • Cybersecurity & Digital Rights
  • Future of Work & Education
  • Trend Radar & Startup Watch
  • Creator Economy & Culture
What's Hot

Bowers & Wilkins Pronounces New Px8 S2 Over Ear Headphones

September 24, 2025

Robotic umpires are coming to MLB. This is how they work

September 24, 2025

Logitech Signature Photo voltaic Slim+ K980 Keyboard Evaluation: See the Mild

September 24, 2025
Facebook X (Twitter) Instagram LinkedIn RSS
NextTech NewsNextTech News
Facebook X (Twitter) Instagram LinkedIn RSS
  • Home
  • Africa
  • Asia
  • Europe
  • Middle East
  • North America
  • Oceania
  • South America
  • Opinion
Trending
  • Bowers & Wilkins Pronounces New Px8 S2 Over Ear Headphones
  • Robotic umpires are coming to MLB. This is how they work
  • Logitech Signature Photo voltaic Slim+ K980 Keyboard Evaluation: See the Mild
  • IPO or Bust? Find out how to Construct the Proper Shareholder Roster Earlier than You Go Public.
  • Built-in Biometric Seat-sensor Market Reveals Robust Potential Throughout Asia-Pacific and North America Pushed by Sensible Automobile and Electrical Automobile Adoption
  • Startup information and updates: Day by day roundup (September 24, 2025)
  • UK danger intelligence platform Sign AI raises $165m to develop globally
  • Areas of Alternative for AI in Medical Analysis
Wednesday, September 24
NextTech NewsNextTech News
Home - AI & Machine Learning - Coding Implementation to Finish-to-Finish Transformer Mannequin Optimization with Hugging Face Optimum, ONNX Runtime, and Quantization
AI & Machine Learning

Coding Implementation to Finish-to-Finish Transformer Mannequin Optimization with Hugging Face Optimum, ONNX Runtime, and Quantization

NextTechBy NextTechSeptember 24, 2025No Comments6 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Telegram Email Copy Link
Follow Us
Google News Flipboard
Coding Implementation to Finish-to-Finish Transformer Mannequin Optimization with Hugging Face Optimum, ONNX Runtime, and Quantization
Share
Facebook Twitter LinkedIn Pinterest Email


On this tutorial, we stroll by how we use Hugging Face Optimum to optimize Transformer fashions and make them quicker whereas sustaining accuracy. We start by organising DistilBERT on the SST-2 dataset, after which we examine completely different execution engines, together with plain PyTorch and torch.compile, ONNX Runtime, and quantized ONNX. By doing this step-by-step, we get hands-on expertise with mannequin export, optimization, quantization, and benchmarking, all inside a Google Colab atmosphere. Try the FULL CODES right here.

!pip -q set up "transformers>=4.49" "optimum[onnxruntime]>=1.20.0" "datasets>=2.20" "consider>=0.4" speed up


from pathlib import Path
import os, time, numpy as np, torch
from datasets import load_dataset
import consider
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import QuantizationConfig


os.environ.setdefault("OMP_NUM_THREADS", "1")
os.environ.setdefault("MKL_NUM_THREADS", "1")


MODEL_ID = "distilbert-base-uncased-finetuned-sst-2-english"
ORT_DIR  = Path("onnx-distilbert")
Q_DIR    = Path("onnx-distilbert-quant")
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"
BATCH    = 16
MAXLEN   = 128
N_WARM   = 3
N_ITERS  = 8


print(f"Machine: {DEVICE} | torch={torch.__version__}")

We start by putting in the required libraries and organising the environment for Hugging Face Optimum with ONNX Runtime. We configure paths, batch measurement, and iteration settings, and we affirm whether or not we run on CPU or GPU. Try the FULL CODES right here.

ds = load_dataset("glue", "sst2", cut up="validation[:20%]")
texts, labels = ds["sentence"], ds["label"]
metric = consider.load("accuracy")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)


def make_batches(texts, max_len=MAXLEN, batch=BATCH):
   for i in vary(0, len(texts), batch):
       yield tokenizer(texts[i:i+batch], padding=True, truncation=True,
                       max_length=max_len, return_tensors="pt")


def run_eval(predict_fn, texts, labels):
   preds = []
   for toks in make_batches(texts):
       preds.prolong(predict_fn(toks))
   return metric.compute(predictions=preds, references=labels)["accuracy"]


def bench(predict_fn, texts, n_warm=N_WARM, n_iters=N_ITERS):
   for _ in vary(n_warm):
       for toks in make_batches(texts[:BATCH*2]):
           predict_fn(toks)
   instances = []
   for _ in vary(n_iters):
       t0 = time.time()
       for toks in make_batches(texts):
           predict_fn(toks)
       instances.append((time.time() - t0) * 1000)
   return float(np.imply(instances)), float(np.std(instances))

We load an SST-2 validation slice and put together tokenization, an accuracy metric, and batching. We outline run_eval to compute accuracy from any predictor and bench to heat up and time end-to-end inference. With these helpers, we pretty examine completely different engines utilizing an identical knowledge and batching. Try the FULL CODES right here.

torch_model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE).eval()


@torch.no_grad()
def pt_predict(toks):
   toks = {okay: v.to(DEVICE) for okay, v in toks.objects()}
   logits = torch_model(**toks).logits
   return logits.argmax(-1).detach().cpu().tolist()


pt_ms, pt_sd = bench(pt_predict, texts)
pt_acc = run_eval(pt_predict, texts, labels)
print(f"[PyTorch eager]   {pt_ms:.1f}±{pt_sd:.1f} ms | acc={pt_acc:.4f}")


compiled_model = torch_model
compile_ok = False
strive:
   compiled_model = torch.compile(torch_model, mode="reduce-overhead", fullgraph=False)
   compile_ok = True
besides Exception as e:
   print("torch.compile unavailable or failed -> skipping:", repr(e))


@torch.no_grad()
def ptc_predict(toks):
   toks = {okay: v.to(DEVICE) for okay, v in toks.objects()}
   logits = compiled_model(**toks).logits
   return logits.argmax(-1).detach().cpu().tolist()


if compile_ok:
   ptc_ms, ptc_sd = bench(ptc_predict, texts)
   ptc_acc = run_eval(ptc_predict, texts, labels)
   print(f"[torch.compile]   {ptc_ms:.1f}±{ptc_sd:.1f} ms | acc={ptc_acc:.4f}")

We load the baseline PyTorch classifier, outline a pt_predict helper, and benchmark/rating it on SST-2. We then try torch.compile for just-in-time graph optimizations and, if profitable, run the identical benchmarks to match pace and accuracy beneath an an identical setup. Try the FULL CODES right here.

supplier = "CUDAExecutionProvider" if DEVICE == "cuda" else "CPUExecutionProvider"
ort_model = ORTModelForSequenceClassification.from_pretrained(
   MODEL_ID, export=True, supplier=supplier, cache_dir=ORT_DIR
)


@torch.no_grad()
def ort_predict(toks):
   logits = ort_model(**{okay: v.cpu() for okay, v in toks.objects()}).logits
   return logits.argmax(-1).cpu().tolist()


ort_ms, ort_sd = bench(ort_predict, texts)
ort_acc = run_eval(ort_predict, texts, labels)
print(f"[ONNX Runtime]    {ort_ms:.1f}±{ort_sd:.1f} ms | acc={ort_acc:.4f}")


Q_DIR.mkdir(dad and mom=True, exist_ok=True)
quantizer = ORTQuantizer.from_pretrained(ORT_DIR)
qconfig = QuantizationConfig(strategy="dynamic", per_channel=False, reduce_range=True)
quantizer.quantize(model_input=ORT_DIR, quantization_config=qconfig, save_dir=Q_DIR)


ort_quant = ORTModelForSequenceClassification.from_pretrained(Q_DIR, supplier=supplier)


@torch.no_grad()
def ortq_predict(toks):
   logits = ort_quant(**{okay: v.cpu() for okay, v in toks.objects()}).logits
   return logits.argmax(-1).cpu().tolist()


oq_ms, oq_sd = bench(ortq_predict, texts)
oq_acc = run_eval(ortq_predict, texts, labels)
print(f"[ORT Quantized]   {oq_ms:.1f}±{oq_sd:.1f} ms | acc={oq_acc:.4f}")

We export the mannequin to ONNX, run it with ONNX Runtime, then apply dynamic quantization with Optimum’s ORTQuantizer and benchmark each to see how latency improves whereas accuracy stays comparable. Try the FULL CODES right here.

pt_pipe  = pipeline("sentiment-analysis", mannequin=torch_model, tokenizer=tokenizer,
                   gadget=0 if DEVICE=="cuda" else -1)
ort_pipe = pipeline("sentiment-analysis", mannequin=ort_model, tokenizer=tokenizer, gadget=-1)
samples = [
   "What a fantastic movie—performed brilliantly!",
   "This was a complete waste of time.",
   "I’m not sure how I feel about this one."
]
print("nSample predictions (PT | ORT):")
for s in samples:
   a = pt_pipe(s)[0]["label"]
   b = ort_pipe(s)[0]["label"]
   print(f"- {s}n  PT={a} | ORT={b}")


import pandas as pd
rows = [["PyTorch eager", pt_ms, pt_sd, pt_acc],
       ["ONNX Runtime",  ort_ms, ort_sd, ort_acc],
       ["ORT Quantized", oq_ms, oq_sd, oq_acc]]
if compile_ok: rows.insert(1, ["torch.compile", ptc_ms, ptc_sd, ptc_acc])
df = pd.DataFrame(rows, columns=["Engine", "Mean ms (↓)", "Std ms", "Accuracy"])
show(df)


print("""
Notes:
- BetterTransformer is deprecated on transformers>=4.49, therefore omitted.
- For bigger positive factors on GPU, additionally strive FlashAttention2 fashions or FP8 with TensorRT-LLM.
- For CPU, tune threads: set OMP_NUM_THREADS/MKL_NUM_THREADS; strive NUMA pinning.
- For static (calibrated) quantization, use QuantizationConfig(strategy="static") with a calibration set.
""")

We sanity-check predictions with fast sentiment pipelines and print PyTorch vs ONNX labels aspect by aspect. We then assemble a abstract desk to match latency and accuracy throughout engines, inserting torch.compile outcomes when obtainable. We conclude with sensible notes, permitting us to increase the workflow to different backends and quantization modes.

In conclusion, we are able to clearly see how Optimum helps us bridge the hole between normal PyTorch fashions and production-ready, optimized deployments. We obtain speedups with ONNX Runtime and quantization whereas retaining accuracy, and we additionally discover how torch.compile offers positive factors straight inside PyTorch. This workflow demonstrates a sensible strategy to balancing efficiency and effectivity for Transformer fashions, offering a basis that may be additional prolonged with superior backends, equivalent to OpenVINO or TensorRT.


Try the FULL CODES right here. Be at liberty to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Publication.

For content material partnership/promotions on marktechpost.com, please TALK to us


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.

🔥[Recommended Read] NVIDIA AI Open-Sources ViPE (Video Pose Engine): A Highly effective and Versatile 3D Video Annotation Device for Spatial AI

Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the most recent breakthroughs, get unique updates, and join with a worldwide community of future-focused thinkers.
Unlock tomorrow’s tendencies at the moment: learn extra, subscribe to our e-newsletter, and change into a part of the NextTech group at NextTech-news.com

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
NextTech
  • Website

Related Posts

High Knowledge Labeling Corporations Powering Monetary AI in 2025

September 24, 2025

CloudFlare AI Staff Simply Open-Sourced ‘VibeSDK’ that Lets Anybody Construct and Deploy a Full AI Vibe Coding Platform with a Single Click on

September 24, 2025

Google AI Introduces the Public Preview of Chrome DevTools MCP: Making Your Coding Agent Management and Examine a Stay Chrome Browser

September 23, 2025
Add A Comment
Leave A Reply Cancel Reply

Economy News

Bowers & Wilkins Pronounces New Px8 S2 Over Ear Headphones

By NextTechSeptember 24, 2025

Bowers & Wilkins says its new Px8 S2 units a brand new benchmark for efficiency…

Robotic umpires are coming to MLB. This is how they work

September 24, 2025

Logitech Signature Photo voltaic Slim+ K980 Keyboard Evaluation: See the Mild

September 24, 2025
Top Trending

Bowers & Wilkins Pronounces New Px8 S2 Over Ear Headphones

By NextTechSeptember 24, 2025

Bowers & Wilkins says its new Px8 S2 units a brand new…

Robotic umpires are coming to MLB. This is how they work

By NextTechSeptember 24, 2025

A Trackman machine used for the Automated Ball-Strike System is posted on…

Logitech Signature Photo voltaic Slim+ K980 Keyboard Evaluation: See the Mild

By NextTechSeptember 24, 2025

This probably received’t be a problem for anybody who needs a sensible,…

Subscribe to News

Get the latest sports news from NewsSite about world, sports and politics.

NEXTTECH-LOGO
Facebook X (Twitter) Instagram YouTube

AI & Machine Learning

Robotics & Automation

Space & Deep Tech

Web3 & Digital Economies

Climate & Sustainability Tech

Biotech & Future Health

Mobility & Smart Cities

Global Tech Pulse

Cybersecurity & Digital Rights

Future of Work & Education

Creator Economy & Culture

Trend Radar & Startup Watch

News By Region

Africa

Asia

Europe

Middle East

North America

Oceania

South America

2025 © NextTech-News. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms Of Service
  • Advertise With Us
  • Write For Us
  • Submit Article & Press Release

Type above and press Enter to search. Press Esc to cancel.

Subscribe For Latest Updates

Sign up to best of Tech news, informed analysis and opinions on what matters to you.

Invalid email address
 We respect your inbox and never send spam. You can unsubscribe from our newsletter at any time.     
Thanks for subscribing!