On this tutorial, we stroll by how we use Hugging Face Optimum to optimize Transformer fashions and make them quicker whereas sustaining accuracy. We start by organising DistilBERT on the SST-2 dataset, after which we examine completely different execution engines, together with plain PyTorch and torch.compile, ONNX Runtime, and quantized ONNX. By doing this step-by-step, we get hands-on expertise with mannequin export, optimization, quantization, and benchmarking, all inside a Google Colab atmosphere. Try the FULL CODES right here.
!pip -q set up "transformers>=4.49" "optimum[onnxruntime]>=1.20.0" "datasets>=2.20" "consider>=0.4" speed up
from pathlib import Path
import os, time, numpy as np, torch
from datasets import load_dataset
import consider
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import QuantizationConfig
os.environ.setdefault("OMP_NUM_THREADS", "1")
os.environ.setdefault("MKL_NUM_THREADS", "1")
MODEL_ID = "distilbert-base-uncased-finetuned-sst-2-english"
ORT_DIR = Path("onnx-distilbert")
Q_DIR = Path("onnx-distilbert-quant")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
BATCH = 16
MAXLEN = 128
N_WARM = 3
N_ITERS = 8
print(f"Machine: {DEVICE} | torch={torch.__version__}")
We start by putting in the required libraries and organising the environment for Hugging Face Optimum with ONNX Runtime. We configure paths, batch measurement, and iteration settings, and we affirm whether or not we run on CPU or GPU. Try the FULL CODES right here.
ds = load_dataset("glue", "sst2", cut up="validation[:20%]")
texts, labels = ds["sentence"], ds["label"]
metric = consider.load("accuracy")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
def make_batches(texts, max_len=MAXLEN, batch=BATCH):
for i in vary(0, len(texts), batch):
yield tokenizer(texts[i:i+batch], padding=True, truncation=True,
max_length=max_len, return_tensors="pt")
def run_eval(predict_fn, texts, labels):
preds = []
for toks in make_batches(texts):
preds.prolong(predict_fn(toks))
return metric.compute(predictions=preds, references=labels)["accuracy"]
def bench(predict_fn, texts, n_warm=N_WARM, n_iters=N_ITERS):
for _ in vary(n_warm):
for toks in make_batches(texts[:BATCH*2]):
predict_fn(toks)
instances = []
for _ in vary(n_iters):
t0 = time.time()
for toks in make_batches(texts):
predict_fn(toks)
instances.append((time.time() - t0) * 1000)
return float(np.imply(instances)), float(np.std(instances))
We load an SST-2 validation slice and put together tokenization, an accuracy metric, and batching. We outline run_eval to compute accuracy from any predictor and bench to heat up and time end-to-end inference. With these helpers, we pretty examine completely different engines utilizing an identical knowledge and batching. Try the FULL CODES right here.
torch_model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE).eval()
@torch.no_grad()
def pt_predict(toks):
toks = {okay: v.to(DEVICE) for okay, v in toks.objects()}
logits = torch_model(**toks).logits
return logits.argmax(-1).detach().cpu().tolist()
pt_ms, pt_sd = bench(pt_predict, texts)
pt_acc = run_eval(pt_predict, texts, labels)
print(f"[PyTorch eager] {pt_ms:.1f}±{pt_sd:.1f} ms | acc={pt_acc:.4f}")
compiled_model = torch_model
compile_ok = False
strive:
compiled_model = torch.compile(torch_model, mode="reduce-overhead", fullgraph=False)
compile_ok = True
besides Exception as e:
print("torch.compile unavailable or failed -> skipping:", repr(e))
@torch.no_grad()
def ptc_predict(toks):
toks = {okay: v.to(DEVICE) for okay, v in toks.objects()}
logits = compiled_model(**toks).logits
return logits.argmax(-1).detach().cpu().tolist()
if compile_ok:
ptc_ms, ptc_sd = bench(ptc_predict, texts)
ptc_acc = run_eval(ptc_predict, texts, labels)
print(f"[torch.compile] {ptc_ms:.1f}±{ptc_sd:.1f} ms | acc={ptc_acc:.4f}")
We load the baseline PyTorch classifier, outline a pt_predict helper, and benchmark/rating it on SST-2. We then try torch.compile for just-in-time graph optimizations and, if profitable, run the identical benchmarks to match pace and accuracy beneath an an identical setup. Try the FULL CODES right here.
supplier = "CUDAExecutionProvider" if DEVICE == "cuda" else "CPUExecutionProvider"
ort_model = ORTModelForSequenceClassification.from_pretrained(
MODEL_ID, export=True, supplier=supplier, cache_dir=ORT_DIR
)
@torch.no_grad()
def ort_predict(toks):
logits = ort_model(**{okay: v.cpu() for okay, v in toks.objects()}).logits
return logits.argmax(-1).cpu().tolist()
ort_ms, ort_sd = bench(ort_predict, texts)
ort_acc = run_eval(ort_predict, texts, labels)
print(f"[ONNX Runtime] {ort_ms:.1f}±{ort_sd:.1f} ms | acc={ort_acc:.4f}")
Q_DIR.mkdir(dad and mom=True, exist_ok=True)
quantizer = ORTQuantizer.from_pretrained(ORT_DIR)
qconfig = QuantizationConfig(strategy="dynamic", per_channel=False, reduce_range=True)
quantizer.quantize(model_input=ORT_DIR, quantization_config=qconfig, save_dir=Q_DIR)
ort_quant = ORTModelForSequenceClassification.from_pretrained(Q_DIR, supplier=supplier)
@torch.no_grad()
def ortq_predict(toks):
logits = ort_quant(**{okay: v.cpu() for okay, v in toks.objects()}).logits
return logits.argmax(-1).cpu().tolist()
oq_ms, oq_sd = bench(ortq_predict, texts)
oq_acc = run_eval(ortq_predict, texts, labels)
print(f"[ORT Quantized] {oq_ms:.1f}±{oq_sd:.1f} ms | acc={oq_acc:.4f}")
We export the mannequin to ONNX, run it with ONNX Runtime, then apply dynamic quantization with Optimum’s ORTQuantizer and benchmark each to see how latency improves whereas accuracy stays comparable. Try the FULL CODES right here.
pt_pipe = pipeline("sentiment-analysis", mannequin=torch_model, tokenizer=tokenizer,
gadget=0 if DEVICE=="cuda" else -1)
ort_pipe = pipeline("sentiment-analysis", mannequin=ort_model, tokenizer=tokenizer, gadget=-1)
samples = [
"What a fantastic movie—performed brilliantly!",
"This was a complete waste of time.",
"I’m not sure how I feel about this one."
]
print("nSample predictions (PT | ORT):")
for s in samples:
a = pt_pipe(s)[0]["label"]
b = ort_pipe(s)[0]["label"]
print(f"- {s}n PT={a} | ORT={b}")
import pandas as pd
rows = [["PyTorch eager", pt_ms, pt_sd, pt_acc],
["ONNX Runtime", ort_ms, ort_sd, ort_acc],
["ORT Quantized", oq_ms, oq_sd, oq_acc]]
if compile_ok: rows.insert(1, ["torch.compile", ptc_ms, ptc_sd, ptc_acc])
df = pd.DataFrame(rows, columns=["Engine", "Mean ms (↓)", "Std ms", "Accuracy"])
show(df)
print("""
Notes:
- BetterTransformer is deprecated on transformers>=4.49, therefore omitted.
- For bigger positive factors on GPU, additionally strive FlashAttention2 fashions or FP8 with TensorRT-LLM.
- For CPU, tune threads: set OMP_NUM_THREADS/MKL_NUM_THREADS; strive NUMA pinning.
- For static (calibrated) quantization, use QuantizationConfig(strategy="static") with a calibration set.
""")
We sanity-check predictions with fast sentiment pipelines and print PyTorch vs ONNX labels aspect by aspect. We then assemble a abstract desk to match latency and accuracy throughout engines, inserting torch.compile outcomes when obtainable. We conclude with sensible notes, permitting us to increase the workflow to different backends and quantization modes.
In conclusion, we are able to clearly see how Optimum helps us bridge the hole between normal PyTorch fashions and production-ready, optimized deployments. We obtain speedups with ONNX Runtime and quantization whereas retaining accuracy, and we additionally discover how torch.compile offers positive factors straight inside PyTorch. This workflow demonstrates a sensible strategy to balancing efficiency and effectivity for Transformer fashions, offering a basis that may be additional prolonged with superior backends, equivalent to OpenVINO or TensorRT.
Try the FULL CODES right here. Be at liberty to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Publication.
For content material partnership/promotions on marktechpost.com, please TALK to us
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.
Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the most recent breakthroughs, get unique updates, and join with a worldwide community of future-focused thinkers.
Unlock tomorrow’s tendencies at the moment: learn extra, subscribe to our e-newsletter, and change into a part of the NextTech group at NextTech-news.com