Coding Implementation to End-to-End Transformer Model Optimization with Hugging Face Optimum, ONNX Runtime, and Quantization
In this tutorial, we stroll by how we use Hugging Face Optimum to optimize Transformer fashions and make them quicker whereas sustaining accuracy. We start by establishing DistilBERT on the SST-2 dataset, and then we evaluate completely different execution engines, together with plain PyTorch and torch.compile, ONNX Runtime, and quantized ONNX. By doing this step-by-step, we get hands-on expertise with mannequin export, optimization, quantization, and benchmarking, all inside a Google Colab surroundings. Check out the FULL CODES right here.
!pip -q set up "transformers>=4.49" "optimum[onnxruntime]>=1.20.0" "datasets>=2.20" "consider>=0.4" speed up
from pathlib import Path
import os, time, numpy as np, torch
from datasets import load_dataset
import consider
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import QuantizationConfig
os.environ.setdefault("OMP_NUM_THREADS", "1")
os.environ.setdefault("MKL_NUM_THREADS", "1")
MODEL_ID = "distilbert-base-uncased-finetuned-sst-2-english"
ORT_DIR = Path("onnx-distilbert")
Q_DIR = Path("onnx-distilbert-quant")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
BATCH = 16
MAXLEN = 128
N_WARM = 3
N_ITERS = 8
print(f"Device: {DEVICE} | torch={torch.__version__}")
We start by putting in the required libraries and establishing our surroundings for Hugging Face Optimum with ONNX Runtime. We configure paths, batch measurement, and iteration settings, and we affirm whether or not we run on CPU or GPU. Check out the FULL CODES right here.
ds = load_dataset("glue", "sst2", break up="validation[:20%]")
texts, labels = ds["sentence"], ds["label"]
metric = consider.load("accuracy")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
def make_batches(texts, max_len=MAXLEN, batch=BATCH):
for i in vary(0, len(texts), batch):
yield tokenizer(texts[i:i+batch], padding=True, truncation=True,
max_length=max_len, return_tensors="pt")
def run_eval(predict_fn, texts, labels):
preds = []
for toks in make_batches(texts):
preds.prolong(predict_fn(toks))
return metric.compute(predictions=preds, references=labels)["accuracy"]
def bench(predict_fn, texts, n_warm=N_WARM, n_iters=N_ITERS):
for _ in vary(n_warm):
for toks in make_batches(texts[:BATCH*2]):
predict_fn(toks)
instances = []
for _ in vary(n_iters):
t0 = time.time()
for toks in make_batches(texts):
predict_fn(toks)
instances.append((time.time() - t0) * 1000)
return float(np.imply(instances)), float(np.std(instances))
We load an SST-2 validation slice and put together tokenization, an accuracy metric, and batching. We outline run_eval to compute accuracy from any predictor and bench to heat up and time end-to-end inference. With these helpers, we pretty evaluate completely different engines utilizing equivalent knowledge and batching. Check out the FULL CODES right here.
torch_model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE).eval()
@torch.no_grad()
def pt_predict(toks):
toks = {ok: v.to(DEVICE) for ok, v in toks.objects()}
logits = torch_model(**toks).logits
return logits.argmax(-1).detach().cpu().tolist()
pt_ms, pt_sd = bench(pt_predict, texts)
pt_acc = run_eval(pt_predict, texts, labels)
print(f"[PyTorch eager] {pt_ms:.1f}±{pt_sd:.1f} ms | acc={pt_acc:.4f}")
compiled_model = torch_model
compile_ok = False
attempt:
compiled_model = torch.compile(torch_model, mode="reduce-overhead", fullgraph=False)
compile_ok = True
besides Exception as e:
print("torch.compile unavailable or failed -> skipping:", repr(e))
@torch.no_grad()
def ptc_predict(toks):
toks = {ok: v.to(DEVICE) for ok, v in toks.objects()}
logits = compiled_model(**toks).logits
return logits.argmax(-1).detach().cpu().tolist()
if compile_ok:
ptc_ms, ptc_sd = bench(ptc_predict, texts)
ptc_acc = run_eval(ptc_predict, texts, labels)
print(f"[torch.compile] {ptc_ms:.1f}±{ptc_sd:.1f} ms | acc={ptc_acc:.4f}")
We load the baseline PyTorch classifier, outline a pt_predict helper, and benchmark/rating it on SST-2. We then try torch.compile for just-in-time graph optimizations and, if profitable, run the identical benchmarks to evaluate pace and accuracy beneath an equivalent setup. Check out the FULL CODES right here.
supplier = "CUDAExecutionProvider" if DEVICE == "cuda" else "CPUExecutionProvider"
ort_model = ORTModelForSequenceClassification.from_pretrained(
MODEL_ID, export=True, supplier=supplier, cache_dir=ORT_DIR
)
@torch.no_grad()
def ort_predict(toks):
logits = ort_model(**{ok: v.cpu() for ok, v in toks.objects()}).logits
return logits.argmax(-1).cpu().tolist()
ort_ms, ort_sd = bench(ort_predict, texts)
ort_acc = run_eval(ort_predict, texts, labels)
print(f"[ONNX Runtime] {ort_ms:.1f}±{ort_sd:.1f} ms | acc={ort_acc:.4f}")
Q_DIR.mkdir(mother and father=True, exist_ok=True)
quantizer = ORTQuantizer.from_pretrained(ORT_DIR)
qconfig = QuantizationConfig(method="dynamic", per_channel=False, reduce_range=True)
quantizer.quantize(model_input=ORT_DIR, quantization_config=qconfig, save_dir=Q_DIR)
ort_quant = ORTModelForSequenceClassification.from_pretrained(Q_DIR, supplier=supplier)
@torch.no_grad()
def ortq_predict(toks):
logits = ort_quant(**{ok: v.cpu() for ok, v in toks.objects()}).logits
return logits.argmax(-1).cpu().tolist()
oq_ms, oq_sd = bench(ortq_predict, texts)
oq_acc = run_eval(ortq_predict, texts, labels)
print(f"[ORT Quantized] {oq_ms:.1f}±{oq_sd:.1f} ms | acc={oq_acc:.4f}")
We export the mannequin to ONNX, run it with ONNX Runtime, then apply dynamic quantization with Optimum’s ORTQuantizer and benchmark each to see how latency improves whereas accuracy stays comparable. Check out the FULL CODES right here.
pt_pipe = pipeline("sentiment-analysis", mannequin=torch_model, tokenizer=tokenizer,
system=0 if DEVICE=="cuda" else -1)
ort_pipe = pipeline("sentiment-analysis", mannequin=ort_model, tokenizer=tokenizer, system=-1)
samples = [
"What a fantastic movie—performed brilliantly!",
"This was a complete waste of time.",
"I’m not sure how I feel about this one."
]
print("nSample predictions (PT | ORT):")
for s in samples:
a = pt_pipe(s)[0]["label"]
b = ort_pipe(s)[0]["label"]
print(f"- {s}n PT={a} | ORT={b}")
import pandas as pd
rows = [["PyTorch eager", pt_ms, pt_sd, pt_acc],
["ONNX Runtime", ort_ms, ort_sd, ort_acc],
["ORT Quantized", oq_ms, oq_sd, oq_acc]]
if compile_ok: rows.insert(1, ["torch.compile", ptc_ms, ptc_sd, ptc_acc])
df = pd.DataFrame(rows, columns=["Engine", "Mean ms (↓)", "Std ms", "Accuracy"])
show(df)
print("""
Notes:
- HigherTransformer is deprecated on transformers>=4.49, therefore omitted.
- For bigger positive aspects on GPU, additionally attempt FlashAttention2 fashions or FP8 with TensorRT-LLM.
- For CPU, tune threads: set OMP_NUM_THREADS/MKL_NUM_THREADS; attempt NUMA pinning.
- For static (calibrated) quantization, use QuantizationConfig(method='static') with a calibration set.
""")
We sanity-check predictions with fast sentiment pipelines and print PyTorch vs ONNX labels facet by facet. We then assemble a abstract desk to evaluate latency and accuracy throughout engines, inserting torch.compile outcomes when obtainable. We conclude with sensible notes, permitting us to prolong the workflow to different backends and quantization modes.
In conclusion, we are able to clearly see how Optimum helps us bridge the hole between customary PyTorch fashions and production-ready, optimized deployments. We obtain speedups with ONNX Runtime and quantization whereas retaining accuracy, and we additionally discover how torch.compile offers positive aspects immediately inside PyTorch. This workflow demonstrates a sensible method to balancing efficiency and effectivity for Transformer fashions, offering a basis that may be additional prolonged with superior backends, comparable to OpenVINO or TensorRT.
Check out the FULL CODES right here. Feel free to try our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t overlook to be part of our 100k+ ML SubReddit and Subscribe to our Newsletter.
For content material partnership/promotions on marktechpost.com, please TALK to us
The publish Coding Implementation to End-to-End Transformer Model Optimization with Hugging Face Optimum, ONNX Runtime, and Quantization appeared first on MarkTechPost.