Coding Implementation to End-to-End Transformer Model Optimization with Hugging Face Optimum, ONNX Runtime, and Quantization

In this tutorial, we stroll by how we use Hugging Face Optimum to optimize Transformer fashions and make them quicker whereas sustaining accuracy. We start by establishing DistilBERT on the SST-2 dataset, and then we evaluate completely different execution engines, together with plain PyTorch and torch.compile, ONNX Runtime, and quantized ONNX. By doing this step-by-step, we get hands-on expertise with mannequin export, optimization, quantization, and benchmarking, all inside a Google Colab surroundings. Check out the FULL CODES right here.

Copy Code

!pip -q set up "transformers>=4.49" "optimum[onnxruntime]>=1.20.0" "datasets>=2.20" "consider>=0.4" speed up


from pathlib import Path
import os, time, numpy as np, torch
from datasets import load_dataset
import consider
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import QuantizationConfig


os.environ.setdefault("OMP_NUM_THREADS", "1")
os.environ.setdefault("MKL_NUM_THREADS", "1")


MODEL_ID = "distilbert-base-uncased-finetuned-sst-2-english"
ORT_DIR  = Path("onnx-distilbert")
Q_DIR    = Path("onnx-distilbert-quant")
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"
BATCH    = 16
MAXLEN   = 128
N_WARM   = 3
N_ITERS  = 8


print(f"Device: {DEVICE} | torch={torch.__version__}")

We start by putting in the required libraries and establishing our surroundings for Hugging Face Optimum with ONNX Runtime. We configure paths, batch measurement, and iteration settings, and we affirm whether or not we run on CPU or GPU. Check out the FULL CODES right here.

Copy Code

ds = load_dataset("glue", "sst2", break up="validation[:20%]")
texts, labels = ds["sentence"], ds["label"]
metric = consider.load("accuracy")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)


def make_batches(texts, max_len=MAXLEN, batch=BATCH):
   for i in vary(0, len(texts), batch):
       yield tokenizer(texts[i:i+batch], padding=True, truncation=True,
                       max_length=max_len, return_tensors="pt")


def run_eval(predict_fn, texts, labels):
   preds = []
   for toks in make_batches(texts):
       preds.prolong(predict_fn(toks))
   return metric.compute(predictions=preds, references=labels)["accuracy"]


def bench(predict_fn, texts, n_warm=N_WARM, n_iters=N_ITERS):
   for _ in vary(n_warm):
       for toks in make_batches(texts[:BATCH*2]):
           predict_fn(toks)
   instances = []
   for _ in vary(n_iters):
       t0 = time.time()
       for toks in make_batches(texts):
           predict_fn(toks)
       instances.append((time.time() - t0) * 1000)
   return float(np.imply(instances)), float(np.std(instances))

We load an SST-2 validation slice and put together tokenization, an accuracy metric, and batching. We outline run_eval to compute accuracy from any predictor and bench to heat up and time end-to-end inference. With these helpers, we pretty evaluate completely different engines utilizing equivalent knowledge and batching. Check out the FULL CODES right here.

Copy Code

torch_model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE).eval()


@torch.no_grad()
def pt_predict(toks):
   toks = {ok: v.to(DEVICE) for ok, v in toks.objects()}
   logits = torch_model(**toks).logits
   return logits.argmax(-1).detach().cpu().tolist()


pt_ms, pt_sd = bench(pt_predict, texts)
pt_acc = run_eval(pt_predict, texts, labels)
print(f"[PyTorch eager]   {pt_ms:.1f}±{pt_sd:.1f} ms | acc={pt_acc:.4f}")


compiled_model = torch_model
compile_ok = False
attempt:
   compiled_model = torch.compile(torch_model, mode="reduce-overhead", fullgraph=False)
   compile_ok = True
besides Exception as e:
   print("torch.compile unavailable or failed -> skipping:", repr(e))


@torch.no_grad()
def ptc_predict(toks):
   toks = {ok: v.to(DEVICE) for ok, v in toks.objects()}
   logits = compiled_model(**toks).logits
   return logits.argmax(-1).detach().cpu().tolist()


if compile_ok:
   ptc_ms, ptc_sd = bench(ptc_predict, texts)
   ptc_acc = run_eval(ptc_predict, texts, labels)
   print(f"[torch.compile]   {ptc_ms:.1f}±{ptc_sd:.1f} ms | acc={ptc_acc:.4f}")

We load the baseline PyTorch classifier, outline a pt_predict helper, and benchmark/rating it on SST-2. We then try torch.compile for just-in-time graph optimizations and, if profitable, run the identical benchmarks to evaluate pace and accuracy beneath an equivalent setup. Check out the FULL CODES right here.

Copy Code

supplier = "CUDAExecutionProvider" if DEVICE == "cuda" else "CPUExecutionProvider"
ort_model = ORTModelForSequenceClassification.from_pretrained(
   MODEL_ID, export=True, supplier=supplier, cache_dir=ORT_DIR
)


@torch.no_grad()
def ort_predict(toks):
   logits = ort_model(**{ok: v.cpu() for ok, v in toks.objects()}).logits
   return logits.argmax(-1).cpu().tolist()


ort_ms, ort_sd = bench(ort_predict, texts)
ort_acc = run_eval(ort_predict, texts, labels)
print(f"[ONNX Runtime]    {ort_ms:.1f}±{ort_sd:.1f} ms | acc={ort_acc:.4f}")


Q_DIR.mkdir(mother and father=True, exist_ok=True)
quantizer = ORTQuantizer.from_pretrained(ORT_DIR)
qconfig = QuantizationConfig(method="dynamic", per_channel=False, reduce_range=True)
quantizer.quantize(model_input=ORT_DIR, quantization_config=qconfig, save_dir=Q_DIR)


ort_quant = ORTModelForSequenceClassification.from_pretrained(Q_DIR, supplier=supplier)


@torch.no_grad()
def ortq_predict(toks):
   logits = ort_quant(**{ok: v.cpu() for ok, v in toks.objects()}).logits
   return logits.argmax(-1).cpu().tolist()


oq_ms, oq_sd = bench(ortq_predict, texts)
oq_acc = run_eval(ortq_predict, texts, labels)
print(f"[ORT Quantized]   {oq_ms:.1f}±{oq_sd:.1f} ms | acc={oq_acc:.4f}")

We export the mannequin to ONNX, run it with ONNX Runtime, then apply dynamic quantization with Optimum’s ORTQuantizer and benchmark each to see how latency improves whereas accuracy stays comparable. Check out the FULL CODES right here.

Copy Code

pt_pipe  = pipeline("sentiment-analysis", mannequin=torch_model, tokenizer=tokenizer,
                   system=0 if DEVICE=="cuda" else -1)
ort_pipe = pipeline("sentiment-analysis", mannequin=ort_model, tokenizer=tokenizer, system=-1)
samples = [
   "What a fantastic movie—performed brilliantly!",
   "This was a complete waste of time.",
   "I’m not sure how I feel about this one."
]
print("nSample predictions (PT | ORT):")
for s in samples:
   a = pt_pipe(s)[0]["label"]
   b = ort_pipe(s)[0]["label"]
   print(f"- {s}n  PT={a} | ORT={b}")


import pandas as pd
rows = [["PyTorch eager", pt_ms, pt_sd, pt_acc],
       ["ONNX Runtime",  ort_ms, ort_sd, ort_acc],
       ["ORT Quantized", oq_ms, oq_sd, oq_acc]]
if compile_ok: rows.insert(1, ["torch.compile", ptc_ms, ptc_sd, ptc_acc])
df = pd.DataFrame(rows, columns=["Engine", "Mean ms (↓)", "Std ms", "Accuracy"])
show(df)


print("""
Notes:
- HigherTransformer is deprecated on transformers>=4.49, therefore omitted.
- For bigger positive aspects on GPU, additionally attempt FlashAttention2 fashions or FP8 with TensorRT-LLM.
- For CPU, tune threads: set OMP_NUM_THREADS/MKL_NUM_THREADS; attempt NUMA pinning.
- For static (calibrated) quantization, use QuantizationConfig(method='static') with a calibration set.
""")

We sanity-check predictions with fast sentiment pipelines and print PyTorch vs ONNX labels facet by facet. We then assemble a abstract desk to evaluate latency and accuracy throughout engines, inserting torch.compile outcomes when obtainable. We conclude with sensible notes, permitting us to prolong the workflow to different backends and quantization modes.

In conclusion, we are able to clearly see how Optimum helps us bridge the hole between customary PyTorch fashions and production-ready, optimized deployments. We obtain speedups with ONNX Runtime and quantization whereas retaining accuracy, and we additionally discover how torch.compile offers positive aspects immediately inside PyTorch. This workflow demonstrates a sensible method to balancing efficiency and effectivity for Transformer fashions, offering a basis that may be additional prolonged with superior backends, comparable to OpenVINO or TensorRT.

Check out the FULL CODES right here. Feel free to try our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t overlook to be part of our 100k+ ML SubReddit and Subscribe to our Newsletter.

For content material partnership/promotions on marktechpost.com, please TALK to us

The publish Coding Implementation to End-to-End Transformer Model Optimization with Hugging Face Optimum, ONNX Runtime, and Quantization appeared first on MarkTechPost.

Coding Implementation to End-to-End Transformer Model Optimization with Hugging Face Optimum, ONNX Runtime, and Quantization

Google DeepMind Releases Gemini Robotics On-Device: Local AI Model for Real-Time Robotic Dexterity

Google DeepMind’s WeatherNext 2 Uses Functional Generative Networks For 8x Faster Probabilistic Weather Forecasts

Demystifying AI in the Water Industry

DeepSeek Just Released a 3B OCR Model: A 3B VLM Designed for High-Performance OCR and Structured Document Conversion

LLM-as-a-Judge: Where Do Its Signals Break, When Do They Hold, and What Should “Evaluation” Mean?

Cisco Released Cisco Time Series Model: Their First Open-Weights Foundation Model based on Decoder-only Transformer Architecture

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!