A Coding Implementation to Compress and Benchmark Instruction-Tuned LLMs with FP8, GPTQ, and SmoothQuant Quantization using llmcompressor

In this tutorial, we discover how to apply post-training quantization to an instruction-tuned language mannequin using llmcompressor. We begin with an FP16 baseline and then examine a number of compression methods, together with FP8 dynamic quantization, GPTQ W4A16, and SmoothQuant with GPTQ W8A8. Along the way in which, we benchmark every mannequin variant for disk measurement, technology latency, throughput, perplexity, and output high quality. We additionally put together a reusable calibration dataset, save compressed mannequin artifacts, and examine how every recipe adjustments sensible inference habits. By the tip, we get a sensible understanding of how totally different quantization strategies have an effect on mannequin effectivity, deployment readiness, and efficiency trade-offs. [Codes with Notebook]

Copy Code

import subprocess, sys
def pip(*pkgs):
   subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *pkgs])
pip("llmcompressor", "compressed-tensors",
   "transformers>=4.45", "speed up", "datasets")
import os, gc, time, json, math
from pathlib import Path
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
assert torch.cuda.is_available(), 
   "Enable a GPU: Runtime > Change runtime kind > T4 GPU"
print("GPU:", torch.cuda.get_device_name(0),
     "| CUDA:", torch.model.cuda,
     "| torch:", torch.__version__)
MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
WORKDIR = Path("/content material/quant_lab"); WORKDIR.mkdir(exist_ok=True)
os.chdir(WORKDIR)
def free_mem():
   gc.acquire(); torch.cuda.empty_cache()
def dir_size_gb(path):
   complete = 0
   for root, _, recordsdata in os.stroll(path):
       for f in recordsdata:
           complete += os.path.getsize(os.path.be a part of(root, f))
   return complete / 1e9
def time_generation(mannequin, tok, immediate, max_new_tokens=64):
   """Greedy decode; reviews latency & tokens/sec after a quick warmup."""
   inputs = tok(immediate, return_tensors="pt").to(mannequin.machine)
   _ = mannequin.generate(**inputs, max_new_tokens=4, do_sample=False)
   torch.cuda.synchronize()
   t0 = time.time()
   out = mannequin.generate(**inputs, max_new_tokens=max_new_tokens,
                        do_sample=False, pad_token_id=tok.eos_token_id)
   torch.cuda.synchronize()
   dt = time.time() - t0
   new_ids = out[0][inputs["input_ids"].form[1]:]
   return tok.decode(new_ids, skip_special_tokens=True), dt, max_new_tokens/dt
@torch.no_grad()
def wikitext_ppl(mannequin, tok, seq_len=512, max_chunks=20, stride=512):
   """Light WikiText-2 perplexity probe (quick, indicative)."""
   ds = load_dataset("wikitext", "wikitext-2-raw-v1", break up="take a look at")
   textual content = "nn".be a part of(t for t in ds["text"][:400] if t.strip())
   enc = tok(textual content, return_tensors="pt").input_ids.to(mannequin.machine)
   nll_sum, tok_count = 0.0, 0
   for start in vary(0, enc.measurement(1) - seq_len, stride):
       chunk = enc[:, begin:begin+seq_len]
       out = mannequin(chunk, labels=chunk)
       nll_sum += out.loss.float().merchandise() * seq_len
       tok_count += seq_len
       if tok_count // seq_len >= max_chunks: break
   return math.exp(nll_sum / tok_count)
outcomes = {}
PROMPT = ("<|im_start|>usernIn two sentences, clarify why post-training "
         "quantization works for big language fashions.<|im_end|>n"
         "<|im_start|>assistantn")
def benchmark(label, model_path_or_id):
   free_mem()
   print(f"n──── benchmarking: {label} ────")
   tok = AutoTokenizer.from_pretrained(model_path_or_id)
   m = AutoModelForCausalLM.from_pretrained(
           model_path_or_id, torch_dtype="auto", device_map="cuda").eval()
   pattern, dt, tps = time_generation(m, tok, PROMPT)
   ppl = wikitext_ppl(m, tok)
   measurement = dir_size_gb(model_path_or_id) if os.path.isdir(str(model_path_or_id)) else None
   outcomes[label] = {"size_gb": measurement, "ppl": spherical(ppl, 3),
                     "latency_s": spherical(dt, 3), "tok_per_s": spherical(tps, 1),
                     "pattern": pattern.strip().substitute("n", " ")[:180]}
   print(json.dumps(outcomes[label], indent=2))
   del m; free_mem()

We set up all required libraries, import the core packages, and confirm {that a} CUDA-enabled GPU is offered in Colab. We outline the bottom Qwen2.5 instruction mannequin, create a working listing, and put together helper features for reminiscence cleanup, mannequin measurement calculation, technology timing, and perplexity analysis. We additionally create a reusable benchmark operate that masses any mannequin variant, checks its technology pace, calculates perplexity, and shops the outcomes for ultimate comparability.

Copy Code

print("n════════════ Baseline (FP16) ════════════")
benchmark("00_fp16_baseline", MODEL_ID)
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
print("n════════════ Recipe 1: FP8_DYNAMIC ════════════")
mannequin = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
tok = AutoTokenizer.from_pretrained(MODEL_ID)
recipe_fp8 = QuantizationModifier(
   targets="Linear",
   scheme="FP8_DYNAMIC",
   ignore=["lm_head"],
)
oneshot(mannequin=mannequin, recipe=recipe_fp8)
FP8_DIR = "Qwen2.5-0.5B-FP8-Dynamic"
mannequin.save_pretrained(FP8_DIR, save_compressed=True)
tok.save_pretrained(FP8_DIR)
del mannequin; free_mem()
benchmark("01_fp8_dynamic", FP8_DIR)

We first benchmark the unique FP16 mannequin to set up a dependable baseline for subsequent comparisons. We then apply FP8 dynamic quantization using llmcompressor, the place linear layers are compressed whereas the language modeling head stays in larger precision. We save the compressed FP8 mannequin and run the identical benchmark once more to examine its measurement, latency, throughput, and perplexity in opposition to the baseline.

Copy Code

NUM_CALIB_SAMPLES = 256
MAX_SEQ_LEN       = 1024
tok = AutoTokenizer.from_pretrained(MODEL_ID)
uncooked = load_dataset("HuggingFaceH4/ultrachat_200k",
                  break up=f"train_sft[:{NUM_CALIB_SAMPLES}]")
def to_text(ex):
   return {"textual content": tok.apply_chat_template(ex["messages"], tokenize=False)}
def tokenize(ex):
   return tok(ex["text"], padding=False, truncation=True,
              max_length=MAX_SEQ_LEN, add_special_tokens=False)
calib_ds = (uncooked.shuffle(seed=42)
              .map(to_text)
              .map(tokenize, remove_columns=uncooked.column_names))
print("Calibration set:", len(calib_ds), "samples, max_seq_len =", MAX_SEQ_LEN)

We construct a small calibration dataset using UltraChat samples in order that the calibrated quantization recipes can observe lifelike instruction-style inputs. We convert every chat instance into model-compatible textual content by way of the tokenizer’s chat template. We then tokenize the samples with a set most sequence size, making a reusable dataset for GPTQ and SmoothQuant-based compression.

Copy Code

from llmcompressor.modifiers.quantization import GPTQModifier
print("n════════════ Recipe 2: GPTQ W4A16 ════════════")
mannequin = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
recipe_w4a16 = GPTQModifier(
   targets="Linear",
   scheme="W4A16",
   ignore=["lm_head"],
   dampening_frac=0.01,
)
oneshot(
   mannequin=mannequin,
   dataset=calib_ds,
   recipe=recipe_w4a16,
   max_seq_length=MAX_SEQ_LEN,
   num_calibration_samples=NUM_CALIB_SAMPLES,
)
W4A16_DIR = "Qwen2.5-0.5B-W4A16-G128"
mannequin.save_pretrained(W4A16_DIR, save_compressed=True)
tok.save_pretrained(W4A16_DIR)
del mannequin; free_mem()
benchmark("02_gptq_w4a16", W4A16_DIR)

We apply GPTQ W4A16 quantization to compress the mannequin’s linear weights into 4-bit precision whereas protecting activations in larger precision. We use the calibration dataset to allow GPTQ to scale back reconstruction error and protect mannequin high quality throughout compression. We save the W4A16 compressed mannequin and benchmark it to examine how aggressive 4-bit weight compression impacts pace, measurement, and perplexity.

Copy Code

from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
print("n════════════ Recipe 3: SmoothQuant + GPTQ W8A8 ════════════")
mannequin = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
recipe_w8a8 = [
   SmoothQuantModifier(smoothing_strength=0.8),
   GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
]
oneshot(
   mannequin=mannequin,
   dataset=calib_ds,
   recipe=recipe_w8a8,
   max_seq_length=MAX_SEQ_LEN,
   num_calibration_samples=NUM_CALIB_SAMPLES,
)
W8A8_DIR = "Qwen2.5-0.5B-W8A8-SmoothQuant"
mannequin.save_pretrained(W8A8_DIR, save_compressed=True)
tok.save_pretrained(W8A8_DIR)
del mannequin; free_mem()
benchmark("03_smoothquant_w8a8", W8A8_DIR)
print("n══════════════════════ FINAL SUMMARY ══════════════════════")
print(f"{'Variant':<26}{'Size GB':>9}{'PPL':>10}{'tok/s':>9}{'Latency':>11}")
print("-" * 65)
for okay, v in outcomes.objects():
   measurement = f"{v['size_gb']:.3f}" if v['size_gb'] else "  (hub) "
   print(f"{okay:<26}{measurement:>9}{v['ppl']:>10.2f}{v['tok_per_s']:>9.1f}"
         f"{v['latency_s']:>10.2f}s")
print("nSample completions (grasping, 64 new tokens):")
for okay, v in outcomes.objects():
   print(f"n[{k}]n  → {v['sample']}")

We mix SmoothQuant with GPTQ W8A8 to create a sophisticated quantization pipeline that handles activation outliers earlier than making use of 8-bit compression. We save and benchmark this SmoothQuant-based mannequin using the identical analysis setup as the sooner variants. Also, we print a abstract desk and pattern completions to examine all quantized fashions in opposition to the FP16 baseline in a single place.

In conclusion, we constructed an entire quantization workflow that compresses and evaluates a small instruction-tuned LLM using trendy PTQ strategies. We noticed that FP8 dynamic quantization presents a quick, data-free choice, whereas GPTQ-based strategies use calibration information to obtain stronger compression and improved accuracy restoration. We additionally in contrast all variants by way of constant benchmarks, which helps us perceive the trade-offs between measurement, pace, latency, and perplexity. By saving every quantized mannequin and testing technology high quality, we made the workflow nearer to an actual deployment pipeline. This offers us a reusable Colab-ready framework for testing LLM compression strategies earlier than deploying environment friendly fashions in real-world inference techniques.

Check out the Codes with Notebook here. Also, be at liberty to comply with us on Twitter and don’t overlook to be a part of our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The put up A Coding Implementation to Compress and Benchmark Instruction-Tuned LLMs with FP8, GPTQ, and SmoothQuant Quantization using llmcompressor appeared first on MarkTechPost.