A Coding Implementation to Compress and Benchmark Instruction-Tuned LLMs with FP8, GPTQ, and SmoothQuant Quantization using llmcompressor
In this tutorial, we discover how to apply post-training quantization to an instruction-tuned language mannequin using llmcompressor. We begin with an FP16 baseline and then examine a number of compression methods, together with FP8 dynamic quantization, GPTQ W4A16, and SmoothQuant with GPTQ W8A8. Along the way in which, we benchmark every mannequin variant for disk…
