|

Comparing the Top 6 Inference Runtimes for LLM Serving in 2025

Large language fashions at the moment are restricted much less by coaching and extra by how briskly and cheaply we will serve tokens underneath actual site visitors. That comes down to a few implementation particulars: how the runtime batches requests, the way it overlaps prefill and decode, and the way it shops and reuses the KV cache. Different engines make completely different tradeoffs on these axes, which present up straight as variations in tokens per second, P50/P99 latency, and GPU reminiscence utilization.

This article compares six runtimes that present up repeatedly in manufacturing stacks:

  • vLLM
  • TensorRT LLM
  • Hugging Face Text Generation Inference (TGI v3)
  • LMDeploy
  • SGLang
  • DeepSpeed Inference / ZeRO Inference

1. vLLM

Design

vLLM is constructed round PagedAttention. Instead of storing every sequence’s KV cache in a big contiguous buffer, it partitions KV into fastened dimension blocks and makes use of an indirection layer so every sequence factors to an inventory of blocks.

This provides:

  • Very low KV fragmentation (reported <4% waste vs 60–80% in naïve allocators)
  • High GPU utilization with steady batching
  • Native assist for prefix sharing and KV reuse at block degree

Recent variations add KV quantization (FP8) and combine FlashAttention model kernels.

Performance

vLLM analysis:

  • vLLM achieves 14–24× larger throughput than Hugging Face Transformers and 2.2–3.5× larger than early TGI for LLaMA fashions on NVIDIA GPUs.

KV and reminiscence conduct

  • PagedAttention offers a KV format that’s each GPU pleasant and fragmentation resistant.
  • FP8 KV quantization reduces KV dimension and improves decode throughput when compute just isn’t the bottleneck.

Where it suits

  • Default excessive efficiency engine while you want a normal LLM serving backend with good throughput, good TTFT, and {hardware} flexibility.

2. TensorRT LLM

Design

TensorRT LLM is a compilation primarily based engine on prime of NVIDIA TensorRT. It generates fused kernels per mannequin and form, and exposes an executor API utilized by frameworks akin to Triton.

Its KV subsystem is specific and have wealthy:

  • Paged KV cache
  • Quantized KV cache (INT8, FP8, with some combos nonetheless evolving)
  • Circular buffer KV cache
  • KV cache reuse, together with offloading KV to CPU and reusing it throughout prompts to cut back TTFT

NVIDIA reviews that CPU primarily based KV reuse can scale back time to first token by as much as 14× on H100 and much more on GH200 in particular eventualities.

Performance

TensorRT LLM is extremely tunable, so outcomes range. Common patterns from public comparisons and vendor benchmarks:

  • Very low single request latency on NVIDIA GPUs when engines are compiled for the actual mannequin and configuration.
  • At average concurrency, it may be tuned both for low TTFT or for excessive throughput; at very excessive concurrency, throughput optimized profiles push P99 up as a result of aggressive batching.

(*6*)KV and reminiscence conduct

  • Paged KV plus quantized KV provides sturdy management over reminiscence use and bandwidth.
  • Executor and reminiscence APIs allow you to design cache conscious routing insurance policies at the utility layer.

Where it suits

  • Latency crucial workloads and NVIDIA solely environments, the place groups can make investments in engine builds and per mannequin tuning.

3. Hugging Face TGI v3

Design

Text Generation Inference (TGI) is a server targeted stack with:

  • Rust primarily based HTTP and gRPC server
  • Continuous batching, streaming, security hooks
  • Backends for PyTorch and TensorRT and tight Hugging Face Hub integration

TGI v3 provides a brand new lengthy context pipeline:

  • Chunked prefill for lengthy inputs
  • Prefix KV caching so lengthy dialog histories should not recomputed on every request

Performance

For typical prompts, current third social gathering work exhibits:

  • vLLM typically edges out TGI on uncooked tokens per second at excessive concurrency as a result of PagedAttention, however the distinction just isn’t enormous on many setups.
  • TGI v3 processes round 3× extra tokens and is as much as 13× quicker than vLLM on lengthy prompts, underneath a setup with very lengthy histories and prefix caching enabled.

Latency profile:

  • P50 for quick and mid size prompts is much like vLLM when each are tuned with steady batching.
  • For lengthy chat histories, prefill dominates in naive pipelines; TGI v3’s reuse of earlier tokens provides a big win in TTFT and P50.

KV and reminiscence conduct

  • TGI makes use of KV caching with paged consideration model kernels and reduces reminiscence footprint via chunking of prefill and different runtime modifications.
  • It integrates quantization via bits and bytes and GPTQ and runs throughout a number of {hardware} backends.

Where it suits

  • Production stacks already on Hugging Face, particularly for chat model workloads with lengthy histories the place prefix caching provides giant actual world features.

4. LMDeploy

Design

LMDeploy is a toolkit for compression and deployment from the InternLM ecosystem. It exposes two engines:

  • TurboMind: excessive efficiency CUDA kernels for NVIDIA GPUs
  • PyTorch engine: versatile fallback

Key runtime options:

  • Persistent, steady batching
  • Blocked KV cache with a supervisor for allocation and reuse
  • Dynamic cut up and fuse for consideration blocks
  • Tensor parallelism
  • Weight solely and KV quantization (together with AWQ and on-line INT8 / INT4 KV quant)

LMDeploy delivers as much as 1.8× larger request throughput than vLLM, attributing this to persistent batching, blocked KV and optimized kernels.

Performance

Evaluations present:

  • For 4 bit Llama model fashions on A100, LMDeploy can attain larger tokens per second than vLLM underneath comparable latency constraints, particularly at excessive concurrency.
  • It additionally reviews that 4 bit inference is about 2.4× quicker than FP16 for supported fashions.

Latency:

  • Single request TTFT is in the identical ballpark as different optimized GPU engines when configured with out excessive batch limits.
  • Under heavy concurrency, persistent batching plus blocked KV let LMDeploy maintain excessive throughput with out TTFT collapse.

KV and reminiscence conduct

  • Blocked KV cache trades contiguous per sequence buffers for a grid of KV chunks managed by the runtime, related in spirit to vLLM’s PagedAttention however with a distinct inner format.
  • Support for weight and KV quantization targets giant fashions on constrained GPUs.

Where it suits

  • NVIDIA centric deployments that need most throughput and are comfy utilizing TurboMind and LMDeploy particular tooling.

5. SGLang

Design

SGLang is each:

  • A DSL for constructing structured LLM applications akin to brokers, RAG workflows and power pipelines
  • A runtime that implements RadixAttention, a KV reuse mechanism that shares prefixes utilizing a radix tree construction fairly than easy block hashes.

RadixAttention:

  • Stores KV for many requests in a prefix tree keyed by tokens
  • Enables excessive KV hit charges when many calls share prefixes, akin to few shot prompts, multi flip chat, or device chains

Performance

Key Insights:

  • SGLang achieves as much as 6.4× larger throughput and as much as 3.7× decrease latency than baseline methods akin to vLLM, LMQL and others on structured workloads.
  • Improvements are largest when there may be heavy prefix reuse, for instance multi flip chat or analysis workloads with repeated context.

Reported KV cache hit charges vary from roughly 50% to 99%, and cache conscious schedulers get near the optimum hit fee on the measured benchmarks.

KV and reminiscence conduct

  • RadixAttention sits on prime of paged consideration model kernels and focuses on reuse fairly than simply allocation.
  • SGLang integrates nicely with hierarchical context caching methods that transfer KV between GPU and CPU when sequences are lengthy, though these methods are often carried out as separate tasks.

Where it suits

  • Agentic methods, device pipelines, and heavy RAG functions the place many calls share giant immediate prefixes and KV reuse issues at the utility degree.

6. DeepSpeed Inference / ZeRO Inference

Design

DeepSpeed offers two items related for inference:

  • DeepSpeed Inference: optimized transformer kernels plus tensor and pipeline parallelism
  • ZeRO Inference / ZeRO Offload: methods that offload mannequin weights, and in some setups KV cache, to CPU or NVMe to run very giant fashions on restricted GPU reminiscence

ZeRO Inference focuses on:

  • Keeping little or no mannequin weights resident in GPU
  • Streaming tensors from CPU or NVMe as wanted
  • Targeting throughput and mannequin dimension fairly than low latency

Performance

In the ZeRO Inference OPT 30B instance on a single V100 32GB:

  • Full CPU offload reaches about 43 tokens per second
  • Full NVMe offload reaches about 30 tokens per second
  • Both are 1.3–2.4× quicker than partial offload configurations, as a result of full offload permits bigger batch sizes

These numbers are small in comparison with GPU resident LLM runtimes on A100 or H100, however they apply to a mannequin that doesn’t match natively in 32GB.

A current I/O characterization of DeepSpeed and FlexGen confirms that offload primarily based methods are dominated by small 128 KiB reads and that I/O conduct turns into the essential bottleneck.

KV and reminiscence conduct

  • Model weights and generally KV blocks are offloaded to CPU or SSD to suit fashions past GPU capability.
  • TTFT and P99 are excessive in comparison with pure GPU engines, however the tradeoff is the capability to run very giant fashions that in any other case wouldn’t match.

Where it suits

  • Offline or batch inference, or low QPS providers the place mannequin dimension issues greater than latency and GPU depend.

Comparison Tables

This desk summarizes the essential tradeoffs qualitatively:

Runtime Main design thought Relative energy KV technique Typical use case
vLLM PagedAttention, steady batching High tokens per second at given TTFT Paged KV blocks, FP8 KV assist General goal GPU serving, multi {hardware}
TensorRT LLM Compiled kernels on NVIDIA + KV reuse Very low latency and excessive throughput on NVIDIA Paged, quantized KV, reuse and offload NVIDIA solely, latency delicate
TGI v3 HF serving layer with lengthy immediate path Strong lengthy immediate efficiency, built-in stack Paged KV, chunked prefill, prefix caching HF centric APIs, lengthy chat histories
LMDeploy TurboMind kernels, blocked KV, quant Up to 1.8× vLLM throughput in vendor assessments Blocked KV cache, weight and KV quant NVIDIA deployments targeted on uncooked throughput
SGLang RadixAttention and structured applications Up to 6.4× throughput and three.7× decrease latency on structured workloads Radix tree KV reuse over prefixes Agents, RAG, excessive prefix reuse
DeepSpeed GPU CPU NVMe offload for enormous fashions Enables giant fashions on small GPU; throughput oriented Offloaded weights and generally KV Very giant fashions, offline or low QPS

Choosing a runtime in observe

For a manufacturing system, the alternative tends to break down to a couple easy patterns:

  • You desire a sturdy default engine with minimal customized work: You can begin with vLLM. It provides you good throughput, affordable TTFT, and strong KV dealing with on widespread {hardware}.
  • You are dedicated to NVIDIA and need advantageous grained management over latency and KV: You can use TensorRT LLM, seemingly behind Triton or TGI. Plan for mannequin particular engine builds and tuning.
  • Your stack is already on Hugging Face and also you care about lengthy chats: You can use TGI v3. Its lengthy immediate pipeline and prefix caching are very efficient for dialog model site visitors.
  • You need most throughput per GPU with quantized fashions: You can use LMDeploy with TurboMind and blocked KV, particularly for 4 bit Llama household fashions.
  • You are constructing brokers, device chains or heavy RAG methods: You can use SGLang and design prompts in order that KV reuse through RadixAttention is excessive.
  • You should run very giant fashions on restricted GPUs: You can use DeepSpeed Inference / ZeRO Inference, settle for larger latency, and deal with the GPU as a throughput engine with SSD in the loop.

Overall, all these engines are converging on the identical thought: KV cache is the actual bottleneck useful resource. The winners are the runtimes that deal with KV as a first-class knowledge construction to be paged, quantized, reused and offloaded, not only a massive tensor slapped into GPU reminiscence.

The publish Comparing the Top 6 Inference Runtimes for LLM Serving in 2025 appeared first on MarkTechPost.

Similar Posts