vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference

Production LLM serving is now a techniques downside, not a generate() loop. For actual workloads, the selection of inference stack drives your tokens per second, tail latency, and in the end value per million tokens on a given GPU fleet.

This comparability focuses on 4 broadly used stacks:

vLLM
NVIDIA TensorRT-LLM
Hugging Face Text Generation Inference (TGI v3)
LMDeploy

1. vLLM, PagedConsideration because the open baseline

Core thought

vLLM is constructed round PagedConsideration, an consideration implementation that treats the KV cache like paged digital reminiscence relatively than a single contiguous buffer per sequence.

Instead of allocating one huge KV area per request, vLLM:

Divides KV cache into fastened dimension blocks
Maintains a block desk that maps logical tokens to bodily blocks
Shares blocks between sequences wherever prefixes overlap

This reduces exterior fragmentation and lets the scheduler pack many extra concurrent sequences into the identical VRAM.

Throughput and latency

vLLM improves throughput by 2–4× over techniques like FasterTransformer and Orca at related latency, with bigger good points for longer sequences.

Key properties for operators:

Continuous batching (additionally known as inflight batching) merges incoming requests into present GPU batches as an alternative of ready for fastened batch home windows.
On typical chat workloads, throughput scales near linearly with concurrency till KV reminiscence or compute saturates.
P50 latency stays low for average concurrency, however P99 can degrade as soon as queues are lengthy or KV reminiscence is tight, particularly for prefill heavy queries.

vLLM exposes an OpenAI appropriate HTTP API and integrates nicely with Ray Serve and different orchestrators, which is why it’s broadly used as an open baseline.

KV and multi tenant

PagedConsideration offers close to zero KV waste and versatile prefix sharing inside and throughout requests.
Each vLLM course of serves one mannequin, multi tenant and multi mannequin setups are often constructed with an exterior router or API gateway that followers out to a number of vLLM situations.

2. TensorRT-LLM, {hardware} most on NVIDIA GPUs

Core thought

TensorRT-LLM is NVIDIA’s optimized inference library for their GPUs. The library gives customized consideration kernels, inflight batching, paged KV caching, quantization all the way down to FP4 and INT4, and speculative decoding.

It is tightly coupled to NVIDIA {hardware}, together with FP8 tensor cores on Hopper and Blackwell.

Measured efficiency

NVIDIA’s H100 vs A100 analysis is probably the most concrete public reference:

On H100 with FP8, TensorRT-LLM reaches over 10,000 output tokens/s at peak throughput for 64 concurrent requests, with ~100 ms time to first token.
H100 FP8 achieves as much as 4.6× increased max throughput and 4.4× sooner first token latency than A100 on the identical fashions.

For latency delicate modes:

TensorRT-LLM on H100 can drive TTFT under 10 ms in batch 1 configurations, at the price of decrease total throughput.

These numbers are mannequin and form particular, however they provide a sensible scale.

Prefill vs decode

TensorRT-LLM optimizes each phases:

Prefill advantages from excessive throughput FP8 consideration kernels and tensor parallelism
Decode advantages from CUDA graphs, speculative decoding, quantized weights and KV, and kernel fusion

The end result could be very excessive tokens/s throughout a variety of enter and output lengths, particularly when the engine is tuned for that mannequin and batch profile.

KV and multi tenant

TensorRT-LLM gives:

Paged KV cache with configurable structure
Support for lengthy sequences, KV reuse and offloading
Inflight batching and precedence conscious scheduling primitives

NVIDIA pairs this with Ray based mostly or Triton based mostly orchestration patterns for multi tenant clusters. Multi mannequin assist is completed on the orchestrator stage, not inside a single TensorRT-LLM engine occasion.

3. Hugging Face TGI v3, lengthy immediate specialist and multi backend gateway

Core thought

Text Generation Inference (TGI) is a Rust and Python based mostly serving stack that provides:

HTTP and gRPC APIs
Continuous batching scheduler
Observability and autoscaling hooks
Pluggable backends, together with vLLM type engines, TensorRT-LLM, and different runtimes

Version 3 focuses on lengthy immediate processing by chunking and prefix caching.

Long immediate benchmark vs vLLM

The TGI v3 docs give a transparent benchmark:

On lengthy prompts with greater than 200,000 tokens, a dialog reply that takes 27.5 s in vLLM may be served in about 2 s in TGI v3.
This is reported as a 13× speedup on that workload.
TGI v3 is ready to course of about 3× extra tokens in the identical GPU reminiscence by lowering its reminiscence footprint and exploiting chunking and caching.

The mechanism is:

TGI retains the unique dialog context in a prefix cache, so subsequent turns solely pay for incremental tokens
Cache lookup overhead is on the order of microseconds, negligible relative to prefill compute

This is a focused optimization for workloads the place prompts are extraordinarily lengthy and reused throughout turns, for instance RAG pipelines and analytic summarization.

Architecture and latency conduct

Key elements:

Chunking, very lengthy prompts are break up into manageable segments for KV and scheduling
Prefix caching, information construction to share lengthy context throughout turns
Continuous batching, incoming requests be part of batches of already operating sequences
PagedConsideration and fused kernels within the GPU backends

For quick chat type workloads, throughput and latency are in the identical ballpark as vLLM. For lengthy, cacheable contexts, each P50 and P99 latency enhance by an order of magnitude as a result of the engine avoids repeated prefill.

Multi backend and multi mannequin

TGI is designed as a router plus mannequin server structure. It can:

Route requests throughout many fashions and replicas
Target completely different backends, for instance TensorRT-LLM on H100 plus CPU or smaller GPUs for low precedence site visitors

This makes it appropriate as a central serving tier in multi tenant environments.

4. LMDeploy, TurboMind with blocked KV and aggressive quantization

Core thought

LMDeploy from the InternLM ecosystem is a toolkit for compressing and serving LLMs, centered across the TurboMind engine. It focuses on:

High throughput request serving
Blocked KV cache
Persistent batching (steady batching)
Quantization of weights and KV cache

Relative throughput vs vLLM

The undertaking states:

‘LMDeploy delivers as much as 1.8× increased request throughput than vLLM‘, with the assist from persistent batch, blocked KV, dynamic break up and fuse, tensor parallelism and optimized CUDA kernels.

KV, quantization and latency

LMDeploy consists of:

Blocked KV cache, much like paged KV, that helps pack many sequences into VRAM
Support for KV cache quantization, sometimes int8 or int4, to chop KV reminiscence and bandwidth
Weight solely quantization paths comparable to 4 bit AWQ
A benchmarking harness that reviews token throughput, request throughput, and first token latency

This makes LMDeploy enticing while you wish to run bigger open fashions like InternLM or Qwen on mid vary GPUs with aggressive compression whereas nonetheless sustaining good tokens/s.

Multi mannequin deployments

LMDeploy gives a proxy server in a position to deal with:

Multi mannequin deployments
Multi machine, multi GPU setups
Routing logic to pick fashions based mostly on request metadata

So architecturally it sits nearer to TGI than to a single engine.

What to make use of when?

If you need most throughput and really low TTFT on NVIDIA GPUs
- TensorRT-LLM is the first alternative
- It makes use of FP8 and decrease precision, customized kernels and speculative decoding to push tokens/s and maintain TTFT below 100 ms at excessive concurrency and below 10 ms at low concurrency
If you might be dominated by lengthy prompts with reuse, comparable to RAG over giant contexts
- TGI v3 is a robust default
- Its prefix cache and chunking give as much as 3× token capability and 13× decrease latency than vLLM in revealed lengthy immediate benchmarks, with out additional configuration
If you need an open, easy engine with robust baseline efficiency and an OpenAI type API
- vLLM stays the usual baseline
- PagedConsideration and steady batching make it 2–4× sooner than older stacks at related latency, and it integrates cleanly with Ray and K8s
If you goal open fashions comparable to InternLM or Qwen and worth aggressive quantization with multi mannequin serving
- LMDeploy is an efficient match
- Blocked KV cache, persistent batching and int8 or int4 KV quantization give as much as 1.8× increased request throughput than vLLM on supported fashions, with a router layer included

In observe, many dev groups combine these techniques, for instance utilizing TensorRT-LLM for excessive quantity proprietary chat, TGI v3 for lengthy context analytics, vLLM or LMDeploy for experimental and open mannequin workloads. The secret’s to align throughput, latency tails, and KV conduct with the precise token distributions in your site visitors, then compute value per million tokens from measured tokens/s by yourself {hardware}.

References

vLLM / PagedConsideration
- Paper: https://arxiv.org/abs/2309.06180
- Blog: https://blog.vllm.ai/2023/06/20/vllm.html
- Repo: https://github.com/vllm-project/vllm
TensorRT-LLM efficiency and overview
- H100 vs A100 efficiency (10k tok/s @ 100 ms TTFT): https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html
- Performance overview tables: https://nvidia.github.io/TensorRT-LLM/performance/perf-overview.html
HF Text Generation Inference (TGI v3) long-prompt conduct
- Chunking / conceptual docs (13× sooner on lengthy prompts): https://huggingface.co/docs/text-generation-inference/en/conceptual/chunking
- Release protection with 13× vs vLLM on lengthy prompts: https://www.marktechpost.com/2024/12/10/hugging-face-releases-text-generation-inference-tgi-v3-0-13x-faster-than-vllm-on-long-prompts/
- HF publish summarizing 27.5 s → 2 s instance: https://huggingface.co/posts/Narsil/601808386353996
LMDeploy / TurboMind
- Repo (core options, 1.8× vLLM throughput, blocked KV, persistent batch): https://github.com/InternLM/lmdeploy
- Official docs (1.8× request throughput, KV + weight quantization particulars): https://lmdeploy.readthedocs.io/

The publish vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference appeared first on MarkTechPost.

vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference

1. vLLM, PagedConsideration because the open baseline

2. TensorRT-LLM, {hardware} most on NVIDIA GPUs

3. Hugging Face TGI v3, lengthy immediate specialist and multi backend gateway

4. LMDeploy, TurboMind with blocked KV and aggressive quantization

What to make use of when?

References

What is Agentic RAG? Use Cases and Top Agentic RAG Tools (2025)

Proton’s privacy-first Lumo AI assistant gets a major upgrade

11 Best AI Agent Frameworks for Software Developers

Salesforce AI Research Introduces xRouter: A Reinforcement Learning Router for Cost Aware LLM Orchestration

Tencent Hunyuan Video-Foley brings lifelike audio to AI video

The rise of algorithmic agriculture? AI steps in

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

1. vLLM, PagedConsideration because the open baseline

2. TensorRT-LLM, {hardware} most on NVIDIA GPUs

3. Hugging Face TGI v3, lengthy immediate specialist and multi backend gateway

4. LMDeploy, TurboMind with blocked KV and aggressive quantization

What to make use of when?

References

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!