LightSeek Foundation Releases TokenSpeed, an Open-Source LLM Inference Engine Targeting TensorRT-LLM-Level Performance for Agentic Workloads
Inference effectivity has quietly turn out to be one of the consequential bottlenecks in AI deployment. As agentic coding techniques corresponding to Claude Code, Codex, and Cursor scale from developer instruments to infrastructure powering software program improvement at giant, the underlying inference engines serving these requests are below rising pressure. The LightSeek Foundation researchers have launched TokenSpeed, an open-source LLM inference engine launched below the MIT license and designed particularly for the calls for of agentic workloads. The TokenSpeed engine is presently in preview standing.
Why Agentic Inference is a Different Problem
To perceive what makes TokenSpeed’s design selections significant, it helps to grasp what makes agentic inference laborious. Coding brokers don’t behave like a typical chatbot flip. Contexts routinely exceed 50K tokens, and conversations typically span dozens of turns. This creates simultaneous strain on two metrics: per-GPU TPM (tokens per minute), which determines what number of customers a single GPU can serve, and per-user TPS (tokens per second), which determines whether or not an particular person person perceives the system as responsive. Most public benchmarks don’t totally seize this conduct.
TokenSpeed has been designed to maximise each. The goal is to maximise per-GPU TPM whereas sustaining a per-user TPS flooring — usually 70 TPS, and generally 200 TPS or larger.
Architecture: Five Interlocking Subsystems
TokenSpeed’s structure is constructed round 5 design pillars: a compiler-backed modeling mechanism for parallelism, a high-performance scheduler, a secure KV useful resource reuse restriction, a pluggable layered kernel system that helps heterogeneous accelerators, and SMG integration for a low-overhead CPU-side request entrypoint.
The modeling layer makes use of a neighborhood SPMD (Single Program, Multiple Data) method. SPMD is a parallel execution mannequin the place all processes run the identical program however on totally different subsets of knowledge — a typical sample in distributed deep studying. Rather than requiring builders to manually implement the communication logic between processes, TokenSpeed permits builders to specify I/O placement annotations at module boundaries, and a light-weight static compiler then robotically generates the required collective operations throughout mannequin building, eliminating the necessity to manually implement communication logic.
The scheduler makes a structural cut up between the management airplane and the execution airplane. The management airplane is applied in C++ as a finite-state machine that works with the sort system to implement secure useful resource administration — together with KV cache state switch and utilization — at compile time reasonably than at runtime. Request lifecycle, KV cache sources, and overlap timing are represented by way of specific FSM transitions and possession semantics, so correctness is enforced by a verifiable management system reasonably than conference. By encoding these correctness constraints into the sort system reasonably than leaving them to runtime conference, errors in KV cache administration — one of the error-prone areas in LLM serving — are caught earlier. The execution airplane is applied in Python to keep up improvement effectivity, enabling sooner function iteration and decrease cognitive load for builders
The kernel layer treats GPU kernels as a first-class modular subsystem reasonably than baking them into the engine core. It supplies a conveyable public API, a centralized registry and choice mannequin, and an extensible plugin mechanism to help heterogeneous accelerators — that means it isn’t locked to NVIDIA {hardware}. The dev staff has additionally developed one of many quickest MLA (Multi-head Latent Attention) kernels for agentic workloads on NVIDIA Blackwell. In the decode kernel, q_seqlen and num_heads are grouped to completely make the most of Tensor Cores, as num_heads are small in a few of these use circumstances. The binary prefill kernel features a fine-tuned softmax implementation. Notably, TokenSpeed MLA has been adopted by vLLM.

Finally, TokenSpeed integrates SMG — a PyTorch-native part — for a low-overhead CPU-side request entrypoint, decreasing the handoff value between CPU orchestration and GPU execution.
Benchmark Results Against TensorRT-LLM on NVIDIA B200
It is value noting upfront that these benchmarks cowl single (non-disaggregated) deployment solely. PD disaggregation help continues to be present process cleanup and could also be lined in a devoted follow-up from the TokenSpeed staff.
Together with the EvalScope staff, TokenSpeed was evaluated in opposition to SWE-smith traces, which intently mirror manufacturing coding-agent site visitors, benchmarked in opposition to TensorRT-LLM — the present state-of-the-art on NVIDIA Blackwell. The take a look at mannequin was Kimi K2.5.
For coding brokers working above 70 TPS/User, the very best configuration is Attention TP4 + MoE TP4, the place TokenSpeed dominates TensorRT-LLM throughout the whole Pareto frontier: roughly 9% sooner within the min-latency case (batch dimension 1), and roughly 11% larger throughput round 100 TPS/User. TP4 right here refers to tensor parallelism throughout 4 GPUs, a method that shards mannequin weights throughout a number of units to cut back per-device reminiscence strain and latency.
On the MLA kernel, the good points are extra pronounced on the decode stage. The decode kernel folds the query-sequence axis into the pinnacle axis to higher fill the BMM1 M tile, enhancing Tensor Core utilization. The binary-version prefill kernel makes use of NVIDIA-internal knobs to fine-tune the softmax implementation, outperforming TensorRT-LLM’s MLA throughout all 5 typical prefill workloads for coding brokers with lengthy prefix KV cache. Combined with different optimizations, this practically halves latency relative to TensorRT-LLM on typical decode workloads with speculative decoding at batch sizes 4, 8, and 16 with lengthy prefix KV cache.
Key Takeaways
- TokenSpeed is a brand new MIT-licensed, open-source LLM inference engine by LightSeek Foundation, constructed particularly for agentic workloads. (Available in preview mode)
- Its scheduler makes use of a C++ finite-state machine to implement KV cache security at compile time, whereas retaining the execution airplane in Python for usability.
- On NVIDIA B200, TokenSpeed outperforms TensorRT-LLM by ~9% in min-latency and ~11% in throughput at 100 TPS/User on Kimi K2.5.
- The TokenSpeed MLA kernel practically halves decode latency vs. TensorRT-LLM on speculative decoding workloads and has already been adopted by vLLM.
Check out the Technical details and GitHub Repo. Also, be happy to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The submit LightSeek Foundation Releases TokenSpeed, an Open-Source LLM Inference Engine Targeting TensorRT-LLM-Level Performance for Agentic Workloads appeared first on MarkTechPost.
