Perplexity AI Open-Sources Unigram Tokenizer That Achieves 5x Lower p50 Latency Than Hugging Face tokenizers Crate

Perplexity AI’s analysis crew reimplemented their Unigram tokenizer from scratch in Rust and open-sourced the code in pplx-garden, their inference expertise repository.

At manufacturing enter lengths, the brand new encoder cuts p50 latency by roughly 5x versus the Hugging Face tokenizers crate, ~2x versus SentencePiece (C++), and ~1.5x versus IREE’s tokenizer (C), with zero steady-state heap allocations. In manufacturing, it decreased CPU utilization in Perplexity’s inference stack by 5-6x and shaved double-digit milliseconds off reranker latency.

Why Tokenization Became a Bottleneck

LLM inference value is usually framed round GPU work: KV caches, consideration kernels, professional routing. But smaller fashions, reminiscent of embedding fashions, classifiers, and rerankers, inform a distinct story. These fashions are two to 3 orders of magnitude smaller than frontier transformers.

A reranker scoring lots of of candidate paperwork per request is a transparent instance. With a small mannequin, GPU compute usually finishes in single-digit milliseconds. Every enter nonetheless passes by means of CPU-side tokenization first. When batch sizes are massive, tokenization turns into a significant fraction of complete request latency.

Perplexity’s work targets XLM-RoBERTa, a mannequin with a 250K-token Unigram vocabulary skilled with SentencePiece. Fine-tuned RoBERTa-family encoders are a standard manufacturing selection for rating, retrieval, and similarity duties.

What is Unigram Tokenization?

Unigram tokenization was launched by Kudo in 2018 and is carried out in SentencePiece. It frames segmentation as a most-probable-path drawback. Each vocabulary token has a discovered log-probability. The tokenizer picks the segmentation whose token scores sum to the best worth.

The algorithm used to seek out that finest path is the Viterbi algorithm, a dynamic programming method from 1967. Byte positions kind graph layers and vocabulary tokens are edges spanning a contiguous byte vary. The DP recurrence iterates over byte positions and updates the best-scoring path at every place.

The outer loop runs in linear time relative to enter size. The interior loop walks a vocabulary trie (a prefix tree construction) at every byte place. On a 16K-token enter, this interior stroll executes lots of of hundreds of trie transitions. It is the new path.

What was Slow within the Hugging Face Implementation

The Hugging Face tokenizers crate is the default Rust tokenizer most groups attain for. Perplexity used it because the benchmark reference. At 514 tokens (512 + BOS/EOS injection), the reference implementation had three pricey patterns:

Bottleneck	Mechanism	Measured impression
Allocation per match	`String::from_utf8` + `AHashMap` lookup per trie match	7,295 allocations at 514 tokens; 299,171 at 16K
Pointer chase per byte	`AHashMap` at each trie node; 4 dependent hundreds per byte step	Dependent-load latency dominates the new path
L2 thrashing on lengthy inputs	DP desk and output buffers freshly allotted every name	L2 miss price climbs from 8% at 128 tokens to 50% at 16K

Per-token allocation is fixed: roughly 2 KB and ~18 allocations per token, no matter enter dimension. The latency drawback turns into extreme at longer inputs when cumulative allocations overflow the per-core L2 cache.

Establishing a Baseline Before Changing the Trie

Before switching the trie construction, Perplexity first remoted how a lot value got here from pointless work alone. They made a zero-allocation port of the reference: similar HashMap trie, however with a caller-owned scratch struct reused throughout calls and token IDs saved instantly in trie nodes (eradicating the per-match string allocation and secondary hash-map lookup).

This baseline already lower p50 latency to 155 µs at 514 tokens, down from 326 µs within the reference. Instructions retired dropped 2.4x. The remaining value was the HashMap pointer chase itself, which the following step addressed.

The Three Optimizations

Optimization 1: Double-Array Trie

The Hugging Face trie shops kids in a HashMap at each node. Each byte step requires a hash computation, two pointer dereferences, and a heap entry. Perplexity changed this with a double-array trie, the identical construction utilized by SentencePiece and IREE, initially launched by Aoe in 1989.

A double-array trie encodes the whole trie in two flat integer arrays, base and examine. A toddler lookup is: subsequent = base[node] + byte, then confirm examine[next] == node. That is 2 array reads, one integer add, and one comparability, with no hashing and no pointer chasing. For XLM-RoBERTa’s 250K vocab, the entire trie suits in ~9 MB of contiguous reminiscence. The scorching working set per encode is on the order of 100 KB, which inserts in L2 cache.

Unlike SentencePiece and IREE, that are general-purpose libraries with lattice bookkeeping and multi-stage pipelines, Perplexity inlined the trie instantly within the Viterbi loop and dropped that overhead fully.

Result at 514 tokens: p50 dropped from 155 µs (zero-allocation baseline) to 68 µs. Wall-clock fell 4.8x from the unique reference.

Optimization 2: Bitmap and Inline Packing

The double-array trie nonetheless requires two dependent array hundreds per byte step: first the father or mother’s base offset, then the examine array to substantiate the transition is legitimate. Perplexity changed the examine array with a per-node bitmap (4 64-bit phrases, 32 bytes) that information which of the 256 attainable bytes have legitimate youngster transitions.

A bitmap lookup compiles to a single bit take a look at towards one 64-bit phrase. The examine array is used solely throughout trie development and dropped from the runtime structure fully.

They additionally packed all 4 per-node fields (bitmap, base, token ID, and rating) right into a single 64-byte cache line, matching CPU cache line width precisely. One trie step now hundreds a single cache line protecting the bitmap for the next-byte examine, the bottom offset for the kid slot, and the token ID and rating at terminal nodes.

Trade-off: trie dimension grows from ~9 MB to ~50 MB (780K nodes x 64 bytes). The scorching working set per encode stays ~100 KB.

Result at 514 tokens: Additional 4.5% wall-clock discount. L2 accesses dropped from 4.6K to 1.8K per encode.

Optimization 3: Huge Pages for the Trie

At 50 MB, the trie spans roughly 12,000 digital pages on a default Linux system utilizing 4 KB pages. The first-level information TLB on Intel Sapphire Rapids holds 96 entries. Each Viterbi step touches a distinct trie node, so TLB misses accumulate. Over a 512-token encode, Perplexity estimated roughly 9,000 cycles spent in page-table walks, about 3% of per-encode price range.

Perplexity backed the trie with 2 MB big pages by way of mmap with the MAP_HUGETLB flag. The similar 50 MB now spans 25 pages, properly inside the TLB. This requires vm.nr_hugepages configured at boot. In manufacturing, 10,561 big pages are reserved; the trie makes use of 24.

Result: 3-12% wall-clock discount relying on enter size. The largest achieve is at 4,098 tokens (-12.0%), the place page-table visitors was actively competing with trie information for L2 bandwidth. Beyond 4K tokens the achieve shrinks as a result of L3 misses dominate.

Final Benchmark Results

All measurements are single-threaded, pinned to at least one core on an Intel Xeon Platinum 8488C, with 10,000 iterations after 1,000 warmup rounds. At 514 tokens:

Engine	p50 Latency	Instructions	Allocations
Hugging Face (`tokenizers` crate)	349 µs	3.60M	7,295
SentencePiece (C++)	128 µs	1.83M	1,559
IREE tokenizer (C)	112 µs	2.28M	1
Perplexity (ultimate, all 3 optimizations)	~63 µs	1.04M	0

Across the total optimization sequence, directions per encode fell from 3.66M to 1.04M, a 3.5x discount. Wall-clock matches that ratio at brief inputs and widens at lengthy inputs the place the reference’s per-token allocations overflow L2 and L3.

One extra discovering: off-the-shelf Rust wrapper crates round SentencePiece and IREE add 1.6-1.9x latency overhead in comparison with the native C/C++ binaries. The sentencepiece crate allocates a contemporary listing of token items on every name. The overhead is measurable however amortizes at lengthy inputs.

The ultimate Perplexity encoder produces token-exact output towards the reference. In manufacturing, it makes use of rayon to parallelize throughout cores.

Marktechpost’s Visual Explainer

Open Source Release

Perplexity AI Rewrites Its Unigram Tokenizer, Cuts CPU Utilization 5-6x

Perplexity reimplemented their Unigram tokenizer from scratch in Rust and open-sourced it in pplx-garden. Three focused optimizations eliminated wasted work from the new path.

5xLower p50 vs HuggingFace tokenizers crate

5-6xCPU utilization discount in manufacturing

0Heap allocations on the new path

Source: analysis.perplexity.ai

The Problem

Why CPU Tokenization Became a Bottleneck

LLM inference value is normally framed round GPU work: KV caches, consideration kernels, professional routing. But small fashions inform a distinct story.

Rerankers and embedders are small

Two to 3 orders of magnitude smaller than frontier transformers. GPU compute finishes in single-digit milliseconds.

Tokenization runs on CPU earlier than every name

Every enter passes by means of CPU-side tokenization first, turning textual content into vocabulary IDs.

Batch dimension amplifies the fee

A reranker scoring lots of of paperwork per request means tokenization runs lots of of occasions per question.

Background

What Is Unigram Tokenization?

Introduced by Kudo (2018), carried out in SentencePiece. Perplexity targets XLM-RoBERTa with a 250K-token Unigram vocabulary.

Most-probable-path drawback

Each vocabulary token carries a discovered log-probability. The tokenizer picks the segmentation whose token scores sum highest.

Viterbi algorithm (1967)

A dynamic programming methodology that finds one of the best path. Byte positions are graph layers; vocabulary tokens are edges.

The scorching path is the interior trie stroll at every byte place. On a 16K-token enter, this executes lots of of hundreds of trie transitions and retires tens of thousands and thousands of directions per encode.

Root Cause

Three Bottlenecks within the Hugging Face Reference

Measured at 514 tokens (512 + BOS/EOS) on Intel Xeon Platinum 8488C:

Bottleneck	Mechanism	Impact
Allocation per match	`String::from_utf8` + `AHashMap` lookup per trie match	7,295 allocs at 514 tokens; 299,171 at 16K
Pointer chase per byte	`AHashMap` at each trie node; 4 dependent hundreds per step	Dependent-load latency dominates
L2 thrashing	DP desk and output buffers freshly allotted every name	L2 miss price: 8% at 128 tokens, 50% at 16K

Per-token allocation is fixed: ~2 KB and ~18 allocations per token no matter enter dimension.

Step 0: Baseline

Zero-Allocation Port Before Changing the Trie

Before touching the trie construction, Perplexity remoted how a lot value got here from pointless allocations alone. They saved the identical HashMap trie however made two modifications:

Caller-owned scratch struct reused throughout calls, eradicating per-encode DP desk allocation
Token IDs saved instantly in trie nodes, eradicating per-match String allocation and secondary hash-map lookup

Reference p50

326 µs

Baseline p50

155 µs (-2.1x)

Allocations alone have been the dominant value. Instructions retired dropped 2.4x. The HashMap pointer chase was now the remaining bottleneck.

Optimization 1

Double-Array Trie

The HashMap trie prices 4 dependent hundreds per byte step. The double-array trie (Aoe, 1989) replaces it with flat integer arrays base and examine.

HashMap trie (reference)

Hash byte, load bucket, observe pointer to youngster, observe pointer to youngster’s HashMap. 4 dependent hundreds per step.

Double-array trie

subsequent = base[node] + byte
Verify examine[next] == node
2 array reads, 1 add, 1 examine. No hashing.

250K vocab suits in ~9 MB contiguous reminiscence. Hot working set per encode is ~100 KB, becoming in L2 cache. Result: p50 drops from 155 µs to 68 µs, wall-clock 4.8x sooner than authentic reference.

Optimization 2

Bitmap + 64-Byte Cache-Line Packing

The double-array trie nonetheless wants two dependent array hundreds per step. Perplexity changed the examine array with a per-node bitmap.

Per-node bitmap: 4 64-bit phrases (32 bytes), one bit per attainable byte worth. A single bit take a look at replaces the second array load.
All 4 per-node fields (bitmap, base, token ID, rating) packed into one 64-byte cache line.
One trie step now hundreds a single cache line protecting validity, youngster offset, and terminal information.

L2 accesses at 514 tokens

4,600 (Darts) vs 1,800 (Bitmap)

Trie dimension trade-off

~9 MB (Darts) grows to ~50 MB (780K nodes x 64 bytes)

Optimization 3

2 MB Huge Pages for the Trie

At 50 MB with 4 KB pages, the trie spans ~12,000 digital pages. Intel Sapphire Rapids holds solely 96 entries within the first-level information TLB. TLB misses set off page-table walks.

~9,000 cycles spent in page-table walks per 512-token encode, about 3% of the per-encode price range.

Fix: again the trie with 2 MB big pages by way of mmap with MAP_HUGETLB. The similar 50 MB spans 25 pages, properly inside TLB capability. In manufacturing, 10,561 big pages are reserved; the trie makes use of 24.

At 514 tokens

65.4 µs with out big pages vs 63.1 µs with (-3.4%)

At 4,098 tokens

773 µs with out big pages vs 679 µs with (-12.0%)

Results

Final Benchmark at 514 Tokens

Single-threaded, pinned core, Intel Xeon Platinum 8488C. 10,000 iterations after 1,000 warmup rounds.

Engine	p50 Latency	Instructions	Allocations
Hugging Face (Rust)	349 µs	3.60M	7,295
SentencePiece (C++)	128 µs	1.83M	1,559
IREE tokenizer (C)	112 µs	2.28M	1
Perplexity (ultimate)	~63 µs	1.04M	0

Instructions per encode fell from 3.66M to 1.04M, a 3.5x discount. Note: off-the-shelf Rust wrapper crates round SentencePiece and IREE add 1.6-1.9x overhead vs native binaries on account of per-call allocations.

Key Takeaways

What Engineers Should Know

CPU tokenization is invisible in GPU profiling traces however actual in end-to-end latency for small fashions.
Removing per-encode heap allocations (zero-allocation baseline) lower p50 from 326 µs to 155 µs earlier than any trie change.
Double-array trie introduced p50 to 68 µs. Bitmap packing and large pages introduced it to ~63 µs.
The Rust wrapper crates round SentencePiece and IREE add 1.6-1.9x latency overhead vs native binaries.
Source code is out there at github.com/perplexityai/pplx-garden underneath MIT license.

Production impression

5-6x CPU utilization discount + double-digit ms off reranker latency

Target mannequin

XLM-RoBERTa, 250K-token SentencePiece Unigram vocabulary

Key Takeaways

Perplexity rebuilt their Unigram tokenizer focusing on XLM-RoBERTa's 250K-token SentencePiece vocabulary
The new encoder achieves zero steady-state heap allocations and ~63 µs p50 at 514 tokens
Three optimizations: double-array trie, bitmap + 64-byte cache-line packing, and a pair of MB big pages for the trie
Intermediate consequence: a zero-allocation HashMap port alone lower p50 from 326 µs to 155 µs earlier than the trie was modified
Production impression: 5-6x CPU utilization discount and double-digit ms discount in reranker latency

Check out the Repo and Technical details. Also, be at liberty to observe us on Twitter and don’t neglect to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The publish Perplexity AI Open-Sources Unigram Tokenizer That Achieves 5x Lower p50 Latency Than Hugging Face tokenizers Crate appeared first on MarkTechPost.

Perplexity AI Open-Sources Unigram Tokenizer That Achieves 5x Lower p50 Latency Than Hugging Face tokenizers Crate

Why Tokenization Became a Bottleneck

What is Unigram Tokenization?

What was Slow within the Hugging Face Implementation

Establishing a Baseline Before Changing the Trie

The Three Optimizations

Optimization 1: Double-Array Trie

Optimization 2: Bitmap and Inline Packing

Optimization 3: Huge Pages for the Trie

Final Benchmark Results

Marktechpost’s Visual Explainer

Perplexity AI Rewrites Its Unigram Tokenizer, Cuts CPU Utilization 5-6x

Why CPU Tokenization Became a Bottleneck

What Is Unigram Tokenization?

Three Bottlenecks within the Hugging Face Reference

Zero-Allocation Port Before Changing the Trie

Double-Array Trie

Bitmap + 64-Byte Cache-Line Packing

2 MB Huge Pages for the Trie

Final Benchmark at 514 Tokens

What Engineers Should Know

Key Takeaways

Your LLM is 5x Slower Than It Should Be. The Reason? Pessimism—and Stanford Researchers Just Showed How to Fix It

Anthropic AI Introduces Persona Vectors to Monitor and Control Personality Shifts in LLMs

Google Adds Event-Driven Webhooks to the Gemini API, Eliminating the Need for Polling in Long-Running AI Jobs

Meta AI Releases NeuralBench: A Unified Open-Source Framework to Benchmark NeuroAI Models Across 36 EEG Tasks and 94 Datasets

Meta AI Releases Segment Anything Model 3 (SAM 3) for Promptable Concept Segmentation in Images and Videos

Unlocking the power of data: How we built text-to-SQL with agentic RAG at Rocket Mortgage

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Why Tokenization Became a Bottleneck

What is Unigram Tokenization?

What was Slow within the Hugging Face Implementation

Establishing a Baseline Before Changing the Trie

The Three Optimizations

Optimization 1: Double-Array Trie

Optimization 2: Bitmap and Inline Packing

Optimization 3: Huge Pages for the Trie

Final Benchmark Results

Marktechpost’s Visual Explainer

Perplexity AI Rewrites Its Unigram Tokenizer, Cuts CPU Utilization 5-6x

Why CPU Tokenization Became a Bottleneck

What Is Unigram Tokenization?

Three Bottlenecks within the Hugging Face Reference

Zero-Allocation Port Before Changing the Trie

Double-Array Trie

Bitmap + 64-Byte Cache-Line Packing

2 MB Huge Pages for the Trie

Final Benchmark at 514 Tokens

What Engineers Should Know

Key Takeaways

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!