|

Zyphra Introduces Tensor and Sequence Parallelism (TSP): A Hardware-Aware Training and Inference Strategy That Delivers 2.6x Throughput Over Matched TP+SP Baselines

Training and serving massive transformer fashions at scale is essentially a reminiscence administration downside. Every GPU in a cluster has a set quantity of VRAM, and as mannequin sizes and context lengths develop, engineers consistently must make trade-offs about how you can distribute work throughout {hardware}. A new method from Zyphra, known as Tensor and Sequence Parallelism (TSP), gives a solution to rethink that trade-off — and in benchmark checks on as much as 1,024 AMD MI300X GPUs, it persistently delivers decrease per-GPU peak reminiscence than any of the usual parallelism schemes used at this time, for each coaching and inference workloads.

https://www.zyphra.com/publish/tsp

The Problem TSP Is Solving

To perceive why TSP is necessary, you need to first perceive the 2 parallelism methods it folds collectively.

Tensor Parallelism (TP) splits mannequin weights throughout GPUs. If you might have a weight matrix in an consideration or MLP layer, every GPU within the TP group holds solely a fraction of that matrix. This straight reduces the per-GPU reminiscence occupied by parameters, gradients, and optimizer states — the ‘mannequin state’ reminiscence. The trade-off is that TP requires collective communication operations (sometimes all-reduce or reduce-scatter/all-gather pairs) each time a layer is computed. This communication is proportional to activation dimension, so it turns into more and more costly as sequence size grows.

Sequence Parallelism (SP) takes a distinct method. Instead of splitting weights, it splits the enter token sequence throughout GPUs. Each GPU processes solely a fraction of the tokens, which reduces activation reminiscence and the quadratic value of consideration computation. However, SP leaves mannequin weights absolutely replicated on each GPU, which implies model-state reminiscence stays precisely the identical no matter what number of GPUs you add to the SP group.

In commonplace multi-dimensional parallelism, engineers mix TP and SP by putting them on orthogonal axes of a tool mesh. If you need a TP diploma of T and an SP diploma of Σ, your mannequin duplicate consumes T.Σ GPUs. This is pricey in two methods. First, it makes use of extra GPUs for the model-parallel group, leaving fewer accessible for data-parallel replicas. Second, if T.Σ is massive sufficient to span a number of nodes, some collective communication has to journey over slower inter-node interconnects like InfiniBand or Ethernet as an alternative of the high-bandwidth intra-node material, corresponding to AMD Infinity Fabric or NVIDIA NVLink. Data Parallelism (DP), the opposite frequent baseline, avoids these model-parallel prices solely however replicates all mannequin state on each gadget, making it impractical for big fashions or lengthy contexts by itself.

What Folding Actually Means

TSP’s core concept is parallelism folding: as an alternative of putting TP and SP on separate, orthogonal mesh dimensions, it collapses each onto a single device-mesh axis of dimension D. Every GPU within the TSP group concurrently holds 1/D of the mannequin weights and 1/D of the token sequence. Because each are sharded throughout the identical D GPUs, the per-device reminiscence footprint decreases by 1/D for each parameter reminiscence and activation reminiscence — one thing no single commonplace parallelism scheme achieves by itself. TSP is subsequently the one scheme that concurrently reduces weight-proportional reminiscence (parameters, gradients, optimizer states) and activation reminiscence by the identical 1/D issue on a single axis with out requiring a two-dimensional T.Σ gadget structure.

The problem is that if every GPU solely has a part of the weights and a part of the sequence, it must coordinate with different GPUs to finish every layer’s ahead cross. TSP makes use of two totally different communication schedules to deal with this, one for consideration and one for the gated MLP.

For consideration, TSP iterates over weight shards. At every step, one GPU broadcasts its packed consideration weight shards (WQ, WK, WV, and WO) to all different GPUs within the group. Every GPU then applies these weights to its native sequence tokens to compute native Q, Ok, and V projections. Since causal consideration requires entry to the total key/worth context, the native Ok and V tensors are all-gathered throughout the TSP group and reordered utilizing a zigzag partition scheme earlier than FlashAttention is utilized. The zigzag partition ensures that causal consideration workload is balanced throughout ranks, since later tokens attend to bigger prefixes and would in any other case trigger load imbalance.

For the gated MLP, TSP makes use of a hoop schedule. Each GPU begins with native shards of the gate, up, and down projections. These weight shards flow into across the TSP group through point-to-point ship/recv operations, and every GPU accumulates partial outputs domestically because the shards arrive. Critically, this eliminates the all-reduce that commonplace TP requires for MLP output — the sequence stays native, and solely the weights transfer. The ring is designed to overlap weight transfers with GEMM computation, so communication occurs within the background whereas the GPU is computing.

Memory and Throughput Results

Tested on a single 8-GPU MI300X node throughout sequence lengths from 16K to 128K tokens, TSP achieves the bottom peak reminiscence at each level. At 16K tokens, TSP and TP are almost equal, 31.0 GB versus 31.5 GB per GPU, as a result of model-state reminiscence dominates at brief context. At 128K tokens, the image adjustments dramatically: TSP makes use of 38.8 GB per GPU, in comparison with 70.0 GB for TP and 85.0 GB and 140.0 GB for 2 totally different TP+SP factorizations on the identical node. The theoretical figures all through this analysis are primarily based on a 7B dense decoder-only transformer reference mannequin (hidden dimension h=4096, 32 layers, 32 question heads, 32 KV heads, FFN growth issue F=4, bf16 precision), offering a reproducible baseline for evaluating the schemes.

Throughput outcomes on 128 full nodes (1,024 MI300X GPUs) present TSP persistently outperforming matched TP+SP baselines. At a folded diploma of D=8 and sequence size of 128K tokens, TSP achieves 173 million tokens per second in comparison with 66.30 million tokens per second for the matched TP+SP baseline (roughly a 2.6x speedup). The benefit grows with larger parallelism diploma and longer sequence size.

https://www.zyphra.com/publish/tsp

Practical Trade-offs to Understand

TSP does improve whole communication quantity in comparison with TP alone. It provides a weight-movement time period per layer on prime of the identical activation-proportional Ok/V all-gather that SP makes use of. However, the analysis workforce reveals that when batch dimension B and sequence size S fulfill BS > 8h (the place h is the mannequin’s embedding dimension), TSP’s ahead communication quantity is aggressive with TP’s. This situation is met in most long-context coaching and inference situations.

The key perception the Zyphra workforce emphasizes is that communication quantity and communication value will not be the identical factor. Whether further communication quantity interprets into wall-clock slowdown is dependent upon whether or not the collectives are latency-bound or bandwidth-bound, and how a lot of that visitors will be overlapped with matrix multiplication. Their implementation pipelines weight transfers behind dominant GEMM operations in order that weight communication consumes bandwidth with out including to critical-path time.

TSP just isn’t designed to exchange TP, SP, or TP+SP in all settings. It is meant as an extra axis within the multi-dimensional parallelism design area. It composes orthogonally with pipeline parallelism, professional parallelism, and information parallelism. This means groups can slot TSP into an present parallelism configuration wherever the usual structure would in any other case power model-parallel teams throughout slower inter-node hyperlinks.

Key Takeaways

  • Zyphra’s Tensor and Sequence Parallelism (TSP) folds tensor parallelism and sequence parallelism onto a single device-mesh axis, so every GPU concurrently holds 1/D of the mannequin weights and 1/D of the token sequence, decreasing reminiscence overhead for each coaching and inference.
  • TSP is the one parallelism scheme that reduces each weight-proportional reminiscence (parameters, gradients, optimizer states) and activation reminiscence by the identical 1/D issue on a single axis, with out requiring a two-dimensional T.Σ gadget mesh.
  • Empirical outcomes on a single 8-GPU MI300X node present TSP makes use of 38.8 GB per GPU at 128K sequence size, in comparison with 70.0 GB for TP and 85.0–140.0 GB for TP+SP configurations.
  • At massive scale (1,024 MI300X GPUs, 128K context, D=8), TSP achieves 173 million tokens per second versus 66.30 million tokens per second for a matched TP+SP baseline (roughly a 2.6x throughput benefit).
  • TSP composes orthogonally with pipeline, professional, and information parallelism and is greatest suited to long-context, memory-constrained coaching and inference workloads the place eliminating weight and activation replication outweighs the added communication quantity.

Check out the Paper and Technical details. Also, be at liberty to observe us on Twitter and don’t neglect to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The publish Zyphra Introduces Tensor and Sequence Parallelism (TSP): A Hardware-Aware Training and Inference Strategy That Delivers 2.6x Throughput Over Matched TP+SP Baselines appeared first on MarkTechPost.

Similar Posts