StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows

ByRicardo October 6, 2025

Why deal with LLM inference as batched kernels to DRAM when a dataflow compiler can pipe tiles via on-chip FIFOs and stream converters?StreamTensor is a compiler that lowers PyTorch LLM graphs (GPT-2, Llama, Qwen, Gemma) into stream-scheduled dataflow accelerators on AMD’s Alveo U55C FPGA. The system introduces an iterative tensor (“itensor”) kind to encode tile/order of streams, enabling provably right inter-kernel streaming and automatic insertion/sizing of DMA engines, FIFOs, and format converters. On LLM decoding workloads, the analysis staff reviews as much as 0.64× decrease latency vs. GPUs and as much as 1.99× larger power effectivity.

What StreamTensor does?

StreamTensor compiles PyTorch graphs right into a stream-oriented dataflow design so that intermediate tiles are largely avoids off-chip DRAM round-trips through on-chip streaming and fusion; DMAs are inserted solely when required; they’re forwarded via on-chip FIFOs to downstream kernels. The compiler’s central abstraction—iterative tensors (itensors)—information iteration order, tiling, and format, which makes inter-kernel stream compatibility express and drives converter era solely the place wanted. The framework additionally searches hierarchically over tiling, fusion, and useful resource allocation, and makes use of a linear program to dimension FIFOs to keep away from stalls or impasse whereas minimizing on-chip reminiscence.

What’s truly new?

Hierarchical DSE. The compiler explores three design areas—(i) tiling/unroll/vectorization/permutation on the Linalg stage, (ii) fusion underneath reminiscence/useful resource constraints, and (iii) useful resource allocation/stream widths—optimizing for sustained throughput underneath bandwidth limits.
End-to-end PyTorch → system stream. Models enter through Torch-MLIR, are reworked to MLIR Linalg, after which right into a dataflow IR whose nodes change into {hardware} kernels with express streams and host/runtime glue—no guide RTL meeting.
iterative tensor (itensor) typing system. A first-class tensor kind expresses iteration order, tiling, and affine maps. This makes stream order express, permits protected kernel fusion, and lets the compiler synthesize minimal buffer/format converters when producers/customers disagree.
Formal FIFO sizing. Inter-kernel buffering is solved with a linear-programming formulation to keep away from stalls/deadlocks whereas minimizing on-chip reminiscence utilization (BRAM/URAM).

Results

Latency: as much as 0.76× vs prior FPGA LLM accelerators and 0.64× vs a GPU baseline on GPT-2; Energy effectivity: as much as 1.99× vs A100 on rising LLMs (model-dependent). Platform context: Alveo U55C (HBM2 16 GB, 460 GB/s, PCIe Gen3×16 or twin Gen4×8, 2×QSFP28).

Our Comments

The helpful contribution here’s a PyTorch→Torch-MLIR→dataflow compiler that emits stream-scheduled kernels and a number/runtime for AMD’s Alveo U55C; the iterative tensor kind plus linear-programming-based FIFO sizing permits protected inter-kernel streaming moderately than DRAM round-trips. On reported LLM decoding benchmarks throughout GPT-2, Llama, Qwen, and Gemma, the analysis staff present geometric-mean latency as little as 0.64× vs. a GPU baseline and power effectivity as much as 1.99×, with scope restricted to decoding workloads. The {hardware} context is obvious: Alveo U55C supplies 16 GB HBM2 at 460 GB/s with twin QSFP28 and PCIe Gen3×16 or twin Gen4×8, which aligns with the streaming dataflow design.

Check out the Paper. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to comply with us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Newsletter.

The publish StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows appeared first on MarkTechPost.

Artificial Intelligence Computer Vision

GLM-4.1V-Thinking: Advancing General-Purpose Multimodal Understanding and Reasoning
ByRicardo July 18, 2025

Vision-language models (VLMs) play a crucial role in today’s intelligent systems by enabling a detailed understanding of visual content. The complexity of multimodal intelligence tasks has grown, ranging from scientific problem-solving to the development of autonomous agents. Current demands on VLMs have far exceeded simple visual content perception, with increasing attention on advanced reasoning. While…

Read More GLM-4.1V-Thinking: Advancing General-Purpose Multimodal Understanding and Reasoning
AI Agents Editors Pick

How to Build an Advanced BrightData Web Scraper with Google Gemini for AI-Powered Data Extraction
ByRicardo June 18, 2025

In this tutorial, we walk you through building an enhanced web scraping tool that leverages BrightData’s powerful proxy network alongside Google’s Gemini API for intelligent data extraction. You’ll see how to structure your Python project, install and import the necessary libraries, and encapsulate scraping logic within a clean, reusable BrightDataScraper class. Whether you’re targeting Amazon…

Read More How to Build an Advanced BrightData Web Scraper with Google Gemini for AI-Powered Data Extraction
AI & Big Data Expo Artificial Intelligence

Suvianna Grecu, AI for Change: Without rules, AI risks ‘trust crisis’
ByRicardo August 8, 2025

The world is in a race to deploy AI, but a leading voice in technology ethics warns prioritising speed over safety risks a “trust crisis.” Suvianna Grecu, Founder of the AI for Change Foundation, argues that without immediate and strong governance, we are on a path to “automating harm at scale.” Speaking on the integration…

Read More Suvianna Grecu, AI for Change: Without rules, AI risks ‘trust crisis’
Editors Pick Staff

How to Build Production-Grade Agentic Workflows with GraphBit Using Deterministic Tools, Validated Execution Graphs, and Optional LLM Orchestration
ByRicardo December 30, 2025

In this tutorial, we build an end-to-end, production-style agentic workflow using GraphBit that demonstrates how graph-structured execution, tool calling, and optional LLM-driven agents can coexist in a single system. We start by initializing and inspecting the GraphBit runtime, then define a realistic customer-support ticket domain with typed data structures and deterministic, offline-executable tools. We show…

Read More How to Build Production-Grade Agentic Workflows with GraphBit Using Deterministic Tools, Validated Execution Graphs, and Optional LLM Orchestration
Artificial Intelligence Companies

Tencent releases versatile open-source Hunyuan AI models
ByRicardo August 4, 2025

Tencent has expanded its family of open-source Hunyuan AI models that are versatile enough for broad use. This new family of models is engineered to deliver powerful performance across computational environments, from small edge devices to demanding, high-concurrency production systems. The release includes a comprehensive set of pre-trained and instruction-tuned models available on the developer…

Read More Tencent releases versatile open-source Hunyuan AI models
Editors Pick Python

The Statistical Cost of Zero Padding in Convolutional Neural Networks (CNNs)
ByRicardo February 3, 2026

What is Zero Padding Zero padding is a technique used in convolutional neural networks where additional pixels with a value of zero are added around the borders of an image. This allows convolutional kernels to slide over edge pixels and helps control how much the spatial dimensions of the feature map shrink after convolution. Padding…

Read More The Statistical Cost of Zero Padding in Convolutional Neural Networks (CNNs)

StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows

What StreamTensor does?

What’s truly new?

Results

Our Comments

GLM-4.1V-Thinking: Advancing General-Purpose Multimodal Understanding and Reasoning

How to Build an Advanced BrightData Web Scraper with Google Gemini for AI-Powered Data Extraction

Suvianna Grecu, AI for Change: Without rules, AI risks ‘trust crisis’

How to Build Production-Grade Agentic Workflows with GraphBit Using Deterministic Tools, Validated Execution Graphs, and Optional LLM Orchestration

Tencent releases versatile open-source Hunyuan AI models

The Statistical Cost of Zero Padding in Convolutional Neural Networks (CNNs)

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

What StreamTensor does?

What’s truly new?

Results

Our Comments

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!