|

Google LiteRT NeuroPilot Stack Turns MediaTek Dimensity NPUs into First Class Targets for on Device LLMs

The new LiteRT NeuroPilot Accelerator from Google and MediaTek is a concrete step towards operating actual generative fashions on telephones, laptops, and IoT {hardware} with out delivery each request to a knowledge heart. It takes the present LiteRT runtime and wires it straight into MediaTek’s NeuroPilot NPU stack, so builders can deploy LLMs and embedding fashions with a single API floor as a substitute of per chip customized code.

What is LiteRT NeuroPilot Accelerator?

LiteRT is the successor of TensorFlow Lite. It is a excessive efficiency runtime that sits on machine, runs fashions in .tflite FlatBuffer format, and might goal CPU, GPU and now NPU backends by a unified {hardware} acceleration layer.

LiteRT NeuroPilot Accelerator is the brand new NPU path for MediaTek {hardware}. It replaces the older TFLite NeuroPilot delegate with a direct integration to the NeuroPilot compiler and runtime. Instead of treating the NPU as a skinny delegate, LiteRT now makes use of a Compiled Model API that understands Ahead of Time (AOT) compilation and on machine compilation, and exposes each by the identical C++ and Kotlin APIs.

On the {hardware} facet, the mixing at present targets MediaTek Dimensity 7300, 8300, 9000, 9200, 9300 and 9400 SoCs, which collectively cowl a big a part of the Android mid vary and flagship machine house.

Why Developers Care, Unified Workflow For Fragmented NPUs??

Historically, on machine ML stacks had been CPU and GPU first. NPU SDKs shipped as vendor particular toolchains that required separate compilation flows per SoC, customized delegates, and handbook runtime packaging. The consequence was a combinatorial explosion of binaries and loads of machine particular debugging.

LiteRT NeuroPilot Accelerator replaces that with a three step workflow that’s the similar no matter which MediaTek NPU is current:

  • Convert or load a .tflite mannequin as traditional.
  • Optionally use the LiteRT Python instruments to run AOT compilation and produce an AI Pack that’s tied to a number of goal SoCs.
  • Ship the AI Pack by Play for On-device AI (PODAI), then choose Accelerator.NPU at runtime. LiteRT handles machine concentrating on, runtime loading, and falls again to GPU or CPU if the NPU will not be out there.

For you as an engineer, the principle change is that machine concentrating on logic strikes into a structured configuration file and Play supply, whereas the app code largely interacts with CompiledModel and Accelerator.NPU.

AOT and on machine compilation are each supported. AOT compiles for a identified SoC forward of time and is advisable for bigger fashions as a result of it removes the price of compiling on the consumer machine. On machine compilation is healthier for small fashions and generic .tflite distribution, at the price of increased first run latency. The weblog exhibits that for a mannequin equivalent to Gemma-3-270M, pure on machine compilation can take greater than 1 minute, which makes AOT the real looking choice for manufacturing LLM use.

Gemma, Qwen, And Embedding Models On MediaTek NPU

The stack is constructed round open weight fashions fairly than a single proprietary NLU path. Google and MediaTek checklist specific, manufacturing oriented assist for:

  • Qwen3 0.6B, for textual content era in markets equivalent to mainland China.
  • Gemma-3-270M, a compact base mannequin that’s simple to effective tune for duties like sentiment evaluation and entity extraction.
  • Gemma-3-1B, a multilingual textual content solely mannequin for summarization and basic reasoning.
  • Gemma-3n E2B, a multimodal mannequin that handles textual content, audio and imaginative and prescient for issues like actual time translation and visible query answering.
  • EmbeddingGemma 300M, a textual content embedding mannequin for retrieval augmented era, semantic search and classification.

On the newest Dimensity 9500, operating on a Vivo X300 Pro, the Gemma 3n E2B variant reaches greater than 1600 tokens per second in prefill and 28 tokens per second in decode at a 4K context size when executed on the NPU.

For textual content era use instances, LiteRT-LM sits on prime of LiteRT and exposes a stateful engine with a textual content in textual content out API. A typical C++ circulation is to create ModelProperty, construct an Engine with litert::lm::Backend::NPU, then create a Session and name GenerateContent per dialog. For embedding workloads, EmbeddingGemma makes use of the decrease degree LiteRT CompiledModel API in a tensor in tensor out configuration, once more with the NPU chosen by {hardware} accelerator choices.

Developer Experience, C++ Pipeline And Zero Copy Buffers

LiteRT introduces a brand new C++ API that replaces the older C entry factors and is designed round specific Environment, Model, CompiledModel and TensorBuffer objects.

For MediaTek NPUs, this API integrates tightly with Android’s AHardwareBuffer and GPU buffers. You can assemble enter TensorBuffer cases straight from OpenGL or OpenCL buffers with TensorBuffer::CreateFromGlBuffer, which lets picture processing code feed NPU inputs with out an intermediate copy by CPU reminiscence. This is necessary for actual time digital camera and video processing the place a number of copies per body shortly saturate reminiscence bandwidth.

A typical excessive degree C++ path on machine seems to be like this, omitting error dealing with for readability:

// Load mannequin compiled for NPU
auto mannequin = Model::CreateFromFile("mannequin.tflite");
auto choices = Options::Create();
options->SetHardwareAccelerators(okayLiteRtHwAcceleratorNpu);

// Create compiled mannequin
auto compiled = CompiledModel::Create(*env, *mannequin, *choices);

// Allocate buffers and run
auto input_buffers = compiled->CreateEnterBuffers();
auto output_buffers = compiled->CreateOutputBuffers();
input_buffers[0].Write<float>(input_span);
compiled->Run(input_buffers, output_buffers);
output_buffers[0].Read(output_span);

The similar Compiled Model API is used whether or not you’re concentrating on CPU, GPU or the MediaTek NPU, which reduces the quantity of conditional logic in utility code.

Key Takeaways

  1. LiteRT NeuroPilot Accelerator is the brand new, top notch NPU integration between LiteRT and MediaTek NeuroPilot, changing the previous TFLite delegate and exposing a unified Compiled Model API with AOT and on machine compilation on supported Dimensity SoCs.
  2. The stack targets concrete open weight fashions, together with Qwen3-0.6B, Gemma-3-270M, Gemma-3-1B, Gemma-3n-E2B and EmbeddingGemma-300M, and runs them by LiteRT and LiteRT LM on MediaTek NPUs with a single accelerator abstraction.
  3. AOT compilation is strongly advisable for LLMs, for instance Gemma-3-270M can take greater than 1 minute to compile on machine, so manufacturing deployments ought to compile as soon as within the pipeline and ship AI Packs through Play for On machine AI.
  4. On a Dimensity 9500 class NPU, Gemma-3n-E2B can attain greater than 1600 tokens per second in prefill and 28 tokens per second in decode at 4K context, with measured throughput as much as 12 occasions CPU and 10 occasions GPU for LLM workloads.
  5. For builders, the C++ and Kotlin LiteRT APIs present a standard path to pick out Accelerator.NPU, handle compiled fashions and use zero copy tensor buffers, so CPU, GPU and MediaTek NPU targets can share one code path and one deployment workflow.

Check out the Docs and Technical details. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to observe us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The submit Google LiteRT NeuroPilot Stack Turns MediaTek Dimensity NPUs into First Class Targets for on Device LLMs appeared first on MarkTechPost.

Similar Posts