|

Google DeepMind Releases Gemma 4 QAT Checkpoints: Q4_0 and a New Mobile Format Cut On-Device Memory

Google DeepMind launched Quantization-Aware Training (QAT) checkpoints for the Gemma 4 household. The launch targets native deployment on edge units and shopper GPUs. It follows the Gemma 4 launch in April and a 12B mannequin two days earlier.

We in contrast the out there Gemma 4 edge-model codecs utilizing solely printed numbers. The objective was easy. Show what every precision stage prices in reminiscence. Then present what QAT truly adjustments.

What QAT truly does

Quantization shrinks a mannequin by decreasing weight precision. Standard Post-Training Quantization (PTQ) compresses a completed mannequin. That typically degrades high quality. QAT as an alternative simulates quantization throughout coaching. The mannequin learns to compensate for the precision loss.

Google’s AI crew states its QAT outcomes yield larger total high quality than commonplace PTQ baselines. Google didn’t publish Gemma 4 QAT benchmark scores within the announcement. For context, Gemma 3 QAT minimize the Q4_0 perplexity drop by 54% utilizing llama.cpp analysis. We cite that solely as prior-generation precedent.

The comparability job

Compare Gemma 4 E2B and E4B throughout three codecs. The codecs are BF16, Q4_0 QAT, and the brand new cell QAT schema. Rank them on reminiscence footprint, high quality preservation, and on-device accessibility. Use printed figures solely.

Memory outcomes

Format E2B E4B Basis
BF16 (16-bit) 9.6 GB 15 GB Official Gemma 4 docs
Q4_0 (4-bit, QAT) 3.2 GB 5 GB Official Gemma 4 docs
Mobile (QAT, E2B) ~1 GB QAT announcement

The Q4_0 figures match the footprint of PTQ Q4_0. QAT doesn’t change the dimensions at a given format. It improves high quality at that dimension. The new cell schema delivers the extra discount.

Using that cell schema, Google lowered Gemma 4 E2B to about 1GB. Developers can go decrease nonetheless. The text-only mannequin with out Per-Layer Embeddings wants below 1GB, dropping the audio and imaginative and prescient encoders.

Per-format breakdown

BF16 is the standard baseline. E2B wants 9.6 GB and E4B wants 15 GB. It is the reference level, not a cellphone deployment goal.

Q4_0 QAT is the general-purpose native format. E2B drops to three.2 GB and E4B to five GB. QAT preserves extra high quality right here than PTQ on the similar dimension. This format matches shopper GPUs. Earlier E2B testing additionally ran on a Raspberry Pi 5 at INT4.

The cell format is the edge-specialized schema. It brings E2B to about 1 GB. It makes use of static activations, channel-wise quantization, and focused 2-bit compression.

How the cell schema works

Google AI crew engineered 4 methods for cell {hardware}. Static activations pre-calculate scaling throughout coaching, decreasing on-device work. Channel-wise quantization matches the design of cell accelerators. Targeted 2-bit quantization compresses solely the token-generation layers. Embedding and KV cache optimization shrinks the energetic reminiscence footprint.

Core reasoning layers keep at larger precision. That protects functionality whereas reducing storage. Developers can even deploy text-only and drop the audio and imaginative and prescient encoders. That trims reminiscence additional to be used circumstances that want no multimodality.

Dimension breakdown

Scores are a qualitative rating of the codecs for on-device use. Memory is the one hard-measured axis. Quality displays Google’s disclosed design, not measured Gemma 4 numbers. Each rating has a one-line foundation.

Dimension BF16 Q4_0 QAT Mobile QAT
Memory footprint 1 — heaviest, 9.6 GB E2B 4 — 3.2 GB E2B 5 — ~1 GB E2B text-only
Quality preservation 5 — full-precision baseline 4 — QAT-preserved, close to baseline 3 — 2-bit token layers, core saved larger
Decode velocity 2 — no quantization speedup 4 — 4-bit accelerates decode 5 — mobile-optimized static activations
Deployment breadth 4 — loadable however heavy 5 — llama.cpp, Ollama, LM Studio, vLLM, MLX 3 — LiteRT-LM, Transformers.js, edge-focused
On-device accessibility 1 — wants massive GPU 4 — shopper GPU, Raspberry Pi 5 5 — runs on telephones
Total (/25) 13 21 21

Winner

The result’s a tie by design. Q4_0 QAT and cell QAT each rating 21, however for various {hardware}. For telephones, the cell format leads. It reaches about 1GB on E2B and targets cell accelerators immediately. For laptops and shopper GPUs, Q4_0 QAT is the sensible default. BF16 stays the standard reference, not a native alternative.

Methodology and limits

Memory figures come from Google’s Gemma 4 documentation. The ~1GB E2B determine comes from the QAT announcement. Quality is Google’s said declare. No unbiased Gemma 4 QAT high quality numbers had been printed at launch. We didn’t run the fashions regionally for this comparability. Developers ought to check at their very own quantization and workload earlier than constructing.

Key Takeaways

  • Q4_0 QAT cuts Gemma 4 E2B to three.2 GB and E4B to five GB, from 9.6 GB and 15 GB at BF16.
  • A brand new cell QAT schema brings E2B to about 1 GB; text-only with out PLE goes below 1 GB.
  • QAT adjustments high quality at a given dimension, not the dimensions itself; the cell format drives the additional reminiscence minimize.
  • Google claims larger high quality than PTQ however printed no Gemma 4 QAT benchmark numbers at launch.
  • Weights ship at the moment on Hugging Face with llama.cpp, Ollama, LM Studio, vLLM, MLX, and LiteRT-LM assist.

Marktechpost’s Visual Explainer

Marktechpost · Benchmark

Gemma 4 QAT: Comparing Q4_0 and the New Mobile Format

Google DeepMind launched Quantization-Aware Training checkpoints for Gemma 4. We in contrast three edge-model codecs on printed numbers.

Formats in contrast

BF16 (16-bit)  ·  Q4_0 QAT (4-bit)  ·  Mobile QAT

June 5, 2026

The Comparison Task

What we ranked

$ examine gemma-4 --models E2B,E4B 
    --formats BF16,Q4_0-QAT,MOBILE-QAT 
    --rank reminiscence,high quality,accessibility 
    --source published-only --no-self-run

Memory from official Gemma 4 docs. Quality from Google’s said declare. No fashions run regionally.

Format 1 of three · Reference

BF16 (16-bit)

13 / 25

The full-precision high quality baseline. E2B wants 9.6 GB and E4B wants 15 GB.

Top statement: a reference level, not a cellphone or laptop computer deployment goal.

Format 2 of three · Laptop / GPU

Q4_0 QAT (4-bit)

21 / 25

The general-purpose native format. E2B drops to three.2 GB and E4B to five GB.

Top statement: QAT preserves extra high quality than PTQ on the similar 4-bit dimension.

Format 3 of three · Mobile

Mobile QAT

21 / 25

The edge-specialized schema. Brings E2B to about 1 GB.

Top statement: 2-bit on token layers, reasoning layers saved at larger precision.

Leaderboard

Full rating

Dimension BF16 Q4_0 QAT Mobile QAT
Memory footprint 1 4 5
Quality preservation 5 4 3
Decode velocity 2 4 5
Deployment breadth 4 5 3
On-device accessibility 1 4 5
Total 13 21 21

Tie by design: Q4_0 wins laptops and GPUs; cell wins telephones.

Key Takeaways

What builders ought to know

  • Q4_0 QAT cuts E2B to three.2 GB and E4B to five GB, from 9.6 GB and 15 GB at BF16.
  • A brand new cell QAT schema brings E2B to about 1 GB; text-only with out PLE goes below 1 GB.
  • QAT adjustments high quality at a given dimension; the cell format drives the additional reminiscence minimize.
  • Google claims larger high quality than PTQ however printed no Gemma 4 QAT numbers.
  • Weights ship at the moment on Hugging Face with llama.cpp, Ollama, vLLM, and MLX assist.

Memory: official Gemma 4 documentation. ~1 GB E2B: QAT announcement (cell format); text-only with out PLE is below 1 GB. Quality: Google’s said declare — no unbiased Gemma 4 QAT scores at launch.


Check out the Model weights (Q4_0 QAT collection, Mobile QAT collection) and (*4*)Also, be happy to comply with us on Twitter and don’t neglect to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The submit Google DeepMind Releases Gemma 4 QAT Checkpoints: Q4_0 and a New Mobile Format Cut On-Device Memory appeared first on MarkTechPost.

Similar Posts