Google DeepMind Releases Gemma 4 QAT Checkpoints: Q4_0 and a New Mobile Format Cut On-Device Memory
Google DeepMind launched Quantization-Aware Training (QAT) checkpoints for the Gemma 4 household. The launch targets native deployment on edge units and shopper GPUs. It follows the Gemma 4 launch in April and a 12B mannequin two days earlier.
We in contrast the out there Gemma 4 edge-model codecs utilizing solely printed numbers. The objective was easy. Show what every precision stage prices in reminiscence. Then present what QAT truly adjustments.
What QAT truly does
Quantization shrinks a mannequin by decreasing weight precision. Standard Post-Training Quantization (PTQ) compresses a completed mannequin. That typically degrades high quality. QAT as an alternative simulates quantization throughout coaching. The mannequin learns to compensate for the precision loss.
Google’s AI crew states its QAT outcomes yield larger total high quality than commonplace PTQ baselines. Google didn’t publish Gemma 4 QAT benchmark scores within the announcement. For context, Gemma 3 QAT minimize the Q4_0 perplexity drop by 54% utilizing llama.cpp analysis. We cite that solely as prior-generation precedent.
The comparability job
Compare Gemma 4 E2B and E4B throughout three codecs. The codecs are BF16, Q4_0 QAT, and the brand new cell QAT schema. Rank them on reminiscence footprint, high quality preservation, and on-device accessibility. Use printed figures solely.
Memory outcomes
| Format | E2B | E4B | Basis |
|---|---|---|---|
| BF16 (16-bit) | 9.6 GB | 15 GB | Official Gemma 4 docs |
| Q4_0 (4-bit, QAT) | 3.2 GB | 5 GB | Official Gemma 4 docs |
| Mobile (QAT, E2B) | ~1 GB | — | QAT announcement |
The Q4_0 figures match the footprint of PTQ Q4_0. QAT doesn’t change the dimensions at a given format. It improves high quality at that dimension. The new cell schema delivers the extra discount.
Using that cell schema, Google lowered Gemma 4 E2B to about 1GB. Developers can go decrease nonetheless. The text-only mannequin with out Per-Layer Embeddings wants below 1GB, dropping the audio and imaginative and prescient encoders.
Per-format breakdown
BF16 is the standard baseline. E2B wants 9.6 GB and E4B wants 15 GB. It is the reference level, not a cellphone deployment goal.
Q4_0 QAT is the general-purpose native format. E2B drops to three.2 GB and E4B to five GB. QAT preserves extra high quality right here than PTQ on the similar dimension. This format matches shopper GPUs. Earlier E2B testing additionally ran on a Raspberry Pi 5 at INT4.
The cell format is the edge-specialized schema. It brings E2B to about 1 GB. It makes use of static activations, channel-wise quantization, and focused 2-bit compression.
How the cell schema works
Google AI crew engineered 4 methods for cell {hardware}. Static activations pre-calculate scaling throughout coaching, decreasing on-device work. Channel-wise quantization matches the design of cell accelerators. Targeted 2-bit quantization compresses solely the token-generation layers. Embedding and KV cache optimization shrinks the energetic reminiscence footprint.
Core reasoning layers keep at larger precision. That protects functionality whereas reducing storage. Developers can even deploy text-only and drop the audio and imaginative and prescient encoders. That trims reminiscence additional to be used circumstances that want no multimodality.
Dimension breakdown
Scores are a qualitative rating of the codecs for on-device use. Memory is the one hard-measured axis. Quality displays Google’s disclosed design, not measured Gemma 4 numbers. Each rating has a one-line foundation.
| Dimension | BF16 | Q4_0 QAT | Mobile QAT |
|---|---|---|---|
| Memory footprint | 1 — heaviest, 9.6 GB E2B | 4 — 3.2 GB E2B | 5 — ~1 GB E2B text-only |
| Quality preservation | 5 — full-precision baseline | 4 — QAT-preserved, close to baseline | 3 — 2-bit token layers, core saved larger |
| Decode velocity | 2 — no quantization speedup | 4 — 4-bit accelerates decode | 5 — mobile-optimized static activations |
| Deployment breadth | 4 — loadable however heavy | 5 — llama.cpp, Ollama, LM Studio, vLLM, MLX | 3 — LiteRT-LM, Transformers.js, edge-focused |
| On-device accessibility | 1 — wants massive GPU | 4 — shopper GPU, Raspberry Pi 5 | 5 — runs on telephones |
| Total (/25) | 13 | 21 | 21 |
Winner
The result’s a tie by design. Q4_0 QAT and cell QAT each rating 21, however for various {hardware}. For telephones, the cell format leads. It reaches about 1GB on E2B and targets cell accelerators immediately. For laptops and shopper GPUs, Q4_0 QAT is the sensible default. BF16 stays the standard reference, not a native alternative.
Methodology and limits
Memory figures come from Google’s Gemma 4 documentation. The ~1GB E2B determine comes from the QAT announcement. Quality is Google’s said declare. No unbiased Gemma 4 QAT high quality numbers had been printed at launch. We didn’t run the fashions regionally for this comparability. Developers ought to check at their very own quantization and workload earlier than constructing.
Key Takeaways
- Q4_0 QAT cuts Gemma 4 E2B to three.2 GB and E4B to five GB, from 9.6 GB and 15 GB at BF16.
- A brand new cell QAT schema brings E2B to about 1 GB; text-only with out PLE goes below 1 GB.
- QAT adjustments high quality at a given dimension, not the dimensions itself; the cell format drives the additional reminiscence minimize.
- Google claims larger high quality than PTQ however printed no Gemma 4 QAT benchmark numbers at launch.
- Weights ship at the moment on Hugging Face with llama.cpp, Ollama, LM Studio, vLLM, MLX, and LiteRT-LM assist.
Marktechpost’s Visual Explainer
Check out the Model weights (Q4_0 QAT collection, Mobile QAT collection) and (*4*). Also, be happy to comply with us on Twitter and don’t neglect to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The submit Google DeepMind Releases Gemma 4 QAT Checkpoints: Q4_0 and a New Mobile Format Cut On-Device Memory appeared first on MarkTechPost.
