2026-06-11

Gemma 4's QAT weights: on-device inference just swapped its real bottleneck

Google shipped quantization-aware training weights for Gemma 4, squeezing E2B down to 1GB so it runs on phones and consumer GPUs. The turn that matters isn't 'it fits now'. It's that the hard problem moved to power draw, the privacy boundary, and exactly how much quality you lose.

open-models quantization local-ai on-device

Gemma 4's QAT weights: on-device inference just swapped its real bottleneck — Photo / Unsplash

Summary

On June 5, Google released a batch of weights for its open Gemma 4 family, produced with quantization-aware training (QAT). What it does, plainly: the same models were retrained with a method that “trains with quantization in the loop,” and the result is a sharp drop in memory use, enough to run on everyday hardware like phones and consumer GPUs. The smallest edge model, Gemma 4 E2B, comes down to a 1GB memory footprint.

But the line worth keeping is not the marketing one, “Gemma 4 runs on phones now.” Barely loading a model onto a phone is something every lab has demoed for two years. What’s worth keeping is that this release changes the central tension of on-device inference. It moves the problem off the binary question of “does it fit” and onto a set of continuous, engineering-tradeoff questions: how much power does it burn while running, how large a privacy boundary does local processing actually buy you, and how much quality did compression cost. Once “it runs” stops being the gate, those three things are what decide whether an on-device feature graduates from demo to product. That’s the line this piece follows.

What happened

First the facts. This release is a set of new weight checkpoints, not new models. Gemma 4 itself shipped two months ago. In between, Google added Multi-Token Prediction (MTP, predicting several tokens at once to speed up inference) and slotted in a 12B model to bridge the gap between the E4B and 26B MOE models. The subject here is quantization: re-compressing the existing models the QAT way.

There are two quantization formats. One is Q4_0, the format the community uses most widely (roughly, weights down to 4-bit); the QAT recipe was applied to it across all the models. The other is a quantization format newly designed specifically for mobile use, applied only to the two edge models, E2B and E4B. With that mobile format, Gemma 4 E2B’s footprint drops to 1GB. Google adds that if you want text only, drop the audio and vision encoders, and remove Per-Layer Embeddings, the text-only E2B fits in under 1GB.

QAT isn’t a new idea, but the judgment to make here is that Google treated it as the default deliverable, not an option for tinkerers. The mechanism: during training it simulates the precision loss that quantization will cause, so the model “knows” while training that it will end up compressed to low bit-width, and adjusts its weights toward something more robust to that compression. That’s a different road from standard post-training quantization (PTQ, where you train normally and compress afterward). Google’s claim is measured: PTQ already preserves quality fairly well, but their QAT results yield higher overall quality than standard PTQ baselines. Note what’s missing. No specific numbers. It does not say “quality loss dropped from X% to Y%,” only the direction. More on that below.

The shipping support came at once: weights on Hugging Face, GGUF for llama.cpp, compressed tensors for vLLM, desktop via Ollama and LM Studio, Google’s lightweight LiteRT-LM runtime for edge deployment, Transformers.js in the browser, MLX for Apple Silicon, and fine-tuning through Hugging Face Transformers and Unsloth. That list is itself a signal: Google didn’t build a runtime to fence developers in. It pushed the weights into every tool developers already use.

On Hacker News the story drew roughly 405 points and 25 comments. For contrast, last April’s “Gemma 3 QAT” post pulled 600-plus points and nearly 280 comments. The cooling-off isn’t surprising, and it makes a point: QAT has gone from a surprise to an expected move, which is exactly how on-device quantization settles from “news” into “infrastructure.”

Why it matters

Squeezing models onto consumer hardware has, until now, been a one-dimensional story about memory: is there enough VRAM, does it fit. This release deserves attention because it largely closes out that dimension. E2B at 1GB means a decent phone can keep a usable language model resident. And once “it fits” stops being the question, the real bottlenecks surface, each harder than the last.

The first is power. A phone is not a GPU; every decode step burns battery and generates heat. The benefit of quantization isn’t only memory. It also speeds up decode, which Google states directly: quantization shrinks the footprint while accelerating decode speed. But faster decode is not the same as lower power, and in an on-device product “it runs” and “it runs continuously without cooking the chip or draining the battery” are two different things. Several moves in the mobile format aim straight at this: static activations, where the scaling parameters are pre-computed during training to cut the runtime overhead of recalculating them; and channel-wise quantization, structuring the compressed data to match the design of mobile accelerators so the phone runs the math natively instead of through a slow workaround. None of these are about saving more memory. They make the chip do less work and run cooler. The bottleneck moved from capacity to energy.

The second is the privacy boundary, which is where on-device actually earns its keep. The model runs locally and the data never leaves the device, a property a cloud API cannot give you. But “data never leaves the device” is a commitment that has to be defined precisely, not a marketing line: which computation is local, which features still call home, how model updates are delivered. Each is part of the boundary. Getting the model to fit in 1GB only makes that boundary possible. Making it real and defensible is product and engineering work, not something quantization hands you. This release pushes the door open; you still have to walk the path behind it.

The third, and the one to watch hardest, is how much quality you actually lose. Google says QAT beats standard PTQ on quality, and the direction is fully credible. It’s the entire reason QAT exists. But “higher” is a relative word. It’s relative to PTQ, not relative to the original full-precision model. In other words, QAT minimizes the loss from compression on the premise that you’re compressing anyway; it doesn’t eliminate the loss. The mobile format includes one aggressive move in particular: the parameters that generate tokens are compressed to 2-bit, while the core reasoning layers are kept at higher precision. That’s an explicit bet that a coarse generation layer plus an accurate reasoning core still leaves the model smart enough overall. It probably holds for most everyday tasks, but which tasks crack first, Google doesn’t say. You’ll have to measure that yourself.

Builder impact

If you’re building on-device or local-first features, these weights are worth pulling down today, with a condition attached to each reason.

The barrier really did drop. E2B in the 1GB range, with the ready-made llama.cpp / Ollama / LM Studio / LiteRT-LM chain, means you no longer write a pile of infrastructure just to get inference running. The practical value: on-device goes from a hard problem that needs a dedicated team to an option a normal backend engineer can validate in an afternoon. But “the demo runs” and “the product ships” are separated by the power and quality hurdles above, so don’t mistake the first for the second.

The QAT weights should be your default starting point, not full-precision models you PTQ yourself. The logic is simple: Google already used QAT to get loss below PTQ and handed you Q4_0 GGUFs and compressed tensors for vLLM, so re-doing PTQ yourself mostly spends effort to land somewhere worse. There are QAT checkpoints for MTP too, so inference speedup and quantization stack together; they aren’t an either/or.

Trimming modalities on demand is an underrated lever. Gemma 4 is multimodal, but the audio and vision encoders are dead weight in many use cases. Google states you can deploy only the modalities you need to cut memory further, and text-only E2B (without Per-Layer Embeddings) fits in under 1GB. If your feature is text-only, dropping those encoders is near-zero-cost memory savings, with no reason to skip it.

The place to actually spend effort is building your own quality evaluation. Google gave only the directional result, “QAT beats PTQ,” with no numbers you can copy. Your product’s quality gate can’t rest on someone else’s relative claim. You have to measure how much the QAT version drops against full precision on your real tasks and real inputs, watching especially for side effects from that 2-bit generation layer in your scenario. There’s no skipping this step.

What to ignore

Ignore the framing that treats “you can run a big model on a phone now” as the milestone. Running a model on a phone isn’t news; running it with low power, a clear privacy boundary, and good-enough quality is. Put your attention on the three continuous questions and don’t stop at the binary “it runs.”

Ignore any precise “quality loss is only X%” or “near-lossless” claim unless it traces back to your own testing. The source gives no specific quality-loss figure, and no exact per-model memory numbers either (Google has a VRAM table but calls the values “approximate,” and the specific numbers couldn’t be verified here). Any precise percentage floating around is either measured on someone else’s task and may not transfer to yours, or simply made up.

And don’t read this as Google building a closed on-device ecosystem to lock anyone in. It pushed the weights into a long list of other people’s tools (llama.cpp, Ollama, LM Studio, vLLM, MLX, Transformers.js) with LiteRT-LM as just one option. This is a “weights everywhere” open play, not platform lock-in. The point of the story is that these weights drop into your stack, not that Google shipped another runtime.

Technical takeaway

Four moves in the mobile format are worth noting on their own, because they explain why this isn’t just another round of PTQ. Static activations: scaling parameters pre-computed during training, removing the runtime overhead of recalculating them on mobile chips, for faster responses. Channel-wise quantization: compressed data structured to match mobile accelerators so the phone computes natively rather than through a slow workaround. Targeted 2-bit quantization: only the token-generating parameters squeezed to 2-bit while the core reasoning layers stay at higher precision, saving storage without dropping the model’s intelligence. Embedding and KV cache optimization: compression focused on the model’s vocabulary and its short-term memory (the KV cache), sharply cutting the active memory footprint so long chats don’t run out of space. Put together, it’s one sentence: QAT doesn’t train then compress, it shapes the model during training into the form a mobile chip prefers. That’s the real engineering line between it and post-training quantization.