quantization analysis

2026-06-11 google

Gemma 4's QAT weights: on-device inference just swapped its real bottleneck

Google shipped quantization-aware training weights for Gemma 4, squeezing E2B down to 1GB so it runs on phones and consumer GPUs. The turn that matters isn't 'it fits now'. It's that the hard problem moved to power draw, the privacy boundary, and exactly how much quality you lose.

open-models quantization local-ai

Read analysis