When a quantized model reports lower perplexity than full precision, something seems wrong. We injected random Gaussian noise into the weights to find out what’s really happening.
The Anomaly
On Qwen3-30B-A3B at the 30 GB budget point, MINT achieves mean PPL 8.657—apparently below BF16’s 8.733. A quantized model appearing to beat the original. This has been observed by other researchers and sometimes attributed to a “regularization effect” where quantization noise prevents overfitting to evaluation data. We designed a controlled experiment to test this claim.
The Experiment
We compared five conditions on Qwen3-30B-A3B, all evaluated on WikiText-2 (test split, seq_len=2048, seed=42):
| Condition | Size (GB) | Mean PPL | Median PPL | Delta Med |
|---|---|---|---|---|
| BF16 (baseline) | 56.87 | 8.733 | 8.789 | — |
| Gaussian noise (1x) | 56.87 | 8.742 | 8.755 | -0.38% |
| Uniform 8-bit g64 | 30.21 | 8.750 | 8.765 | -0.27% |
| Uniform 8-bit g128 | 29.32 | 8.769 | 8.776 | -0.15% |
| Uniform 6-bit g64 | 23.11 | 8.765 | 8.798 | +0.10% |
The Gaussian noise condition is the key control: we injected unstructured random noise into 385 tensors, calibrated to match the exact magnitude of 8-bit quantization error (average SNR 44.7 dB per tensor). This preserves the model size at BF16 while introducing noise at the same scale as quantization.
Finding 1: The Mean PPL “Improvement” Is a Distributional Artifact
All quantized conditions produce higher mean PPL than BF16 (+0.11% to +0.42%). The earlier observation of mean PPL below BF16 at the 30 GB MINT budget point was likely due to the specific allocation pattern combined with evaluation variance. There is no genuine mean PPL improvement from quantization.
Finding 2: The Median PPL Improvement Is Real But Not Quantization-Specific
All 8-bit conditions show lower median PPL than BF16 (-0.15% to -0.27%). This looks like regularization—until you check the control. Gaussian noise produces the same effect at -0.38%, ruling out anything specific to quantization’s structured rounding. The mechanism: BF16 has a left tail of sequences with unusually low PPL, and any noise—structured or not—partially disrupts these, pulling the median down while pushing the mean up.
Finding 3: No Selective Improvement
If quantization acted as genuine regularization, you would expect it to selectively improve the hardest sequences (those with highest PPL) while leaving easy sequences unchanged. We compared BF16 to uniform 8-bit across all 145 WikiText-2 sequences. The result: 68% of sequences degrade and only 32% improve, with the effect uniformly distributed across PPL quartiles (approximately +0.20% per quartile). There is no selective stabilization of hard sequences on this model.
Note: This contrasts with GLM-4.7-Flash, where quantization does genuinely stabilize 5 catastrophic outlier sequences (PPL 25,000–81,000 at BF16) to PPL 100–360. That is a real and useful effect—but it is model-specific pathology, not a general regularization mechanism.
The Evaluation Pitfall
When mean and median perplexity diverge in opposite directions, apparent “improvements” from quantization may reflect distributional artifacts rather than genuine quality gains. This is not a theoretical concern—it affected the interpretation of our own results during development.
The practical takeaway: if your quantized model appears to beat full precision on mean perplexity, check the median. If the median also improves, check whether random noise at the same scale produces the same effect. If it does, you are observing a noise phenomenon, not regularization.
Implications for Published Results
Any quantization paper that reports only mean perplexity and claims an “improvement” over BF16 should be treated with caution. The improvement may be real (as in GLM-4.7-Flash’s outlier stabilization) or it may be an artifact (as in Qwen3-30B-A3B’s distributional shift). Without median PPL and tail statistics, it is impossible to distinguish the two.
Data from the MINT paper: “MINT: Budget-Aware Data-Free Mixed-Precision Quantization for Large Language Models via Rate-Distortion Optimization” (baa.ai, 2026). Controlled noise experiments conducted on Qwen3-30B-A3B using WikiText-2 test split. Full paper at baa.ai/articles/24-mint-paper.html. Code at github.com/baa-ai/MINT.