When Quantization Beats Full Precision: Anatomy of a Perplexity Anomaly

We quantized GLM-4.7-Flash to 4.4 average bits and measured 12.5% lower perplexity than the BF16 baseline. A quantized model beating full precision? We spent a week chasing this result. Here's what we found, and why anyone evaluating quantized models needs to hear it.

The Numbers That Don't Add Up

We were running standard perplexity evaluation on GLM-4.7-Flash, a 31-billion parameter model from the GLM-4 family, as part of our quantization validation work. The protocol was routine: WikiText-2, sequence length 2048, 256 samples, seed 42. Three conditions: BF16 baseline, FP16 baseline, and mixed-precision quantized (4.4 average bits).

The results stopped us cold:

Condition	Perplexity	Model Size	Compression
BF16 baseline	11.344	58.2 GB	1.0×
FP16 baseline	11.208	55.8 GB	1.04×
4-bit quantized (4.4 avg bits)	9.930	15.9 GB	3.66×

Read that again. The model compressed to one quarter of its original size has lower perplexity than the full-precision version. Not slightly lower. 12.5% lower. This shouldn't happen. Quantization destroys information. You can't add noise to a model and get better predictions.

Unless the metric is lying to you.

Chasing the Anomaly

Our first instinct was a bug. We re-ran the evaluation three times, verified the model weights, and checked the tokeniser. Identical results every time. The numbers were real.

So we did what any good engineer does with a suspicious average: we looked at the distribution.

We computed per-sequence perplexity for all 256 WikiText-2 test sequences across all three conditions. What we found was stark: 5 sequences in the BF16 model produce catastrophic perplexity, individual sequence PPL values in the range of 25,000 to 81,000.

For context, the median sequence perplexity is around 8.7. Those 5 outliers are three to four orders of magnitude worse than a typical sequence. And because perplexity is computed as exp(mean(log-loss)), a handful of extreme values can dominate the entire average.

Both precision formats, BF16 and FP16, produce exactly the same 5 catastrophic sequences. These aren't random failures. They're deterministic: specific token patterns in WikiText-2 that this model consistently handles catastrophically.

What's Actually Happening

Quantization noise acts as an accidental regulariser on these pathological sequences.

When the full-precision model encounters one of these degenerate sequences, some internal dynamic (likely involving extreme attention patterns or saturated activations) causes the predicted probability for the correct next token to collapse to near zero. The log-loss explodes, and perplexity follows.

The 4-bit quantisation grid introduces small perturbations throughout the weight matrices. These perturbations are enough to disrupt whatever fragile pattern causes the model to spiral on those specific sequences. One of the 5 catastrophic sequences drops below the 100 PPL threshold entirely. The remaining outliers still produce high loss, but the peaks are substantially tamed.

This isn't quantisation improving the model. It's quantisation noise accidentally preventing a rare, deterministic failure mode. The model is still worse on the other 251 sequences, exactly as you'd expect from lossy compression. But the arithmetic mean doesn't care about 251 modestly worse sequences when 5 catastrophic ones improve by orders of magnitude.

The Honest Numbers

Once you use statistics that aren't dominated by 2% of the data, the real picture emerges:

Condition	Standard PPL	Median PPL	Trimmed Mean PPL	Outliers (PPL > 100)
BF16	11.344	8.706	8.689	5
FP16	11.208	8.609	8.647	5
4-bit quantized	9.930	9.084	9.177	4

Median perplexity tells the true story: the quantized model is 4.3% worse than BF16 (9.084 vs 8.706). That's exactly what you'd expect from 3.66× compression with mixed-precision bit allocation. No anomaly. No magic. Just honest lossy compression doing its job.

The gap between standard and median PPL, 2.6 points for BF16 but only 0.8 points for the quantized model, is itself informative. It quantifies exactly how much the mean is being distorted by outlier sequences.

Why This Matters for the Community

Perplexity is the lingua franca of quantisation evaluation. It's in every paper, every model card, every community benchmark. And it is fragile.

Standard mean perplexity can be dominated by a tiny fraction of evaluation sequences. If your model has even a few pathological failure modes on the test set (and many models do), the mean becomes unreliable as a quality signal. Worse, quantisation can accidentally improve the metric by perturbing these failure modes, creating the illusion that compression is free or even beneficial.

This isn't hypothetical. It happened to us with a production evaluation pipeline on real models. If we hadn't investigated, we might have published a headline claiming quantization improves model quality at 3.66× compression. It would have been wrong.

Practical Recommendations

For anyone evaluating quantised models, three concrete steps:

Always report median perplexity alongside the mean. If they differ substantially, your mean is being dominated by outliers. The median gives you the typical-case quality signal.
Inspect the per-sequence distribution. Count sequences with PPL above 100 (or whatever threshold makes sense for your model). If a handful of sequences are orders of magnitude worse than the rest, they're distorting your evaluation.
Be suspicious of "too good" results. If your quantised model reports lower perplexity than the baseline, it's almost certainly an artifact. Investigate before celebrating.

The Bottom Line

Quantisation doesn't improve models. If your numbers say otherwise, the numbers are wrong. Or more precisely, the statistic you're using isn't measuring what you think it's measuring.

Our 4-bit quantization of GLM-4.7-Flash achieves 3.66× compression with 4.3% perplexity degradation. That's a legitimate, useful result. The 12.5% "improvement" was a mirage created by outlier sequences and a fragile metric.

In quantisation, as in everything else: if it looks too good to be true, measure it differently.

Hardware: Apple M2 Ultra, 192 GB unified memory. Software: MLX 0.30.3, Python 3.12.

← Previous: AI Without Permission All Articles →