When Quantization Beats Full Precision
Research Insight

When Quantization Beats Full Precision: Anatomy of a Perplexity Anomaly

March 2026 · Black Sheep AI Research

We quantized GLM-4.7-Flash to 4.4 average bits and measured 12.5% lower perplexity than the BF16 baseline. A quantized model, beating full precision? We spent a week chasing this result. Here’s what we found — and why anyone evaluating quantized models needs to hear it.

The Numbers That Don’t Add Up

We were running standard perplexity evaluation on GLM-4.7-Flash — a 31-billion parameter model from the GLM-4 family — as part of our quantization validation work. The protocol was routine: WikiText-2, sequence length 2048, 256 samples, seed 42. Three conditions: BF16 baseline, FP16 baseline, and mixed-precision quantized (4.4 average bits).

The results stopped us cold:

ConditionPerplexityModel SizeCompression
BF16 baseline11.34458.2 GB1.0×
FP16 baseline11.20855.8 GB1.04×
4-bit quantized (4.4 avg bits)9.93015.9 GB3.66×

Read that again. The model compressed to one quarter of its original size has lower perplexity than the full-precision version. Not slightly lower — 12.5% lower. This should not happen. Quantization destroys information. You cannot add noise to a model and get better predictions.

Unless the metric is lying to you.

Chasing the Anomaly

Our first instinct was to suspect a bug. We re-ran the evaluation three times, verified the model weights, and checked the tokeniser. Identical results every time. The numbers were real.

So we did what any good engineer does with a suspicious average: we looked at the distribution.

We computed per-sequence perplexity for all 256 WikiText-2 test sequences across all three conditions. What we found was stark: 5 sequences in the BF16 model produce catastrophic perplexity — individual sequence PPL values in the range of 25,000 to 81,000.

For context, the median sequence perplexity is around 8.7. These 5 outliers are three to four orders of magnitude worse than the typical sequence. And because perplexity is computed as exp(mean(log-loss)), a handful of extreme values can dominate the entire average.

Both precision formats — BF16 and FP16 — produce exactly the same 5 catastrophic sequences. These aren’t random failures. They’re deterministic: specific token patterns in WikiText-2 that this model consistently handles catastrophically.

What’s Actually Happening

Quantization noise acts as an accidental regulariser on these pathological sequences.

When the full-precision model encounters one of these degenerate sequences, some internal dynamic — likely involving extreme attention patterns or saturated activations — causes the predicted probability for the correct next token to collapse to near zero. The log-loss explodes, and perplexity follows.

The 4-bit quantisation grid introduces small perturbations throughout the weight matrices. These perturbations are enough to disrupt whatever fragile pattern causes the model to spiral on those specific sequences. One of the 5 catastrophic sequences drops below the 100 PPL threshold entirely. The remaining outliers still produce high loss, but the peaks are substantially tamed.

This is not quantisation improving the model. It’s quantisation noise accidentally preventing a rare, deterministic failure mode. The model is still worse on the other 251 sequences — as you’d expect from lossy compression. But the arithmetic mean doesn’t care about 251 modestly worse sequences when 5 catastrophic ones improve by orders of magnitude.

The Honest Numbers

Once you use robust statistics that aren’t dominated by 2% of the data, the real picture emerges:

ConditionStandard PPLMedian PPLTrimmed Mean PPLOutliers (PPL > 100)
BF1611.3448.7068.6895
FP1611.2088.6098.6475
4-bit quantized9.9309.0849.1774

Median perplexity tells the true story: the quantized model is 4.3% worse than BF16 (9.084 vs 8.706). That’s exactly what you’d expect from 3.66× compression with mixed-precision bit allocation. No anomaly. No magic. Just honest lossy compression doing its job.

The gap between standard and median PPL — 2.6 points for BF16, but only 0.8 points for the quantized model — is itself informative. It quantifies exactly how much of the mean is being distorted by outlier sequences.

Why This Matters for the Community

Perplexity is the lingua franca of quantisation evaluation. It’s in every paper, every model card, every community benchmark. And it is fragile.

Standard mean perplexity can be dominated by a tiny fraction of evaluation sequences. If your model has even a few pathological failure modes on the test set — and many models do — the mean becomes unreliable as a quality signal. Worse, quantisation can accidentally improve the metric by perturbing these failure modes, creating the illusion that compression is free or even beneficial.

This isn’t hypothetical. It happened to us with a production evaluation pipeline on real models. If we hadn’t investigated, we might have published a headline claiming quantization improves model quality at 3.66× compression. It would have been wrong.

Practical Recommendations

For anyone evaluating quantised models, three concrete steps:

The Bottom Line

Quantisation does not improve models. If your numbers say otherwise, the numbers are wrong — or more precisely, the statistic you’re using is not measuring what you think it’s measuring.

Our 4-bit quantization of GLM-4.7-Flash achieves 3.66× compression with 4.3% perplexity degradation. That’s a legitimate, useful result. The 12.5% “improvement” was a mirage created by outlier sequences and a fragile metric.

In quantisation, as in everything else: if it looks too good to be true, measure it differently.

Hardware: Apple M2 Ultra, 192 GB unified memory. Software: MLX 0.30.3, Python 3.12.

← Previous: AI Without Permission All Articles →

Continue Reading

Related research from our team.

Does Quantization Actually Regularize? We Tested It.
Research

Does Quantization Actually Regularize? We Tested It.

We tested the regularization hypothesis rigorously. The results challenge a common assumption.

Mean Perplexity Is Lying to You
Research

Mean Perplexity Is Lying to You

Why average perplexity hides critical quality differences between quantization strategies.

View All Research