When Quantization Beats Full Precision
Research Insight

When Quantization Beats Full Precision: Anatomy of a Perplexity Anomaly

March 2026 · Black Sheep AI Research

We quantized GLM-4.7-Flash to 4.4 average bits and measured 12.5% lower perplexity than the BF16 baseline. A quantized model, beating full precision? We spent a week chasing this result. Here’s what we found — and why anyone evaluating quantized models needs to hear it.

The Numbers That Don’t Add Up

We were running standard perplexity evaluation on GLM-4.7-Flash — a 31-billion parameter model from the GLM-4 family — as part of our SWAN v3 validation. The protocol was routine: WikiText-2, sequence length 2048, 256 samples, seed 42. Three conditions: BF16 baseline, FP16 baseline, and SWAN v3 mixed-precision (4.4 average bits).

The results stopped us cold:

ConditionPerplexityModel SizeCompression
BF16 baseline11.34458.2 GB1.0×
FP16 baseline11.20855.8 GB1.04×
SWAN v3 (4.4 avg bits)9.93015.9 GB3.66×

Read that again. The model compressed to one quarter of its original size has lower perplexity than the full-precision version. Not slightly lower — 12.5% lower. This should not happen. Quantization destroys information. You cannot add noise to a model and get better predictions.

Unless the metric is lying to you.

Chasing the Anomaly

Our first instinct was to suspect a bug. We re-ran the evaluation three times, verified the model weights, and checked the tokeniser. Identical results every time. The numbers were real.

So we did what any good engineer does with a suspicious average: we looked at the distribution.

We computed per-sequence perplexity for all 256 WikiText-2 test sequences across all three conditions. What we found was stark: 5 sequences in the BF16 model produce catastrophic perplexity — individual sequence PPL values in the range of 25,000 to 81,000.

For context, the median sequence perplexity is around 8.7. These 5 outliers are three to four orders of magnitude worse than the typical sequence. And because perplexity is computed as exp(mean(log-loss)), a handful of extreme values can dominate the entire average.

Both precision formats — BF16 and FP16 — produce exactly the same 5 catastrophic sequences. These aren’t random failures. They’re deterministic: specific token patterns in WikiText-2 that this model consistently handles catastrophically.

What’s Actually Happening

Quantization noise acts as an accidental regulariser on these pathological sequences.

When the full-precision model encounters one of these degenerate sequences, some internal dynamic — likely involving extreme attention patterns or saturated activations — causes the predicted probability for the correct next token to collapse to near zero. The log-loss explodes, and perplexity follows.

The 4-bit quantisation grid introduces small perturbations throughout the weight matrices. These perturbations are enough to disrupt whatever fragile pattern causes the model to spiral on those specific sequences. One of the 5 catastrophic sequences drops below the 100 PPL threshold entirely. The remaining outliers still produce high loss, but the peaks are substantially tamed.

This is not quantisation improving the model. It’s quantisation noise accidentally preventing a rare, deterministic failure mode. The model is still worse on the other 251 sequences — as you’d expect from lossy compression. But the arithmetic mean doesn’t care about 251 modestly worse sequences when 5 catastrophic ones improve by orders of magnitude.

The Honest Numbers

Once you use robust statistics that aren’t dominated by 2% of the data, the real picture emerges:

ConditionStandard PPLMedian PPLTrimmed Mean PPLOutliers (PPL > 100)
BF1611.3448.7068.6895
FP1611.2088.6098.6475
SWAN v39.9309.0849.1774

Median perplexity tells the true story: SWAN v3 is 4.3% worse than BF16 (9.084 vs 8.706). That’s exactly what you’d expect from 3.66× compression with sensitivity-aware bit allocation. No anomaly. No magic. Just honest lossy compression doing its job.

The gap between standard and median PPL — 2.6 points for BF16, but only 0.8 points for SWAN — is itself informative. It quantifies exactly how much of the mean is being distorted by outlier sequences.

Why This Matters for the Community

Perplexity is the lingua franca of quantisation evaluation. It’s in every paper, every model card, every community benchmark. And it is fragile.

Standard mean perplexity can be dominated by a tiny fraction of evaluation sequences. If your model has even a few pathological failure modes on the test set — and many models do — the mean becomes unreliable as a quality signal. Worse, quantisation can accidentally improve the metric by perturbing these failure modes, creating the illusion that compression is free or even beneficial.

This isn’t hypothetical. It happened to us with a production evaluation pipeline on real models. If we hadn’t investigated, we might have published a headline claiming SWAN improves model quality at 3.66× compression. It would have been wrong.

Practical Recommendations

For anyone evaluating quantised models, three concrete steps:

The Deeper Pattern: Outliers All the Way Down

There’s an irony in this finding that connects directly to SWAN’s core approach.

SWAN’s strongest single predictor of quantisation sensitivity is excess kurtosis — the degree to which a weight tensor’s distribution has heavy tails and outliers. Across 2,347 tensors in a 400B parameter model, kurtosis correlates with actual reconstruction error at ρ = 0.80. Tensors with outlier-heavy distributions are genuinely harder to quantise because the quantisation grid must expand to accommodate extreme values, reducing precision for everything else.

The perplexity anomaly is the same phenomenon, one level up. Just as outlier weights distort quantisation grids, outlier sequences distort evaluation metrics. The same statistical fragility that makes quantisation hard also makes measuring quantisation quality unreliable.

SWAN addresses the weight-level problem by allocating more bits to high-kurtosis tensors. The evaluation-level problem requires a similar solution: using robust statistics (median, trimmed mean) instead of metrics that let 2% of the data dominate the other 98%.

What We Changed

After discovering this anomaly, we updated our evaluation pipeline to report three metrics for every perplexity measurement:

We recommend any quantisation benchmark adopt the same practice. It costs nothing to compute and protects against exactly this class of evaluation artifact.

The Bottom Line

Quantisation does not improve models. If your numbers say otherwise, the numbers are wrong — or more precisely, the statistic you’re using is not measuring what you think it’s measuring.

SWAN v3 on GLM-4.7-Flash achieves 3.66× compression with 4.3% perplexity degradation. That’s a legitimate, useful result. The 12.5% “improvement” was a mirage created by outlier sequences and a fragile metric.

In quantisation, as in everything else: if it looks too good to be true, measure it differently.

All evaluation data, per-sequence analysis, and the corrected comparison are available in our open-source repository. Hardware: Apple M2 Ultra, 192 GB unified memory. Software: MLX 0.30.3, Python 3.12.

Need deep AI expertise to get your models into production?

Black Sheep AI brings deep expertise in model quantization, mixed-precision optimisation, and production AI systems. We help teams extract maximum intelligence from minimum hardware — using techniques like SWAN that go beyond one-size-fits-all compression.

Talk to Our Team
← Previous: SWAN-Guided Knowledge Distillation All Articles →
← Back to all articles