Mean Perplexity Is Lying to You

The metric the entire field uses to evaluate quantized models can rank the best model as the worst. We found five sequences that flip the entire leaderboard.

The Metric Everyone Trusts

Standard perplexity evaluation computes token-weighted cross-entropy averaged across evaluation sequences. It is the de facto metric for comparing quantized models. Every paper reports it. Every leaderboard sorts by it. And on at least one model family, it produces a completely inverted quality ranking.

The Inversion

On GLM-4.7-Flash (31.2B dense), mean perplexity gives a ranking that is exactly backwards:

Condition	Mean PPL	Median PPL	Outliers (>100)
BF16 (unquantized)	11.344	8.706	5
SWAN v1 (best)	9.930	9.084	4
MINT (15.8 GB)	9.427	9.210	0

By mean PPL, BF16 appears worst (11.344) and MINT appears best (9.427). By median PPL, the correct ordering emerges: BF16 is best (8.706), then v1 (9.084), then MINT (9.210). The mean and median give opposite rankings.

Five Sequences That Flip the Leaderboard

The cause: five catastrophic outlier sequences where BF16 produces per-sequence perplexity values of 25,000–81,000. These are sequences that the unquantized model handles pathologically—likely degenerate or adversarial text patterns in the WikiText-2 test split. Quantization noise acts as implicit regularization that stabilizes these pathological sequences to PPL 100–360. The five outliers inflate BF16’s mean by 30% while barely moving the median.

It’s Not Just GLM

The effect appears across models, though less dramatically:

Model	Condition	Mean PPL	Median PPL
Qwen3-30B-A3B	MINT (16 GB)	8.930	8.971
Qwen3-30B-A3B	MINT (19 GB)	8.782	8.798
Qwen3-30B-A3B	v1 (best)	8.924	8.974
Llama-4-Scout	MINT (64 GB)	7.703	8.070
Llama-4-Scout	MINT (192 GB)	7.359	7.691

On Qwen3-30B-A3B, the gap is subtle but consequential: MINT at 16 GB has mean 8.930 (appears worse than v1’s 8.924) but median 8.971 (actually better than v1’s 8.974)—at 2.6% smaller size. The conclusion flips depending on which metric you read.

On Llama-4-Scout, mean and median diverge by up to 0.33 PPL points (7.359 vs 7.691 at 192 GB). This divergence makes quality comparisons unreliable when only mean is reported.

What to Report Instead

The MINT paper recommends reporting:

Standard corpus perplexity (token-weighted cross-entropy) as the primary metric
Median per-sequence PPL as a robustness check
Tail percentiles (P95, P99) to flag heavy-tailed distributions
Outlier counts (sequences with PPL > 100) as a diagnostic

When mean and median diverge, the per-sequence loss distribution is heavy-tailed. In this regime, mean PPL can produce misleading orderings. Reporting both gives a complete picture.

Why This Matters Beyond Benchmarks

If you are selecting a quantization method based on reported perplexity, and the evaluation only reports mean PPL, you may be choosing the wrong method. If you are publishing a quantization paper and only reporting mean PPL, your quality ordering may not reflect reality. If you are deploying a quantized model and the “improvement” in perplexity is driven by outlier stabilization rather than genuine quality gains, your production system may not perform as expected.

The fix is simple: report both metrics. It costs one extra line in your evaluation script and prevents an entire class of misleading conclusions.

Data from the MINT paper: “MINT: Budget-Aware Data-Free Mixed-Precision Quantization for Large Language Models via Rate-Distortion Optimization” (baa.ai, 2026). All evaluations use WikiText-2 test split, sequence length 2048, 128 sequences, seed 42. Full paper available at baa.ai/articles/24-mint-paper.html. Code available at github.com/baa-ai/MINT.

← Back to all articles