Mean Perplexity Is Lying to You

The metric the entire field uses to evaluate quantized models can rank the best model as the worst. We found five sequences that flip the entire leaderboard.

The Metric Everyone Trusts

Standard perplexity evaluation computes token-weighted cross-entropy averaged across evaluation sequences. It's the de facto metric for comparing quantized models. Every paper reports it. Every leaderboard sorts by it. And on at least one model family, it produces a completely backwards quality ranking.

The Inversion

On GLM-4.7-Flash (31.2B dense), mean perplexity gives a ranking that's exactly wrong:

Condition	Mean PPL	Median PPL	Outliers (>100)
BF16 (unquantized)	11.344	8.706	5
RAM v1 (best)	9.930	9.084	4
RAM (15.8 GB)	9.427	9.210	0

By mean PPL, BF16 looks worst (11.344) and RAM looks best (9.427). By median PPL, the correct ordering shows up: BF16 is best (8.706), then v1 (9.084), then RAM (9.210). Mean and median give opposite rankings.

Five Sequences That Flip the Leaderboard

The culprit: five catastrophic outlier sequences where BF16 produces per-sequence perplexity values of 25,000 to 81,000. These are sequences the unquantized model handles pathologically, likely degenerate or adversarial text patterns in the WikiText-2 test split. Quantization noise acts as accidental regularization that calms these pathological sequences down to PPL 100 to 360. Those five outliers inflate BF16's mean by 30% while barely nudging the median.

It's Not Just GLM

The effect shows up across models, though less dramatically:

Model	Condition	Mean PPL	Median PPL
Qwen3-30B-A3B	RAM (16 GB)	8.930	8.971
Qwen3-30B-A3B	RAM (19 GB)	8.782	8.798
Qwen3-30B-A3B	v1 (best)	8.924	8.974
Llama-4-Scout	RAM (64 GB)	7.703	8.070
Llama-4-Scout	RAM (192 GB)	7.359	7.691

On Qwen3-30B-A3B, the gap is subtle but it matters. RAM at 16 GB has mean 8.930 (looks worse than v1's 8.924) but median 8.971 (actually better than v1's 8.974) at 2.6% smaller size. Your conclusion flips depending on which metric you read.

On Llama-4-Scout, mean and median diverge by up to 0.33 PPL points (7.359 vs 7.691 at 192 GB). That kind of divergence makes quality comparisons unreliable when only mean gets reported.

What to Report Instead

The RAM paper recommends reporting:

Standard corpus perplexity (token-weighted cross-entropy) as the primary metric
Median per-sequence PPL as a robustness check
Tail percentiles (P95, P99) to flag heavy-tailed distributions
Outlier counts (sequences with PPL > 100) as a diagnostic

When mean and median diverge, the per-sequence loss distribution is heavy-tailed. In that regime, mean PPL can produce misleading orderings. Reporting both gives the complete picture.

Why This Matters Beyond Benchmarks

If you're picking a quantization method based on reported perplexity and the evaluation only reports mean PPL, you may be choosing the wrong method. If you're publishing a quantization paper and only reporting mean PPL, your quality ordering may not reflect reality. And if you're deploying a quantized model where the "improvement" in perplexity is driven by outlier stabilization rather than genuine quality gains, your production system may not perform as expected.

The fix is simple. Report both metrics. It costs one extra line in your evaluation script and prevents an entire class of misleading conclusions.

Data from the RAM paper: "RAM: Budget-Aware Proprietary Compression for Large Language Models via Rate-Distortion Optimization" (baa.ai, 2026). All evaluations use WikiText-2 test split, sequence length 2048, 128 sequences, seed 42. Full paper available at huggingface.co/spaces/baa-ai/RAM. Code available at github.com/baa-ai/RAM.

Read the Full Paper

The full RAM paper covers formal derivations, benchmark results across 7 model families and 40,000+ questions, and the optimal allocation framework. It's on our HuggingFace:

RAM: Compute-Optimal Proprietary Compression for LLMs, Full Paper

huggingface.co/spaces/baa-ai/RAM

Licensed under CC BY-NC-ND 4.0

Mean Perplexity Is Lying to You

The Metric Everyone Trusts

The Inversion

Five Sequences That Flip the Leaderboard

It's Not Just GLM

What to Report Instead

Why This Matters Beyond Benchmarks

Read the Full Paper

Continue Reading

Eight Things Our Benchmarks Reveal That Nobody Expected

When Quantization Beats Full Precision

Beyond Perplexity: Downstream Benchmarks Confirm RAM Beats All Quantization Strategies