Mean Perplexity Is Lying to You
RAM Research

Mean Perplexity Is Lying to You

March 2026 · Black Sheep AI Research

The metric the entire field uses to evaluate quantized models can rank the best model as the worst. We found five sequences that flip the entire leaderboard.

The Metric Everyone Trusts

Standard perplexity evaluation computes token-weighted cross-entropy averaged across evaluation sequences. It is the de facto metric for comparing quantized models. Every paper reports it. Every leaderboard sorts by it. And on at least one model family, it produces a completely inverted quality ranking.

The Inversion

On GLM-4.7-Flash (31.2B dense), mean perplexity gives a ranking that is exactly backwards:

Condition Mean PPL Median PPL Outliers (>100)
BF16 (unquantized) 11.344 8.706 5
RAM v1 (best) 9.930 9.084 4
RAM (15.8 GB) 9.427 9.210 0

By mean PPL, BF16 appears worst (11.344) and RAM appears best (9.427). By median PPL, the correct ordering emerges: BF16 is best (8.706), then v1 (9.084), then RAM (9.210). The mean and median give opposite rankings.

Five Sequences That Flip the Leaderboard

The cause: five catastrophic outlier sequences where BF16 produces per-sequence perplexity values of 25,000–81,000. These are sequences that the unquantized model handles pathologically, likely degenerate or adversarial text patterns in the WikiText-2 test split. Quantization noise acts as implicit regularization that stabilizes these pathological sequences to PPL 100–360. The five outliers inflate BF16’s mean by 30% while barely moving the median.

It’s Not Just GLM

The effect appears across models, though less dramatically:

Model Condition Mean PPL Median PPL
Qwen3-30B-A3B RAM (16 GB) 8.930 8.971
Qwen3-30B-A3B RAM (19 GB) 8.782 8.798
Qwen3-30B-A3B v1 (best) 8.924 8.974
Llama-4-Scout RAM (64 GB) 7.703 8.070
Llama-4-Scout RAM (192 GB) 7.359 7.691

On Qwen3-30B-A3B, the gap is subtle but consequential: RAM at 16 GB has mean 8.930 (appears worse than v1’s 8.924) but median 8.971 (actually better than v1’s 8.974), at 2.6% smaller size. The conclusion flips depending on which metric you read.

On Llama-4-Scout, mean and median diverge by up to 0.33 PPL points (7.359 vs 7.691 at 192 GB). This divergence makes quality comparisons unreliable when only mean is reported.

What to Report Instead

The RAM paper recommends reporting:

  1. Standard corpus perplexity (token-weighted cross-entropy) as the primary metric
  2. Median per-sequence PPL as a robustness check
  3. Tail percentiles (P95, P99) to flag heavy-tailed distributions
  4. Outlier counts (sequences with PPL > 100) as a diagnostic

When mean and median diverge, the per-sequence loss distribution is heavy-tailed. In this regime, mean PPL can produce misleading orderings. Reporting both gives a complete picture.

Why This Matters Beyond Benchmarks

If you are selecting a quantization method based on reported perplexity, and the evaluation only reports mean PPL, you may be choosing the wrong method. If you are publishing a quantization paper and only reporting mean PPL, your quality ordering may not reflect reality. If you are deploying a quantized model and the “improvement” in perplexity is driven by outlier stabilization rather than genuine quality gains, your production system may not perform as expected.

The fix is simple: report both metrics. It costs one extra line in your evaluation script and prevents an entire class of misleading conclusions.


Data from the RAM paper: “RAM: Budget-Aware Proprietary Compression for Large Language Models via Rate-Distortion Optimization” (baa.ai, 2026). All evaluations use WikiText-2 test split, sequence length 2048, 128 sequences, seed 42. Full paper available at huggingface.co/spaces/baa-ai/RAM. Code available at github.com/baa-ai/RAM.

Read the Full Paper

The complete RAM paper, including formal derivations, benchmark results across 7 model families and 40,000+ questions, and the full optimal allocation framework, is available on our HuggingFace:

RAM: Compute-Optimal Proprietary Compression for LLMs, Full Paper

huggingface.co/spaces/baa-ai/RAM

Licensed under CC BY-NC-ND 4.0

Continue Reading

Related research from our team.

Eight Things Our Benchmarks Reveal That Nobody Expected
RAM Research

Eight Things Our Benchmarks Reveal That Nobody Expected

Surprising findings from our benchmark suite that challenge conventional quantization wisdom.

When Quantization Beats Full Precision
Quantization

When Quantization Beats Full Precision

Anatomy of a perplexity anomaly, when compressed models outperform their uncompressed originals.

Beyond Perplexity: Downstream Benchmarks Confirm RAM Beats All Quantization Strategies
RAM Research

Beyond Perplexity: Downstream Benchmarks Confirm RAM Beats All Quantization Strategies

Real benchmarks, real results. RAM wins across downstream tasks, not just perplexity.

View All Research