Mean Perplexity Is Lying to You
RAM Research

Mean Perplexity Is Lying to You

March 2026 · Black Sheep AI Research

The metric the entire field uses to evaluate quantized models can rank the best model as the worst. We found five sequences that flip the entire leaderboard.

The Metric Everyone Trusts

Standard perplexity evaluation computes token-weighted cross-entropy averaged across evaluation sequences. It's the de facto metric for comparing quantized models. Every paper reports it. Every leaderboard sorts by it. And on at least one model family, it produces a completely backwards quality ranking.

The Inversion

On GLM-4.7-Flash (31.2B dense), mean perplexity gives a ranking that's exactly wrong:

Condition Mean PPL Median PPL Outliers (>100)
BF16 (unquantized) 11.344 8.706 5
RAM v1 (best) 9.930 9.084 4
RAM (15.8 GB) 9.427 9.210 0

By mean PPL, BF16 looks worst (11.344) and RAM looks best (9.427). By median PPL, the correct ordering shows up: BF16 is best (8.706), then v1 (9.084), then RAM (9.210). Mean and median give opposite rankings.

Five Sequences That Flip the Leaderboard

The culprit: five catastrophic outlier sequences where BF16 produces per-sequence perplexity values of 25,000 to 81,000. These are sequences the unquantized model handles pathologically, likely degenerate or adversarial text patterns in the WikiText-2 test split. Quantization noise acts as accidental regularization that calms these pathological sequences down to PPL 100 to 360. Those five outliers inflate BF16's mean by 30% while barely nudging the median.

It's Not Just GLM

The effect shows up across models, though less dramatically:

Model Condition Mean PPL Median PPL
Qwen3-30B-A3B RAM (16 GB) 8.930 8.971
Qwen3-30B-A3B RAM (19 GB) 8.782 8.798
Qwen3-30B-A3B v1 (best) 8.924 8.974
Llama-4-Scout RAM (64 GB) 7.703 8.070
Llama-4-Scout RAM (192 GB) 7.359 7.691

On Qwen3-30B-A3B, the gap is subtle but it matters. RAM at 16 GB has mean 8.930 (looks worse than v1's 8.924) but median 8.971 (actually better than v1's 8.974) at 2.6% smaller size. Your conclusion flips depending on which metric you read.

On Llama-4-Scout, mean and median diverge by up to 0.33 PPL points (7.359 vs 7.691 at 192 GB). That kind of divergence makes quality comparisons unreliable when only mean gets reported.

What to Report Instead

The RAM paper recommends reporting:

  1. Standard corpus perplexity (token-weighted cross-entropy) as the primary metric
  2. Median per-sequence PPL as a robustness check
  3. Tail percentiles (P95, P99) to flag heavy-tailed distributions
  4. Outlier counts (sequences with PPL > 100) as a diagnostic

When mean and median diverge, the per-sequence loss distribution is heavy-tailed. In that regime, mean PPL can produce misleading orderings. Reporting both gives the complete picture.

Why This Matters Beyond Benchmarks

If you're picking a quantization method based on reported perplexity and the evaluation only reports mean PPL, you may be choosing the wrong method. If you're publishing a quantization paper and only reporting mean PPL, your quality ordering may not reflect reality. And if you're deploying a quantized model where the "improvement" in perplexity is driven by outlier stabilization rather than genuine quality gains, your production system may not perform as expected.

The fix is simple. Report both metrics. It costs one extra line in your evaluation script and prevents an entire class of misleading conclusions.


Data from the RAM paper: "RAM: Budget-Aware Proprietary Compression for Large Language Models via Rate-Distortion Optimization" (baa.ai, 2026). All evaluations use WikiText-2 test split, sequence length 2048, 128 sequences, seed 42. Full paper available at huggingface.co/spaces/baa-ai/RAM. Code available at github.com/baa-ai/RAM.

Read the Full Paper

The full RAM paper covers formal derivations, benchmark results across 7 model families and 40,000+ questions, and the optimal allocation framework. It's on our HuggingFace:

RAM: Compute-Optimal Proprietary Compression for LLMs, Full Paper

huggingface.co/spaces/baa-ai/RAM

Licensed under CC BY-NC-ND 4.0

Continue Reading

Related research from our team.

Eight Things Our Benchmarks Reveal That Nobody Expected
RAM Research

Eight Things Our Benchmarks Reveal That Nobody Expected

Surprising findings from our benchmark suite that challenge conventional quantization wisdom.

When Quantization Beats Full Precision
Quantization

When Quantization Beats Full Precision

Anatomy of a perplexity anomaly, when compressed models outperform their uncompressed originals.

Beyond Perplexity: Downstream Benchmarks Confirm RAM Beats All Quantization Strategies
RAM Research

Beyond Perplexity: Downstream Benchmarks Confirm RAM Beats All Quantization Strategies

Real benchmarks, real results. RAM wins across downstream tasks, not just perplexity.

View All Research