The metric the entire field uses to evaluate quantized models can rank the best model as the worst. We found five sequences that flip the entire leaderboard.
The Metric Everyone Trusts
Standard perplexity evaluation computes token-weighted cross-entropy averaged across evaluation sequences. It's the de facto metric for comparing quantized models. Every paper reports it. Every leaderboard sorts by it. And on at least one model family, it produces a completely backwards quality ranking.
The Inversion
On GLM-4.7-Flash (31.2B dense), mean perplexity gives a ranking that's exactly wrong:
| Condition | Mean PPL | Median PPL | Outliers (>100) |
|---|---|---|---|
| BF16 (unquantized) | 11.344 | 8.706 | 5 |
| RAM v1 (best) | 9.930 | 9.084 | 4 |
| RAM (15.8 GB) | 9.427 | 9.210 | 0 |
By mean PPL, BF16 looks worst (11.344) and RAM looks best (9.427). By median PPL, the correct ordering shows up: BF16 is best (8.706), then v1 (9.084), then RAM (9.210). Mean and median give opposite rankings.
Five Sequences That Flip the Leaderboard
The culprit: five catastrophic outlier sequences where BF16 produces per-sequence perplexity values of 25,000 to 81,000. These are sequences the unquantized model handles pathologically, likely degenerate or adversarial text patterns in the WikiText-2 test split. Quantization noise acts as accidental regularization that calms these pathological sequences down to PPL 100 to 360. Those five outliers inflate BF16's mean by 30% while barely nudging the median.
It's Not Just GLM
The effect shows up across models, though less dramatically:
| Model | Condition | Mean PPL | Median PPL |
|---|---|---|---|
| Qwen3-30B-A3B | RAM (16 GB) | 8.930 | 8.971 |
| Qwen3-30B-A3B | RAM (19 GB) | 8.782 | 8.798 |
| Qwen3-30B-A3B | v1 (best) | 8.924 | 8.974 |
| Llama-4-Scout | RAM (64 GB) | 7.703 | 8.070 |
| Llama-4-Scout | RAM (192 GB) | 7.359 | 7.691 |
On Qwen3-30B-A3B, the gap is subtle but it matters. RAM at 16 GB has mean 8.930 (looks worse than v1's 8.924) but median 8.971 (actually better than v1's 8.974) at 2.6% smaller size. Your conclusion flips depending on which metric you read.
On Llama-4-Scout, mean and median diverge by up to 0.33 PPL points (7.359 vs 7.691 at 192 GB). That kind of divergence makes quality comparisons unreliable when only mean gets reported.
What to Report Instead
The RAM paper recommends reporting:
- Standard corpus perplexity (token-weighted cross-entropy) as the primary metric
- Median per-sequence PPL as a robustness check
- Tail percentiles (P95, P99) to flag heavy-tailed distributions
- Outlier counts (sequences with PPL > 100) as a diagnostic
When mean and median diverge, the per-sequence loss distribution is heavy-tailed. In that regime, mean PPL can produce misleading orderings. Reporting both gives the complete picture.
Why This Matters Beyond Benchmarks
If you're picking a quantization method based on reported perplexity and the evaluation only reports mean PPL, you may be choosing the wrong method. If you're publishing a quantization paper and only reporting mean PPL, your quality ordering may not reflect reality. And if you're deploying a quantized model where the "improvement" in perplexity is driven by outlier stabilization rather than genuine quality gains, your production system may not perform as expected.
The fix is simple. Report both metrics. It costs one extra line in your evaluation script and prevents an entire class of misleading conclusions.
Data from the RAM paper: "RAM: Budget-Aware Proprietary Compression for Large Language Models via Rate-Distortion Optimization" (baa.ai, 2026). All evaluations use WikiText-2 test split, sequence length 2048, 128 sequences, seed 42. Full paper available at huggingface.co/spaces/baa-ai/RAM. Code available at github.com/baa-ai/RAM.
Read the Full Paper
The full RAM paper covers formal derivations, benchmark results across 7 model families and 40,000+ questions, and the optimal allocation framework. It's on our HuggingFace:
RAM: Compute-Optimal Proprietary Compression for LLMs, Full Paper
huggingface.co/spaces/baa-ai/RAMLicensed under CC BY-NC-ND 4.0