If you measure KL divergence between two language models token-by-token over a corpus and report the mean, you are measuring how often the worse model catastrophically fails, not how good the average prediction is. Switch to median and the picture inverts. Here's why, and what to do.
The setup
KL divergence between a reference model (BF16) and a candidate model (quantized, fine-tuned, distilled, anything) is a per-token quantity:
KLD_t = sum_v p_ref(v|context_t) × log( p_ref(v|context_t) / p_cand(v|context_t) )
You can compute it for every token in WikiText-2 (about 66,000 tokens in the standard split), then aggregate across tokens to a single number. The aggregation choice is where things go wrong.
What the distribution actually looks like
Most teams report mean KLD. We measured the per-token KLD distribution between several quantized variants of a 27B model and the BF16 source on WikiText-2. The shape:
| Quantile | KLD (typical) |
|---|---|
| 50th percentile (median) | ~0.001 |
| 75th percentile | ~0.005 |
| 90th percentile | ~0.02 |
| 95th percentile | ~0.06 |
| 99th percentile | ~0.5 |
| 99.9th percentile | ~5.0 |
The distribution is heavy-tailed by 3-4 orders of magnitude. The median is at one-thousandth; the 99.9th percentile is 5,000× higher.
The mean is dominated by the tail. Specifically: the top 1% of tokens (by KLD) contributes about 80% of the mean. If you compare two models by mean KLD, you are comparing how often each one has a catastrophic prediction, not how the model behaves on the typical token.
Why this matters for compression comparisons
Imagine two quantized variants of the same source model:
- Variant A: 99% of tokens have KLD < 0.005, but on 50 catastrophic tokens it spikes to KLD ≈ 5.0 each.
- Variant B: 99% of tokens have KLD ≈ 0.01 (twice as bad on the typical token), but no catastrophic spikes, worst token at KLD ≈ 0.1.
The mean KLD ranking:
- Variant A: 0.99 × 0.0025 + 0.01 × 0.5 ≈ 0.0075
- Variant B: 0.99 × 0.01 + 0.01 × 0.05 ≈ 0.0105
Mean says A is better. But A is much worse for any user. A's spike on those 50 tokens means 50 places in any generation where A produces gibberish or hallucinated content. B is uniformly slightly worse but never catastrophic.
The median KLD ranking inverts this:
- Variant A: median ≈ 0.001 (the 50 spikes don't move the median)
- Variant B: median ≈ 0.01
Median says A is much worse, which matches user-perceived quality. Mean is hiding the spike-failure mode by averaging it into the typical-token noise.
When this happens in practice
We see the spike-failure mode reliably in:
- Aggressive quantization (≤3-bit average) where some tensors get pushed below their precision floor and the model occasionally outputs out-of-distribution logits.
- Distillation runs where the student approximates the teacher well on common tokens but breaks on rare ones (named entities, technical jargon, code).
- Bad LoRA targeting where the adapter doesn't cover the tensors actually used for some token classes.
In all three cases, the mean KLD looks fine, sometimes better than competitors, while the model is actually broken on a small but visible fraction of generations.
What to report instead
The cleanest comparison reports a small set of statistics together:
Variant A: median=0.001 p90=0.05 p99=2.0 spike_rate=0.012
Variant B: median=0.010 p90=0.02 p99=0.05 spike_rate=0.000
Where spike_rate = fraction of tokens with KLD > 0.5 (an arbitrary "this is a catastrophe" threshold; pick yours).
For a single number, report median. For a model-vs-model comparison, the variant with lower median AND lower p99 is unambiguously better. If they trade off, you have a real engineering decision.
The same effect in perplexity
Perplexity has the same problem, it's exp(mean(log_loss)), and the log-loss distribution is even more heavy-tailed than KLD. We've covered this previously in Mean Perplexity Is Lying to You; the short version is that the same tail-domination logic applies and the same fix (report median log-loss along with mean) helps.
What it looks like in practice
In our own work, the median-not-mean rule is the single biggest source of "wait, did we just rank these wrong?" moments when reading new comparison studies. Whenever a paper reports "our quantization beats X on mean KLD by 0.001", check whether the median still moves in the same direction. About a third of the time, it doesn't, the win is purely in the tail and reflects "our spike rate is lower" rather than "our typical-token quality is higher."
Both can be valuable findings. They're also very different findings, and the paper should say which it is.
The minimum reporting standard we recommend
For any LLM-vs-LLM comparison on a corpus-aggregated quality metric (KLD, log-loss, perplexity, cross-entropy), report:
- Median, the tail-resistant primary number.
- Mean (for backward compatibility, many older papers only report mean).
- p90 and p99, characterize the tail explicitly.
- Spike rate at a fixed threshold, explicit catastrophe count.
It's four numbers instead of one. It costs zero extra compute (you computed the per-token values anyway). And it makes the comparison interpretable for someone reading the paper later.
Source: measured on Qwen3.6-27B and Gemma-4-31B with multiple quantized variants. Pattern holds across all models we've tested.
Read more: Mean Perplexity Is Lying to You, GPQA-Diamond's 4 pp Noise Floor.