KLD Tail Effects: Why Quality Comparisons Need Median, Not Mean
Eval Methodology

KLD Tail Effects: Why Quality Comparisons Need Median, Not Mean

May 2026 · Black Sheep AI Research

If you measure KL divergence between two language models token-by-token over a corpus and report the mean, you are measuring how often the worse model catastrophically fails, not how good the average prediction is. Switch to median and the picture inverts. Here's why, and what to do.

The setup

KL divergence between a reference model (BF16) and a candidate model (quantized, fine-tuned, distilled, anything) is a per-token quantity:

KLD_t = sum_v p_ref(v|context_t) × log( p_ref(v|context_t) / p_cand(v|context_t) )

You can compute it for every token in WikiText-2 (about 66,000 tokens in the standard split), then aggregate across tokens to a single number. The aggregation choice is where things go wrong.

What the distribution actually looks like

Most teams report mean KLD. We measured the per-token KLD distribution between several quantized variants of a 27B model and the BF16 source on WikiText-2. The shape:

Quantile KLD (typical)
50th percentile (median) ~0.001
75th percentile ~0.005
90th percentile ~0.02
95th percentile ~0.06
99th percentile ~0.5
99.9th percentile ~5.0

The distribution is heavy-tailed by 3-4 orders of magnitude. The median is at one-thousandth; the 99.9th percentile is 5,000× higher.

The mean is dominated by the tail. Specifically: the top 1% of tokens (by KLD) contributes about 80% of the mean. If you compare two models by mean KLD, you are comparing how often each one has a catastrophic prediction, not how the model behaves on the typical token.

Why this matters for compression comparisons

Imagine two quantized variants of the same source model:

The mean KLD ranking:
- Variant A: 0.99 × 0.0025 + 0.01 × 0.5 ≈ 0.0075
- Variant B: 0.99 × 0.01 + 0.01 × 0.05 ≈ 0.0105

Mean says A is better. But A is much worse for any user. A's spike on those 50 tokens means 50 places in any generation where A produces gibberish or hallucinated content. B is uniformly slightly worse but never catastrophic.

The median KLD ranking inverts this:
- Variant A: median ≈ 0.001 (the 50 spikes don't move the median)
- Variant B: median ≈ 0.01

Median says A is much worse, which matches user-perceived quality. Mean is hiding the spike-failure mode by averaging it into the typical-token noise.

When this happens in practice

We see the spike-failure mode reliably in:

In all three cases, the mean KLD looks fine, sometimes better than competitors, while the model is actually broken on a small but visible fraction of generations.

What to report instead

The cleanest comparison reports a small set of statistics together:

Variant A: median=0.001  p90=0.05  p99=2.0  spike_rate=0.012
Variant B: median=0.010  p90=0.02  p99=0.05 spike_rate=0.000

Where spike_rate = fraction of tokens with KLD > 0.5 (an arbitrary "this is a catastrophe" threshold; pick yours).

For a single number, report median. For a model-vs-model comparison, the variant with lower median AND lower p99 is unambiguously better. If they trade off, you have a real engineering decision.

The same effect in perplexity

Perplexity has the same problem, it's exp(mean(log_loss)), and the log-loss distribution is even more heavy-tailed than KLD. We've covered this previously in Mean Perplexity Is Lying to You; the short version is that the same tail-domination logic applies and the same fix (report median log-loss along with mean) helps.

What it looks like in practice

In our own work, the median-not-mean rule is the single biggest source of "wait, did we just rank these wrong?" moments when reading new comparison studies. Whenever a paper reports "our quantization beats X on mean KLD by 0.001", check whether the median still moves in the same direction. About a third of the time, it doesn't, the win is purely in the tail and reflects "our spike rate is lower" rather than "our typical-token quality is higher."

Both can be valuable findings. They're also very different findings, and the paper should say which it is.

The minimum reporting standard we recommend

For any LLM-vs-LLM comparison on a corpus-aggregated quality metric (KLD, log-loss, perplexity, cross-entropy), report:

  1. Median, the tail-resistant primary number.
  2. Mean (for backward compatibility, many older papers only report mean).
  3. p90 and p99, characterize the tail explicitly.
  4. Spike rate at a fixed threshold, explicit catastrophe count.

It's four numbers instead of one. It costs zero extra compute (you computed the per-token values anyway). And it makes the comparison interpretable for someone reading the paper later.


Source: measured on Qwen3.6-27B and Gemma-4-31B with multiple quantized variants. Pattern holds across all models we've tested.

Read more: Mean Perplexity Is Lying to You, GPQA-Diamond's 4 pp Noise Floor.