The Death of Uniform Quantization: Why Treating All Parameters Equally Is a Fundamental Mistake

We analysed 2,347 tensors across a 400-billion parameter model and measured how much each one changes under 4-bit quantization. The variation is staggering — some tensors lose 50× more information than others at the same bit-width. Uniform quantization ignores this entirely. It shouldn't.

The Comfortable Lie of Uniform Quantization

The standard approach to model compression is seductively simple: take every weight in the model, reduce it from 16-bit to 4-bit, and hope for the best. This is uniform quantization, and it has become the default for deploying large language models on consumer hardware.

Tools like llama.cpp, mlx_lm, and GGUF quantize entire models with a single command. Pick your bit-width (Q4, Q5, Q8), press enter, and you get a smaller model. It's fast, it's easy, and it mostly works.

But "mostly works" hides a profound inefficiency. Uniform quantization makes an implicit assumption that is demonstrably false: that all parameters are equally important.

The Evidence: Not All Parameters Are Created Equal

When we built SWAN, we computed the actual 4-bit reconstruction error for every weight tensor in Qwen3.5-397B — 2,347 tensors, 403.4 billion parameters. The results destroy the uniform assumption.

Sensitivity Varies by 50×

The normalised RMSE under 4-bit quantization ranges from 0.001 to over 0.05 across tensors in the same model. Some tensors barely notice quantization; others are devastated by it. Treating them identically is like performing surgery with a sledgehammer because it works on nails.

Sensitivity Is Predictable

The critical finding from our research is that this sensitivity is not random — it's highly predictable from the mathematical properties of the weights themselves:

Property	Correlation with Error	What It Means
Excess Kurtosis	ρ = 0.80	Outlier-heavy distributions are hard to quantize
Output Noise Amplification	ρ = 0.69	Some layers amplify quantization noise dramatically
SVD Spectral Concentration	ρ = 0.40	Information concentrated in few directions is fragile
Reconstruction Error Proxy	ρ = 0.37	Direct simulation confirms the above signals

And these four signals are largely independent of each other (max inter-metric |ρ| = 0.38), meaning each captures a genuinely different aspect of why a tensor is hard to quantize. This is the key insight: quantization difficulty is multi-dimensional, and a single metric cannot capture it.

Sensitivity Is Structurally Consistent

Across three radically different model architectures — dense 8B, 128-expert MoE 400B, and 512-expert MoE 400B — SWAN finds the same patterns:

Attention layers need more precision. 1.6–2.5 more bits than MLP layers, consistently.
MoE experts compress well. 74–82% of expert parameters tolerate 4-bit without meaningful degradation.
Early and late layers are more sensitive. A U-shaped pattern that repeats across every architecture tested.

These aren't random fluctuations. They're structural properties of how transformer models encode information. Uniform quantization ignores all of this structure.

The Cost of Uniformity

Uniform quantization fails in two directions simultaneously:

Over-compresses sensitive tensors

The 4–5% of parameters that are genuinely sensitive get forced to 4-bit alongside everything else. These are the attention projections, expert gates, and early-layer weights that carry disproportionate importance for model quality. Compressing them costs quality with no practical size benefit (they're a tiny fraction of total parameters).

Under-compresses insensitive tensors

The vast majority of parameters — particularly middle-layer MLP weights in MoE models — could be compressed to 2-bit with negligible quality loss. At uniform 4-bit, we're spending double the necessary bits on parameters that don't need them, wasting memory and bandwidth.

The result: uniform quantization simultaneously degrades quality (by damaging sensitive tensors) and wastes space (by over-preserving insensitive ones). It's the worst of both worlds.

The SWAN Proof

Our controlled experiments on Qwen3.5-397B demonstrate this directly. At matched group size (128):

Uniform 4-bit

4.298

perplexity · 196 GB

SWAN Mixed-Precision

4.283

perplexity · 199 GB

SWAN achieves lower perplexity than uniform 4-bit while using only 3 GB more space. It does this by giving 8-bit precision to just 4.3% of tensors and keeping 95.2% at 4-bit. The extra 3 GB spent on sensitive tensors buys back more quality than uniform 4-bit loses across the entire model.

Put differently: SWAN proves that a few hundred tensors at 8-bit are worth more than a few hundred billion parameters at uniform 4-bit.

Why Mixed-Precision Has Been Hard

If mixed-precision is so clearly better, why has uniform quantization dominated? Three reasons:

Calibration cost. Previous mixed-precision methods (AWQ, GPTQ, SqueezeLLM, LLM-MQ) required expensive calibration runs to determine sensitivity. For a 400B parameter model, this could take hours on a multi-GPU cluster. SWAN eliminates this entirely — 13 minutes, no calibration data, single machine.
Single-metric limitations. Prior data-free approaches (like MXQ) used a single sensitivity metric — typically Frobenius norm of quantization error. Our analysis shows this captures only one dimension of a multi-dimensional problem. Kurtosis alone (ρ = 0.80) predicts more than reconstruction error alone (ρ = 0.37), and combining four metrics captures non-redundant information that no single metric can.
Tooling. Uniform quantization is a single command. Mixed-precision requires per-tensor decisions, manifest files, and framework support. SWAN automates this entirely — the output is a JSON manifest that plugs directly into MLX's quantization pipeline.

The Future Is Non-Uniform

The trajectory is clear. As models continue to scale — we're already at 400B+ parameters with MoE architectures reaching 512 experts — the variation in parameter importance only increases. MoE models are inherently non-uniform: some experts activate frequently, others rarely; router weights are critical, feed-forward weights are redundant.

Treating this rich internal structure uniformly is not just suboptimal — it's leaving intelligence on the table. Every bit spent on an insensitive parameter is a bit stolen from a sensitive one. Every model compressed uniformly is less capable than it needs to be at its size.

SWAN represents the beginning of a shift: from "how small can we make this model?" to "how intelligent can we make this model at this size?" The answer, it turns out, depends entirely on where you spend your bits.

Uniform quantization had a good run. The data says it's time to move on.

Code and data at github.com/baa-ai/swan-quantization.

Need deep AI expertise to get your models into production?

Black Sheep AI brings deep expertise in model quantization, mixed-precision optimisation, and production AI systems. We help teams extract maximum intelligence from minimum hardware — using techniques like SWAN that go beyond one-size-fits-all compression.

Talk to Our Team

← Previous: SWAN and the Humanoid Intelligence Problem Next: AI Sovereignty on Commodity Hardware →

← Back to all articles