Why Your 4-bit Model is Leaving Intelligence on the Table

The default quantization strategy for most of the AI industry is uniform 4-bit: take every weight tensor, compress it to 4 bits per parameter, ship it. It's simple, it's fast, and it's fundamentally wrong. The weights in your model are not equally important — not even close — and treating them as if they are is costing you intelligence you've already paid for.

The 50x Problem

When we ran SWAN's four sensitivity metrics across Qwen3.5-397B — a 403-billion parameter Mixture-of-Experts model with 512 experts — we found something striking: the composite sensitivity score across 2,347 weight tensors varies by a factor of fifty.

That means the most sensitive tensor in the model is 50 times more important to preserve at high precision than the least sensitive one. And uniform quantization treats them identically.

Sensitivity Distribution (Qwen3.5-397B)

Bottom 5% of tensors (least sensitive) Score < 0.10 → Safe at 2-bit

Middle 80% of tensors Score 0.10–0.65 → Fine at 4-bit

Next 12% of tensors Score 0.65–0.90 → Need 8-bit

Top 3% of tensors (most sensitive) Score ≥ 0.90 → Must be 16-bit

When you apply uniform 4-bit quantization, you're simultaneously over-compressing the 15% of tensors that need higher precision (destroying information the model needs) and under-compressing the 5% that could safely be at 2-bit (wasting space that could be reclaimed). It's a dual failure.

Which Tensors Matter Most?

SWAN's analysis reveals clear patterns in which parts of a model are sensitive and why:

Attention layers: The precision bottleneck

Query and key projection matrices in attention layers consistently score among the highest sensitivity across all models tested. This makes intuitive sense: attention is how the model decides what information to focus on. Corrupt the attention weights, and the model doesn't just produce worse outputs — it looks at the wrong information entirely.

Specifically, SWAN finds that Q/K projections have high SVD spectral concentration (information is packed into a few critical singular vectors) and high excess kurtosis (outlier weights that carry disproportionate signal). These are exactly the properties that make tensors vulnerable to quantization noise.

Embedding and output layers: The input/output interface

The token embedding matrix and the final language model head consistently score high on noise amplification. Small perturbations in these layers propagate to every downstream (or upstream) computation. Quantizing them aggressively is like adding static to a microphone — every subsequent processing step amplifies the noise.

FFN down-projections: The hidden bottleneck

In MoE models, the down-projection matrices of the feed-forward network inside each expert are often the most sensitive tensors. These are where the expert's "knowledge" is concentrated — the mapping from the expanded hidden dimension back to the model dimension. SWAN's reconstruction error metric catches this: simulated 4-bit quantization of these tensors produces measurably larger reconstruction error than other FFN components.

Router weights: Small but critical

In MoE architectures, the router (or gate) matrices that decide which experts handle each token are tiny relative to the model size but have outsized impact. A router error doesn't produce a slightly wrong answer — it sends the token to the wrong expert entirely. SWAN consistently flags these for 16-bit protection.

The Proof: Perplexity

Theory is nice. Numbers are better. Here's what happens when you let sensitivity drive bit-width allocation versus forcing uniform 4-bit:

Model	Method	Size	Avg Bits	Perplexity
Qwen3.5-397B	SWAN mixed-precision	199 GB	4.31	4.283
Qwen3.5-397B	Uniform 4-bit RTN	196 GB	4.25	4.298
Qwen3.5-397B	BF16 (baseline)	807 GB	16.00	~4.27

Read that carefully. SWAN uses slightly more total bits on average (4.31 vs 4.25) but achieves lower perplexity. The extra bits aren't wasted — they're allocated to the tensors that need them most. And the tensors that don't need 4-bit are compressed further to 2-bit, freeing up the bit budget for protection where it matters.

The 0.015 perplexity improvement might look small, but at this scale it represents a measurable quality difference across every token the model generates. Over a long conversation or document, this compounds into noticeably more coherent, accurate, and nuanced output.

The Four Lenses of Sensitivity

Part of what makes SWAN's approach robust is that it doesn't rely on a single metric. Different tensors are sensitive for different reasons, and a single metric would miss important patterns. SWAN uses four complementary lenses:

SVD Spectral Concentration (w=0.20)

Measures how much of the tensor's "information" is packed into a few top singular values. High concentration means a few dimensions carry most of the signal — quantizing them introduces significant error.

Excess Kurtosis (w=0.45)

Quantifies how "heavy-tailed" the weight distribution is. High kurtosis means outlier weights that carry disproportionate signal. Quantization clips or distorts these outliers, destroying information. This is the most predictive single metric, hence the highest weight.

Output Noise Amplification (w=0.15)

Estimates how much a small perturbation to the weight tensor gets amplified in the output. Some tensors are "noise amplifiers" — even tiny quantization errors cascade through subsequent computation.

Reconstruction Error Proxy (w=0.20)

Directly measures the Frobenius norm difference between the original tensor and its simulated 4-bit version. The most empirical metric — it literally quantizes the tensor and measures the damage.

The weighted combination (with kurtosis at 0.45 weight) was validated through ablation studies: removing any single metric degrades the correlation with actual perplexity impact. The four metrics are partially correlated but capture different failure modes, giving the composite score robustness that no single metric could achieve.

What This Means in Practice

If you're currently deploying uniform 4-bit quantized models, here's what you're likely experiencing without knowing the cause:

Subtle reasoning degradation. The model gives plausible-sounding answers that are slightly less accurate than the full-precision version. You attribute this to "quantization cost" and accept it. But much of this degradation is localised in the 15% of tensors that needed higher precision.
Long-context coherence loss. On longer conversations or documents, the model starts losing track of earlier context. This is often caused by attention layer quantization damage — the very tensors SWAN would protect at 8 or 16-bit.
Code generation errors. The model produces syntactically correct code that fails on edge cases. The output head and embedding precision directly affect the model's ability to distinguish between similar tokens — critical for programming.
Wasted memory. Meanwhile, 5% of your tensors are at 4-bit when they could safely be at 2-bit, taking up twice the memory they need. In a 400B model, that's gigabytes of wasted capacity that could be reclaimed.

SWAN doesn't fix these problems by using more bits overall. It fixes them by redistributing the same bit budget intelligently. Protect what matters, compress what doesn't.

The Analogy

Imagine compressing a photograph. Uniform quantization is like reducing every pixel to the same colour depth — the sky, the face, the background, all get the same number of colours. Any photographer would tell you this is absurd: the face needs high colour precision, the out-of-focus background can survive with far less.

JPEG understood this decades ago. It applies stronger compression to areas with less detail and preserves areas with high-frequency information. The result: dramatically smaller files that look indistinguishable from the original.

SWAN is doing for neural network weights what JPEG did for images: allocating precision where it matters and reclaiming it where it doesn't. The surprise isn't that this works better than uniform quantization. The surprise is that the AI industry has been deploying the equivalent of bitmap compression in 2026.

The Path Forward

Mixed-precision quantization isn't new as a concept. The reason it hasn't been widely adopted is that previous approaches required extensive calibration, GPU compute, and model-specific tuning to determine optimal bit-width assignments. The cost of computing "what goes where" was high enough that teams defaulted to "just use 4-bit everywhere."

SWAN eliminates that cost. Four metrics, computed from weights alone, 13 minutes on a CPU, deterministic results. The barrier to intelligent bit-width allocation has dropped to essentially zero.

If you're quantizing models for production deployment, the question is no longer whether to use mixed precision. It's why you're still using uniform quantization when a better alternative takes less time, costs less compute, and produces measurably better results.

Code and data at github.com/baa-ai/swan-quantization.

Need deep AI expertise to get your models into production?

Black Sheep AI helps organisations move beyond uniform quantization to intelligent, mixed-precision deployments that preserve model quality where it matters. Deep expertise, no vendor lock-in.

Talk to Our Team

← Previous: The End of Calibration Data Next: AI Without Permission →

← Back to all articles