SWAN: Data-Free Mixed-Precision Quantization via Multi-Metric Sensitivity Analysis

We present SWAN (Statistical Weight Analysis for N-bit allocation) — a data-free, per-tensor mixed-precision quantization method that analyses 400B+ parameter models in under 13 minutes on commodity hardware, with no calibration data, no gradients, and no fine-tuning.

The Calibration Data Problem

Post-training quantization (PTQ) has become the primary means of deploying large language models on consumer hardware. Methods like GPTQ, AWQ, and SqueezeLLM achieve remarkable compression, but they share a common dependency: a representative calibration dataset.

This dependency creates three practical problems that compound as models scale:

Data availability. Calibration data may not exist for proprietary or domain-specific models. Licensing restrictions may prevent redistribution of suitable calibration sets.
Distribution mismatch. The calibration distribution may not generalise to deployment domains. A model calibrated on English Wikipedia performs differently on medical text.
Compute cost. Running calibration on a 400B+ parameter Mixture-of-Experts (MoE) model with 512 experts requires hundreds of gigabytes of memory and hours of compute — prohibitive for most practitioners.

SWAN eliminates all three problems. Instead of measuring sensitivity through forward passes, it computes four lightweight, data-free metrics directly on each weight tensor.

Four Complementary Sensitivity Metrics

Given a weight tensor W ∈ ℝ^m×n, SWAN computes four scores, each normalised to [0, 1] and each capturing a different aspect of quantization sensitivity.

1. SVD Spectral Concentration

Measures how concentrated the tensor's information is in its top singular values. Using randomised SVD (rank k=256 for efficiency), SWAN computes the fraction of total spectral energy in the top 10% of singular values:

s_svd = clip((c − 0.1) / 0.8, 0, 1) where c = Σ σ_i² (top 10%) / Σ σ_i² (all)

High concentration means a few directions carry most of the information. Quantization is more likely to corrupt these critical components.

2. Excess Kurtosis

Quantifies the "tailedness" of the weight distribution. High kurtosis means outliers force the quantization grid to accommodate a wider range, degrading precision for the majority of values:

s_kurt = clip(κ / 10, 0, 1) where κ = (1/N) Σ ((w_i − w̄) / σ)⁴ − 3

This proved to be the single strongest predictor of quantization difficulty, with a Spearman correlation of ρ = 0.80 against actual reconstruction error across 2,347 tensors.

3. Output Noise Amplification

Estimates how quantization noise propagates through the linear transformation — without any real input data. Using 32 random unit-norm probe vectors, SWAN measures how much simulated quantization noise gets amplified:

s_out = clip((log₁₀(δ) + 2) / 3, 0, 1)

The log-scale normalisation (a v2 improvement over linear normalisation) prevents saturation on smaller models and maps the wide range of amplification values into [0, 1] without loss of discriminating power.

4. Reconstruction Error Proxy

The most direct measure: simulate 4-bit group-wise round-to-nearest quantization (group size 128), then measure the normalised RMSE between original and dequantized weights:

s_err = clip((NRMSE − 0.005) / 0.045, 0, 1)

Unlike the other metrics, this simulates the actual quantization operation rather than using a proxy. It was introduced in v2 to replace a cross-layer positional heuristic that, as our correlation analysis revealed, was actually counterproductive (ρ = −0.47).

Composite Score and Bit Allocation

The four scores are combined into a weighted composite:

S = 0.20 · s_svd + 0.45 · s_kurt + 0.15 · s_out + 0.20 · s_err

Kurtosis receives the highest weight based on its empirically strongest correlation with actual quantization error. Bit-width is then allocated by threshold:

Composite Score	Bit-Width	Rationale
S ≥ 0.90 or protected	16-bit	Highly sensitive — embeddings, norms, routers
S ≥ 0.65	8-bit	Moderately sensitive — attention projections
S ≤ 0.10	2-bit	Extremely insensitive — maximum compression
Otherwise	4-bit	Standard precision for the majority of tensors

Empirical Validation

We validated SWAN across three model architectures spanning 8B to 400B+ parameters — including dense transformers and MoE models with up to 512 experts.

Metric Correlation with Quantization Error

On Qwen3.5-397B (2,347 tensors), all four metrics show highly significant correlation with actual 4-bit reconstruction error (p < 0.001):

Metric	Spearman ρ	p-value
Kurtosis	0.796	< 0.001
Output Sensitivity	0.694	< 0.001
SVD Concentration	0.399	< 0.001
Reconstruction Error	0.374	< 0.001

Critically, the maximum inter-metric correlation is |ρ| = 0.38, confirming that the four metrics capture genuinely distinct aspects of tensor sensitivity. This non-redundancy justifies the multi-metric composite approach — each metric adds unique information.

Perplexity Results

In controlled perplexity evaluation at matched group sizes on Qwen3.5-397B:

Method	Group Size	Size (GB)	Avg Bits	Perplexity ↓
SWAN v2	128	199.1	4.31	4.283
Uniform 4-bit RTN	128	196.0	4.25	4.298

SWAN outperforms uniform 4-bit quantization (4.283 vs 4.298 PPL) by selectively allocating 8-bit precision to the 4.3% of tensors identified as most sensitive — primarily shared expert gates, MTP layers, and linear attention projections — while keeping 95.2% of parameters at 4-bit.

Academic Benchmarks

At 4.31 average bits (199 GB), SWAN-quantized Qwen3.5-397B achieves:

Benchmark	Score	Notes
MMLU-Pro (0-shot)	77.1%	With native thinking enabled
ARC-Challenge (0-shot)	96.0%	Near-perfect science reasoning
GSM8K (0-shot CoT)	88.7%	Mathematical reasoning
HumanEval (pass@1)	78.7%	Code generation

The 96.0% ARC-Challenge score is particularly noteworthy — near-perfect science reasoning despite 95.2% of parameters being at 4-bit precision.

Consistent Patterns Across Architectures

Despite dramatic differences in model size (8B–400B+) and architecture (dense vs MoE), SWAN discovers remarkably consistent patterns across all three models tested:

Attention receives more precision. Attention tensors receive 1.6–2.5 more bits than MLP/FFN tensors on average, reflecting their higher sensitivity to quantization noise.
MoE experts compress aggressively. Expert feed-forward weights are overwhelmingly assigned 4-bit (74–82% of parameters), confirming that the sparse, redundant nature of MoE layers tolerates aggressive quantization.
U-shaped layer pattern. Early (first 25%) and late (last 25%) layers consistently receive higher precision than middle layers.
Smaller models are more sensitive. The dense 8B model retains 15.2% of parameters at 16-bit, while MoE models keep only 0.5–0.7% — reflecting the inherent redundancy of MoE architectures.

Evolution from v1 to v2

SWAN's development was driven by empirical validation. The initial v1 metrics used linear output sensitivity normalisation and a U-shaped cross-layer positional heuristic. Rigorous correlation analysis on 2,347 tensors revealed three problems:

Output sensitivity saturated at 1.0 on smaller models using linear normalisation.
Cross-layer position showed negative correlation with actual reconstruction error (ρ = −0.47), counterproductively assigning more bits to less sensitive tensors.
Kurtosis, the strongest predictor (ρ = 0.80), had the lowest weight.

These data-driven insights produced v2: log-scale output sensitivity, reconstruction error proxy replacing position, kurtosis-dominant weighting, and tighter thresholds. The result: v2 achieves the same perplexity as v1 while using 16% fewer bits (4.31 vs 5.06 avg), confirming that v1 was wasting ~0.8 bits/param on tensors that did not benefit from extra precision.

Key Insight: Group Size Dominance

Our controlled experiments reveal an important finding for the field: quantization group size is the dominant factor for perplexity, more impactful than per-tensor bit allocation. Halving the group size from 128 to 64 reduces perplexity by ~0.23 for both SWAN and uniform quantization, while SWAN's selective 8-bit allocation provides a ~0.015 improvement at matched group sizes. Future mixed-precision methods should consider group size optimisation alongside bit-width allocation.

Future Directions

Adaptive normalisation. Model-scale-adaptive normalisation ranges to prevent metric saturation across different model sizes.
Joint optimisation. Jointly optimising group size and bit-width allocation rather than treating them independently.
Coarse-to-fine pipelines. Using SWAN's sensitivity map as a starting point for calibration-based fine-tuning — the best of both worlds.
Activation quantization. Extending the sensitivity analysis framework to activation tensors.

Code and data are available at github.com/baa-ai/swan-quantization.

Need deep AI expertise to get your models into production?

Black Sheep AI brings deep expertise in model quantization, deployment architecture, and production AI systems. We help enterprises bridge the gap between AI research and real-world production — from model optimisation to infrastructure design.

Talk to Our Team

← Previous: Why Collapse Tests Are Insufficient Next: SWAN on Apple Silicon →

← Back to all articles