SWAN: Data-Free Mixed-Precision Quantization via Multi-Metric Sensitivity Analysis
SWAN Research

SWAN: Data-Free Mixed-Precision Quantization via Multi-Metric Sensitivity Analysis

February 2026 · Black Sheep AI Research

We present SWAN (Statistical Weight Analysis for N-bit allocation) — a data-free, per-tensor mixed-precision quantization method that analyses 400B+ parameter models in under 13 minutes on commodity hardware, with no calibration data, no gradients, and no fine-tuning.

The Calibration Data Problem

Post-training quantization (PTQ) has become the primary means of deploying large language models on consumer hardware. Methods like GPTQ, AWQ, and SqueezeLLM achieve remarkable compression, but they share a common dependency: a representative calibration dataset.

This dependency creates three practical problems that compound as models scale:

SWAN eliminates all three problems. Instead of measuring sensitivity through forward passes, it computes four lightweight, data-free metrics directly on each weight tensor.

Four Complementary Sensitivity Metrics

Given a weight tensor W ∈ ℝm×n, SWAN computes four scores, each normalised to [0, 1] and each capturing a different aspect of quantization sensitivity.

1. SVD Spectral Concentration

Measures how concentrated the tensor's information is in its top singular values. Using randomised SVD (rank k=256 for efficiency), SWAN computes the fraction of total spectral energy in the top 10% of singular values:

ssvd = clip((c − 0.1) / 0.8, 0, 1)   where c = Σ σi² (top 10%) / Σ σi² (all)

High concentration means a few directions carry most of the information. Quantization is more likely to corrupt these critical components.

2. Excess Kurtosis

Quantifies the "tailedness" of the weight distribution. High kurtosis means outliers force the quantization grid to accommodate a wider range, degrading precision for the majority of values:

skurt = clip(κ / 10, 0, 1)   where κ = (1/N) Σ ((wi − w̄) / σ)4 − 3

This proved to be the single strongest predictor of quantization difficulty, with a Spearman correlation of ρ = 0.80 against actual reconstruction error across 2,347 tensors.

3. Output Noise Amplification

Estimates how quantization noise propagates through the linear transformation — without any real input data. Using 32 random unit-norm probe vectors, SWAN measures how much simulated quantization noise gets amplified:

sout = clip((log10(δ) + 2) / 3, 0, 1)

The log-scale normalisation (a v2 improvement over linear normalisation) prevents saturation on smaller models and maps the wide range of amplification values into [0, 1] without loss of discriminating power.

4. Reconstruction Error Proxy

The most direct measure: simulate 4-bit group-wise round-to-nearest quantization (group size 128), then measure the normalised RMSE between original and dequantized weights:

serr = clip((NRMSE − 0.005) / 0.045, 0, 1)

Unlike the other metrics, this simulates the actual quantization operation rather than using a proxy. It was introduced in v2 to replace a cross-layer positional heuristic that, as our correlation analysis revealed, was actually counterproductive (ρ = −0.47).

Composite Score and Bit Allocation

The four scores are combined into a weighted composite:

S = 0.20 · ssvd + 0.45 · skurt + 0.15 · sout + 0.20 · serr

Kurtosis receives the highest weight based on its empirically strongest correlation with actual quantization error. Bit-width is then allocated by threshold:

Composite ScoreBit-WidthRationale
S ≥ 0.90 or protected16-bitHighly sensitive — embeddings, norms, routers
S ≥ 0.658-bitModerately sensitive — attention projections
S ≤ 0.102-bitExtremely insensitive — maximum compression
Otherwise4-bitStandard precision for the majority of tensors

Empirical Validation

We validated SWAN across three model architectures spanning 8B to 400B+ parameters — including dense transformers and MoE models with up to 512 experts.

Metric Correlation with Quantization Error

On Qwen3.5-397B (2,347 tensors), all four metrics show highly significant correlation with actual 4-bit reconstruction error (p < 0.001):

MetricSpearman ρp-value
Kurtosis0.796< 0.001
Output Sensitivity0.694< 0.001
SVD Concentration0.399< 0.001
Reconstruction Error0.374< 0.001

Critically, the maximum inter-metric correlation is |ρ| = 0.38, confirming that the four metrics capture genuinely distinct aspects of tensor sensitivity. This non-redundancy justifies the multi-metric composite approach — each metric adds unique information.

Perplexity Results

In controlled perplexity evaluation at matched group sizes on Qwen3.5-397B:

MethodGroup SizeSize (GB)Avg BitsPerplexity ↓
SWAN v2128199.14.314.283
Uniform 4-bit RTN128196.04.254.298

SWAN outperforms uniform 4-bit quantization (4.283 vs 4.298 PPL) by selectively allocating 8-bit precision to the 4.3% of tensors identified as most sensitive — primarily shared expert gates, MTP layers, and linear attention projections — while keeping 95.2% of parameters at 4-bit.

Academic Benchmarks

At 4.31 average bits (199 GB), SWAN-quantized Qwen3.5-397B achieves:

BenchmarkScoreNotes
MMLU-Pro (0-shot)77.1%With native thinking enabled
ARC-Challenge (0-shot)96.0%Near-perfect science reasoning
GSM8K (0-shot CoT)88.7%Mathematical reasoning
HumanEval (pass@1)78.7%Code generation

The 96.0% ARC-Challenge score is particularly noteworthy — near-perfect science reasoning despite 95.2% of parameters being at 4-bit precision.

Consistent Patterns Across Architectures

Despite dramatic differences in model size (8B–400B+) and architecture (dense vs MoE), SWAN discovers remarkably consistent patterns across all three models tested:

Evolution from v1 to v2

SWAN's development was driven by empirical validation. The initial v1 metrics used linear output sensitivity normalisation and a U-shaped cross-layer positional heuristic. Rigorous correlation analysis on 2,347 tensors revealed three problems:

  1. Output sensitivity saturated at 1.0 on smaller models using linear normalisation.
  2. Cross-layer position showed negative correlation with actual reconstruction error (ρ = −0.47), counterproductively assigning more bits to less sensitive tensors.
  3. Kurtosis, the strongest predictor (ρ = 0.80), had the lowest weight.

These data-driven insights produced v2: log-scale output sensitivity, reconstruction error proxy replacing position, kurtosis-dominant weighting, and tighter thresholds. The result: v2 achieves the same perplexity as v1 while using 16% fewer bits (4.31 vs 5.06 avg), confirming that v1 was wasting ~0.8 bits/param on tensors that did not benefit from extra precision.

Key Insight: Group Size Dominance

Our controlled experiments reveal an important finding for the field: quantization group size is the dominant factor for perplexity, more impactful than per-tensor bit allocation. Halving the group size from 128 to 64 reduces perplexity by ~0.23 for both SWAN and uniform quantization, while SWAN's selective 8-bit allocation provides a ~0.015 improvement at matched group sizes. Future mixed-precision methods should consider group size optimisation alongside bit-width allocation.

Future Directions

Code and data are available at github.com/baa-ai/swan-quantization.

Need deep AI expertise to get your models into production?

Black Sheep AI brings deep expertise in model quantization, deployment architecture, and production AI systems. We help enterprises bridge the gap between AI research and real-world production — from model optimisation to infrastructure design.

Talk to Our Team
← Previous: Why Collapse Tests Are Insufficient Next: SWAN on Apple Silicon →
← Back to all articles