For over a decade, model quantization has required calibration data — representative samples that guide the compression process. SWAN eliminates this requirement entirely. The implications extend far beyond convenience: an entire category of ML infrastructure, data licensing, and pipeline complexity simply disappears.
The Calibration Tax
Every major quantization method in production today — GPTQ, AWQ, SqueezeLLM, QuIP, OmniQuant — requires a calibration dataset. This isn't a minor implementation detail. It's a structural dependency that cascades through the entire model deployment pipeline.
Here's what calibration actually costs you:
The Hidden Cost Chain
This is what we call the calibration tax: a mandatory overhead that every team pays, every time they quantize a model. It adds days to the deployment pipeline, creates legal exposure, requires GPU compute, and introduces a source of non-determinism that makes debugging production issues harder.
What SWAN Does Differently
SWAN computes four sensitivity metrics directly from the weight tensors:
| Metric | What It Measures | Data Needed |
|---|---|---|
| SVD Spectral Concentration | How much information is concentrated in the top singular values | None — pure linear algebra |
| Excess Kurtosis | How heavy the outlier tails are in the weight distribution | None — statistical moment |
| Output Noise Amplification | How much quantization noise gets amplified through the layer | None — random perturbation test |
| Reconstruction Error Proxy | How much the tensor changes under simulated 4-bit quantization | None — simulated round-trip |
Every single metric operates on the weight matrices alone. No forward passes. No activation statistics. No training data. No calibration samples. The model's weights contain enough information about their own sensitivity to drive intelligent bit-width allocation.
This isn't a rough approximation. SWAN-quantized Qwen3.5-397B achieves 4.283 perplexity versus 4.298 for uniform 4-bit with calibration. The data-free method doesn't just match calibrated quantization — it beats it.
The Pipeline Before and After
Here's what a typical quantization pipeline looks like with calibration-dependent methods versus SWAN:
Traditional (GPTQ/AWQ)
- Download model weights
- Source calibration dataset
- Verify data licensing
- Prepare & tokenize calibration data
- Load model on GPU cluster
- Run calibration forward passes
- Tune quantization hyperparameters
- Apply quantization
- Validate output quality
- Debug calibration-data-dependent artifacts
- Deploy
6 steps require calibration data · GPU cluster needed · Hours to days
SWAN
- Download model weights
- Run SWAN analysis (CPU)
- Apply quantization
- Validate output quality
- Deploy
0 steps require calibration data · No GPU needed for analysis · 13 minutes
Six steps eliminated. Not automated — eliminated. The infrastructure to store, license, version, and process calibration datasets is no longer needed. The GPU compute for calibration passes is no longer needed. The debugging of calibration-dependent quality variations is no longer needed.
Why This Matters for Production
Deterministic reproducibility
Run SWAN twice on the same model. You get identical results. Every time. No variance from calibration data sampling, no sensitivity to sequence length choices, no dependency on random seeds. This matters enormously for regulated environments where model behaviour must be reproducible and auditable.
Zero data governance burden
In healthcare, finance, and government, using data — any data — triggers governance processes. Even "public" calibration datasets like WikiText or C4 may have terms that conflict with your organisation's data policies. With SWAN, there is no data to govern. The model's weights are the only input, and you already have a licence for those.
Instant model updates
When a new model version drops — Qwen4, Llama 5, whatever comes next — teams using calibration-dependent methods need to restart their entire quantization pipeline. Re-source appropriate calibration data (the new model may have different training characteristics). Re-run calibration passes. Re-validate.
With SWAN: download new weights, run analysis, deploy. Thirteen minutes from download to production-ready quantization. When models are released monthly or faster, this speed difference is the difference between deploying state-of-the-art and always being one version behind.
Cross-domain portability
Calibration data is domain-specific. A model quantized with English text calibration may perform worse on code generation. A model calibrated on general text may degrade on medical terminology. SWAN's data-free approach means the quantization is domain-agnostic. The bit-width allocation reflects the intrinsic mathematical properties of each tensor, not the statistical properties of whatever calibration sample you happened to choose.
The Inverse Scaling Advantage
Here's perhaps the most counterintuitive property of SWAN: as models get larger, calibration-dependent methods get harder, while SWAN gets proportionally easier.
Scaling Dynamics
Calibration-based (GPTQ/AWQ)
- • 8B model: 1 GPU, ~30 minutes
- • 70B model: 4 GPUs, ~2 hours
- • 400B model: 8 GPUs, ~6+ hours
- • More parameters = more memory = more GPUs = more cost
SWAN (data-free)
- • 8B model: CPU, ~2 minutes
- • 70B model: CPU, ~5 minutes
- • 400B model: CPU, ~13 minutes
- • Embarrassingly parallel across shards
The 400B model case is where this becomes dramatic. Running GPTQ calibration on Qwen3.5-397B requires loading the entire model into GPU memory for forward passes. That's at minimum 4–8 H100 GPUs costing $25,000+ each, running for hours. SWAN analyses the same model's safetensor shards on a single CPU in 13 minutes, processing each shard independently.
As models grow to 1 trillion parameters and beyond — and they will — calibration-based methods will require increasingly expensive GPU clusters just for the quantization step. SWAN will process them on whatever hardware can read files and do matrix math.
What the Industry Should Be Asking
If the model's weights contain enough information to drive intelligent bit-width allocation without any external data, why were we ever using calibration data in the first place?
The honest answer: because no one had demonstrated a sufficiently rigorous data-free alternative. The four-metric approach in SWAN — combining spectral analysis, distributional statistics, noise propagation, and simulated quantization error — captures enough complementary information about tensor sensitivity to make calibration data redundant. It took the right combination of metrics, not a fundamentally new kind of mathematics.
This suggests that calibration-based quantization was always solving two problems simultaneously: (1) understanding which parts of the model are sensitive, and (2) computing the optimal quantization parameters. Problem (1) can be solved from weights alone. Problem (2) — the actual rounding and scaling — is mechanical and doesn't need calibration either. The two problems were conflated, and the entire field assumed both required data.
SWAN proves they don't.
The Broader Implication
Model compression is following the same arc as many technologies: an initial phase where external resources (calibration data, fine-tuning data, human feedback) are assumed to be essential, followed by the discovery that sufficiently clever analysis of the artefact itself renders those resources unnecessary.
If quantization can be data-free, what else can be? Pruning decisions based on weight statistics rather than gradient flow? Architecture search based on layer geometry rather than training experiments? The principle that a model's weights encode enough information about their own importance is a deeper insight than quantization alone.
SWAN is a proof of concept for a broader thesis: models know more about themselves than we've been giving them credit for. We just needed to ask the right questions — in the right mathematical language.
Code and data at github.com/baa-ai/swan-quantization.
Need deep AI expertise to get your models into production?
Black Sheep AI helps organisations eliminate pipeline complexity and deploy quantized models faster — from analysis to production in hours, not weeks. Deep expertise, no vendor lock-in.
Talk to Our Team