We ran SWAN across four production models — two dense, two Mixture-of-Experts — totalling over 20,000 weight tensors. We measured perplexity, ran academic benchmarks, discovered evaluation artifacts, and X-rayed model architectures none of us had access to the training code for. Here’s what data-free mixed-precision quantization actually delivers, with no hedging.
The Core Claim, Tested
SWAN’s promise is simple: analyse a model’s weight statistics, assign each tensor the minimum bit width it can tolerate, and compress the model without needing a single sample of calibration data.
The evaluation spanned four architectures with very different characteristics:
| Model | Type | Total Params | Tensors | BF16 Size |
|---|---|---|---|---|
| Qwen3-8B | Dense | 8.19B | 399 | 15.3 GB |
| Qwen3-30B-A3B | MoE (128 experts) | 30.53B | 18,867 | 56.9 GB |
| GLM-4.7 | Dense | ~9B | ~400 | ~17 GB |
| GLM-4.7-Flash | MoE | ~31B | ~19,000 | 58.2 GB |
Every evaluation used the same protocol: WikiText-2 perplexity (2048 tokens, 256 samples, seed 42) on an Apple M2 Ultra with 192 GB unified memory. For Qwen3-8B, we added ARC-Challenge (25-shot) and HellaSwag (10-shot) benchmarks via lm-evaluation-harness.
Value #1: Better Compression at Every Scale
The baseline comparison is uniform 4-bit quantization — every weight tensor gets the same bit width. This is what you get from mlx_lm.convert --quant or any standard quantization tool. SWAN consistently beats it.
| Model | Method | Size | PPL | ΔPPL vs BF16 | Compression |
|---|---|---|---|---|---|
| Qwen3-8B (Dense) | |||||
| Qwen3-8B | BF16 | 15.3 GB | 9.727 | — | 1.0× |
| Qwen3-8B | Uniform 4-bit | 4.1 GB | 10.250 | +5.4% | 3.77× |
| Qwen3-8B | SWAN v3 | 6.1 GB | 10.097 | +3.8% | 2.52× |
| Qwen3-30B-A3B (MoE) | |||||
| Qwen3-30B-A3B | BF16 | 56.9 GB | 8.728 | — | 1.0× |
| Qwen3-30B-A3B | Uniform 4-bit | 15.1 GB | 9.629 | +10.3% | 3.76× |
| Qwen3-30B-A3B | SWAN v3 | 16.2 GB | 9.041 | +3.6% | 3.51× |
| GLM-4.7-Flash (MoE) — median PPL | |||||
| GLM-4.7-Flash | BF16 | 58.2 GB | 8.706 | — | 1.0× |
| GLM-4.7-Flash | SWAN v3 | 15.9 GB | 9.084 | +4.3% | 3.66× |
The headline numbers: SWAN reduces perplexity degradation by 30% on the dense model (5.4% → 3.8%) and by 65% on the MoE model (10.3% → 3.6%). The MoE result is especially striking — uniform 4-bit loses over 10% of quality while SWAN holds the line at under 4%, with only marginally less compression.
This isn’t magic. MoE models have 128 experts per layer, each with different weight characteristics. Treating every expert identically is wasteful. SWAN identifies which experts are sensitive and allocates bits accordingly — some get 2-bit, many stay at 4-bit, critical ones get 8-bit or 16-bit.
Value #2: Benchmark-Validated, Not Just Perplexity
Perplexity is necessary but not sufficient. A model could score well on next-token prediction while failing at actual reasoning tasks. So we ran two academic benchmarks on Qwen3-8B — ARC-Challenge (science reasoning, 25-shot) and HellaSwag (commonsense inference, 10-shot) — across all three conditions.
| Benchmark | BF16 | SWAN v3 | Δ | Uniform 4-bit | Δ |
|---|---|---|---|---|---|
| ARC-Challenge (acc_norm) | 44.62% | 43.43% | -1.19pp | 42.83% | -1.79pp |
| HellaSwag (acc_norm) | 60.04% | 58.16% | -1.88pp | 58.14% | -1.90pp |
On ARC-Challenge, SWAN retains 66% more accuracy than uniform 4-bit (1.19pp loss vs 1.79pp). The science reasoning that matters most — normalised accuracy on genuinely difficult questions — is better preserved by SWAN’s sensitivity-aware allocation.
HellaSwag shows near-identical results between SWAN and uniform 4-bit (-1.88pp vs -1.90pp). This makes sense: commonsense inference is distributed broadly across the model, so there’s less to gain from selective bit allocation. But SWAN still matches uniform 4-bit performance while using 50% more storage (6.1 GB vs 4.1 GB) — meaning the extra bits are being allocated to preserve the capabilities that do benefit from higher precision.
The cross-validation between perplexity and benchmarks matters. SWAN’s 3.8% PPL improvement over uniform 4-bit translates into a measurable accuracy improvement on reasoning tasks. The metrics agree. This isn’t an evaluation artifact.
Value #3: No Data Required
This is the property that changes everything for production deployment.
Most competitive quantization methods — GPTQ, AWQ, SqueezeLLM — require calibration data: you feed representative samples through the model to measure activation patterns, then optimise quantization parameters against those observations. This creates three problems:
- Privacy exposure. If your model was trained on sensitive data, calibration samples may need to be drawn from similar distributions. In regulated industries — healthcare, finance, government — this can be a compliance blocker.
- Distribution bias. Calibration data determines which model behaviours are preserved. If your calibration set doesn’t represent real production queries, the quantised model may degrade on exactly the tasks that matter most.
- Pipeline friction. Every time you update a model, you need to source and validate calibration data, run it through the model, and hope the activation statistics are representative. This is a manual step that doesn’t belong in an automated CI/CD pipeline.
SWAN analyses only the weight tensors themselves. Four statistical metrics — spectral concentration (SVD), excess kurtosis, output noise amplification, and reconstruction error proxy — measured directly from the weights, with no forward pass required. The entire analysis runs in 3 minutes for an 8B model and 30 minutes for a 30B MoE model.
This means SWAN can slot into an automated model registry. A new model checkpoint lands → SWAN analyses it → optimal bit allocation is determined → the quantised model is deployed. No human in the loop. No calibration data to curate.
Value #4: Models as Diagnostic Subjects
This was the unexpected discovery. SWAN’s per-tensor sensitivity analysis doesn’t just tell you how to quantise a model — it reveals how the model was built.
Consider the sensitivity score distributions across our test models:
| Model | Sensitivity Span | Max 2-bit | Max 16-bit | Interpretation |
|---|---|---|---|---|
| Qwen3-8B | 1.173 | 16.8% | 20.3% | Diverse — clear sensitive/robust layers |
| Qwen3-30B-A3B | 1.378 | 41.3% | 7.3% | Highly diverse — many robust experts |
| GLM-4.7-Flash | 0.073 | 0.0% | 3.3% | Homogeneous — all tensors look the same |
The Qwen models show a sensitivity span of 1.17–1.38: there are clearly robust tensors that tolerate aggressive compression and sensitive tensors that need protection. This is exactly the variance SWAN exploits.
GLM-4.7-Flash tells a radically different story. Its sensitivity span is 0.073 — 18× more homogeneous than Qwen. Every tensor looks nearly identical to SWAN’s analysis. This has two implications:
- For quantization: SWAN can’t differentiate tensors effectively, so it defaults to near-uniform allocation. The value proposition for mixed-precision is limited on highly regularised models.
- For the model builders: This homogeneity likely reflects heavy regularisation or normalisation during training. The same property that makes tensors indistinguishable to SWAN also correlates with the confidence calibration fragility we observed — 5 catastrophic sequences where the model assigns near-zero probability to correct tokens, producing perplexity spikes of 25,000–81,000.
SWAN became a model X-ray. Without access to training code, training data, or any insider knowledge, the sensitivity profile alone told us that GLM was trained differently from Qwen — and predicted the exact class of evaluation artifacts we later observed.
Value #5: Discovering Evaluation Blind Spots
SWAN’s evaluation on GLM-4.7-Flash produced a result that should have been a headline: the quantized model reported 12.5% lower perplexity than the BF16 baseline. A model compressed to one quarter of its size, scoring better than full precision.
We spent a week investigating. The finding: 5 out of 256 WikiText-2 test sequences produce catastrophic perplexity (25,000–81,000) in the full-precision model. Quantization noise accidentally stabilises these pathological sequences. Standard mean perplexity, dominated by these outliers, makes the quantised model look better.
The honest number — median perplexity — shows SWAN is 4.3% worse, exactly as expected from lossy compression.
This isn’t just a footnote. Perplexity is the most-reported metric in quantization research. If standard mean PPL can be dominated by 2% of evaluation sequences, then published results across the field may be unreliable. The same outlier dynamics that make quantization hard (high-kurtosis weight distributions) also make evaluating quantization unreliable (high-kurtosis sequence distributions).
SWAN’s contribution here wasn’t the quantization itself — it was the rigour of the evaluation process. By investigating an anomalous result rather than celebrating it, we identified a systemic weakness in how the entire field measures compression quality.
Value #6: Purpose-Built for the MoE Era
The industry is moving to Mixture-of-Experts. Qwen3, DeepSeek-V3, Mixtral, DBRX, GLM-4 — the largest and most capable open models increasingly use sparse expert architectures. And MoE is where SWAN’s value proposition is strongest.
Here’s why. A 30B MoE model with 128 experts per layer has enormous internal diversity. Some experts activate frequently and encode critical knowledge. Others activate rarely and handle niche patterns. Uniform 4-bit treats them identically — and loses 10.3% on Qwen3-30B-A3B.
SWAN’s per-tensor analysis identifies the critical experts automatically. On Qwen3-30B-A3B, it allocates:
- 16.6% of tensors to 2-bit — these are robust experts that tolerate extreme compression
- 71.9% at 4-bit — the standard allocation for most weights
- 6.3% at 8-bit — moderately sensitive tensors
- 5.3% at 16-bit — the most sensitive attention and embedding layers
The result: 65% less quality degradation (3.6% vs 10.3%) with only 7% less compression (3.51× vs 3.76×). In efficiency terms, SWAN v3 on MoE achieves 0.82% degradation per unit of compression — the best ratio across all conditions tested.
As models get larger and more expert-heavy, this advantage compounds. SWAN doesn’t just compress MoE models — it understands their internal structure.
Where SWAN Falls Short
Honest evaluation means reporting limitations. We found three.
1. Highly regularised models neutralise it. GLM-4.7-Flash’s extreme weight homogeneity (sensitivity span 0.073) means SWAN can’t differentiate tensors effectively. If every tensor has nearly identical statistics, mixed-precision allocation has no leverage. SWAN doesn’t hurt — it just defaults to something close to uniform — but it can’t help either.
2. Dense models see modest gains. On Qwen3-8B, SWAN reduces PPL degradation from 5.4% to 3.8% — a real improvement, but with 50% more storage. The efficiency ratio is less compelling than on MoE. For dense models, the value is real but incremental.
3. Adaptive normalization can backfire. SWAN v2’s adaptive normalization amplifies tiny differences in narrow metric ranges, causing unnecessary bit upgrades on dense models. On Qwen3-8B, v2 actually performed worse than v1 (6.3% vs 4.1% degradation). The v3 hybrid approach addresses this, but the sensitivity to normalization strategy is a design constraint that requires continued attention.
The Composite Value
SWAN isn’t one thing. The evaluation revealed five distinct capabilities delivered by the same analysis pipeline:
| Capability | Evidence | Who Benefits |
|---|---|---|
| Data-free compression | 3–4% PPL degradation at 2.5–3.7× compression, no calibration data needed | Regulated industries, privacy-sensitive deployments |
| MoE-optimised allocation | 65% less quality loss vs uniform 4-bit on Qwen3-30B-A3B | Anyone deploying MoE models on constrained hardware |
| Model diagnostics | Predicted GLM’s calibration fragility from weight statistics alone | Model developers, quality assurance teams |
| Evaluation methodology | Discovered perplexity anomaly affecting published benchmarks | The entire quantization research community |
| Pipeline automation | 3–30 min end-to-end, no human-in-the-loop steps | MLOps teams running model registries |
Most quantization tools do exactly one thing: compress a model. SWAN does that, but the analysis it performs along the way turns out to be at least as valuable as the compression itself.
What This Means in Practice
A 30B MoE model in BF16 requires 57 GB of memory. Most edge devices — Mac laptops, workstations, embedded systems — cannot load it. At 16.2 GB after SWAN compression, it fits comfortably on a 32 GB M-series Mac with room for KV cache and inference overhead. The quality cost: 3.6% higher perplexity.
For the Qwen3-8B dense model, SWAN at 6.1 GB runs on any modern laptop with 8 GB of RAM, retaining 98.5% of ARC-Challenge accuracy and 96.9% of HellaSwag accuracy compared to the 15.3 GB BF16 version that wouldn’t fit in memory at all.
The models don’t just fit. They run faster. Smaller models mean less memory bandwidth pressure — the actual bottleneck on Apple Silicon and most inference hardware. In our benchmarks, the SWAN-quantised Qwen3-8B completed ARC-Challenge evaluation in 1,060 seconds vs 1,199 seconds for BF16 — an 11.6% speedup from reduced model size alone.
The Bottom Line
SWAN delivers what the quantization field has been missing: intelligent, automated, data-free compression that understands model architecture.
On MoE models — the architecture the industry is converging on — SWAN achieves near-lossless compression at 3.5×. On dense models, it provides meaningful improvements over uniform quantization with complete transparency about the trade-offs. And the analysis it performs along the way reveals things about models that no other tool in the ecosystem can show you.
The results are real. Four models, 20,000 tensors, cross-validated with both perplexity and academic benchmarks, with every anomaly investigated rather than swept under the rug.
SWAN is open source. All evaluation data, per-tensor manifests, and analysis tools are available at github.com/baa-ai/swan-quantization. Hardware: Apple M2 Ultra, 192 GB unified memory. Software: MLX 0.30.3, Python 3.12.
Ready to deploy smarter models on less hardware?
Black Sheep AI brings deep expertise in model quantization, mixed-precision optimisation, and production AI systems. We help teams extract maximum intelligence from minimum hardware — using techniques like SWAN that go beyond one-size-fits-all compression.
Talk to Our Team