We evaluated SWAN across four models (Qwen3-8B, Qwen3-30B-A3B, GLM-4.7-Flash, GLM-4.7), three architectures (dense transformer, sparse MoE, dense MoE), and multiple SWAN versions (v1 through v3-opt). Here are the complete results.
Test Environment
All experiments were conducted on a single workstation with fixed seeds and consistent evaluation protocols. Reproducibility was a first-class concern throughout.
Hardware
- Compute: Apple M2 Ultra
- Memory: 192 GB Unified Memory
Software
- Python: 3.12.0
- MLX: 0.30.3
- mlx_lm: 0.30.4
- PyTorch: 2.6.0
Evaluation Protocol
- Perplexity: WikiText-2 test set,
seq_len=2048, 256 samples,seed=42 - Benchmarks: lm-evaluation-harness v0.4.11 via
mlx_lm.evaluate
Perplexity Results — Qwen3-8B (Dense Transformer, 8.19B Parameters)
Qwen3-8B is a standard dense transformer — the architecture where every parameter is active on every forward pass. This is SWAN’s primary design target, and the results demonstrate a clear progression from v1 through v3-opt.
| Condition | Avg Bits | Size (GB) | PPL | ΔPPL vs BF16 | Peak Mem (GB) |
|---|---|---|---|---|---|
| BF16 baseline | 16.00 | 15.26 | 9.727 | — | 17.44 |
| Uniform 4-bit | 4.00 | 4.05 | 10.250 | +5.4% | 6.36 |
| SWAN v1 (fixed norm) | 6.17 | 5.88 | 10.122 | +4.1% | 8.13 |
| SWAN v2 (adaptive norm) | 6.63 | 6.32 | 10.337 | +6.3% | 8.57 |
| SWAN v3 (hybrid) | ~5.82 | 6.95 | 10.102 | +3.9% | 9.25 |
| SWAN v3-opt | 5.82 | 6.05 | 10.097 | +3.8% | 8.30 |
Key observation: SWAN v3-opt achieves the best perplexity of any SWAN version (+3.8% vs BF16) at a model size of 6.05 GB — 60% smaller than BF16 and only 49% larger than uniform 4-bit. The threshold optimisation step in v3-opt eliminates the size overhead of v3’s hybrid approach while preserving its quality gains.
Note the v2 regression: adaptive norm allocation on a dense model with relatively homogeneous tensors can over-allocate bits without commensurate quality improvement. This is what motivated the v3 hybrid approach.
Perplexity Results — Qwen3-30B-A3B (Sparse MoE, 30.53B Total / 3.3B Active)
Qwen3-30B-A3B is a sparse Mixture-of-Experts model with 128 experts, of which 8 are active per token. The vast majority of parameters sit in expert FFN layers that are only activated for specific inputs. This architecture presents a fundamentally different quantisation challenge: most tensors are expert weights with similar structure, and the model is far more sensitive to uniform compression because each expert carries specialised knowledge.
| Condition | Avg Bits | Size (GB) | PPL | ΔPPL vs BF16 | Peak Mem (GB) |
|---|---|---|---|---|---|
| BF16 baseline | 16.00 | 56.87 | 8.728 | — | 59.18 |
| Uniform 4-bit | 4.00 | 15.11 | 9.629 | +10.3% | 17.42 |
| SWAN v1 (fixed norm) | 4.51 | 16.02 | 9.180 | +5.2% | 18.33 |
| SWAN v2 (adaptive norm) | 4.69 | 16.65 | 8.976 | +2.8% | 18.97 |
| SWAN v3 (hybrid) | ~4.5 | 16.20 | 9.041 | +3.6% | 18.51 |
| SWAN v2-opt | ~4.6 | 16.42 | 9.057 | +3.8% | 18.73 |
Key observation: On MoE, SWAN v2 is the clear winner at +2.8% degradation — less than a third of uniform 4-bit’s +10.3%. The v3 hybrid approach actually hurts MoE performance because its norm-based fallback pathway doesn’t account for the structural regularity of expert layers. This architecture-dependent behaviour is what led to the v4 auto-detection system, which routes dense models to v3-opt and MoE models to v2.
The uniform 4-bit result here is striking: 10.3% degradation means uniform quantisation destroys meaningful expert specialisation. Sensitivity-aware allocation preserves it by protecting the most critical expert weights.
Perplexity Results — GLM-4.7-Flash (Dense MoE, 31B Parameters)
GLM-4.7-Flash is a dense MoE architecture. Its perplexity results require careful interpretation due to the outlier sequence phenomenon documented in our previous article.
| Condition | Standard PPL | Median PPL | Size (GB) |
|---|---|---|---|
| BF16 | 11.344 | 8.706 | 58.2 |
| FP16 | 11.208 | 8.609 | 55.8 |
| Uniform 4-bit | 11.532 | — | 14.8 |
| SWAN v3 | 9.930* | 9.084 | 15.9 |
*Standard PPL is misleading. The 9.930 figure appears to show quantisation improving over the BF16 baseline. This is an artifact caused by 5 outlier sequences (PPL values of 25,000–81,000) that dominate the mean in BF16 but are partially suppressed by quantisation noise. Median PPL tells the true story: SWAN v3 degrades by 4.3% (9.084 vs 8.706), which is consistent with 3.66× compression. See article 18 for the full analysis.
Academic Benchmark Results — Qwen3-8B
Perplexity measures prediction quality on raw text, but academic benchmarks test whether the model can still reason, recall knowledge, and follow instructions after compression. We ran ARC-Challenge (25-shot) and HellaSwag (10-shot) via mlx_lm.evaluate on Qwen3-8B across three conditions.
| Model | Size | ARC-C (25-shot) | HellaSwag (10-shot) | PPL |
|---|---|---|---|---|
| BF16 | 15.26 GB | 44.62% | 60.04% | 9.727 |
| SWAN v3-opt | 6.05 GB | 43.43% (-1.2%) | 58.16% (-1.9%) | 10.097 |
| Uniform 4-bit | 4.05 GB | 42.83% (-1.8%) | 58.14% (-1.9%) | 10.249 |
Key finding: The ordering BF16 > SWAN v3-opt > Uniform 4-bit is consistent across all three metrics. SWAN beats uniform 4-bit on ARC-Challenge (43.43% vs 42.83%), demonstrating that sensitivity-aware bit allocation preserves reasoning capability better than uniform compression. On HellaSwag, both quantised variants land within 0.02 percentage points of each other, suggesting this benchmark is less sensitive to per-tensor precision allocation.
The practical implication: SWAN v3-opt gives you better reasoning quality than uniform 4-bit at a 49% size premium (6.05 GB vs 4.05 GB). Whether that trade-off is worth it depends on your deployment constraints.
Bit Allocation Analysis
SWAN’s core mechanism is assigning different bit widths to different tensors based on their quantisation sensitivity. How those bits actually get distributed reveals how each model’s architecture interacts with the allocation algorithm.
Qwen3-8B Bit Distribution
| Bit Width | Uniform | SWAN v1 | SWAN v2 |
|---|---|---|---|
| 2-bit | 0% | 0% | 2.2% |
| 4-bit | 100% | 81.7% | 73.5% |
| 8-bit | 0% | 3.1% | 6.0% |
| 16-bit | 0% | 15.2% | 18.3% |
The v1→v2 progression shows more aggressive differentiation: v2’s adaptive norm scoring identifies a small tail of tensors (2.2%) that can tolerate aggressive 2-bit quantisation, freeing bits that are reallocated to the 18.3% of tensors kept at full 16-bit precision. The majority of tensors remain at 4-bit, but the edges of the distribution widen.
Qwen3-30B-A3B Bit Distribution
| Bit Width | Uniform | SWAN v1 | SWAN v2 |
|---|---|---|---|
| 2-bit | 0% | 0% | 16.6% |
| 4-bit | 100% | 97.2% | 71.9% |
| 8-bit | 0% | 0.8% | 6.3% |
| 16-bit | 0% | 2.1% | 5.3% |
The MoE distribution is dramatically different. SWAN v2 pushes 16.6% of tensors to 2-bit — nearly all of these are expert FFN layers that the sensitivity analysis identifies as redundant or highly resilient to quantisation noise. This aggressive low-bit allocation for non-critical experts is precisely why SWAN v2 excels on MoE: it recognises that not all experts are equally important and compresses the least sensitive ones aggressively.
Meanwhile, only 5.3% of tensors get 16-bit treatment (vs 18.3% in Qwen3-8B), reflecting the fact that MoE models have fewer critical shared layers relative to their total parameter count.
Sensitivity Score Discrimination
SWAN’s effectiveness depends on there being meaningful variation in how sensitive different tensors are to quantisation. If all tensors respond similarly, there’s nothing for sensitivity-aware allocation to exploit. The sensitivity score span — the range between the most and least sensitive tensors — quantifies this opportunity.
| Model | Sensitivity Score Span | Max 2-bit % | Max 16-bit % |
|---|---|---|---|
| Qwen3-8B | 1.173 | 16.8% | 20.3% |
| Qwen3-30B-A3B | 1.378 | 41.3% | 7.3% |
| GLM-4.7-Flash | 0.073 | 0.0% | 3.3% |
Key insight: GLM-4.7-Flash tensors are 18× more homogeneous than Qwen tensors. With a sensitivity score span of just 0.073, there is almost no variation for SWAN to exploit. The algorithm correctly responds by making minimal allocation changes (0% at 2-bit, only 3.3% at 16-bit), effectively falling back to near-uniform quantisation.
This explains why SWAN’s benefit on GLM-4.7-Flash is modest compared to its impact on Qwen models. It is not a limitation of the algorithm — it is a property of the model. When the architecture produces uniformly sensitive tensors, the correct allocation is uniform. SWAN correctly identifies this and acts accordingly.
Conversely, Qwen3-30B-A3B’s high span (1.378) with 41.3% of tensors eligible for 2-bit quantisation explains why SWAN achieves its most dramatic improvement on this model — cutting uniform 4-bit’s 10.3% degradation down to 2.8%.
Compression Efficiency
Raw compression ratio and raw perplexity degradation do not tell the full story individually. Compression efficiency — defined as the ratio of perplexity degradation to compression ratio — captures how much quality you sacrifice per unit of compression.
| Model | Condition | Compression | PPL Degradation | Efficiency |
|---|---|---|---|---|
| Qwen3-8B | Uniform 4-bit | 3.77× | +5.4% | 1.43 |
| Qwen3-8B | SWAN v1 | 2.59× | +4.1% | 1.58 |
| Qwen3-30B-A3B | Uniform 4-bit | 3.76× | +10.3% | 2.74 |
| Qwen3-30B-A3B | SWAN v2 | 3.42× | +2.8% | 0.82 |
Lower efficiency scores are better — they mean less quality loss per unit of compression. SWAN v2 on the MoE model achieves 0.82, meaning each unit of compression costs less than 1% perplexity. Uniform 4-bit on the same model scores 2.74, paying nearly 3% perplexity per compression unit. This 3.3× efficiency advantage is SWAN’s strongest result across the entire evaluation.
For Qwen3-8B, uniform 4-bit actually has a better efficiency ratio (1.43 vs 1.58) because it achieves much higher compression (3.77× vs 2.59×). SWAN v1 preserves more quality but at a lower compression ratio, so the per-unit cost is slightly higher. This highlights that efficiency is not the only metric that matters — the absolute quality level and the absolute size also factor into deployment decisions.
Processing Time
SWAN adds an analysis phase before quantisation, where it profiles every tensor to compute sensitivity scores. The conversion step itself is fast; the analysis is the bottleneck.
| Phase | Qwen3-8B | Qwen3-30B-A3B |
|---|---|---|
| Tensors analysed | 399 | 18,867 |
| v3 analysis time | ~3.3 min | ~44 min |
| Conversion time | ~6s | ~15s |
The 47× increase in tensor count from Qwen3-8B to Qwen3-30B-A3B (399 to 18,867) drives a roughly 13× increase in analysis time. The sub-linear scaling reflects that many MoE expert tensors share similar shapes and can be analysed in batches. Conversion itself is negligible — under 15 seconds even for a 30B parameter model.
For deployment workflows, the analysis phase is a one-time cost. Once the sensitivity manifest is generated, subsequent conversions with different threshold settings (e.g., sweeping v3-opt parameters) require only the conversion step.
Key Takeaways
Across four models, three architectures, and multiple SWAN versions, several patterns emerge consistently:
- SWAN v3-opt is the recommended version for dense models. It achieves the lowest perplexity (+3.8% on Qwen3-8B) at a reasonable size (6.05 GB), with threshold optimisation eliminating the size overhead of v3’s hybrid approach.
- SWAN v2 remains best for MoE models. The v3 hybrid fallback pathway hurts MoE performance because it does not account for the structural regularity of expert layers. On Qwen3-30B-A3B, v2 achieves +2.8% vs v3’s +3.6%.
- v4 auto-detection solves the architecture routing problem. By automatically detecting model architecture and routing to the correct SWAN version (v3-opt for dense, v2 for MoE), v4 removes the need for users to understand these trade-offs.
- Threshold optimisation (v3-opt) saves 13% size at zero quality cost. The optimised thresholds produce a 6.05 GB model vs v3’s 6.95 GB, with perplexity actually improving from 10.102 to 10.097.
- Standard perplexity is unreliable. Always report median alongside mean. On GLM-4.7-Flash, the mean suggests quantisation improves quality; the median correctly shows 4.3% degradation.
- MoE models benefit disproportionately from sensitivity-aware quantisation. Uniform 4-bit costs Qwen3-30B-A3B 10.3% perplexity; SWAN v2 reduces this to 2.8%. On dense Qwen3-8B, the gap is narrower (5.4% vs 3.8%), because dense models have more homogeneous tensor sensitivity.
Reproducibility
Every result in this article is fully reproducible:
- All random seeds fixed at 42 across all evaluation runs
- Per-tensor sensitivity manifests preserved as JSON files, enabling exact reproduction of bit allocation decisions
- All perplexity and benchmark results saved as JSON with full configuration metadata (model paths, quantisation settings, evaluation parameters)
- Code available at github.com/baa-ai/swan-quantization
Hardware: Apple M2 Ultra, 192 GB unified memory. Software: Python 3.12.0, MLX 0.30.3, mlx_lm 0.30.4, PyTorch 2.6.0.
Need help quantizing your models for production?
Black Sheep AI brings deep expertise in model quantization, mixed-precision optimisation, and production AI systems. Whether you're deploying on edge hardware or optimising cloud costs, we can help you find the right compression strategy for your architecture.
Talk to Our Team