SWAN Evaluation Results: Four Models, Three Architectures, One Framework

We evaluated SWAN across four models (Qwen3-8B, Qwen3-30B-A3B, GLM-4.7-Flash, GLM-4.7), three architectures (dense transformer, sparse MoE, dense MoE), and multiple SWAN versions (v1 through v3-opt). Here are the complete results.

Test Environment

All experiments were conducted on a single workstation with fixed seeds and consistent evaluation protocols. Reproducibility was a first-class concern throughout.

Hardware

Compute: Apple M2 Ultra
Memory: 192 GB Unified Memory

Software

Python: 3.12.0
MLX: 0.30.3
mlx_lm: 0.30.4
PyTorch: 2.6.0

Evaluation Protocol

Perplexity: WikiText-2 test set, seq_len=2048, 256 samples, seed=42
Benchmarks: lm-evaluation-harness v0.4.11 via mlx_lm.evaluate

Perplexity Results — Qwen3-8B (Dense Transformer, 8.19B Parameters)

Qwen3-8B is a standard dense transformer — the architecture where every parameter is active on every forward pass. This is SWAN’s primary design target, and the results demonstrate a clear progression from v1 through v3-opt.

Condition	Avg Bits	Size (GB)	PPL	ΔPPL vs BF16	Peak Mem (GB)
BF16 baseline	16.00	15.26	9.727	—	17.44
Uniform 4-bit	4.00	4.05	10.250	+5.4%	6.36
SWAN v1 (fixed norm)	6.17	5.88	10.122	+4.1%	8.13
SWAN v2 (adaptive norm)	6.63	6.32	10.337	+6.3%	8.57
SWAN v3 (hybrid)	~5.82	6.95	10.102	+3.9%	9.25
SWAN v3-opt	5.82	6.05	10.097	+3.8%	8.30

Key observation: SWAN v3-opt achieves the best perplexity of any SWAN version (+3.8% vs BF16) at a model size of 6.05 GB — 60% smaller than BF16 and only 49% larger than uniform 4-bit. The threshold optimisation step in v3-opt eliminates the size overhead of v3’s hybrid approach while preserving its quality gains.

Note the v2 regression: adaptive norm allocation on a dense model with relatively homogeneous tensors can over-allocate bits without commensurate quality improvement. This is what motivated the v3 hybrid approach.

Perplexity Results — Qwen3-30B-A3B (Sparse MoE, 30.53B Total / 3.3B Active)

Qwen3-30B-A3B is a sparse Mixture-of-Experts model with 128 experts, of which 8 are active per token. The vast majority of parameters sit in expert FFN layers that are only activated for specific inputs. This architecture presents a fundamentally different quantisation challenge: most tensors are expert weights with similar structure, and the model is far more sensitive to uniform compression because each expert carries specialised knowledge.

Condition	Avg Bits	Size (GB)	PPL	ΔPPL vs BF16	Peak Mem (GB)
BF16 baseline	16.00	56.87	8.728	—	59.18
Uniform 4-bit	4.00	15.11	9.629	+10.3%	17.42
SWAN v1 (fixed norm)	4.51	16.02	9.180	+5.2%	18.33
SWAN v2 (adaptive norm)	4.69	16.65	8.976	+2.8%	18.97
SWAN v3 (hybrid)	~4.5	16.20	9.041	+3.6%	18.51
SWAN v2-opt	~4.6	16.42	9.057	+3.8%	18.73

Key observation: On MoE, SWAN v2 is the clear winner at +2.8% degradation — less than a third of uniform 4-bit’s +10.3%. The v3 hybrid approach actually hurts MoE performance because its norm-based fallback pathway doesn’t account for the structural regularity of expert layers. This architecture-dependent behaviour is what led to the v4 auto-detection system, which routes dense models to v3-opt and MoE models to v2.

The uniform 4-bit result here is striking: 10.3% degradation means uniform quantisation destroys meaningful expert specialisation. Sensitivity-aware allocation preserves it by protecting the most critical expert weights.

Perplexity Results — GLM-4.7-Flash (Dense MoE, 31B Parameters)

GLM-4.7-Flash is a dense MoE architecture. Its perplexity results require careful interpretation due to the outlier sequence phenomenon documented in our previous article.

Condition	Standard PPL	Median PPL	Size (GB)
BF16	11.344	8.706	58.2
FP16	11.208	8.609	55.8
Uniform 4-bit	11.532	—	14.8
SWAN v3	9.930*	9.084	15.9

*Standard PPL is misleading. The 9.930 figure appears to show quantisation improving over the BF16 baseline. This is an artifact caused by 5 outlier sequences (PPL values of 25,000–81,000) that dominate the mean in BF16 but are partially suppressed by quantisation noise. Median PPL tells the true story: SWAN v3 degrades by 4.3% (9.084 vs 8.706), which is consistent with 3.66× compression. See article 18 for the full analysis.

Academic Benchmark Results — Qwen3-8B

Perplexity measures prediction quality on raw text, but academic benchmarks test whether the model can still reason, recall knowledge, and follow instructions after compression. We ran ARC-Challenge (25-shot) and HellaSwag (10-shot) via mlx_lm.evaluate on Qwen3-8B across three conditions.

Model	Size	ARC-C (25-shot)	HellaSwag (10-shot)	PPL
BF16	15.26 GB	44.62%	60.04%	9.727
SWAN v3-opt	6.05 GB	43.43% (-1.2%)	58.16% (-1.9%)	10.097
Uniform 4-bit	4.05 GB	42.83% (-1.8%)	58.14% (-1.9%)	10.249

Key finding: The ordering BF16 > SWAN v3-opt > Uniform 4-bit is consistent across all three metrics. SWAN beats uniform 4-bit on ARC-Challenge (43.43% vs 42.83%), demonstrating that sensitivity-aware bit allocation preserves reasoning capability better than uniform compression. On HellaSwag, both quantised variants land within 0.02 percentage points of each other, suggesting this benchmark is less sensitive to per-tensor precision allocation.

The practical implication: SWAN v3-opt gives you better reasoning quality than uniform 4-bit at a 49% size premium (6.05 GB vs 4.05 GB). Whether that trade-off is worth it depends on your deployment constraints.

Bit Allocation Analysis

SWAN’s core mechanism is assigning different bit widths to different tensors based on their quantisation sensitivity. How those bits actually get distributed reveals how each model’s architecture interacts with the allocation algorithm.

Qwen3-8B Bit Distribution

Bit Width	Uniform	SWAN v1	SWAN v2
2-bit	0%	0%	2.2%
4-bit	100%	81.7%	73.5%
8-bit	0%	3.1%	6.0%
16-bit	0%	15.2%	18.3%

The v1→v2 progression shows more aggressive differentiation: v2’s adaptive norm scoring identifies a small tail of tensors (2.2%) that can tolerate aggressive 2-bit quantisation, freeing bits that are reallocated to the 18.3% of tensors kept at full 16-bit precision. The majority of tensors remain at 4-bit, but the edges of the distribution widen.

Qwen3-30B-A3B Bit Distribution

Bit Width	Uniform	SWAN v1	SWAN v2
2-bit	0%	0%	16.6%
4-bit	100%	97.2%	71.9%
8-bit	0%	0.8%	6.3%
16-bit	0%	2.1%	5.3%

The MoE distribution is dramatically different. SWAN v2 pushes 16.6% of tensors to 2-bit — nearly all of these are expert FFN layers that the sensitivity analysis identifies as redundant or highly resilient to quantisation noise. This aggressive low-bit allocation for non-critical experts is precisely why SWAN v2 excels on MoE: it recognises that not all experts are equally important and compresses the least sensitive ones aggressively.

Meanwhile, only 5.3% of tensors get 16-bit treatment (vs 18.3% in Qwen3-8B), reflecting the fact that MoE models have fewer critical shared layers relative to their total parameter count.

Sensitivity Score Discrimination

SWAN’s effectiveness depends on there being meaningful variation in how sensitive different tensors are to quantisation. If all tensors respond similarly, there’s nothing for sensitivity-aware allocation to exploit. The sensitivity score span — the range between the most and least sensitive tensors — quantifies this opportunity.

Model	Sensitivity Score Span	Max 2-bit %	Max 16-bit %
Qwen3-8B	1.173	16.8%	20.3%
Qwen3-30B-A3B	1.378	41.3%	7.3%
GLM-4.7-Flash	0.073	0.0%	3.3%

Key insight: GLM-4.7-Flash tensors are 18× more homogeneous than Qwen tensors. With a sensitivity score span of just 0.073, there is almost no variation for SWAN to exploit. The algorithm correctly responds by making minimal allocation changes (0% at 2-bit, only 3.3% at 16-bit), effectively falling back to near-uniform quantisation.

This explains why SWAN’s benefit on GLM-4.7-Flash is modest compared to its impact on Qwen models. It is not a limitation of the algorithm — it is a property of the model. When the architecture produces uniformly sensitive tensors, the correct allocation is uniform. SWAN correctly identifies this and acts accordingly.

Conversely, Qwen3-30B-A3B’s high span (1.378) with 41.3% of tensors eligible for 2-bit quantisation explains why SWAN achieves its most dramatic improvement on this model — cutting uniform 4-bit’s 10.3% degradation down to 2.8%.

Compression Efficiency

Raw compression ratio and raw perplexity degradation do not tell the full story individually. Compression efficiency — defined as the ratio of perplexity degradation to compression ratio — captures how much quality you sacrifice per unit of compression.

Model	Condition	Compression	PPL Degradation	Efficiency
Qwen3-8B	Uniform 4-bit	3.77×	+5.4%	1.43
Qwen3-8B	SWAN v1	2.59×	+4.1%	1.58
Qwen3-30B-A3B	Uniform 4-bit	3.76×	+10.3%	2.74
Qwen3-30B-A3B	SWAN v2	3.42×	+2.8%	0.82

Lower efficiency scores are better — they mean less quality loss per unit of compression. SWAN v2 on the MoE model achieves 0.82, meaning each unit of compression costs less than 1% perplexity. Uniform 4-bit on the same model scores 2.74, paying nearly 3% perplexity per compression unit. This 3.3× efficiency advantage is SWAN’s strongest result across the entire evaluation.

For Qwen3-8B, uniform 4-bit actually has a better efficiency ratio (1.43 vs 1.58) because it achieves much higher compression (3.77× vs 2.59×). SWAN v1 preserves more quality but at a lower compression ratio, so the per-unit cost is slightly higher. This highlights that efficiency is not the only metric that matters — the absolute quality level and the absolute size also factor into deployment decisions.

Processing Time

SWAN adds an analysis phase before quantisation, where it profiles every tensor to compute sensitivity scores. The conversion step itself is fast; the analysis is the bottleneck.

Phase	Qwen3-8B	Qwen3-30B-A3B
Tensors analysed	399	18,867
v3 analysis time	~3.3 min	~44 min
Conversion time	~6s	~15s

The 47× increase in tensor count from Qwen3-8B to Qwen3-30B-A3B (399 to 18,867) drives a roughly 13× increase in analysis time. The sub-linear scaling reflects that many MoE expert tensors share similar shapes and can be analysed in batches. Conversion itself is negligible — under 15 seconds even for a 30B parameter model.

For deployment workflows, the analysis phase is a one-time cost. Once the sensitivity manifest is generated, subsequent conversions with different threshold settings (e.g., sweeping v3-opt parameters) require only the conversion step.

Key Takeaways

Across four models, three architectures, and multiple SWAN versions, several patterns emerge consistently:

SWAN v3-opt is the recommended version for dense models. It achieves the lowest perplexity (+3.8% on Qwen3-8B) at a reasonable size (6.05 GB), with threshold optimisation eliminating the size overhead of v3’s hybrid approach.
SWAN v2 remains best for MoE models. The v3 hybrid fallback pathway hurts MoE performance because it does not account for the structural regularity of expert layers. On Qwen3-30B-A3B, v2 achieves +2.8% vs v3’s +3.6%.
v4 auto-detection solves the architecture routing problem. By automatically detecting model architecture and routing to the correct SWAN version (v3-opt for dense, v2 for MoE), v4 removes the need for users to understand these trade-offs.
Threshold optimisation (v3-opt) saves 13% size at zero quality cost. The optimised thresholds produce a 6.05 GB model vs v3’s 6.95 GB, with perplexity actually improving from 10.102 to 10.097.
Standard perplexity is unreliable. Always report median alongside mean. On GLM-4.7-Flash, the mean suggests quantisation improves quality; the median correctly shows 4.3% degradation.
MoE models benefit disproportionately from sensitivity-aware quantisation. Uniform 4-bit costs Qwen3-30B-A3B 10.3% perplexity; SWAN v2 reduces this to 2.8%. On dense Qwen3-8B, the gap is narrower (5.4% vs 3.8%), because dense models have more homogeneous tensor sensitivity.

Reproducibility

Every result in this article is fully reproducible:

All random seeds fixed at 42 across all evaluation runs
Per-tensor sensitivity manifests preserved as JSON files, enabling exact reproduction of bit allocation decisions
All perplexity and benchmark results saved as JSON with full configuration metadata (model paths, quantisation settings, evaluation parameters)
Code available at github.com/baa-ai/swan-quantization

Hardware: Apple M2 Ultra, 192 GB unified memory. Software: Python 3.12.0, MLX 0.30.3, mlx_lm 0.30.4, PyTorch 2.6.0.

Need help quantizing your models for production?

Black Sheep AI brings deep expertise in model quantization, mixed-precision optimisation, and production AI systems. Whether you're deploying on edge hardware or optimising cloud costs, we can help you find the right compression strategy for your architecture.

Talk to Our Team

← Previous: Why SWAN Matters All Articles →

← Back to all articles