SWAN Evaluation Results
Evaluation Report

SWAN Evaluation Results: Four Models, Three Architectures, One Framework

March 2026 · Black Sheep AI Research

We evaluated SWAN across four models (Qwen3-8B, Qwen3-30B-A3B, GLM-4.7-Flash, GLM-4.7), three architectures (dense transformer, sparse MoE, dense MoE), and multiple SWAN versions (v1 through v3-opt). Here are the complete results.

Test Environment

All experiments were conducted on a single workstation with fixed seeds and consistent evaluation protocols. Reproducibility was a first-class concern throughout.

Hardware

Software

Evaluation Protocol

Perplexity Results — Qwen3-8B (Dense Transformer, 8.19B Parameters)

Qwen3-8B is a standard dense transformer — the architecture where every parameter is active on every forward pass. This is SWAN’s primary design target, and the results demonstrate a clear progression from v1 through v3-opt.

ConditionAvg BitsSize (GB)PPLΔPPL vs BF16Peak Mem (GB)
BF16 baseline16.0015.269.72717.44
Uniform 4-bit4.004.0510.250+5.4%6.36
SWAN v1 (fixed norm)6.175.8810.122+4.1%8.13
SWAN v2 (adaptive norm)6.636.3210.337+6.3%8.57
SWAN v3 (hybrid)~5.826.9510.102+3.9%9.25
SWAN v3-opt5.826.0510.097+3.8%8.30

Key observation: SWAN v3-opt achieves the best perplexity of any SWAN version (+3.8% vs BF16) at a model size of 6.05 GB — 60% smaller than BF16 and only 49% larger than uniform 4-bit. The threshold optimisation step in v3-opt eliminates the size overhead of v3’s hybrid approach while preserving its quality gains.

Note the v2 regression: adaptive norm allocation on a dense model with relatively homogeneous tensors can over-allocate bits without commensurate quality improvement. This is what motivated the v3 hybrid approach.

Perplexity Results — Qwen3-30B-A3B (Sparse MoE, 30.53B Total / 3.3B Active)

Qwen3-30B-A3B is a sparse Mixture-of-Experts model with 128 experts, of which 8 are active per token. The vast majority of parameters sit in expert FFN layers that are only activated for specific inputs. This architecture presents a fundamentally different quantisation challenge: most tensors are expert weights with similar structure, and the model is far more sensitive to uniform compression because each expert carries specialised knowledge.

ConditionAvg BitsSize (GB)PPLΔPPL vs BF16Peak Mem (GB)
BF16 baseline16.0056.878.72859.18
Uniform 4-bit4.0015.119.629+10.3%17.42
SWAN v1 (fixed norm)4.5116.029.180+5.2%18.33
SWAN v2 (adaptive norm)4.6916.658.976+2.8%18.97
SWAN v3 (hybrid)~4.516.209.041+3.6%18.51
SWAN v2-opt~4.616.429.057+3.8%18.73

Key observation: On MoE, SWAN v2 is the clear winner at +2.8% degradation — less than a third of uniform 4-bit’s +10.3%. The v3 hybrid approach actually hurts MoE performance because its norm-based fallback pathway doesn’t account for the structural regularity of expert layers. This architecture-dependent behaviour is what led to the v4 auto-detection system, which routes dense models to v3-opt and MoE models to v2.

The uniform 4-bit result here is striking: 10.3% degradation means uniform quantisation destroys meaningful expert specialisation. Sensitivity-aware allocation preserves it by protecting the most critical expert weights.

Perplexity Results — GLM-4.7-Flash (Dense MoE, 31B Parameters)

GLM-4.7-Flash is a dense MoE architecture. Its perplexity results require careful interpretation due to the outlier sequence phenomenon documented in our previous article.

ConditionStandard PPLMedian PPLSize (GB)
BF1611.3448.70658.2
FP1611.2088.60955.8
Uniform 4-bit11.53214.8
SWAN v39.930*9.08415.9

*Standard PPL is misleading. The 9.930 figure appears to show quantisation improving over the BF16 baseline. This is an artifact caused by 5 outlier sequences (PPL values of 25,000–81,000) that dominate the mean in BF16 but are partially suppressed by quantisation noise. Median PPL tells the true story: SWAN v3 degrades by 4.3% (9.084 vs 8.706), which is consistent with 3.66× compression. See article 18 for the full analysis.

Academic Benchmark Results — Qwen3-8B

Perplexity measures prediction quality on raw text, but academic benchmarks test whether the model can still reason, recall knowledge, and follow instructions after compression. We ran ARC-Challenge (25-shot) and HellaSwag (10-shot) via mlx_lm.evaluate on Qwen3-8B across three conditions.

ModelSizeARC-C (25-shot)HellaSwag (10-shot)PPL
BF1615.26 GB44.62%60.04%9.727
SWAN v3-opt6.05 GB43.43% (-1.2%)58.16% (-1.9%)10.097
Uniform 4-bit4.05 GB42.83% (-1.8%)58.14% (-1.9%)10.249

Key finding: The ordering BF16 > SWAN v3-opt > Uniform 4-bit is consistent across all three metrics. SWAN beats uniform 4-bit on ARC-Challenge (43.43% vs 42.83%), demonstrating that sensitivity-aware bit allocation preserves reasoning capability better than uniform compression. On HellaSwag, both quantised variants land within 0.02 percentage points of each other, suggesting this benchmark is less sensitive to per-tensor precision allocation.

The practical implication: SWAN v3-opt gives you better reasoning quality than uniform 4-bit at a 49% size premium (6.05 GB vs 4.05 GB). Whether that trade-off is worth it depends on your deployment constraints.

Bit Allocation Analysis

SWAN’s core mechanism is assigning different bit widths to different tensors based on their quantisation sensitivity. How those bits actually get distributed reveals how each model’s architecture interacts with the allocation algorithm.

Qwen3-8B Bit Distribution

Bit WidthUniformSWAN v1SWAN v2
2-bit0%0%2.2%
4-bit100%81.7%73.5%
8-bit0%3.1%6.0%
16-bit0%15.2%18.3%

The v1→v2 progression shows more aggressive differentiation: v2’s adaptive norm scoring identifies a small tail of tensors (2.2%) that can tolerate aggressive 2-bit quantisation, freeing bits that are reallocated to the 18.3% of tensors kept at full 16-bit precision. The majority of tensors remain at 4-bit, but the edges of the distribution widen.

Qwen3-30B-A3B Bit Distribution

Bit WidthUniformSWAN v1SWAN v2
2-bit0%0%16.6%
4-bit100%97.2%71.9%
8-bit0%0.8%6.3%
16-bit0%2.1%5.3%

The MoE distribution is dramatically different. SWAN v2 pushes 16.6% of tensors to 2-bit — nearly all of these are expert FFN layers that the sensitivity analysis identifies as redundant or highly resilient to quantisation noise. This aggressive low-bit allocation for non-critical experts is precisely why SWAN v2 excels on MoE: it recognises that not all experts are equally important and compresses the least sensitive ones aggressively.

Meanwhile, only 5.3% of tensors get 16-bit treatment (vs 18.3% in Qwen3-8B), reflecting the fact that MoE models have fewer critical shared layers relative to their total parameter count.

Sensitivity Score Discrimination

SWAN’s effectiveness depends on there being meaningful variation in how sensitive different tensors are to quantisation. If all tensors respond similarly, there’s nothing for sensitivity-aware allocation to exploit. The sensitivity score span — the range between the most and least sensitive tensors — quantifies this opportunity.

ModelSensitivity Score SpanMax 2-bit %Max 16-bit %
Qwen3-8B1.17316.8%20.3%
Qwen3-30B-A3B1.37841.3%7.3%
GLM-4.7-Flash0.0730.0%3.3%

Key insight: GLM-4.7-Flash tensors are 18× more homogeneous than Qwen tensors. With a sensitivity score span of just 0.073, there is almost no variation for SWAN to exploit. The algorithm correctly responds by making minimal allocation changes (0% at 2-bit, only 3.3% at 16-bit), effectively falling back to near-uniform quantisation.

This explains why SWAN’s benefit on GLM-4.7-Flash is modest compared to its impact on Qwen models. It is not a limitation of the algorithm — it is a property of the model. When the architecture produces uniformly sensitive tensors, the correct allocation is uniform. SWAN correctly identifies this and acts accordingly.

Conversely, Qwen3-30B-A3B’s high span (1.378) with 41.3% of tensors eligible for 2-bit quantisation explains why SWAN achieves its most dramatic improvement on this model — cutting uniform 4-bit’s 10.3% degradation down to 2.8%.

Compression Efficiency

Raw compression ratio and raw perplexity degradation do not tell the full story individually. Compression efficiency — defined as the ratio of perplexity degradation to compression ratio — captures how much quality you sacrifice per unit of compression.

ModelConditionCompressionPPL DegradationEfficiency
Qwen3-8BUniform 4-bit3.77×+5.4%1.43
Qwen3-8BSWAN v12.59×+4.1%1.58
Qwen3-30B-A3BUniform 4-bit3.76×+10.3%2.74
Qwen3-30B-A3BSWAN v23.42×+2.8%0.82

Lower efficiency scores are better — they mean less quality loss per unit of compression. SWAN v2 on the MoE model achieves 0.82, meaning each unit of compression costs less than 1% perplexity. Uniform 4-bit on the same model scores 2.74, paying nearly 3% perplexity per compression unit. This 3.3× efficiency advantage is SWAN’s strongest result across the entire evaluation.

For Qwen3-8B, uniform 4-bit actually has a better efficiency ratio (1.43 vs 1.58) because it achieves much higher compression (3.77× vs 2.59×). SWAN v1 preserves more quality but at a lower compression ratio, so the per-unit cost is slightly higher. This highlights that efficiency is not the only metric that matters — the absolute quality level and the absolute size also factor into deployment decisions.

Processing Time

SWAN adds an analysis phase before quantisation, where it profiles every tensor to compute sensitivity scores. The conversion step itself is fast; the analysis is the bottleneck.

PhaseQwen3-8BQwen3-30B-A3B
Tensors analysed39918,867
v3 analysis time~3.3 min~44 min
Conversion time~6s~15s

The 47× increase in tensor count from Qwen3-8B to Qwen3-30B-A3B (399 to 18,867) drives a roughly 13× increase in analysis time. The sub-linear scaling reflects that many MoE expert tensors share similar shapes and can be analysed in batches. Conversion itself is negligible — under 15 seconds even for a 30B parameter model.

For deployment workflows, the analysis phase is a one-time cost. Once the sensitivity manifest is generated, subsequent conversions with different threshold settings (e.g., sweeping v3-opt parameters) require only the conversion step.

Key Takeaways

Across four models, three architectures, and multiple SWAN versions, several patterns emerge consistently:

Reproducibility

Every result in this article is fully reproducible:

Hardware: Apple M2 Ultra, 192 GB unified memory. Software: Python 3.12.0, MLX 0.30.3, mlx_lm 0.30.4, PyTorch 2.6.0.

Need help quantizing your models for production?

Black Sheep AI brings deep expertise in model quantization, mixed-precision optimisation, and production AI systems. Whether you're deploying on edge hardware or optimising cloud costs, we can help you find the right compression strategy for your architecture.

Talk to Our Team
← Previous: Why SWAN Matters All Articles →
← Back to all articles