We expected to find massive redundancy in a 256-expert MoE model. Instead we found that removing even 5% of the least-used experts causes catastrophic quality collapse — a 13x perplexity blow-up that no amount of fine-tuning can justify.
The Experiment
We profiled Qwen3.5-35B-A3B (256 experts per layer, 8 active per token) across 100 diverse prompts spanning code, math, creative writing, factual Q&A, multilingual text, and instruction following. Then we zeroed out the bottom N% least-activated experts and measured WikiText-2 perplexity.
The results were unambiguous:
| Experts Pruned | PPL | Degradation |
|---|---|---|
| 0% (baseline) | 6.580 | — |
| 5% | 86.906 | 13.2x |
| 10% | 15,894 | 2,416x |
| 25% | 906,762 | 137,807x |
Removing just 5% of the least-used experts — roughly 13 out of 256 per layer — causes a 13.2x perplexity blow-up. At 10%, the model is effectively destroyed. At 25%, it produces random noise.
Why It Fails So Catastrophically
The router was trained with all experts available. Expert specialization is distributed — even “rare” experts handle specific token patterns that no other expert covers. Zeroing them creates holes in the model’s coverage that cannot be compensated for by the remaining experts.
The softmax routing distributes probability mass across experts. When a target expert produces zero output, the weighted sum collapses. The router still assigns non-zero probability to the pruned expert, but receives nothing back. This is not a graceful degradation — it is a structural failure in the forward pass.
Unlike dense layers where removing a neuron slightly reduces capacity, removing an MoE expert eliminates an entire specialization pathway. The router has no mechanism to redirect traffic to an equivalent expert because no equivalent exists.
The Activation Profile Tells a Different Story
With only 5 prompts, 30% of experts appeared dead. At 100 prompts, only 0.6% were truly unused. Expert usage is prompt-dependent — a “dead” expert on English text may be critical for code or math.
| Metric | Value |
|---|---|
| Gini coefficient | 0.53 (moderate concentration) |
| Entropy ratio | 0.91 (fairly uniform) |
| Top-10 expert share | 20.4% (vs 3.9% if perfectly uniform) |
| Dead experts (100 prompts) | 1.5 / 256 (0.6%) |
The activation profile reveals moderate concentration but very few truly unused experts. The routing is uneven, not sparse — most experts are used, just at different frequencies. This distinction is critical: uneven usage is an opportunity for tiered quantization, not a license to prune.
Implications
Expert pruning is not a viable compression strategy for MoE models. The correct approach is mixed-precision quantization — keeping all experts but at different bit-widths proportional to their importance.
MINT’s tiered quantization (DynaMINT) preserves all experts at appropriate precision levels, achieving only +0.5% PPL degradation — compared to the 1,220% degradation from removing just 5% of experts. The difference is six orders of magnitude in quality impact.
- Pruning 5% of experts: +1,220% PPL (catastrophic)
- DynaMINT tiered quantization: +0.5% PPL (production-ready)
- The lesson: compress precision, not coverage
What This Means for the Field
Papers claiming “18% safely prunable” likely tested with too few prompts. Our results show that diverse prompts dramatically reduce apparent redundancy. At 5 prompts, 30% of experts appear dead. At 100 prompts, that number drops to 0.6%. The sample-size effect alone can make pruning appear viable when it is not.
The MoE routing mechanism relies on expert diversity, not individual expert quality. Each expert covers a different region of the input space. Removing any expert — even a rarely-used one — creates a gap that the remaining experts cannot fill. The router was never trained to compensate for missing experts.
The path forward for MoE compression is quantization-aware tiering: use activation profiles to inform bit-width allocation, not to decide which experts to remove.
Full results and methodology available at huggingface.co/spaces/baa-ai/MoE-Expert-Quantization. All evaluations use WikiText-2 test split, sequence length 2048, 128 sequences, seed 42.
Read the Full Paper
The complete MoE expert quantization paper, including expert activation profiling, per-expert mixed-bit allocation, and evaluation across 512-expert architectures, is available on our HuggingFace:
MoE Expert Quantization: Per-Expert Mixed-Precision for Mixture-of-Experts Models — Full Paper
huggingface.co/spaces/baa-ai/MoE-Expert-QuantizationLicensed under CC BY-NC-ND 4.0