Why You Can't Prune MoE Experts — Even the Ones Nobody Uses

We expected to find massive redundancy in a 256-expert MoE model. Instead we found that removing even 5% of the least-used experts causes catastrophic quality collapse — a 13x perplexity blow-up that no amount of fine-tuning can justify.

The Experiment

We profiled Qwen3.5-35B-A3B (256 experts per layer, 8 active per token) across 100 diverse prompts spanning code, math, creative writing, factual Q&A, multilingual text, and instruction following. Then we zeroed out the bottom N% least-activated experts and measured WikiText-2 perplexity.

The results were unambiguous:

Experts Pruned	PPL	Degradation
0% (baseline)	6.580	—
5%	86.906	13.2x
10%	15,894	2,416x
25%	906,762	137,807x

Removing just 5% of the least-used experts — roughly 13 out of 256 per layer — causes a 13.2x perplexity blow-up. At 10%, the model is effectively destroyed. At 25%, it produces random noise.

Why It Fails So Catastrophically

The router was trained with all experts available. Expert specialization is distributed — even “rare” experts handle specific token patterns that no other expert covers. Zeroing them creates holes in the model’s coverage that cannot be compensated for by the remaining experts.

The softmax routing distributes probability mass across experts. When a target expert produces zero output, the weighted sum collapses. The router still assigns non-zero probability to the pruned expert, but receives nothing back. This is not a graceful degradation — it is a structural failure in the forward pass.

Unlike dense layers where removing a neuron slightly reduces capacity, removing an MoE expert eliminates an entire specialization pathway. The router has no mechanism to redirect traffic to an equivalent expert because no equivalent exists.

The Activation Profile Tells a Different Story

With only 5 prompts, 30% of experts appeared dead. At 100 prompts, only 0.6% were truly unused. Expert usage is prompt-dependent — a “dead” expert on English text may be critical for code or math.

Metric	Value
Gini coefficient	0.53 (moderate concentration)
Entropy ratio	0.91 (fairly uniform)
Top-10 expert share	20.4% (vs 3.9% if perfectly uniform)
Dead experts (100 prompts)	1.5 / 256 (0.6%)

The activation profile reveals moderate concentration but very few truly unused experts. The routing is uneven, not sparse — most experts are used, just at different frequencies. This distinction is critical: uneven usage is an opportunity for tiered quantization, not a license to prune.

Implications

Expert pruning is not a viable compression strategy for MoE models. The correct approach is mixed-precision quantization — keeping all experts but at different bit-widths proportional to their importance.

MINT’s tiered quantization (DynaMINT) preserves all experts at appropriate precision levels, achieving only +0.5% PPL degradation — compared to the 1,220% degradation from removing just 5% of experts. The difference is six orders of magnitude in quality impact.

Pruning 5% of experts: +1,220% PPL (catastrophic)
DynaMINT tiered quantization: +0.5% PPL (production-ready)
The lesson: compress precision, not coverage

What This Means for the Field

Papers claiming “18% safely prunable” likely tested with too few prompts. Our results show that diverse prompts dramatically reduce apparent redundancy. At 5 prompts, 30% of experts appear dead. At 100 prompts, that number drops to 0.6%. The sample-size effect alone can make pruning appear viable when it is not.

The MoE routing mechanism relies on expert diversity, not individual expert quality. Each expert covers a different region of the input space. Removing any expert — even a rarely-used one — creates a gap that the remaining experts cannot fill. The router was never trained to compensate for missing experts.

The path forward for MoE compression is quantization-aware tiering: use activation profiles to inform bit-width allocation, not to decide which experts to remove.

Full results and methodology available at huggingface.co/spaces/baa-ai/MoE-Expert-Quantization. All evaluations use WikiText-2 test split, sequence length 2048, 128 sequences, seed 42.

Read the Full Paper

The complete MoE expert quantization paper, including expert activation profiling, per-expert mixed-bit allocation, and evaluation across 512-expert architectures, is available on our HuggingFace:

MoE Expert Quantization: Per-Expert Mixed-Precision for Mixture-of-Experts Models — Full Paper

huggingface.co/spaces/baa-ai/MoE-Expert-Quantization

Licensed under CC BY-NC-ND 4.0

Why You Can’t Prune MoE Experts — Even the Ones Nobody Uses

The Experiment

Why It Fails So Catastrophically

The Activation Profile Tells a Different Story

Implications

What This Means for the Field

Read the Full Paper

Continue Reading

DynaMINT: When MINT Meets Expert-Aware Tiering

What 100 Prompts Reveal About Expert Routing in 256-Expert MoE Models

Mean Perplexity Is Lying to You