Why You Can't Prune MoE Experts, Even the Ones Nobody Uses
MoE Research

Why You Can’t Prune MoE Experts, Even the Ones Nobody Uses

March 2026 · Black Sheep AI Research

We expected to find massive redundancy in a 256-expert MoE model. Instead we found that removing even 5% of the least-used experts causes catastrophic quality collapse, a 13x perplexity blow-up that no amount of fine-tuning can justify.

The Result

We profiled expert activation across 100 diverse prompts on a 256-expert MoE model, then zeroed out the least-activated experts and measured perplexity. The results were unambiguous:

Experts Pruned PPL Degradation
0% (baseline) 6.580 -
5% 86.906 13.2x
10% 15,894 2,416x
25% 906,762 137,807x

Removing just 5% of the least-used experts, roughly 13 out of 256 per layer, causes a 13.2x perplexity blow-up. At 10%, the model is effectively destroyed. At 25%, it produces random noise.

Why It Fails So Catastrophically

The router was trained with all experts available. Expert specialization is distributed, even “rare” experts handle specific token patterns that no other expert covers. Zeroing them creates holes in the model’s coverage that cannot be compensated for by the remaining experts.

The softmax routing distributes probability mass across experts. When a target expert produces zero output, the weighted sum collapses. The router still assigns non-zero probability to the pruned expert, but receives nothing back. This is not a graceful degradation, it is a structural failure in the forward pass.

Unlike dense layers where removing a neuron slightly reduces capacity, removing an MoE expert eliminates an entire specialization pathway. The router has no mechanism to redirect traffic to an equivalent expert because no equivalent exists.

Why Pruning Looks Viable But Isn’t

With only 5 prompts, 30% of experts appeared dead. At 100 prompts, only 0.6% were truly unused. Expert usage is prompt-dependent — a “dead” expert on English text may be critical for code or math. The routing is uneven, not sparse: most experts are used, just at different frequencies. This distinction is critical — uneven usage is an opportunity for tiered quantization, not a license to prune.

Implications

Expert pruning is not a viable compression strategy for MoE models. The correct approach is compression, keeping all experts but at different bit-widths proportional to their importance.

RAM’s tiered quantization (DynaMINT) preserves all experts at appropriate precision levels, achieving only +0.5% PPL degradation, compared to the 1,220% degradation from removing just 5% of experts. The difference is six orders of magnitude in quality impact.

What This Means for the Field

Papers claiming “18% safely prunable” likely tested with too few prompts. Our results show that diverse prompts dramatically reduce apparent redundancy. At 5 prompts, 30% of experts appear dead. At 100 prompts, that number drops to 0.6%. The sample-size effect alone can make pruning appear viable when it is not.

The MoE routing mechanism relies on expert diversity, not individual expert quality. Each expert covers a different region of the input space. Removing any expert, even a rarely-used one, creates a gap that the remaining experts cannot fill. The router was never trained to compensate for missing experts.

The path forward for MoE compression is quantization-aware tiering: use activation profiles to inform bit-width allocation, not to decide which experts to remove.


Full results and methodology available at huggingface.co/spaces/baa-ai/MoE-Expert-Quantization. All evaluations use WikiText-2 test split, sequence length 2048, 128 sequences, seed 42.

Read the Full Paper

The complete MoE expert quantization paper, including expert activation profiling, per-expert mixed-bit allocation, and evaluation across 512-expert architectures, is available on our HuggingFace:

MoE Expert Quantization: Per-Expert Mixed-Precision for Mixture-of-Experts Models, Full Paper

huggingface.co/spaces/baa-ai/MoE-Expert-Quantization

Licensed under CC BY-NC-ND 4.0

Continue Reading

Related research from our team.

What 100 Prompts Reveal About Expert Routing in 256-Expert MoE Models
MoE Research

What 100 Prompts Reveal About Expert Routing in 256-Expert MoE Models

Profiling expert activation across 100 diverse prompts reveals the dramatic sample-size effect on apparent redundancy.

Mean Perplexity Is Lying to You
RAM Research

Mean Perplexity Is Lying to You

Standard perplexity evaluation produces misleading quality orderings. Here’s why, and what to report instead.

View All Research