Expert Pruning in MoE Models — When Dead Experts Aren't Dead

We pruned 18% of experts from a 512-expert MoE model based on activation profiling. The model passed all automated quality tests. Then we looked at the actual responses.

Introduction

Mixture-of-Experts models contain vast numbers of expert sub-networks — Qwen3.5-397B has 512 experts per layer across 60 layers, totaling 30,720 expert instances. Our activation profiling (Article 1) showed that 18.1% of these experts (5,562 instances) were activated less than 0.05% of the time across 150 calibration prompts.

The obvious move: prune them. Zero their weights, mask them out of the router, reclaim the capacity. The technique worked perfectly in every automated test. It failed silently and dangerously on real-world tasks.

This article documents the pruning technique, the automated test results that gave us false confidence, and the quality regressions we discovered through manual inspection.

The Pruning Technique

Router Masking

MoE models use a router (gate) network to select which experts process each token. The router produces logits for all experts, applies softmax to get probabilities, then selects the top-k experts.

To prune an expert, we don't need to remove it from the model — we just need to ensure it's never selected. We do this by setting the router's gate weight row for that expert to an extreme negative value:


# gate_weight shape: [num_experts, hidden_dim] = [512, 4096]
# For each pruned expert index:
gate_weight[expert_idx, :] = -1e9

After softmax, a logit of -1e9 produces a probability of approximately 0. The expert is never selected by top-k routing. This has zero runtime cost — the softmax and top-k computation happens regardless, and the pruned expert's probability is simply negligible.

Weight Zeroing

For completeness, we also zero the expert's weights. This doesn't affect inference (the expert is never selected) but ensures the model file doesn't contain stale parameters:


# For fused gate_up_proj: shape [512, dim1, dim2]
expert_weights[expert_idx] = 0

Post-Processing Implementation

We bake the router masks into the converted model's safetensors files:


from safetensors import safe_open
from safetensors.torch import save_file

for shard_file in model_shards:
    tensors = {}
    with safe_open(str(shard_file), framework="pt") as f:
        for key in f.keys():
            tensors[key] = f.get_tensor(key)

    # Find gate weight tensors and mask pruned experts
    for key in tensors:
        match = gate_pattern.search(key)
        if match:
            layer_idx = int(match.group(1))
            for expert_idx in pruned_experts[layer_idx]:
                tensors[key][expert_idx, :] = -1e9

    save_file(tensors, str(shard_file))

Critical note: We discovered that using mx.save_safetensors for this post-processing step corrupts bfloat16 data (see Article 4). The safetensors.torch.save_file function must be used instead.

What We Pruned

From the Qwen3.5-397B activation profiling (150 calibration prompts, 7.8M activation records):

Layer Range	Experts Pruned	Percentage	Notes
Layers 0-5	352	11.5%	High pruning in earliest layers
Layer 0 alone	166	32.4%	Highest single-layer pruning
Layers 6-18	57	1.9%	Very few prunable in early-mid layers
Layers 19-35	1,611	18.8%	Moderate pruning
Layers 36-59	3,542	23.2%	Heaviest pruning in late layers
Total	5,562	18.1%	Out of 30,720

The pruning is not uniform across layers. Late layers (36-59) have the most pruned experts because those layers develop more specialized, less frequently activated experts.

Automated Test Results: False Confidence

We ran the standard 15-prompt collapse test suite against three variants:

Variant	Pass	Warn	Avg Time
ExpertQuant (with pruning)	15	0	47.3s
Baseline (no pruning)	15	1	8.2s
Hybrid (with pruning)	15	0	7.7s

All variants pass. The pruned model actually has fewer warnings than the unpruned baseline. By every automated metric, the pruning is safe.

The collapse tests check for:

Minimum response length
Keyword presence (e.g., "class" for a coding prompt)
Repetition detection
Empty responses

These are necessary but deeply insufficient quality checks.

What Manual Inspection Revealed

When we compared actual responses between the pruned (Hybrid) and unpruned (baseline) models, three categories of failure emerged.

Failure 1: Language Contamination

Prompt: "Translate the following to Spanish: 'The weather is beautiful today and I plan to go hiking in the mountains.'"

Unpruned model:

El clima está hermoso hoy y planeo ir de caminata a las montañas.

Pruned model:

El clima está hermoso hoy y我 plane to go hiking in the mountains.

The pruned model produces a Chinese character (我, "I") mid-sentence and fails to translate the second half. This passed the collapse test because the test only checks for minimum length (10 characters) and has no keyword check for this prompt.

Failure 2: Factual Hallucination

Prompt: "Explain what a monad is in functional programming, with a practical example in Haskell or Python."

Unpruned model:

This is a great question about functional programming concepts. Let me explain monads clearly with practical examples. A monad is a design pattern in functional programming that allows for sequencing...

Pruned model:

Actually, there no such term as "monad" in functional programming. You may be confusing it with one of the following similar concepts...

The pruned model confidently denies the existence of a fundamental programming concept. This passed the collapse test because there's no keyword check for this prompt (check: None).

Failure 3: Abbreviated Reasoning

Prompt: "A farmer needs to cross a river with a fox, a chicken, and a bag of corn..."

Unpruned model:

1,119 characters — complete step-by-step solution with all river crossings

Pruned model:

206 characters — mostly restates the problem, does not provide the solution

The pruned model's response is one-fifth the length and doesn't actually solve the puzzle. It passed because the test only checks for the keyword "chicken" (present in the restated problem) and minimum length of 80 characters.

Root Cause Analysis

The Calibration Bias Problem

Our activation profiling used 150 prompts across six domains — but all prompts were primarily in English. The domain distribution:


coding:           25 prompts
math:             25 prompts
reasoning:        25 prompts
agent_tooluse:    25 prompts
english_general:  25 prompts
multilingual:     25 prompts (but relatively simple translations)

Experts that specialize in:

Non-English language generation (Spanish, Chinese, Japanese, etc.)
Niche domain knowledge (functional programming, category theory, formal logic)
Extended reasoning chains (multi-step problem solving)

...may fire infrequently in this calibration set but be essential when needed.

Activation Frequency ≠ Importance

This is the core insight: an expert that activates 0.03% of the time isn't 30x less important than one that activates 1% of the time. It might handle a rare but critical capability. When it's pruned, that capability vanishes entirely.

Consider the analogy: a fire extinguisher is "activated" 0% of the time during normal operations. That doesn't make it safe to remove.

The Long-Tail Knowledge Problem

A 512-expert model distributes knowledge across many specialists. The long tail of rarely-activated experts collectively contains substantial world knowledge — functional programming concepts, grammar rules for specific languages, domain-specific reasoning patterns.

Pruning 18.1% of experts removes this long tail entirely. The model retains its common-case capabilities (English coding, basic math, general knowledge) but loses rare capabilities disproportionately.

Quantifying the Damage

We ran academic benchmarks on the pruned model (thinking mode disabled, which affects absolute scores but not the relative comparison):

Benchmark	Pruned Model	Note
MMLU-Pro	43.6% (41/94)	History: 0%, Biology: 20%, Law: 22%
ARC-Challenge	74.7% (112/150)
GSM8K	38.0% (19/50)
HumanEval	20.0% (6/30)

Without a baseline run on the unpruned model using the same benchmark protocol, we can't attribute all degradation to pruning (Qwen3.5 with thinking disabled has lower baseline scores). However, the category breakdown in MMLU-Pro is suggestive: History at 0% and Biology at 20% point to knowledge-domain experts being pruned.

Recommendations

1. Lower the Pruning Threshold

Our threshold was activation frequency < 0.05%. A more conservative threshold of < 0.01% (truly dead experts that never fire across 150 prompts) would prune far fewer experts but with higher confidence that nothing important is removed.

For Qwen3.5-397B, using 0.01% would reduce the pruned set from 5,562 to approximately 800-1,200 experts (rough estimate based on activation frequency distribution).

2. Use Diverse Calibration Data

The calibration set must reflect the model's intended use cases. For a multilingual model:

Include prompts in all target languages
Include specialized domain content (formal logic, niche science, legal terminology)
Weight the calibration toward the tail, not the head, of the usage distribution

3. Test Beyond Collapse Detection

Automated collapse tests (keyword matching, length thresholds, repetition detection) are a necessary but insufficient quality bar. At minimum:

Run perplexity evaluation on diverse held-out text
Include domain-specific probes (multilingual translation, niche knowledge questions)
Perform manual spot-checks on a curated set of adversarial prompts targeting expected weaknesses

4. Consider Soft Pruning

Instead of hard pruning (setting gate weights to -1e9), consider soft down-weighting: reduce the gate weight magnitude by a factor (e.g., multiply by 0.1) so the expert is rarely selected but still available for tokens that strongly need it. This preserves the long-tail safety net.

5. Validate Pruning Incrementally

Instead of pruning 18% at once, prune in stages: 2%, 5%, 10%, 15%, measuring quality at each stage. This identifies the threshold where degradation appears and allows for a data-driven pruning budget.

Conclusion

Expert pruning in MoE models is appealing — it's zero-cost at inference and can simplify the routing landscape. But our experience shows that activation frequency during calibration is a poor proxy for expert importance. The 18.1% of experts we pruned from Qwen3.5-397B included specialists for multilingual generation, niche domain knowledge, and extended reasoning — capabilities that are rare in calibration but essential in production.

The most dangerous aspect is that standard automated quality tests don't catch these regressions. The pruned model passed 15/15 collapse tests with 0 warnings. Only manual inspection revealed the Chinese characters in Spanish translations and the confident denial that monads exist.

If you're considering expert pruning for MoE models, our recommendation is: prune very conservatively (only truly dead experts), use diverse calibration data, and invest in quality evaluation that goes well beyond collapse detection.

Next in this series: MLX Quantization on Apple Silicon — Engineering Pitfalls and Workarounds — the bugs we found in MLX's quantization pipeline and how to work around them.

← Previous: Per-Expert Mixed-Bit Quantization via Mask-and-Com...Next: MLX Quantization on Apple Silicon — Engineering Pi... →

← Back to all articles

Expert Pruning in MoE Models — When Dead Experts Aren't Dead

Introduction

The Pruning Technique

Router Masking

Weight Zeroing

Post-Processing Implementation

What We Pruned

Automated Test Results: False Confidence

What Manual Inspection Revealed

Failure 1: Language Contamination

Failure 2: Factual Hallucination

Failure 3: Abbreviated Reasoning

Root Cause Analysis

The Calibration Bias Problem

Activation Frequency ≠ Importance

The Long-Tail Knowledge Problem

Quantifying the Damage

Recommendations

1. Lower the Pruning Threshold

2. Use Diverse Calibration Data

3. Test Beyond Collapse Detection

4. Consider Soft Pruning

5. Validate Pruning Incrementally

Conclusion

Want to apply these techniques to your AI infrastructure?