We pruned 18% of experts from a 512-expert MoE model based on activation profiling. The model passed all automated quality tests. Then we looked at the actual responses.
Introduction
Mixture-of-Experts models contain vast numbers of expert sub-networks — Qwen3.5-397B has 512 experts per layer across 60 layers, totaling 30,720 expert instances. Our activation profiling (Article 1) showed that 18.1% of these experts (5,562 instances) were activated less than 0.05% of the time across 150 calibration prompts.
The obvious move: prune them. Zero their weights, mask them out of the router, reclaim the capacity. The technique worked perfectly in every automated test. It failed silently and dangerously on real-world tasks.
This article documents the pruning technique, the automated test results that gave us false confidence, and the quality regressions we discovered through manual inspection.
The Pruning Technique
Router Masking
MoE models use a router (gate) network to select which experts process each token. The router produces logits for all experts, applies softmax to get probabilities, then selects the top-k experts.
To prune an expert, we don't need to remove it from the model — we just need to ensure it's never selected. We do this by setting the router's gate weight row for that expert to an extreme negative value:
# gate_weight shape: [num_experts, hidden_dim] = [512, 4096]
# For each pruned expert index:
gate_weight[expert_idx, :] = -1e9
After softmax, a logit of -1e9 produces a probability of approximately 0. The expert is never selected by top-k routing. This has zero runtime cost — the softmax and top-k computation happens regardless, and the pruned expert's probability is simply negligible.
Weight Zeroing
For completeness, we also zero the expert's weights. This doesn't affect inference (the expert is never selected) but ensures the model file doesn't contain stale parameters:
# For fused gate_up_proj: shape [512, dim1, dim2]
expert_weights[expert_idx] = 0
Post-Processing Implementation
We bake the router masks into the converted model's safetensors files:
from safetensors import safe_open
from safetensors.torch import save_file
for shard_file in model_shards:
tensors = {}
with safe_open(str(shard_file), framework="pt") as f:
for key in f.keys():
tensors[key] = f.get_tensor(key)
# Find gate weight tensors and mask pruned experts
for key in tensors:
match = gate_pattern.search(key)
if match:
layer_idx = int(match.group(1))
for expert_idx in pruned_experts[layer_idx]:
tensors[key][expert_idx, :] = -1e9
save_file(tensors, str(shard_file))
Critical note: We discovered that using mx.save_safetensors for this post-processing step corrupts bfloat16 data (see Article 4). The safetensors.torch.save_file function must be used instead.
What We Pruned
From the Qwen3.5-397B activation profiling (150 calibration prompts, 7.8M activation records):
| Layer Range | Experts Pruned | Percentage | Notes |
|---|---|---|---|
| Layers 0-5 | 352 | 11.5% | High pruning in earliest layers |
| Layer 0 alone | 166 | 32.4% | Highest single-layer pruning |
| Layers 6-18 | 57 | 1.9% | Very few prunable in early-mid layers |
| Layers 19-35 | 1,611 | 18.8% | Moderate pruning |
| Layers 36-59 | 3,542 | 23.2% | Heaviest pruning in late layers |
| Total | 5,562 | 18.1% | Out of 30,720 |
The pruning is not uniform across layers. Late layers (36-59) have the most pruned experts because those layers develop more specialized, less frequently activated experts.
Automated Test Results: False Confidence
We ran the standard 15-prompt collapse test suite against three variants:
| Variant | Pass | Warn | Fail | Avg Time |
|---|---|---|---|---|
| ExpertQuant (with pruning) | 15 | 0 | 0 | 47.3s |
| Baseline (no pruning) | 15 | 1 | 0 | 8.2s |
| Hybrid (with pruning) | 15 | 0 | 0 | 7.7s |
All variants pass. The pruned model actually has fewer warnings than the unpruned baseline. By every automated metric, the pruning is safe.
The collapse tests check for:
- Minimum response length
- Keyword presence (e.g., "class" for a coding prompt)
- Repetition detection
- Empty responses
These are necessary but deeply insufficient quality checks.
What Manual Inspection Revealed
When we compared actual responses between the pruned (Hybrid) and unpruned (baseline) models, three categories of failure emerged.
Failure 1: Language Contamination
Prompt: "Translate the following to Spanish: 'The weather is beautiful today and I plan to go hiking in the mountains.'"
Unpruned model:
El clima está hermoso hoy y planeo ir de caminata a las montañas.
Pruned model:
El clima está hermoso hoy y我 plane to go hiking in the mountains.
The pruned model produces a Chinese character (我, "I") mid-sentence and fails to translate the second half. This passed the collapse test because the test only checks for minimum length (10 characters) and has no keyword check for this prompt.
Failure 2: Factual Hallucination
Prompt: "Explain what a monad is in functional programming, with a practical example in Haskell or Python."
Unpruned model:
This is a great question about functional programming concepts. Let me explain monads clearly with practical examples. A monad is a design pattern in functional programming that allows for sequencing...
Pruned model:
Actually, there no such term as "monad" in functional programming. You may be confusing it with one of the following similar concepts...
The pruned model confidently denies the existence of a fundamental programming concept. This passed the collapse test because there's no keyword check for this prompt (check: None).
Failure 3: Abbreviated Reasoning
Prompt: "A farmer needs to cross a river with a fox, a chicken, and a bag of corn..."
Unpruned model:
1,119 characters — complete step-by-step solution with all river crossings
Pruned model:
206 characters — mostly restates the problem, does not provide the solution
The pruned model's response is one-fifth the length and doesn't actually solve the puzzle. It passed because the test only checks for the keyword "chicken" (present in the restated problem) and minimum length of 80 characters.
Root Cause Analysis
The Calibration Bias Problem
Our activation profiling used 150 prompts across six domains — but all prompts were primarily in English. The domain distribution:
coding: 25 prompts
math: 25 prompts
reasoning: 25 prompts
agent_tooluse: 25 prompts
english_general: 25 prompts
multilingual: 25 prompts (but relatively simple translations)
Experts that specialize in:
- Non-English language generation (Spanish, Chinese, Japanese, etc.)
- Niche domain knowledge (functional programming, category theory, formal logic)
- Extended reasoning chains (multi-step problem solving)
...may fire infrequently in this calibration set but be essential when needed.
Activation Frequency ≠ Importance
This is the core insight: an expert that activates 0.03% of the time isn't 30x less important than one that activates 1% of the time. It might handle a rare but critical capability. When it's pruned, that capability vanishes entirely.
Consider the analogy: a fire extinguisher is "activated" 0% of the time during normal operations. That doesn't make it safe to remove.
The Long-Tail Knowledge Problem
A 512-expert model distributes knowledge across many specialists. The long tail of rarely-activated experts collectively contains substantial world knowledge — functional programming concepts, grammar rules for specific languages, domain-specific reasoning patterns.
Pruning 18.1% of experts removes this long tail entirely. The model retains its common-case capabilities (English coding, basic math, general knowledge) but loses rare capabilities disproportionately.
Quantifying the Damage
We ran academic benchmarks on the pruned model (thinking mode disabled, which affects absolute scores but not the relative comparison):
| Benchmark | Pruned Model | Note |
|---|---|---|
| MMLU-Pro | 43.6% (41/94) | History: 0%, Biology: 20%, Law: 22% |
| ARC-Challenge | 74.7% (112/150) | |
| GSM8K | 38.0% (19/50) | |
| HumanEval | 20.0% (6/30) |
Without a baseline run on the unpruned model using the same benchmark protocol, we can't attribute all degradation to pruning (Qwen3.5 with thinking disabled has lower baseline scores). However, the category breakdown in MMLU-Pro is suggestive: History at 0% and Biology at 20% point to knowledge-domain experts being pruned.
Recommendations
1. Lower the Pruning Threshold
Our threshold was activation frequency < 0.05%. A more conservative threshold of < 0.01% (truly dead experts that never fire across 150 prompts) would prune far fewer experts but with higher confidence that nothing important is removed.
For Qwen3.5-397B, using 0.01% would reduce the pruned set from 5,562 to approximately 800-1,200 experts (rough estimate based on activation frequency distribution).
2. Use Diverse Calibration Data
The calibration set must reflect the model's intended use cases. For a multilingual model:
- Include prompts in all target languages
- Include specialized domain content (formal logic, niche science, legal terminology)
- Weight the calibration toward the tail, not the head, of the usage distribution
3. Test Beyond Collapse Detection
Automated collapse tests (keyword matching, length thresholds, repetition detection) are a necessary but insufficient quality bar. At minimum:
- Run perplexity evaluation on diverse held-out text
- Include domain-specific probes (multilingual translation, niche knowledge questions)
- Perform manual spot-checks on a curated set of adversarial prompts targeting expected weaknesses
4. Consider Soft Pruning
Instead of hard pruning (setting gate weights to -1e9), consider soft down-weighting: reduce the gate weight magnitude by a factor (e.g., multiply by 0.1) so the expert is rarely selected but still available for tokens that strongly need it. This preserves the long-tail safety net.
5. Validate Pruning Incrementally
Instead of pruning 18% at once, prune in stages: 2%, 5%, 10%, 15%, measuring quality at each stage. This identifies the threshold where degradation appears and allows for a data-driven pruning budget.
Conclusion
Expert pruning in MoE models is appealing — it's zero-cost at inference and can simplify the routing landscape. But our experience shows that activation frequency during calibration is a poor proxy for expert importance. The 18.1% of experts we pruned from Qwen3.5-397B included specialists for multilingual generation, niche domain knowledge, and extended reasoning — capabilities that are rare in calibration but essential in production.
The most dangerous aspect is that standard automated quality tests don't catch these regressions. The pruned model passed 15/15 collapse tests with 0 warnings. Only manual inspection revealed the Chinese characters in Spanish translations and the confident denial that monads exist.
If you're considering expert pruning for MoE models, our recommendation is: prune very conservatively (only truly dead experts), use diverse calibration data, and invest in quality evaluation that goes well beyond collapse detection.
Next in this series: MLX Quantization on Apple Silicon — Engineering Pitfalls and Workarounds — the bugs we found in MLX's quantization pipeline and how to work around them.