How we profiled 30,720 experts across two large MoE models, what the activation patterns revealed, and why the numbers challenge common assumptions about expert redundancy.
Introduction
MoE models like Qwen3-235B and Qwen3.5-397B route each token to a small subset of "expert" sub-networks. The idea behind expert-aware quantization is straightforward: figure out which experts matter most, give them more bits, and compress the rest. Simple enough in theory.
The catch is that you need to actually know which experts matter. We built a profiling system that captures activation patterns across domains, and the results surprised us.
This article covers the methodology, the patterns we found, and why the numbers look so different between 128-expert and 512-expert architectures.
The Models
We profiled two models:
| Property | Qwen3-235B-A22B | Qwen3.5-397B-A17B |
|---|---|---|
| Total parameters | 235B | 397B |
| Active per token | ~22B | ~17B |
| Layers | 94 | 60 |
| Experts per layer | 128 | 512 |
| Top-k routing | 8 | 10 |
| Shared expert | No | Yes |
| Routing | Softmax top-k | Softmax top-k |
| Total expert instances | 12,032 | 30,720 |
Profiling Methodology
Calibration Data
We ran 150 calibration prompts through each model, tagged by domain:
- Coding, algorithm implementation, debugging, code review
- Math, calculus, proofs, word problems
- Reasoning, logic puzzles, comparisons, analysis
- Agent/Tool-use, structured responses, tool calling patterns
- English general, creative writing, knowledge questions, conversation
- Multilingual, translation, non-English generation
Each prompt triggered a full forward pass. We captured every expert activation: which expert was selected, its softmax routing score, and the domain tag of the input.
Hooking Into the Router
For MLX models, the router lives inside each MoE layer's __call__ method. We needed to intercept the routing decision after softmax but before expert dispatch. The obvious approach of patching __call__ with types.MethodType doesn't work. Python's dunder method lookup goes through __class__, not the instance. So we used __class__ swapping instead:
# This DOES NOT work for __call__:
# layer.__call__ = types.MethodType(hooked_call, layer)
# This DOES work, swap the class itself:
original_class = layer.__class__
HookedClass = type(
f"Hooked{original_class.__name__}",
(original_class,),
{"__call__": hooked_call}
)
layer.__class__ = HookedClass
This gives us the routing weights, selected expert indices, and softmax probabilities at every layer for every token.
What We Captured
For each expert activation, we recorded:
- Layer index and expert index
- Softmax routing score (the probability assigned by the router)
- Domain tag of the input prompt
- Token position (prefill vs generation)
For Qwen3.5-397B, this produced 7.8 million activation records from 150 calibration samples. The whole run took about 11 minutes.
Key Finding: Softmax Confidence Scales Inversely With Expert Count
This was the biggest surprise. The softmax routing scores look fundamentally different between 128-expert and 512-expert models:
| Metric | Qwen3-235B (128 experts) | Qwen3.5-397B (512 experts) |
|---|---|---|
| Median softmax score | 0.078 | 0.016 |
| P95 softmax score | 0.142 | 0.030 |
| P99 softmax score | 0.210 | 0.048 |
| Top-k | 8 | 10 |
When 512 experts compete for routing probability, individual scores get crushed. A "highly confident" routing decision in a 512-expert model (score 0.048) would be below average in a 128-expert model.
What this means for quantization: Threshold-based classification schemes need model-specific calibration. A fixed threshold like "score > 0.1 = critical" works fine for 128-expert models but classifies zero experts as critical in 512-expert models.
Expert Classification
We classified every expert instance into four tiers based on activation frequency, domain specificity, and confidence scores:
Qwen3-235B-A22B (12,032 total expert instances)
| Tier | Count | Percentage | Bits |
|---|---|---|---|
| Critical | 23 | 0.19% | 8 |
| Standard | 10,845 | 90.14% | 4 |
| Deprioritized | 365 | 3.03% | 2 |
| Prune | 799 | 6.64% | 0 |
Qwen3.5-397B-A17B (30,720 total expert instances)
| Tier | Count | Percentage | Bits |
|---|---|---|---|
| Critical | 879 | 2.86% | 8 |
| Standard | 22,466 | 73.09% | 4 |
| Deprioritized | 1,813 | 5.90% | 2 |
| Prune | 5,562 | 18.11% | 0 |
The 512-expert model has a much longer tail of low-activation experts. Nearly 1 in 5 experts (18.1%) landed in the prunable category based on activation frequency below 0.05%. That makes sense. With 4x more experts per layer, there's simply more room for redundancy.
Domain Specificity
Most experts are generalists. The domain specificity score (how unevenly an expert's activations distribute across domains) stays low across both models:
Domain specificity distribution (Qwen3.5-397B):
Mean: 0.047
Median: 0.039
P90: 0.082
P99: 0.198
A score of 0.047 means the expert fires almost uniformly across all six domains. Only about 1% of experts show strong domain specialization (specificity > 0.198).
Here's the thing, though. Domain specificity measured against our calibration set doesn't capture true specialization. An expert that only fires on Haskell monad questions or Swahili grammar won't show up as "domain-specific" if those inputs aren't in the calibration set. This limitation came back to bite us hard when we later pruned experts (see Article 3).
Layer-Level Patterns
Expert activation isn't uniform across layers:
Qwen3.5-397B critical expert distribution by layer:
Layer 41: 94 critical experts ████████████████████████████
Layer 40: 50 critical experts ███████████████
Layer 36: 39 critical experts ████████████
Layer 19: 37 critical experts ███████████
Layer 29: 36 critical experts ███████████
Layer 27: 34 critical experts ██████████
Layer 25: 33 critical experts ██████████
Layer 37: 33 critical experts ██████████
Layer 26: 32 critical experts ██████████
Layer 28: 31 critical experts █████████
Layer 45: 31 critical experts █████████
...
Layers 0-18: 0-5 critical experts each
Critical experts cluster in the middle-to-late layers (25-45). Early layers (0-18) have very few critical experts but plenty of prunable ones. Layer 0 alone had 166 experts classified as prunable.
The pattern suggests early MoE layers handle broad, redundant routing where many experts can substitute for each other. Later layers develop specialized, non-redundant experts that can't be easily replaced.
Practical Recommendations
- Calibration set design matters enormously. Our 150 English-focused samples missed experts that are critical for multilingual and niche-domain tasks. A production profiling run should include diverse languages and specialized domains proportional to expected usage.
- Don't use fixed thresholds across model sizes. Scale your classification thresholds relative to
1/num_expertsto account for softmax dilution.
- Profile more samples than you think you need. With 512 experts, 150 samples gives only about 51 activations per expert on average. Rare but important experts may only fire a handful of times.
- Layer position is informative. Early layers are safe targets for aggressive quantization. Late layers need more care.
Code Availability
The profiling tool (MoEActivationTracker) supports Qwen3, Qwen3.5, and Llama-4 Maverick architectures. It captures per-expert activation frequency, domain distribution, softmax confidence, and cluster assignments.
Next in this series: Per-Expert Mixed-Bit Quantization via Mask-and-Combine Dispatch, how we used these profiling results to build a custom quantization kernel, and why it was too slow for production.
Read the Full Paper
The full MoE expert quantization paper, covering expert activation profiling, per-expert mixed-bit allocation, and evaluation across 512-expert architectures, is available on our HuggingFace:
MoE Expert Quantization: Per-Expert Mixed-Precision for Mixture-of-Experts Models, Full Paper
huggingface.co/spaces/baa-ai/MoE-Expert-QuantizationLicensed under CC BY-NC-ND 4.0