Profiling Expert Activation Patterns in 512-Expert MoE Models

How we profiled 30,720 experts across two large MoE models, what the activation patterns revealed, and why the numbers challenge common assumptions about expert redundancy.

Introduction

Mixture-of-Experts (MoE) models like Qwen3-235B and Qwen3.5-397B route each token to a small subset of "expert" sub-networks. The premise of expert-aware quantization is simple: if we know which experts matter most, we can allocate more bits to critical experts and fewer bits (or none at all) to unimportant ones.

But to do that, you first need to know which experts matter. We built a profiling system that captures expert activation patterns across domains, and the results challenged several of our assumptions.

This article covers the profiling methodology, the patterns we discovered, and why the numbers differ dramatically between 128-expert and 512-expert architectures.

The Models

We profiled two models:

Property	Qwen3-235B-A22B	Qwen3.5-397B-A17B
Total parameters	235B	397B
Active per token	~22B	~17B
Layers	94	60
Experts per layer	128	512
Top-k routing	8	10
Shared expert	No	Yes
Routing	Softmax top-k	Softmax top-k
Total expert instances	12,032	30,720

Profiling Methodology

Calibration Data

We ran 150 calibration prompts through each model, tagged by domain:

Coding — algorithm implementation, debugging, code review
Math — calculus, proofs, word problems
Reasoning — logic puzzles, comparisons, analysis
Agent/Tool-use — structured responses, tool calling patterns
English general — creative writing, knowledge questions, conversation
Multilingual — translation, non-English generation

Each prompt generated a full forward pass, and we captured every expert activation — which expert was selected, its softmax routing score, and the domain tag of the input.

Hooking Into the Router

For MLX models, the router is inside each MoE layer's __call__ method. We needed to intercept the routing decision after softmax but before expert dispatch. The standard approach of patching __call__ with types.MethodType fails because Python's dunder method lookup goes through __class__, not the instance. Our solution was __class__ swapping:


# This DOES NOT work for __call__:
# layer.__call__ = types.MethodType(hooked_call, layer)

# This DOES work — swap the class itself:
original_class = layer.__class__
HookedClass = type(
    f"Hooked{original_class.__name__}",
    (original_class,),
    {"__call__": hooked_call}
)
layer.__class__ = HookedClass

This gives us access to the routing weights, selected expert indices, and softmax probabilities at every layer for every token.

What We Captured

For each expert activation, we recorded:

Layer index and expert index
Softmax routing score (the probability assigned by the router)
Domain tag of the input prompt
Token position (prefill vs generation)

For Qwen3.5-397B, this produced 7.8 million activation records from 150 calibration samples, taking about 11 minutes.

Key Finding: Softmax Confidence Scales Inversely With Expert Count

This was the most surprising result. The softmax routing scores look fundamentally different between 128-expert and 512-expert models:

Metric	Qwen3-235B (128 experts)	Qwen3.5-397B (512 experts)
Median softmax score	0.078	0.016
P95 softmax score	0.142	0.030
P99 softmax score	0.210	0.048
Top-k	8	10

With 512 experts competing for routing probability, individual expert scores are much lower. A "highly confident" routing decision in a 512-expert model (score 0.048) would be below-average in a 128-expert model.

Implication for quantization: Threshold-based classification schemes need model-specific calibration. A fixed threshold like "score > 0.1 = critical" works for 128-expert models but classifies zero experts as critical in 512-expert models.

Expert Classification

We classified every expert instance into four tiers based on activation frequency, domain specificity, and confidence scores:

Qwen3-235B-A22B (12,032 total expert instances)

Tier	Count	Percentage	Bits
Critical	23	0.19%	8
Standard	10,845	90.14%	4
Deprioritized	365	3.03%	2
Prune	799	6.64%	0

Qwen3.5-397B-A17B (30,720 total expert instances)

Tier	Count	Percentage	Bits
Critical	879	2.86%	8
Standard	22,466	73.09%	4
Deprioritized	1,813	5.90%	2
Prune	5,562	18.11%	0

The 512-expert model has a much larger tail of low-activation experts. Nearly 1 in 5 experts (18.1%) was classified as prunable based on activation frequency below 0.05%. This makes intuitive sense — with 4x more experts per layer, there's more room for redundancy.

Domain Specificity

Most experts are generalists. The domain specificity score (how unevenly an expert's activations distribute across domains) is low across both models:


Domain specificity distribution (Qwen3.5-397B):
  Mean:   0.047
  Median: 0.039
  P90:    0.082
  P99:    0.198

A score of 0.047 means the expert's activation pattern is nearly uniform across all six domains. Only ~1% of experts show strong domain specialization (specificity > 0.198).

However — and this is critical — domain specificity measured against our calibration set doesn't capture true specialization. An expert that fires only on Haskell monad questions or Swahili grammar won't register as "domain-specific" in a calibration set that doesn't include those inputs. This limitation directly caused quality issues when we later pruned experts (see Article 3 in this series).

Layer-Level Patterns

Expert activation isn't uniform across layers:

Qwen3.5-397B critical expert distribution by layer:


Layer 41: 94 critical experts  ████████████████████████████
Layer 40: 50 critical experts  ███████████████
Layer 36: 39 critical experts  ████████████
Layer 19: 37 critical experts  ███████████
Layer 29: 36 critical experts  ███████████
Layer 27: 34 critical experts  ██████████
Layer 25: 33 critical experts  ██████████
Layer 37: 33 critical experts  ██████████
Layer 26: 32 critical experts  ██████████
Layer 28: 31 critical experts  █████████
Layer 45: 31 critical experts  █████████
...
Layers 0-18: 0-5 critical experts each

The middle-to-late layers (25-45) concentrate the critical experts. Early layers (0-18) have very few critical experts but many prunable ones — layer 0 alone had 166 experts classified as prunable.

This suggests that early MoE layers handle broad, redundant routing patterns (many experts can substitute for each other), while later layers develop specialized, non-redundant experts.

Practical Recommendations

Calibration set design matters enormously. Our 150 English-focused samples missed experts that are critical for multilingual and niche-domain tasks. A production profiling run should include diverse languages and specialized domains proportional to expected usage.

Don't use fixed thresholds across model sizes. Scale your classification thresholds relative to 1/num_experts to account for softmax dilution.

Profile more samples than you think you need. With 512 experts, 150 samples gives only ~51 activations per expert on average. Rare but important experts may only fire a handful of times.

Layer position is informative. Early layers are safe targets for aggressive quantization. Late layers require more care.

Code Availability

The profiling tool (MoEActivationTracker) supports Qwen3, Qwen3.5, and Llama-4 Maverick architectures. It captures per-expert activation frequency, domain distribution, softmax confidence, and cluster assignments.

Next in this series: Per-Expert Mixed-Bit Quantization via Mask-and-Combine Dispatch — how we used these profiling results to build a custom quantization kernel, and why it was too slow for production.

Next: Per-Expert Mixed-Bit Quantization via Mask-and-Com... →

← Back to all articles