How concentrated is expert routing in a 256-expert MoE model? The answer depends entirely on how many prompts you test with — and the difference is dramatic.
The Setup
We built a lightweight activation profiler that monkey-patches MLX MoE gate modules to capture routing decisions. The profiler intercepts the softmax output of each router, records which experts are selected (top-k), and accumulates per-expert activation counts across all tokens and layers.
We ran the profiler on Qwen3.5-35B-A3B — a model with 256 experts per layer, 8 active per token, and 40 MoE layers — across 100 diverse prompts spanning:
- Code generation (Python, Rust, JavaScript)
- Mathematical reasoning and proofs
- Creative writing and storytelling
- Factual Q&A and knowledge retrieval
- Multilingual text (Chinese, French, Spanish, Arabic)
- Instruction following and structured output
The Sample Size Effect
This is the most important methodological finding in the study. The number of prompts used for profiling dramatically changes the apparent redundancy of the expert pool:
| Prompts | Dead Experts (avg per layer) | Apparent Redundancy |
|---|---|---|
| 5 | 76.8 / 256 | 30.0% |
| 20 | 18.2 / 256 | 7.1% |
| 50 | 4.9 / 256 | 1.9% |
| 100 | 1.5 / 256 | 0.6% |
At 5 prompts, nearly a third of experts appear unused. At 100 prompts, that figure drops to 0.6%. Papers evaluating MoE routing with small prompt sets massively overestimate redundancy. This is not a subtle effect — it is a 50x reduction in apparent dead experts between 5 and 100 prompts.
The mechanism is straightforward: expert specialization is domain-dependent. An expert that never fires on English prose may be essential for code. An expert dormant during factual Q&A may activate heavily on mathematical notation. Small prompt sets sample a narrow slice of the input distribution and miss these domain-specific activations.
Routing Concentration
At 100 prompts, the full routing profile stabilizes. Here are the key concentration metrics:
| Metric | Value | Interpretation |
|---|---|---|
| Entropy ratio | 0.91 | Fairly uniform (1.0 = perfect uniformity) |
| Gini coefficient | 0.53 | Moderate concentration (0 = equal, 1 = monopoly) |
| Top-10 expert share | 20.4% | 5.2x their “fair share” (uniform = 3.9%) |
| Dead experts | 1.5 / 256 | 0.6% truly unused at 100 prompts |
The routing is concentrated but not sparse. Most experts are used, just unevenly. The top-10 experts get roughly 5x their fair share of traffic, while long-tail experts handle small but non-negligible fractions of tokens. The Gini coefficient of 0.53 indicates moderate inequality — comparable to income distribution in a mid-inequality economy, not the extreme concentration you might expect.
The entropy ratio of 0.91 confirms that Qwen3.5’s load-balancing loss during training was effective: the router uses most of the expert pool, just not uniformly. This is arguably the ideal outcome for quantization-aware tiering — enough concentration to justify differential precision, but not so much that most experts are wasted.
What This Means for Quantization
Activation frequency provides a useful signal for tiered quantization. Our DynaMINT experiment demonstrates this directly: assigning higher precision to frequently-used experts and lower precision to rare ones yields only +0.5% PPL degradation — a negligible quality cost for meaningful compression gains.
But activation frequency is not a safe pruning signal. Our companion pruning study shows that removing just 5% of least-used experts causes a 13x PPL blow-up. The correct use of activation data is to inform bit-width allocation, not to decide which experts to remove.
The distinction is critical:
- Tiered quantization (activation-guided bit allocation): +0.5% PPL — production-viable
- Expert pruning (activation-guided removal): +1,220% PPL — catastrophic
- Same signal, opposite conclusions depending on how you act on it
Implications for MoE Design
MoE models at 256 experts achieve good load balancing (entropy ratio 0.91) but not perfect uniformity. The remaining concentration creates an opportunity for quantization-aware tiering that uniform quantization leaves on the table.
Framework support for per-expert precision would unlock this directly. Currently, MLX and most inference frameworks quantize all experts within a layer to the same bit-width. A per-expert precision API — where each expert in a SwitchLinear layer can have its own bits and group_size — would enable DynaMINT-style tiering without the Python dispatch overhead that currently limits throughput.
For researchers evaluating MoE routing, the methodological takeaway is clear: use at least 100 diverse prompts before drawing conclusions about expert redundancy. With fewer prompts, you are measuring the narrowness of your prompt set, not the redundancy of the model.
Profiler code and full activation data available at github.com/baa-ai/MINT. Full paper: “MINT: Budget-Aware Data-Free Mixed-Precision Quantization for Large Language Models via Rate-Distortion Optimization” (baa.ai, 2026). Expert routing data collected on Qwen3.5-35B-A3B using MLX on Apple M2 Ultra 192GB.
Read the Full Paper
The complete MoE expert quantization paper, including expert activation profiling, per-expert mixed-bit allocation, and evaluation across 512-expert architectures, is available on our HuggingFace:
MoE Expert Quantization: Per-Expert Mixed-Precision for Mixture-of-Experts Models — Full Paper
huggingface.co/spaces/baa-ai/MoE-Expert-QuantizationLicensed under CC BY-NC-ND 4.0