MoE Routing Layers Converge Across Subjects: No Free Lunch for Domain-Specific Targeting

We profiled MoE gate routing across 14 academic domains expecting to find subject-specific “knowledge layers.” Instead, all subjects route through the same three layers. Domain-specific LoRA targeting is indistinguishable from generic mid-layer targeting.

The Hypothesis

Mixture-of-Experts models route different tokens to different experts. An appealing idea for targeted fine-tuning: if “chemistry knowledge” routes through specific layers and experts, you could inject chemistry knowledge by training LoRA only on those layers, leaving other domains undisturbed. This would enable domain-specific knowledge injection without cross-domain interference.

We tested this on Qwen3.5-35B-A3B (128 experts per MoE layer, 40 MoE layers) by profiling router gate activations across 14 MMLU-Pro subjects with 100 questions per subject.

Methodology

For each subject, we logged per-layer expert activation distributions across 100 domain-specific prompts. We computed per-subject specificity as the KL divergence between each subject’s routing distribution and the pooled (all-subjects) distribution at each layer. For each subject, the top-3 layers by KL divergence represent the layers where that subject’s routing is most distinctive from the average.

The key metric is mean pairwise Jaccard overlap of the top-3 layer sets across all subject pairs. If subjects route through different layers, Jaccard should be low (<0.3). If they share the same layers, Jaccard will be high (>0.6).

Results: The Shared Backbone

Subject	Top-3 Routing Layers
Chemistry	20, 8, 14
Math	14, 20, 18
Business	14, 19, 8
Engineering	8, 9, 14
Physics	8, 20, 14
Law	14, 19, 15
Other	20, 8, 14
Computer Science	20, 8, 15
Health	20, 8, 14
History	14, 20, 9
Philosophy	8, 20, 14
Economics	8, 20, 14
Psychology	20, 8, 9
Biology	20, 8, 13

Layers 8, 14, and 20 (highlighted) form a shared routing backbone. Each appears in 11 of 14 subjects’ top-3. The only variation is the third slot, and even there, the alternatives (9, 13, 15, 18, 19) appear sporadically with no subject-specific consistency.

The Numbers

Metric	Value	Interpretation
Mean pairwise Jaccard (top-3)	0.496	Moderate overlap, subjects share ~half their top-3 layers
Layers appearing in ≥11/14 subjects	3 (layers 8, 14, 20)	A fixed backbone, not subject-specific routing
Subjects with exact {8,14,20} top-3	6 / 14	Nearly half the subjects are identical

What This Means for Targeted LoRA

The premise of “subject-specific LoRA targeting” requires subjects to route through different layers. When all subjects share the same top layers, targeting “chemistry layers” vs. “physics layers” vs. “generic knowledge layers” is the same operation, you are always targeting {8, 14, 20}.

This is a negative result. At the layer level, MoE routing on this model does not provide a subject-specific signal that can be exploited for targeted fine-tuning. The gate nominally passes our pre-registered threshold (Jaccard < 0.6), but the pass is driven entirely by the variable third layer, the core backbone is universal.

Why the Backbone Is Universal

Two likely explanations:

Load balancing during pre-training pushes routing toward uniformity across layers. The auxiliary load-balancing loss prevents any single layer from becoming a domain-specific bottleneck.
KL-based specificity measures general routing distinctiveness, not domain encoding. Layers 8, 14, and 20 may be where routing is most variable (highest entropy), making all subjects appear “distinctive” at the same layers without actually routing differently.

Practical Takeaway

For LoRA-based knowledge injection on this architecture:

Use generic mid-layer targeting (layers {8, 14, 20} for Qwen3.5-35B-A3B), subject-specific layer selection adds complexity without benefit
Domain specificity may exist at the expert level within a layer, but this requires per-expert LoRA (currently infeasible due to Metal buffer limits at scale)
Do not assume router profiles imply targetable structure, having a diverse routing profile does not mean subjects can be isolated by layer

This negative result simplifies the LoRA targeting decision: on models with this routing structure, just target the highest-activity mid-layers generically. The “routing-aware” approach collapses to the generic approach, saving the engineering cost of per-subject routing analysis.

Model: Qwen3.5-35B-A3B (hybrid MoE, 128 experts, 40 MoE layers). Profiled: 14 MMLU-Pro subjects, 100 prompts per subject. Metric: KL divergence of per-subject routing vs. pooled routing, top-3 layers by KL, mean pairwise Jaccard overlap. Temperature 0, thinking disabled.

MoE Routing Layers Converge Across Subjects: No Free Lunch for Domain-Specific Targeting

The Hypothesis

Methodology

Results: The Shared Backbone

The Numbers

What This Means for Targeted LoRA

Why the Backbone Is Universal

Practical Takeaway

Continue Reading

Identity Survives LoRA Stacking

The Stacking Confound: Why LoRA Recovery Numbers Lie

What 100 Prompts Reveal About Expert Routing in 256-Expert MoE Models