MoE Routing Layers Converge Across Subjects
MoE Research

MoE Routing Layers Converge Across Subjects: No Free Lunch for Domain-Specific Targeting

May 2026 · Black Sheep AI Research

We profiled MoE gate routing across 14 academic domains expecting to find subject-specific “knowledge layers.” Instead, all subjects route through the same three layers. Domain-specific LoRA targeting is indistinguishable from generic mid-layer targeting.

The Hypothesis

Mixture-of-Experts models route different tokens to different experts. An appealing idea for targeted fine-tuning: if “chemistry knowledge” routes through specific layers and experts, you could inject chemistry knowledge by training LoRA only on those layers, leaving other domains undisturbed. This would enable domain-specific knowledge injection without cross-domain interference.

We tested this on Qwen3.5-35B-A3B (128 experts per MoE layer, 40 MoE layers) by profiling router gate activations across 14 MMLU-Pro subjects with 100 questions per subject.

Methodology

For each subject, we logged per-layer expert activation distributions across 100 domain-specific prompts. We computed per-subject specificity as the KL divergence between each subject’s routing distribution and the pooled (all-subjects) distribution at each layer. For each subject, the top-3 layers by KL divergence represent the layers where that subject’s routing is most distinctive from the average.

The key metric is mean pairwise Jaccard overlap of the top-3 layer sets across all subject pairs. If subjects route through different layers, Jaccard should be low (<0.3). If they share the same layers, Jaccard will be high (>0.6).

Results: The Shared Backbone

Subject Top-3 Routing Layers
Chemistry20, 8, 14
Math14, 20, 18
Business14, 19, 8
Engineering8, 9, 14
Physics8, 20, 14
Law14, 19, 15
Other20, 8, 14
Computer Science20, 8, 15
Health20, 8, 14
History14, 20, 9
Philosophy8, 20, 14
Economics8, 20, 14
Psychology20, 8, 9
Biology20, 8, 13

Layers 8, 14, and 20 (highlighted) form a shared routing backbone. Each appears in 11 of 14 subjects’ top-3. The only variation is the third slot, and even there, the alternatives (9, 13, 15, 18, 19) appear sporadically with no subject-specific consistency.

The Numbers

Metric Value Interpretation
Mean pairwise Jaccard (top-3) 0.496 Moderate overlap, subjects share ~half their top-3 layers
Layers appearing in ≥11/14 subjects 3 (layers 8, 14, 20) A fixed backbone, not subject-specific routing
Subjects with exact {8,14,20} top-3 6 / 14 Nearly half the subjects are identical

What This Means for Targeted LoRA

The premise of “subject-specific LoRA targeting” requires subjects to route through different layers. When all subjects share the same top layers, targeting “chemistry layers” vs. “physics layers” vs. “generic knowledge layers” is the same operation, you are always targeting {8, 14, 20}.

This is a negative result. At the layer level, MoE routing on this model does not provide a subject-specific signal that can be exploited for targeted fine-tuning. The gate nominally passes our pre-registered threshold (Jaccard < 0.6), but the pass is driven entirely by the variable third layer, the core backbone is universal.

Why the Backbone Is Universal

Two likely explanations:

Practical Takeaway

For LoRA-based knowledge injection on this architecture:

This negative result simplifies the LoRA targeting decision: on models with this routing structure, just target the highest-activity mid-layers generically. The “routing-aware” approach collapses to the generic approach, saving the engineering cost of per-subject routing analysis.


Model: Qwen3.5-35B-A3B (hybrid MoE, 128 experts, 40 MoE layers). Profiled: 14 MMLU-Pro subjects, 100 prompts per subject. Metric: KL divergence of per-subject routing vs. pooled routing, top-3 layers by KL, mean pairwise Jaccard overlap. Temperature 0, thinking disabled.

Continue Reading

Related research from our team.

Identity Survives LoRA Stacking
MoE Research

Identity Survives LoRA Stacking

Persona preservation holds at 18/20 across all stacking conditions on a 128-expert MoE model.

The Stacking Confound
MoE Research

The Stacking Confound: Why LoRA Recovery Numbers Lie

~80% of apparent knowledge injection is a weight-perturbation artifact, not learned facts.

What 100 Prompts Reveal About Expert Routing
MoE Research

What 100 Prompts Reveal About Expert Routing in 256-Expert MoE Models

Profiling expert activation across 100 prompts reveals moderate routing concentration and a dramatic sample-size effect.

View All Research