We profiled MoE gate routing across 14 academic domains expecting to find subject-specific “knowledge layers.” Instead, all subjects route through the same three layers. Domain-specific LoRA targeting is indistinguishable from generic mid-layer targeting.
The Hypothesis
Mixture-of-Experts models route different tokens to different experts. An appealing idea for targeted fine-tuning: if “chemistry knowledge” routes through specific layers and experts, you could inject chemistry knowledge by training LoRA only on those layers, leaving other domains undisturbed. This would enable domain-specific knowledge injection without cross-domain interference.
We tested this on Qwen3.5-35B-A3B (128 experts per MoE layer, 40 MoE layers) by profiling router gate activations across 14 MMLU-Pro subjects with 100 questions per subject.
Methodology
For each subject, we logged per-layer expert activation distributions across 100 domain-specific prompts. We computed per-subject specificity as the KL divergence between each subject’s routing distribution and the pooled (all-subjects) distribution at each layer. For each subject, the top-3 layers by KL divergence represent the layers where that subject’s routing is most distinctive from the average.
The key metric is mean pairwise Jaccard overlap of the top-3 layer sets across all subject pairs. If subjects route through different layers, Jaccard should be low (<0.3). If they share the same layers, Jaccard will be high (>0.6).
Results: The Shared Backbone
| Subject | Top-3 Routing Layers |
|---|---|
| Chemistry | 20, 8, 14 |
| Math | 14, 20, 18 |
| Business | 14, 19, 8 |
| Engineering | 8, 9, 14 |
| Physics | 8, 20, 14 |
| Law | 14, 19, 15 |
| Other | 20, 8, 14 |
| Computer Science | 20, 8, 15 |
| Health | 20, 8, 14 |
| History | 14, 20, 9 |
| Philosophy | 8, 20, 14 |
| Economics | 8, 20, 14 |
| Psychology | 20, 8, 9 |
| Biology | 20, 8, 13 |
Layers 8, 14, and 20 (highlighted) form a shared routing backbone. Each appears in 11 of 14 subjects’ top-3. The only variation is the third slot, and even there, the alternatives (9, 13, 15, 18, 19) appear sporadically with no subject-specific consistency.
The Numbers
| Metric | Value | Interpretation |
|---|---|---|
| Mean pairwise Jaccard (top-3) | 0.496 | Moderate overlap, subjects share ~half their top-3 layers |
| Layers appearing in ≥11/14 subjects | 3 (layers 8, 14, 20) | A fixed backbone, not subject-specific routing |
| Subjects with exact {8,14,20} top-3 | 6 / 14 | Nearly half the subjects are identical |
What This Means for Targeted LoRA
The premise of “subject-specific LoRA targeting” requires subjects to route through different layers. When all subjects share the same top layers, targeting “chemistry layers” vs. “physics layers” vs. “generic knowledge layers” is the same operation, you are always targeting {8, 14, 20}.
This is a negative result. At the layer level, MoE routing on this model does not provide a subject-specific signal that can be exploited for targeted fine-tuning. The gate nominally passes our pre-registered threshold (Jaccard < 0.6), but the pass is driven entirely by the variable third layer, the core backbone is universal.
Why the Backbone Is Universal
Two likely explanations:
- Load balancing during pre-training pushes routing toward uniformity across layers. The auxiliary load-balancing loss prevents any single layer from becoming a domain-specific bottleneck.
- KL-based specificity measures general routing distinctiveness, not domain encoding. Layers 8, 14, and 20 may be where routing is most variable (highest entropy), making all subjects appear “distinctive” at the same layers without actually routing differently.
Practical Takeaway
For LoRA-based knowledge injection on this architecture:
- Use generic mid-layer targeting (layers {8, 14, 20} for Qwen3.5-35B-A3B), subject-specific layer selection adds complexity without benefit
- Domain specificity may exist at the expert level within a layer, but this requires per-expert LoRA (currently infeasible due to Metal buffer limits at scale)
- Do not assume router profiles imply targetable structure, having a diverse routing profile does not mean subjects can be isolated by layer
This negative result simplifies the LoRA targeting decision: on models with this routing structure, just target the highest-activity mid-layers generically. The “routing-aware” approach collapses to the generic approach, saving the engineering cost of per-subject routing analysis.
Model: Qwen3.5-35B-A3B (hybrid MoE, 128 experts, 40 MoE layers). Profiled: 14 MMLU-Pro subjects, 100 prompts per subject. Metric: KL divergence of per-subject routing vs. pooled routing, top-3 layers by KL, mean pairwise Jaccard overlap. Temperature 0, thinking disabled.