Layer-Level vs Expert-Level Granularity in MoE Quantization
Quantization

Layer-Level vs Expert-Level Granularity in MoE Quantization

February 2026 · Black Sheep AI Research

We compared three granularities of bit allocation for MoE quantization. The finest granularity was the slowest and barely the best.

Introduction

When quantizing a Mixture-of-Experts model, you face a granularity decision: at what level do you assign bit widths?

The intuition says finer granularity should be better — more precise allocation of precision where it matters. We tested all three approaches on Qwen3-235B-A22B (128 experts, 94 layers) and Qwen3.5-397B-A17B (512 experts, 60 layers). The results surprised us.

The Three Approaches

Uniform Quantization (Baseline)

Standard mlx_lm.convert with q_bits=4:


convert(hf_path=source, mlx_path=output, quantize=True, q_bits=4)

Every QuantizedSwitchLinear uses 4-bit weights. Simple, fast, well-supported.

Layer-Level Quantization

Uses a custom quant_predicate that returns different bit widths for different layers:


def layer_level_predicate(path, module):
    if "switch_mlp" in path:
        layer_idx = extract_layer(path)
        if layer_idx in high_priority_layers:
            return {"bits": 8, "group_size": 64}
    return True  # default 4-bit

All experts in a promoted layer share 8-bit precision. This works with standard QuantizedSwitchLinear — no custom kernels needed.

Expert-Level Quantization (MixedBitSwitchGLU)

Our custom implementation that groups experts by bit width within each layer (see Article 2):


# Per-expert classification from activation profiling:
# Expert 47: critical → 8-bit
# Expert 12: standard → 4-bit
# Expert 201: deprioritized → 2-bit
# Expert 389: prune → 0-bit (removed)

Requires custom MixedBitSwitchGLU with mask-and-combine dispatch.

Layer Selection Strategy: "Any Critical Wins"

For layer-level quantization, we needed a rule to decide which layers get 8-bit. Our approach: count critical experts per layer from the activation profiling. If a layer has ≥ N critical experts, promote the entire layer to 8-bit.

For Qwen3.5-397B with threshold N=30:


Layers promoted to 8-bit (11 of 60):
Layer 41:  94 critical experts  → 8-bit
Layer 40:  50 critical experts  → 8-bit
Layer 36:  39 critical experts  → 8-bit
Layer 19:  37 critical experts  → 8-bit
Layer 29:  36 critical experts  → 8-bit
Layer 27:  34 critical experts  → 8-bit
Layer 25:  33 critical experts  → 8-bit
Layer 37:  33 critical experts  → 8-bit
Layer 26:  32 critical experts  → 8-bit
Layer 28:  31 critical experts  → 8-bit
Layer 45:  31 critical experts  → 8-bit

The remaining 49 layers stay at 4-bit. This is straightforward — no custom kernels, no mask-and-combine, standard MLX inference path.

Quality Comparison

Qwen3-235B-A22B — Five Versions

We built five versions varying the number of 8-bit layers, all using layer-level quantization:

Version 8-bit Layers MMLU-Pro ARC GSM8K HumanEval Size
Uniform 4-bit 0 72.1% 96.0% 88.7% 78.7% ~140 GB
v2 17 76.7% 96.2% 92.0% 88.0% 149 GB
v3 25 68.6% 95.4% 93.0% 88.0% 151 GB
v4 35 71.7% 96.2% 94.0% 84.0% 153 GB
v4b 40 69.3% 96.2% 95.0% 86.0% 153 GB

Official BF16 reference: MMLU-Pro 75.7%, GSM8K 91.5%, HumanEval 80.5%

Key Findings

1. v2 (17 layers at 8-bit) is the sweet spot. It scores higher than the BF16 reference on every benchmark we measured. Adding more 8-bit layers (v3, v4, v4b) does not improve — it actually degrades MMLU-Pro substantially.

2. The quality curve is non-monotonic. Going from 17→25→35→40 8-bit layers, MMLU-Pro goes 76.7%→68.6%→71.7%→69.3%. More precision doesn't always help.

3. GSM8K is the exception. Math performance (GSM8K) does improve monotonically with more 8-bit layers: 92%→93%→94%→95%. This suggests math reasoning benefits from higher precision across more layers, even as other capabilities degrade.

4. The v3 anomaly. Version 3 (25 layers) had a sharp MMLU drop to 68.6%. The v3→v3b control experiment (rerunning v2's exact config) reproduced v2's scores (76.7%), confirming the drop was real and caused by the additional 8 layers, not randomness.

Why Does More Precision Hurt?

Our hypothesis: promoting a layer from 4-bit to 8-bit changes the relative precision balance between that layer and its neighbors. When a critical layer is at 8-bit and its non-critical neighbors are at 4-bit, the model can rely on the critical layer for precision-sensitive decisions. But when too many layers are at 8-bit, the precision differential disappears, and the model may amplify quantization noise from the remaining 4-bit layers differently.

This is analogous to how adding contrast to some elements of an image makes them stand out, but adding contrast to everything returns you to a flat image.

Expert-Level vs Layer-Level

For Qwen3-235B, the expert-level MixedBitSwitchGLU (v2) achieved:

Approach MMLU-Pro ARC GSM8K HumanEval Speed Size
Uniform 4-bit 72.1% 96.0% 88.7% 78.7% ~16s ~140 GB
Layer-level (v2) 76.7% 96.2% 92.0% 88.0% ~16s 149 GB
Expert-level (MixedBit) 76.7% 96.2% 92.0% 88.0% ~21s 149 GB

The layer-level and expert-level approaches produce identical benchmark scores on Qwen3-235B. The expert-level allocation (23 experts at 8-bit, 10,845 at 4-bit, 365 at 2-bit, 799 pruned) is equivalent to the layer-level allocation (17 layers at 8-bit, 77 at 4-bit) for quality — but the layer-level version is 30% faster because it uses standard kernels.

Qwen3.5-397B Comparison

For the larger 512-expert model, the speed difference is even more dramatic:

Approach Collapse Tests Speed Size Kernels
Uniform 4-bit 15/15 ~8s 209 GB Standard
Layer-level (11 layers@8bit) 15/15 7.7s 236 GB Standard
Expert-level (MixedBit) 15/15 47.3s 176 GB Custom

Expert-level is 6x slower than layer-level despite producing equivalent collapse test results. The only advantage is size: 176 GB vs 236 GB, a 60 GB saving from the 2-bit and pruned experts.

The Size-Speed-Quality Tradeoff

Visualizing the three approaches:


                    Quality
                      ▲
                      │
    Expert-level ─────┤───── Layer-level
         (slow)       │        (fast)
                      │
                      │
                      │
    Uniform 4-bit ────┤
                      │
                      └──────────────────► Speed
                   slow              fast


Size:  Expert-level < Uniform < Layer-level
       (176 GB)      (209 GB)   (236 GB)

Decision Framework

When choosing a quantization granularity for your MoE model:

Use Uniform Quantization When:

Use Layer-Level Quantization When:

Use Expert-Level Quantization When:

Never Use More Than ~20-30% of Layers at 8-Bit

Our data consistently shows that promoting too many layers to 8-bit degrades quality. Stick to the critical layers identified by activation profiling and leave the rest at 4-bit.

Practical Implementation

Layer-level quantization requires only a custom predicate function for mlx_lm.convert:


# Layers to promote to 8-bit (from activation profiling)
LAYERS_8BIT = {19, 25, 26, 27, 28, 29, 36, 37, 40, 41, 45}

def layer_predicate(path: str, module) -> bool | dict:
    """Custom quantization predicate for layer-level MoE quantization."""
    if "switch_mlp" in path:
        # Extract layer index from path like:
        # "language_model.model.layers.41.mlp.switch_mlp.gate_proj"
        match = re.search(r"layers\.(\d+)\.", path)
        if match and int(match.group(1)) in LAYERS_8BIT:
            return {"bits": 8, "group_size": 64}
    return True  # Default quantization

convert(
    hf_path=source,
    mlx_path=output,
    quantize=True,
    q_bits=4,
    q_group_size=128,
    quant_predicate=layer_predicate,
    dtype="bfloat16",
)

No custom inference code. No custom kernels. Standard mlx_lm.load and mlx_lm.generate work unchanged.

Conclusion

The granularity of bit allocation in MoE quantization has diminishing returns:

Layer-level quantization is the practical sweet spot. It captures the benefit of activation profiling (promoting critical layers) without the engineering and performance costs of per-expert bit allocation.

The biggest surprise: more 8-bit layers can hurt. The relationship between precision allocation and model quality is non-monotonic. Profile your experts, promote only the layers with the most critical experts, and leave the rest alone.


Next in this series: Why Collapse Tests Are Insufficient for Quantization Quality Assessment — how a model can score 15/15 on automated tests while producing Chinese characters in Spanish translations.

← Previous: MLX Quantization on Apple Silicon — Engineering Pi...Next: Why Collapse Tests Are Insufficient for Quantizati... →
← Back to all articles

Want to apply these techniques to your AI infrastructure?

Our team specialises in MoE model optimisation, Apple Silicon deployment, and production AI systems.

Talk to Our Team