Layer-Level vs Expert-Level Granularity in MoE Quantization

We compared three granularities of bit allocation for MoE quantization. The finest granularity was the slowest and barely the best.

Introduction

When quantizing a Mixture-of-Experts model, you face a granularity decision: at what level do you assign bit widths?

Uniform: Every expert in every layer gets the same bits (e.g., all 4-bit)
Layer-level: All experts within a layer share the same bits, but different layers can have different bits (e.g., layer 41 at 8-bit, layer 3 at 4-bit)
Expert-level: Each individual expert gets its own bits (e.g., expert 47 in layer 41 at 8-bit, expert 12 in the same layer at 4-bit)

The intuition says finer granularity should be better — more precise allocation of precision where it matters. We tested all three approaches on Qwen3-235B-A22B (128 experts, 94 layers) and Qwen3.5-397B-A17B (512 experts, 60 layers). The results surprised us.

The Three Approaches

Uniform Quantization (Baseline)

Standard mlx_lm.convert with q_bits=4:


convert(hf_path=source, mlx_path=output, quantize=True, q_bits=4)

Every QuantizedSwitchLinear uses 4-bit weights. Simple, fast, well-supported.

Layer-Level Quantization

Uses a custom quant_predicate that returns different bit widths for different layers:


def layer_level_predicate(path, module):
    if "switch_mlp" in path:
        layer_idx = extract_layer(path)
        if layer_idx in high_priority_layers:
            return {"bits": 8, "group_size": 64}
    return True  # default 4-bit

All experts in a promoted layer share 8-bit precision. This works with standard QuantizedSwitchLinear — no custom kernels needed.

Expert-Level Quantization (MixedBitSwitchGLU)

Our custom implementation that groups experts by bit width within each layer (see Article 2):


# Per-expert classification from activation profiling:
# Expert 47: critical → 8-bit
# Expert 12: standard → 4-bit
# Expert 201: deprioritized → 2-bit
# Expert 389: prune → 0-bit (removed)

Requires custom MixedBitSwitchGLU with mask-and-combine dispatch.

Layer Selection Strategy: "Any Critical Wins"

For layer-level quantization, we needed a rule to decide which layers get 8-bit. Our approach: count critical experts per layer from the activation profiling. If a layer has ≥ N critical experts, promote the entire layer to 8-bit.

For Qwen3.5-397B with threshold N=30:


Layers promoted to 8-bit (11 of 60):
Layer 41:  94 critical experts  → 8-bit
Layer 40:  50 critical experts  → 8-bit
Layer 36:  39 critical experts  → 8-bit
Layer 19:  37 critical experts  → 8-bit
Layer 29:  36 critical experts  → 8-bit
Layer 27:  34 critical experts  → 8-bit
Layer 25:  33 critical experts  → 8-bit
Layer 37:  33 critical experts  → 8-bit
Layer 26:  32 critical experts  → 8-bit
Layer 28:  31 critical experts  → 8-bit
Layer 45:  31 critical experts  → 8-bit

The remaining 49 layers stay at 4-bit. This is straightforward — no custom kernels, no mask-and-combine, standard MLX inference path.

Quality Comparison

Qwen3-235B-A22B — Five Versions

We built five versions varying the number of 8-bit layers, all using layer-level quantization:

Version	8-bit Layers	MMLU-Pro	ARC	GSM8K	HumanEval	Size
Uniform 4-bit	0	72.1%	96.0%	88.7%	78.7%	~140 GB
v2	17	76.7%	96.2%	92.0%	88.0%	149 GB
v3	25	68.6%	95.4%	93.0%	88.0%	151 GB
v4	35	71.7%	96.2%	94.0%	84.0%	153 GB
v4b	40	69.3%	96.2%	95.0%	86.0%	153 GB

Official BF16 reference: MMLU-Pro 75.7%, GSM8K 91.5%, HumanEval 80.5%

Key Findings

1. v2 (17 layers at 8-bit) is the sweet spot. It scores higher than the BF16 reference on every benchmark we measured. Adding more 8-bit layers (v3, v4, v4b) does not improve — it actually degrades MMLU-Pro substantially.

2. The quality curve is non-monotonic. Going from 17→25→35→40 8-bit layers, MMLU-Pro goes 76.7%→68.6%→71.7%→69.3%. More precision doesn't always help.

3. GSM8K is the exception. Math performance (GSM8K) does improve monotonically with more 8-bit layers: 92%→93%→94%→95%. This suggests math reasoning benefits from higher precision across more layers, even as other capabilities degrade.

4. The v3 anomaly. Version 3 (25 layers) had a sharp MMLU drop to 68.6%. The v3→v3b control experiment (rerunning v2's exact config) reproduced v2's scores (76.7%), confirming the drop was real and caused by the additional 8 layers, not randomness.

Why Does More Precision Hurt?

Our hypothesis: promoting a layer from 4-bit to 8-bit changes the relative precision balance between that layer and its neighbors. When a critical layer is at 8-bit and its non-critical neighbors are at 4-bit, the model can rely on the critical layer for precision-sensitive decisions. But when too many layers are at 8-bit, the precision differential disappears, and the model may amplify quantization noise from the remaining 4-bit layers differently.

This is analogous to how adding contrast to some elements of an image makes them stand out, but adding contrast to everything returns you to a flat image.

Expert-Level vs Layer-Level

For Qwen3-235B, the expert-level MixedBitSwitchGLU (v2) achieved:

Approach	MMLU-Pro	ARC	GSM8K	HumanEval	Speed	Size
Uniform 4-bit	72.1%	96.0%	88.7%	78.7%	~16s	~140 GB
Layer-level (v2)	76.7%	96.2%	92.0%	88.0%	~16s	149 GB
Expert-level (MixedBit)	76.7%	96.2%	92.0%	88.0%	~21s	149 GB

The layer-level and expert-level approaches produce identical benchmark scores on Qwen3-235B. The expert-level allocation (23 experts at 8-bit, 10,845 at 4-bit, 365 at 2-bit, 799 pruned) is equivalent to the layer-level allocation (17 layers at 8-bit, 77 at 4-bit) for quality — but the layer-level version is 30% faster because it uses standard kernels.

Qwen3.5-397B Comparison

For the larger 512-expert model, the speed difference is even more dramatic:

Approach	Collapse Tests	Speed	Size	Kernels
Uniform 4-bit	15/15	~8s	209 GB	Standard
Layer-level (11 layers@8bit)	15/15	7.7s	236 GB	Standard
Expert-level (MixedBit)	15/15	47.3s	176 GB	Custom

Expert-level is 6x slower than layer-level despite producing equivalent collapse test results. The only advantage is size: 176 GB vs 236 GB, a 60 GB saving from the 2-bit and pruned experts.

The Size-Speed-Quality Tradeoff

Visualizing the three approaches:


                    Quality
                      ▲
                      │
    Expert-level ─────┤───── Layer-level
         (slow)       │        (fast)
                      │
                      │
                      │
    Uniform 4-bit ────┤
                      │
                      └──────────────────► Speed
                   slow              fast


Size:  Expert-level < Uniform < Layer-level
       (176 GB)      (209 GB)   (236 GB)

If you optimize for quality + speed: layer-level wins (same quality as expert-level, standard kernel speed)
If you optimize for quality + size: expert-level wins (same quality as layer-level, smallest size from 2-bit/pruned experts)
If you optimize for speed alone: uniform wins (no overhead, good quality, medium size)

Decision Framework

When choosing a quantization granularity for your MoE model:

Use Uniform Quantization When:

You don't have activation profiling data
You need the simplest deployment
Your model fits comfortably in memory at 4-bit
You can accept ~3-5% quality loss vs BF16

Use Layer-Level Quantization When:

You have activation profiling data
You want maximum quality with standard kernels
You can afford 5-15% extra model size (for 8-bit layers)
You need interactive inference speed

Use Expert-Level Quantization When:

Memory is your binding constraint
You can tolerate 30-50%+ speed overhead
You're running batch inference (speed matters less)
You need the absolute smallest model possible

Never Use More Than ~20-30% of Layers at 8-Bit

Our data consistently shows that promoting too many layers to 8-bit degrades quality. Stick to the critical layers identified by activation profiling and leave the rest at 4-bit.

Practical Implementation

Layer-level quantization requires only a custom predicate function for mlx_lm.convert:


# Layers to promote to 8-bit (from activation profiling)
LAYERS_8BIT = {19, 25, 26, 27, 28, 29, 36, 37, 40, 41, 45}

def layer_predicate(path: str, module) -> bool | dict:
    """Custom quantization predicate for layer-level MoE quantization."""
    if "switch_mlp" in path:
        # Extract layer index from path like:
        # "language_model.model.layers.41.mlp.switch_mlp.gate_proj"
        match = re.search(r"layers\.(\d+)\.", path)
        if match and int(match.group(1)) in LAYERS_8BIT:
            return {"bits": 8, "group_size": 64}
    return True  # Default quantization

convert(
    hf_path=source,
    mlx_path=output,
    quantize=True,
    q_bits=4,
    q_group_size=128,
    quant_predicate=layer_predicate,
    dtype="bfloat16",
)

No custom inference code. No custom kernels. Standard mlx_lm.load and mlx_lm.generate work unchanged.

Conclusion

The granularity of bit allocation in MoE quantization has diminishing returns:

Uniform → Layer-level: Significant quality improvement (+4.6% MMLU-Pro), minimal speed cost, standard kernels
Layer-level → Expert-level: No measurable quality improvement, 30-52% speed cost, requires custom kernels

Layer-level quantization is the practical sweet spot. It captures the benefit of activation profiling (promoting critical layers) without the engineering and performance costs of per-expert bit allocation.

The biggest surprise: more 8-bit layers can hurt. The relationship between precision allocation and model quality is non-monotonic. Profile your experts, promote only the layers with the most critical experts, and leave the rest alone.

Next in this series: Why Collapse Tests Are Insufficient for Quantization Quality Assessment — how a model can score 15/15 on automated tests while producing Chinese characters in Spanish translations.

← Previous: MLX Quantization on Apple Silicon — Engineering Pi...Next: Why Collapse Tests Are Insufficient for Quantizati... →

← Back to all articles