We compared three granularities of bit allocation for MoE quantization. The finest granularity was the slowest and barely the best.
Introduction
When quantizing a Mixture-of-Experts model, you face a granularity decision: at what level do you assign bit widths?
- Uniform: Every expert in every layer gets the same bits (e.g., all 4-bit)
- Layer-level: All experts within a layer share the same bits, but different layers can have different bits (e.g., layer 41 at 8-bit, layer 3 at 4-bit)
- Expert-level: Each individual expert gets its own bits (e.g., expert 47 in layer 41 at 8-bit, expert 12 in the same layer at 4-bit)
The intuition says finer granularity should be better — more precise allocation of precision where it matters. We tested all three approaches on Qwen3-235B-A22B (128 experts, 94 layers) and Qwen3.5-397B-A17B (512 experts, 60 layers). The results surprised us.
The Three Approaches
Uniform Quantization (Baseline)
Standard mlx_lm.convert with q_bits=4:
convert(hf_path=source, mlx_path=output, quantize=True, q_bits=4)
Every QuantizedSwitchLinear uses 4-bit weights. Simple, fast, well-supported.
Layer-Level Quantization
Uses a custom quant_predicate that returns different bit widths for different layers:
def layer_level_predicate(path, module):
if "switch_mlp" in path:
layer_idx = extract_layer(path)
if layer_idx in high_priority_layers:
return {"bits": 8, "group_size": 64}
return True # default 4-bit
All experts in a promoted layer share 8-bit precision. This works with standard QuantizedSwitchLinear — no custom kernels needed.
Expert-Level Quantization (MixedBitSwitchGLU)
Our custom implementation that groups experts by bit width within each layer (see Article 2):
# Per-expert classification from activation profiling:
# Expert 47: critical → 8-bit
# Expert 12: standard → 4-bit
# Expert 201: deprioritized → 2-bit
# Expert 389: prune → 0-bit (removed)
Requires custom MixedBitSwitchGLU with mask-and-combine dispatch.
Layer Selection Strategy: "Any Critical Wins"
For layer-level quantization, we needed a rule to decide which layers get 8-bit. Our approach: count critical experts per layer from the activation profiling. If a layer has ≥ N critical experts, promote the entire layer to 8-bit.
For Qwen3.5-397B with threshold N=30:
Layers promoted to 8-bit (11 of 60):
Layer 41: 94 critical experts → 8-bit
Layer 40: 50 critical experts → 8-bit
Layer 36: 39 critical experts → 8-bit
Layer 19: 37 critical experts → 8-bit
Layer 29: 36 critical experts → 8-bit
Layer 27: 34 critical experts → 8-bit
Layer 25: 33 critical experts → 8-bit
Layer 37: 33 critical experts → 8-bit
Layer 26: 32 critical experts → 8-bit
Layer 28: 31 critical experts → 8-bit
Layer 45: 31 critical experts → 8-bit
The remaining 49 layers stay at 4-bit. This is straightforward — no custom kernels, no mask-and-combine, standard MLX inference path.
Quality Comparison
Qwen3-235B-A22B — Five Versions
We built five versions varying the number of 8-bit layers, all using layer-level quantization:
| Version | 8-bit Layers | MMLU-Pro | ARC | GSM8K | HumanEval | Size |
|---|---|---|---|---|---|---|
| Uniform 4-bit | 0 | 72.1% | 96.0% | 88.7% | 78.7% | ~140 GB |
| v2 | 17 | 76.7% | 96.2% | 92.0% | 88.0% | 149 GB |
| v3 | 25 | 68.6% | 95.4% | 93.0% | 88.0% | 151 GB |
| v4 | 35 | 71.7% | 96.2% | 94.0% | 84.0% | 153 GB |
| v4b | 40 | 69.3% | 96.2% | 95.0% | 86.0% | 153 GB |
Official BF16 reference: MMLU-Pro 75.7%, GSM8K 91.5%, HumanEval 80.5%
Key Findings
1. v2 (17 layers at 8-bit) is the sweet spot. It scores higher than the BF16 reference on every benchmark we measured. Adding more 8-bit layers (v3, v4, v4b) does not improve — it actually degrades MMLU-Pro substantially.
2. The quality curve is non-monotonic. Going from 17→25→35→40 8-bit layers, MMLU-Pro goes 76.7%→68.6%→71.7%→69.3%. More precision doesn't always help.
3. GSM8K is the exception. Math performance (GSM8K) does improve monotonically with more 8-bit layers: 92%→93%→94%→95%. This suggests math reasoning benefits from higher precision across more layers, even as other capabilities degrade.
4. The v3 anomaly. Version 3 (25 layers) had a sharp MMLU drop to 68.6%. The v3→v3b control experiment (rerunning v2's exact config) reproduced v2's scores (76.7%), confirming the drop was real and caused by the additional 8 layers, not randomness.
Why Does More Precision Hurt?
Our hypothesis: promoting a layer from 4-bit to 8-bit changes the relative precision balance between that layer and its neighbors. When a critical layer is at 8-bit and its non-critical neighbors are at 4-bit, the model can rely on the critical layer for precision-sensitive decisions. But when too many layers are at 8-bit, the precision differential disappears, and the model may amplify quantization noise from the remaining 4-bit layers differently.
This is analogous to how adding contrast to some elements of an image makes them stand out, but adding contrast to everything returns you to a flat image.
Expert-Level vs Layer-Level
For Qwen3-235B, the expert-level MixedBitSwitchGLU (v2) achieved:
| Approach | MMLU-Pro | ARC | GSM8K | HumanEval | Speed | Size |
|---|---|---|---|---|---|---|
| Uniform 4-bit | 72.1% | 96.0% | 88.7% | 78.7% | ~16s | ~140 GB |
| Layer-level (v2) | 76.7% | 96.2% | 92.0% | 88.0% | ~16s | 149 GB |
| Expert-level (MixedBit) | 76.7% | 96.2% | 92.0% | 88.0% | ~21s | 149 GB |
The layer-level and expert-level approaches produce identical benchmark scores on Qwen3-235B. The expert-level allocation (23 experts at 8-bit, 10,845 at 4-bit, 365 at 2-bit, 799 pruned) is equivalent to the layer-level allocation (17 layers at 8-bit, 77 at 4-bit) for quality — but the layer-level version is 30% faster because it uses standard kernels.
Qwen3.5-397B Comparison
For the larger 512-expert model, the speed difference is even more dramatic:
| Approach | Collapse Tests | Speed | Size | Kernels |
|---|---|---|---|---|
| Uniform 4-bit | 15/15 | ~8s | 209 GB | Standard |
| Layer-level (11 layers@8bit) | 15/15 | 7.7s | 236 GB | Standard |
| Expert-level (MixedBit) | 15/15 | 47.3s | 176 GB | Custom |
Expert-level is 6x slower than layer-level despite producing equivalent collapse test results. The only advantage is size: 176 GB vs 236 GB, a 60 GB saving from the 2-bit and pruned experts.
The Size-Speed-Quality Tradeoff
Visualizing the three approaches:
Quality
▲
│
Expert-level ─────┤───── Layer-level
(slow) │ (fast)
│
│
│
Uniform 4-bit ────┤
│
└──────────────────► Speed
slow fast
Size: Expert-level < Uniform < Layer-level
(176 GB) (209 GB) (236 GB)
- If you optimize for quality + speed: layer-level wins (same quality as expert-level, standard kernel speed)
- If you optimize for quality + size: expert-level wins (same quality as layer-level, smallest size from 2-bit/pruned experts)
- If you optimize for speed alone: uniform wins (no overhead, good quality, medium size)
Decision Framework
When choosing a quantization granularity for your MoE model:
Use Uniform Quantization When:
- You don't have activation profiling data
- You need the simplest deployment
- Your model fits comfortably in memory at 4-bit
- You can accept ~3-5% quality loss vs BF16
Use Layer-Level Quantization When:
- You have activation profiling data
- You want maximum quality with standard kernels
- You can afford 5-15% extra model size (for 8-bit layers)
- You need interactive inference speed
Use Expert-Level Quantization When:
- Memory is your binding constraint
- You can tolerate 30-50%+ speed overhead
- You're running batch inference (speed matters less)
- You need the absolute smallest model possible
Never Use More Than ~20-30% of Layers at 8-Bit
Our data consistently shows that promoting too many layers to 8-bit degrades quality. Stick to the critical layers identified by activation profiling and leave the rest at 4-bit.
Practical Implementation
Layer-level quantization requires only a custom predicate function for mlx_lm.convert:
# Layers to promote to 8-bit (from activation profiling)
LAYERS_8BIT = {19, 25, 26, 27, 28, 29, 36, 37, 40, 41, 45}
def layer_predicate(path: str, module) -> bool | dict:
"""Custom quantization predicate for layer-level MoE quantization."""
if "switch_mlp" in path:
# Extract layer index from path like:
# "language_model.model.layers.41.mlp.switch_mlp.gate_proj"
match = re.search(r"layers\.(\d+)\.", path)
if match and int(match.group(1)) in LAYERS_8BIT:
return {"bits": 8, "group_size": 64}
return True # Default quantization
convert(
hf_path=source,
mlx_path=output,
quantize=True,
q_bits=4,
q_group_size=128,
quant_predicate=layer_predicate,
dtype="bfloat16",
)
No custom inference code. No custom kernels. Standard mlx_lm.load and mlx_lm.generate work unchanged.
Conclusion
The granularity of bit allocation in MoE quantization has diminishing returns:
- Uniform → Layer-level: Significant quality improvement (+4.6% MMLU-Pro), minimal speed cost, standard kernels
- Layer-level → Expert-level: No measurable quality improvement, 30-52% speed cost, requires custom kernels
Layer-level quantization is the practical sweet spot. It captures the benefit of activation profiling (promoting critical layers) without the engineering and performance costs of per-expert bit allocation.
The biggest surprise: more 8-bit layers can hurt. The relationship between precision allocation and model quality is non-monotonic. Profile your experts, promote only the layers with the most critical experts, and leave the rest alone.
Next in this series: Why Collapse Tests Are Insufficient for Quantization Quality Assessment — how a model can score 15/15 on automated tests while producing Chinese characters in Spanish translations.