How to Quantize a Model with 256 Experts

Individual expert analysis for small MoE models, k-means clustering for large ones, and conservative aggregation that ensures no expert is the weak link. MINT’s approach to the hardest quantization target in AI.

Why MoE Models Are the Hardest Quantization Target

Mixture-of-Experts models like Mixtral (8 experts), Qwen3-30B (64 experts per layer), and Qwen3.5-397B (512 experts per layer) achieve excellent quality-per-FLOP because only a fraction of experts activate per token. But they have a brutal memory problem: every expert must reside in memory even though most are idle at any given moment. A 109B MoE model needs 200+ GB in full precision.

Quantization is essential for MoE deployment on anything less than a data-center node. But MoE models make quantization harder than dense models in three specific ways.

Problem 1: Expert Heterogeneity

In a dense model, all weight matrices at the same position across layers have roughly similar characteristics. The gate projection in layer 10 looks statistically similar to the gate projection in layer 20. This makes uniform quantization—same bit-width for all layers—a reasonable starting point.

In an MoE model, this assumption breaks down. Experts within the same layer can have wildly different weight distributions, different sensitivities to quantization, and different importance to the model’s output. Expert 3 in layer 15 might have tight, well-behaved weight distributions that compress cleanly to 3-bit. Expert 47 in the same layer might have heavy-tailed distributions with outlier values that collapse at anything below 8-bit.

Treating them identically—same bit-width for all experts—wastes bits on the robust experts and under-protects the sensitive ones. The result is a model that is simultaneously over-compressed in some places and under-compressed in others.

Problem 2: Calibration Coverage

Calibration-based methods like GPTQ need calibration data that activates the experts being analyzed. The algorithm measures how activations flow through each expert and uses that information to decide how to compress it. But in a model with 512 experts per layer, any given input only activates a handful—typically 10 (top-k routing).

Getting adequate coverage for all 30,720 expert instances (60 layers × 512 experts) would require enormous calibration datasets. And even then, rare-but-important experts may be missed entirely. An expert that activates only for mathematical reasoning or code generation might see zero calibration samples, leaving it quantized based on no data at all.

This is why MINT’s data-free approach is especially valuable for MoE models. There is no calibration coverage problem because there is no calibration. Every expert is analyzed based on its weight statistics alone, regardless of how frequently it activates during inference.

MINT’s Two-Phase Expert Strategy

MINT uses different strategies based on expert count, recognizing that what works for 8 experts does not scale to 512.

Small MoE models (≤32 experts per layer)

For models like Mixtral (8 experts) or smaller MoE architectures, MINT analyzes each expert individually. It computes all sensitivity features—rate-distortion curves, spectral analysis, kurtosis, SQNR—for every expert weight matrix. This is feasible because 16 experts × 6 weight types equals roughly 96 matrices per layer. Full analysis at this scale takes minutes, not hours.

Large MoE models (>32 experts per layer)

For models like Qwen3-30B (64 experts) or Qwen3.5-397B (512 experts), analyzing every expert individually becomes prohibitively slow. Instead, MINT uses k-means clustering on lightweight statistics—weight norm, standard deviation, kurtosis—to group similar experts together. It then samples one representative expert per cluster for full analysis.

This reduces a 512-expert model to approximately 8–16 representative analyses per layer while capturing the diversity. The clustering ensures that experts with different statistical profiles end up in different groups, so no category of expert is missed. A cluster of robust, low-kurtosis experts gets one representative. A cluster of sensitive, heavy-tailed experts gets another. Each representative’s analysis is applied to all members of its cluster.

Conservative Aggregation: The Safety Net

When multiple experts share an allocation decision—whether because they are in the same cluster or because the allocator groups them for efficiency—MINT uses worst-case aggregation:

Metric	Aggregation Rule	Rationale
NRMSE	max(NRMSE across experts)	Use the most sensitive expert’s error
SQNR	min(SQNR across experts)	Use the least robust expert’s safety margin
Size	sum(size across experts)	Total storage for the group

This guarantees that no individual expert is catastrophically degraded, even if it is the most sensitive one in the group. The cost is slightly conservative allocation—a few more bits than the average expert needs—but the safety margin prevents silent failures. In a model where any expert might be the one that handles a critical query, you cannot afford to leave the weakest expert unprotected.

Results on Real MoE Models

MINT’s matched-size comparison with GPTQ across three MoE families shows consistent improvements:

Model	Experts	Method	Size (GB)	Mean PPL	Δ PPL
Qwen3-30B	64/layer	GPTQ	16.0	9.122	—
Qwen3-30B	64/layer	MINT	16.1	8.970	−1.7%
Qwen2-57B	56/layer	GPTQ	29.9	6.390	—
Qwen2-57B	56/layer	MINT	29.9	6.329	−0.95%
Mixtral-8x7B	8/layer	GPTQ	87.0	4.608	—
Mixtral-8x7B	8/layer	MINT	24.5	4.264	−4.6%

MINT outperforms GPTQ on every MoE model tested, and the gap is largest on models where calibration coverage is most challenging. The Mixtral result is particularly striking: MINT at 24.5 GB beats GPTQ at 87.0 GB. Better quality at less than a third of the size, with no calibration data required.

The Llama-4-Scout Story

The 109-billion-parameter Llama-4-Scout model with 16 experts demonstrates the full power of MoE-aware quantization. This model needs over 200 GB in full precision—far beyond any consumer hardware. The question is: how small can you make it without breaking it?

Configuration	Size (GB)	Perplexity	Status
Without safety veto	34.6	23.6	Completely broken
With 9 dB safety floor	46.9	8.7	Usable
At 64 GB budget	58	7.7	Excellent (beats uniform 4-bit)
At 192 GB budget	163	7.4	Near-optimal

The 64 GB configuration is the standout result. A 109-billion-parameter MoE model compressed to fit on a 64 GB Mac, running at perplexity 7.7—better than naive uniform 4-bit quantization (perplexity 7.9). No calibration data. No expert profiling runs. No GPU required for the analysis.

The safety veto result is equally instructive. Without the SQNR safety floor, the allocator compresses too aggressively and a handful of sensitive expert tensors collapse, dragging perplexity from 8.7 to 23.6. The 9 dB safety floor catches these tensors and forces them to higher bit-widths, adding 12 GB of storage but recovering the model’s quality entirely. This is conservative aggregation doing its job.

Implications for MoE Deployment

MoE architectures are the future of efficient LLM design. Models like Mixtral, DBRX, Llama 4, and Qwen3 demonstrate that MoE delivers better quality per FLOP than dense architectures. The trajectory is clear: more experts, more parameters, better quality, same inference cost.

But MoE deployment on consumer hardware has been blocked by the memory wall. Every expert must be in memory, and with hundreds of experts per layer, the total parameter count dwarfs what consumer devices can hold. A 109B MoE model that uses only 17B parameters per forward pass still needs 200+ GB to store all the experts.

Budget-targeted, expert-aware, data-free quantization removes that block. A 109B MoE model that needed 200+ GB now fits in 47–58 GB—consumer Mac territory. A 30B MoE model compresses to 16 GB—laptop territory. The memory wall that kept MoE models confined to data centers is coming down.

The combination matters: budget-targeted (hit any memory target exactly), expert-aware (different allocation per expert group based on actual sensitivity), and data-free (no calibration, no coverage gaps, no expert profiling runs). Remove any one of these properties and MoE quantization either wastes memory, breaks sensitive experts, or requires impractical calibration infrastructure. Together, they make MoE deployment on consumer hardware not just possible but practical.

This article describes expert-grouped allocation as implemented in MINT (Memory-Informed N-bit Tuning), developed at baa.ai. All benchmark results are from matched-size comparisons using standard perplexity evaluation on WikiText-2 test split. For full methodology, see the MINT paper. The full pipeline is open source at github.com/baa-ai/MINT.

← Back to all articles