DynaMINT: When MINT Meets Expert-Aware Tiering

What if your quantizer could tell the difference between a critical expert and a rarely-used one — and assign bits accordingly? DynaMINT does exactly this, combining MINT’s rate-distortion optimizer with activation-guided expert tiering.

The Problem

Uniform quantization treats all 256 experts the same. But some experts handle 3x more traffic than others. Critical experts deserve higher precision; deprioritized ones can be aggressively compressed. The question is whether activation frequency is a reliable signal for bit-width allocation — and whether the quality cost is acceptable.

Standard mixed-precision methods like MINT assign bits based on weight sensitivity metrics (spectral concentration, kurtosis, noise amplification). These metrics capture how much a tensor’s output degrades under quantization, but they are blind to how often each expert is actually used during inference. DynaMINT adds the missing dimension: activation frequency.

The Pipeline

DynaMINT operates in three stages:

Profile: Run 100 diverse prompts through the model, capturing expert routing decisions at every MoE layer. Record per-expert activation counts across all tokens and layers.
Tier: Rank experts by activation frequency. Assign each expert to one of four tiers: critical (top 20%, 8-bit), standard (next 65%, 4-bit), deprioritized (next 12%, 2-bit), and prunable (bottom 3.6%, zeroed).
Dispatch: DynaMINTSwitchGLU — a modified MoE dispatch layer that performs tier-lookup at inference time, routing each token’s expert selection to the correct precision bank via separate gather_qmm calls.

Tier Distribution

Across all 40 MoE layers of Qwen3.5-35B-A3B (256 experts each, 10,240 total experts):

Tier	Experts	Share	Precision
Critical	2,040	19.9%	8-bit
Standard	6,640	64.8%	4-bit
Deprioritized	1,187	11.6%	2-bit
Prunable	373	3.6%	Zeroed

The distribution is intentionally conservative: nearly two-thirds of experts remain at standard 4-bit precision, and only 3.6% are zeroed — far below the 5% threshold where our pruning experiments showed catastrophic failure.

Results

Configuration	PPL	Delta	Speed (tok/s)
Baseline MINT	6.580	—	70.1
DynaMINT (tiered)	6.613	+0.5%	9.6

The quality result is compelling: +0.5% PPL degradation with tiered precision across all 10,240 experts. DynaMINT produces coherent chain-of-thought responses on all test prompts, with no detectable quality difference in human evaluation.

The speed result is not: 9.6 tok/s vs 70.1 tok/s baseline. This 7x slowdown is entirely due to the Python-level dispatch implementation, not an inherent limitation of the approach.

Why This Matters

DynaMINT proves that activation-guided tiering is complementary to MINT — the profiler provides the “which experts matter” signal, MINT provides the “how many bits” optimization. Together they enable per-expert precision at production quality.

The key insight is the separation of concerns: MINT’s rate-distortion analysis handles the within-expert sensitivity (which weight tensors are fragile), while DynaMINT’s activation profiler handles the between-expert importance (which experts carry more traffic). Neither signal alone captures the full picture.

The Speed Problem (and the Solution)

The 7x slowdown comes from Python-level per-tier dispatch. The current implementation performs 3 separate gather_qmm calls — one for each active precision tier — with Python orchestration between them. This is fundamentally a software engineering problem, not an algorithmic one.

Two paths to eliminate the overhead:

Sorted dispatch: Sort tokens by their expert tier before dispatch, enabling a single fused kernel call per tier with batch-level parallelism
Native MLX kernel: A custom Metal kernel that handles mixed-precision gather-scatter in a single pass, eliminating Python round-trips entirely

The quality result (+0.5% PPL) justifies the engineering investment. When the dispatch overhead is eliminated, DynaMINT would provide strictly better quality-per-bit than uniform quantization with no throughput penalty.

DynaMINT is part of the MINT research project. Code and full results available at github.com/baa-ai/MINT. Paper: “MINT: Budget-Aware Data-Free Mixed-Precision Quantization for Large Language Models via Rate-Distortion Optimization” (baa.ai, 2026).

Read the Full Paper

The complete MINT paper, including formal derivations, benchmark results across 7 model families and 40,000+ questions, and the full MCKP allocation framework, is available on our HuggingFace:

MINT: Compute-Optimal Data-Free Mixed-Precision Quantization for LLMs — Full Paper

huggingface.co/spaces/baa-ai/MINT

Licensed under CC BY-NC-ND 4.0

DynaMINT: When MINT Meets Expert-Aware Tiering

The Problem

The Pipeline

Tier Distribution

Results

Why This Matters

The Speed Problem (and the Solution)

Read the Full Paper

Continue Reading

Why You Can’t Prune MoE Experts — Even the Ones Nobody Uses

What 100 Prompts Reveal About Expert Routing in 256-Expert MoE Models

Mean Perplexity Is Lying to You