What if your quantizer could tell the difference between a critical expert and a rarely-used one — and assign bits accordingly? DynaMINT does exactly this, combining MINT’s rate-distortion optimizer with activation-guided expert tiering.
The Problem
Uniform quantization treats all 256 experts the same. But some experts handle 3x more traffic than others. Critical experts deserve higher precision; deprioritized ones can be aggressively compressed. The question is whether activation frequency is a reliable signal for bit-width allocation — and whether the quality cost is acceptable.
Standard mixed-precision methods like MINT assign bits based on weight sensitivity metrics (spectral concentration, kurtosis, noise amplification). These metrics capture how much a tensor’s output degrades under quantization, but they are blind to how often each expert is actually used during inference. DynaMINT adds the missing dimension: activation frequency.
The Pipeline
DynaMINT operates in three stages:
- Profile: Run 100 diverse prompts through the model, capturing expert routing decisions at every MoE layer. Record per-expert activation counts across all tokens and layers.
- Tier: Rank experts by activation frequency. Assign each expert to one of four tiers: critical (top 20%, 8-bit), standard (next 65%, 4-bit), deprioritized (next 12%, 2-bit), and prunable (bottom 3.6%, zeroed).
- Dispatch: DynaMINTSwitchGLU — a modified MoE dispatch layer that performs tier-lookup at inference time, routing each token’s expert selection to the correct precision bank via separate gather_qmm calls.
Tier Distribution
Across all 40 MoE layers of Qwen3.5-35B-A3B (256 experts each, 10,240 total experts):
| Tier | Experts | Share | Precision |
|---|---|---|---|
| Critical | 2,040 | 19.9% | 8-bit |
| Standard | 6,640 | 64.8% | 4-bit |
| Deprioritized | 1,187 | 11.6% | 2-bit |
| Prunable | 373 | 3.6% | Zeroed |
The distribution is intentionally conservative: nearly two-thirds of experts remain at standard 4-bit precision, and only 3.6% are zeroed — far below the 5% threshold where our pruning experiments showed catastrophic failure.
Results
| Configuration | PPL | Delta | Speed (tok/s) |
|---|---|---|---|
| Baseline MINT | 6.580 | — | 70.1 |
| DynaMINT (tiered) | 6.613 | +0.5% | 9.6 |
The quality result is compelling: +0.5% PPL degradation with tiered precision across all 10,240 experts. DynaMINT produces coherent chain-of-thought responses on all test prompts, with no detectable quality difference in human evaluation.
The speed result is not: 9.6 tok/s vs 70.1 tok/s baseline. This 7x slowdown is entirely due to the Python-level dispatch implementation, not an inherent limitation of the approach.
Why This Matters
DynaMINT proves that activation-guided tiering is complementary to MINT — the profiler provides the “which experts matter” signal, MINT provides the “how many bits” optimization. Together they enable per-expert precision at production quality.
The key insight is the separation of concerns: MINT’s rate-distortion analysis handles the within-expert sensitivity (which weight tensors are fragile), while DynaMINT’s activation profiler handles the between-expert importance (which experts carry more traffic). Neither signal alone captures the full picture.
The Speed Problem (and the Solution)
The 7x slowdown comes from Python-level per-tier dispatch. The current implementation performs 3 separate gather_qmm calls — one for each active precision tier — with Python orchestration between them. This is fundamentally a software engineering problem, not an algorithmic one.
Two paths to eliminate the overhead:
- Sorted dispatch: Sort tokens by their expert tier before dispatch, enabling a single fused kernel call per tier with batch-level parallelism
- Native MLX kernel: A custom Metal kernel that handles mixed-precision gather-scatter in a single pass, eliminating Python round-trips entirely
The quality result (+0.5% PPL) justifies the engineering investment. When the dispatch overhead is eliminated, DynaMINT would provide strictly better quality-per-bit than uniform quantization with no throughput penalty.
DynaMINT is part of the MINT research project. Code and full results available at github.com/baa-ai/MINT. Paper: “MINT: Budget-Aware Data-Free Mixed-Precision Quantization for Large Language Models via Rate-Distortion Optimization” (baa.ai, 2026).
Read the Full Paper
The complete MINT paper, including formal derivations, benchmark results across 7 model families and 40,000+ questions, and the full MCKP allocation framework, is available on our HuggingFace:
MINT: Compute-Optimal Data-Free Mixed-Precision Quantization for LLMs — Full Paper
huggingface.co/spaces/baa-ai/MINTLicensed under CC BY-NC-ND 4.0