DynaMINT: When MINT Meets Expert-Aware Tiering
MoE Research

DynaMINT: When MINT Meets Expert-Aware Tiering

March 2026 · Black Sheep AI Research

What if your quantizer could tell the difference between a critical expert and a rarely-used one — and assign bits accordingly? DynaMINT does exactly this, combining MINT’s rate-distortion optimizer with activation-guided expert tiering.

The Problem

Uniform quantization treats all 256 experts the same. But some experts handle 3x more traffic than others. Critical experts deserve higher precision; deprioritized ones can be aggressively compressed. The question is whether activation frequency is a reliable signal for bit-width allocation — and whether the quality cost is acceptable.

Standard mixed-precision methods like MINT assign bits based on weight sensitivity metrics (spectral concentration, kurtosis, noise amplification). These metrics capture how much a tensor’s output degrades under quantization, but they are blind to how often each expert is actually used during inference. DynaMINT adds the missing dimension: activation frequency.

The Pipeline

DynaMINT operates in three stages:

Tier Distribution

Across all 40 MoE layers of Qwen3.5-35B-A3B (256 experts each, 10,240 total experts):

Tier Experts Share Precision
Critical 2,040 19.9% 8-bit
Standard 6,640 64.8% 4-bit
Deprioritized 1,187 11.6% 2-bit
Prunable 373 3.6% Zeroed

The distribution is intentionally conservative: nearly two-thirds of experts remain at standard 4-bit precision, and only 3.6% are zeroed — far below the 5% threshold where our pruning experiments showed catastrophic failure.

Results

Configuration PPL Delta Speed (tok/s)
Baseline MINT 6.580 70.1
DynaMINT (tiered) 6.613 +0.5% 9.6

The quality result is compelling: +0.5% PPL degradation with tiered precision across all 10,240 experts. DynaMINT produces coherent chain-of-thought responses on all test prompts, with no detectable quality difference in human evaluation.

The speed result is not: 9.6 tok/s vs 70.1 tok/s baseline. This 7x slowdown is entirely due to the Python-level dispatch implementation, not an inherent limitation of the approach.

Why This Matters

DynaMINT proves that activation-guided tiering is complementary to MINT — the profiler provides the “which experts matter” signal, MINT provides the “how many bits” optimization. Together they enable per-expert precision at production quality.

The key insight is the separation of concerns: MINT’s rate-distortion analysis handles the within-expert sensitivity (which weight tensors are fragile), while DynaMINT’s activation profiler handles the between-expert importance (which experts carry more traffic). Neither signal alone captures the full picture.

The Speed Problem (and the Solution)

The 7x slowdown comes from Python-level per-tier dispatch. The current implementation performs 3 separate gather_qmm calls — one for each active precision tier — with Python orchestration between them. This is fundamentally a software engineering problem, not an algorithmic one.

Two paths to eliminate the overhead:

The quality result (+0.5% PPL) justifies the engineering investment. When the dispatch overhead is eliminated, DynaMINT would provide strictly better quality-per-bit than uniform quantization with no throughput penalty.


DynaMINT is part of the MINT research project. Code and full results available at github.com/baa-ai/MINT. Paper: “MINT: Budget-Aware Data-Free Mixed-Precision Quantization for Large Language Models via Rate-Distortion Optimization” (baa.ai, 2026).

Read the Full Paper

The complete MINT paper, including formal derivations, benchmark results across 7 model families and 40,000+ questions, and the full MCKP allocation framework, is available on our HuggingFace:

MINT: Compute-Optimal Data-Free Mixed-Precision Quantization for LLMs — Full Paper

huggingface.co/spaces/baa-ai/MINT

Licensed under CC BY-NC-ND 4.0

Continue Reading

Related research from our team.

Why You Can't Prune MoE Experts
MoE Research

Why You Can’t Prune MoE Experts — Even the Ones Nobody Uses

Removing just 5% of least-used experts causes 13x perplexity blow-up. Rarely activated does not mean safely removable.

What 100 Prompts Reveal About Expert Routing in 256-Expert MoE Models
MoE Research

What 100 Prompts Reveal About Expert Routing in 256-Expert MoE Models

Profiling expert activation across 100 diverse prompts reveals the dramatic sample-size effect on apparent redundancy.

Mean Perplexity Is Lying to You
MINT Research

Mean Perplexity Is Lying to You

Standard perplexity evaluation produces misleading quality orderings. Here’s why, and what to report instead.

View All Research