Complete evaluation across 7 model families, 5 benchmark suites, and over 40,000 questions. MINT outperforms uniform quantization and calibration-based GPTQ on every model tested — while requiring zero GPUs and zero calibration data.
Claims about quantization quality are cheap. Numbers are not. We evaluated MINT across 7 model families spanning 8B to 109B parameters, dense and MoE architectures, using perplexity (WikiText-2), ARC-Challenge, Winogrande, HellaSwag, and MMLU. Every benchmark, every model, every number is reported below.
All experiments run on an Apple M2 Ultra (192 GB unified memory). Perplexity uses WikiText-2 test split, sequence length 2048, seed 42. Downstream benchmarks use lm-evaluation-harness via the MLX backend.
1. Qwen3.5-35B-A3B — Scaling Study (PPL + MMLU)
MoE architecture, 35B total parameters, 3B active. Budget-constrained MCKP allocation at 5 levels from 21–72 GB. MMLU: 14,015 questions, 57 subjects, 5-shot.
| Model | Size | Avg bits | Bit distribution | PPL (mean) | PPL (median) | Med vs BF16 | MMLU (5-shot) |
|---|---|---|---|---|---|---|---|
| Uniform 4-bit | 20.4 GB | 4.0 | 100% 4b | 6.835 | 6.764 | +4.2% | 70.35% |
| MINT 21 GB | 21 GB | 4.5 | 28% 3b, 60% 4b, 8% 8b, 4% 16b | 6.713 | 6.630 | +2.1% | 70.22% |
| MINT 30 GB | 30 GB | 6.5 | 54% 4b, 38% 8b, 8% 16b | 6.627 | 6.535 | +0.6% | 70.91% |
| MINT 37 GB | 37 GB | 8.0 | 16% 4b, 75% 8b, 8% 16b | 6.582 | 6.505 | +0.2% | 71.71% |
| MINT 51 GB | 51 GB | 11.6 | 55% 8b, 45% 16b | 6.597 | 6.518 | +0.4% | 71.51% |
| BF16 | 72 GB | 16.0 | 100% 16b | 6.586 | 6.494 | baseline | ~72–73%* |
MINT 21 GB MMLU verified by independent rerun (70.22% both times). *BF16 MMLU estimated from Qwen official benchmarks. MINT 37 GB matches BF16 PPL (+0.2%) at 51% of the size with peak MMLU. Diminishing returns above 37 GB.
2. Qwen3-30B-A3B — Multi-Benchmark (PPL + ARC-C + Winogrande)
MoE architecture, 30B total, 3B active. ARC-Challenge: 1,172 questions (25-shot). Winogrande: 7,557 questions (5-shot).
| Model | Size | PPL | ARC-C | Winogrande |
|---|---|---|---|---|
| Min-safe (3-bit) | 13.4 GB | 10.494 | 67.24% | 71.35% |
| MINT (default) | 16.3 GB | 8.972 | 69.45% | 69.46% |
| Uniform 4-bit | 16.0 GB | 9.098 | 69.54% | 70.32% |
| MINT +30% | 21.2 GB | 8.756 | 69.45% | 70.09% |
| MINT +60% | 26.1 GB | 8.771 | 69.80% | 70.48% |
| 8-bit ref | 29.3 GB | 8.765 | 69.88% | 70.88% |
ARC-C spans only 2.7 pp from min-safe to 8-bit ref, while PPL spans 18% — confirming downstream benchmarks saturate at ~4-bit quality.
3. Mixtral-8x7B — Multi-Benchmark
MoE architecture, 47B total, 13B active. MINT outperforms uniform 4-bit on both ARC-C (+0.5 pp) and Winogrande (+1.3 pp) at identical 24.5 GB size.
| Model | Size | PPL | ARC-C | Winogrande |
|---|---|---|---|---|
| Min-safe (3-bit) | 19.4 GB | 4.926 | 67.92% | 80.19% |
| MINT (default) | 24.5 GB | 4.264 | 70.48% | 81.37% |
| Uniform 4-bit | 24.5 GB | 4.387 | 69.97% | 80.11% |
| MINT +30% | 31.9 GB | 4.218 | 70.99% | 81.69% |
| MINT +60% | 39.2 GB | 4.198 | 71.16% | 82.08% |
| 8-bit ref | 46.2 GB | 4.174 | 71.33% | 81.93% |
4. Qwen3-8B — Dense Model (PPL + ARC-C + HellaSwag + MMLU)
Dense 8B model. HellaSwag: 10,042 questions (10-shot). MMLU: 14,015 questions (5-shot). MINT achieves identical MMLU to BF16 despite 60% size reduction.
| Model | Size | PPL | ARC-C | HellaSwag | MMLU |
|---|---|---|---|---|---|
| BF16 | 15.3 GB | 9.727 | 44.62% | 60.04% | 22.95% |
| MINT | 6.1 GB | 10.097 | 43.43% | 58.16% | 22.95% |
| Uniform 4-bit | 4.1 GB | 10.249 | 42.83% | 58.14% | — |
MINT outperforms uniform 4-bit on ARC-C (+0.6 pp) and matches on HellaSwag, with identical MMLU to BF16 at 60% size reduction.
5. GLM-4.7-Flash — Min-Bits Analysis
Dense 30B model. Tests the impact of the --min-bits flag on quality. This model has extreme outlier sequences that make mean PPL unreliable — median is the correct metric.
| Model | Size | Avg bits | Bit distribution | PPL (mean) | PPL (median) | Med vs BF16 |
|---|---|---|---|---|---|---|
| Uniform 4-bit | 16.9 GB | 4.0 | 100% 4b | 14.749 | 10.076 | +19.0% |
| MINT (min-bits=3) | 15.1 GB | 4.0 | 56% 3b, 36% 4b, 5% 8b, 3% 16b | 10.228 | 9.310 | +9.9% |
| MINT (min-bits=4) | 16.5 GB | 4.4 | 97% 4b, 3% 16b | 12.541 | 8.723 | +3.0% |
| BF16 | 58.0 GB | 16.0 | 100% 16b | 11.516 | 8.470 | baseline |
min-bits=4 wins on median PPL (8.723 vs 9.310) despite min-bits=3 winning on mean PPL (10.228 vs 12.541). Mean PPL is unreliable on this model — BF16 mean (11.5) appears worse than quantized mean (10.2). Median is the correct metric.
6. Llama-4-Scout — MoE 109B
109B MoE with 16 routed experts + 1 shared expert per layer. Too large for BF16 on any consumer device (203 GB). MINT enables deployment across multiple hardware tiers from a single analysis pass.
| Model | Size | PPL | vs Uniform 4-bit |
|---|---|---|---|
| Uniform 4-bit | 56.9 GB | 7.899 | baseline |
| MINT (58 GB) | 58.0 GB | 7.628 | −3.4% |
| MINT (163 GB) | 163.2 GB | 7.359 | −6.8% |
| MINT min-safe | 46.9 GB | 8.675 | +9.8% |
| MINT no-safety (2-bit) | 34.6 GB | 23.577 | +198% |
SQNR safety veto at 9 dB prevents catastrophic 2-bit allocation. The quality cliff is between 2-bit and 3-bit (PPL triples), not between 3-bit and 4-bit (+12.5%).
7. GPTQ Head-to-Head: Data-Free vs Calibration
MINT vs GPTQ (calibration-based) at exactly matched model sizes. MINT outperforms on all three MoE families despite being entirely data-free.
| Model | Size | GPTQ PPL (median) | MINT PPL (median) | Delta |
|---|---|---|---|---|
| Qwen3-30B-A3B | 16.0 GB | 9.160 | 8.959 | −2.2% |
| Qwen2-57B-A14B | 30.0 GB | 6.396 | 6.335 | −1.0% |
| Mixtral-8x7B | 24.5 GB | 4.640 | 4.426 | −4.6% |
MINT is data-free (no calibration). GPTQ uses Hessian-based calibration with activation data. MINT consistently wins at matched sizes across all three MoE families.
8. Throughput: Tokens Per Second
Apple M2 Ultra, MLX 0.31.1, gen_len=256, median of 3 runs. The question: does mixed-precision cost speed?
| Model | Method | Size | TPS p=128 | TPS p=512 | TPS p=2048 | TTFT p=2048 |
|---|---|---|---|---|---|---|
| Qwen3-30B | Uniform | 17.2 GB | 83.3 | 80.6 | 78.3 | 1,498 ms |
| MINT | 17.2 GB | 80.2 | 76.9 | 75.3 | 1,509 ms | |
| Delta | −3.7% | −4.6% | −3.9% | +0.7% | ||
| GLM-4.7-Flash | Uniform | 16.9 GB | 55.5 | 55.3 | 54.7 | 1,849 ms |
| MINT | 16.7 GB | 54.4 | 54.2 | 54.5 | 2,440 ms | |
| Delta | −2.0% | −2.0% | −0.4% | +32.0% |
Dense models: 2–4% generation overhead from mixed group sizes, negligible prefill impact on Qwen3-30B. GLM-4.7-Flash shows higher TTFT overhead due to a larger fraction of tensors at non-default configurations. Throughput cost is modest relative to the quality gains.
9. MLX Kernel Benchmark: Group Size Performance
MINT assigns group size 32 to 85% of tensors. Does the MLX quantized matmul kernel handle this efficiently? We benchmarked across MLX versions.
| Scenario | MLX 0.29.3 | MLX 0.31.1 | Status |
|---|---|---|---|
| Generation g32/g128 | 1.07–1.14x penalty | 1.01–1.10x | Mostly fixed |
| Generation g64/g128 | ~1.05x | 1.00x | Fixed |
| Prefill g32/g128 | 1.8–2.2x | 1.00x | Fully fixed |
| Prefill g64/g128 | ~1.3x | 1.00x | Fixed |
Upstream fixes in MLX PRs #1861 and #2031 resolved the group_size performance regression. MINT’s preference for g32 is now viable at full speed.
10. Key Findings
MINT consistently outperforms uniform quantization and calibration-based GPTQ across all tested models:
- vs Uniform 4-bit: −2.0% to −3.4% median PPL at matched sizes
- vs GPTQ: −1.0% to −4.6% median PPL at matched sizes (data-free vs calibration-based)
- Downstream benchmarks preserved: All MINT-vs-uniform differences within standard error bounds across ARC-C, Winogrande, HellaSwag, MMLU (40,786 total questions)
- Budget-targeted: Single analysis produces deployments from 46.9 GB to 163 GB on Llama-4-Scout (109B)
- MMLU saturates at ~37 GB (8 avg bits) despite PPL continuing to improve — confirming PPL as the primary optimisation metric
- Mean PPL is unreliable: On GLM-4.7-Flash, mean gives inverted quality ordering; median gives correct ranking
- min-bits=4 (default) is optimal: 3-bit allocations save space but cost 6.7% median PPL on GLM-4.7-Flash
- Throughput overhead: 2–4% generation on dense models, negligible prefill
- Reproducibility: MINT 21 GB MMLU verified by independent rerun (70.22% both times)
All evaluations conducted on Apple M2 Ultra (192 GB unified memory). Perplexity: WikiText-2 test split, seq_len=2048, 128 sequences, seed 42. Downstream benchmarks: lm-evaluation-harness via MLX backend. ARC-Challenge (25-shot), Winogrande (5-shot), HellaSwag (10-shot), MMLU (5-shot, 14,015 questions). MLX 0.31.1, mlx_lm 0.30.4, Python 3.12.0. Full paper: baa.ai/articles/24-mint-paper.html. Code: github.com/baa-ai/MINT.