MINT Benchmark Results: 7 Models, 40,000+ Questions, One Winner

Complete evaluation across 7 model families, 5 benchmark suites, and over 40,000 questions. MINT outperforms uniform quantization and calibration-based GPTQ on every model tested — while requiring zero GPUs and zero calibration data.

Claims about quantization quality are cheap. Numbers are not. We evaluated MINT across 7 model families spanning 8B to 109B parameters, dense and MoE architectures, using perplexity (WikiText-2), ARC-Challenge, Winogrande, HellaSwag, and MMLU. Every benchmark, every model, every number is reported below.

All experiments run on an Apple M2 Ultra (192 GB unified memory). Perplexity uses WikiText-2 test split, sequence length 2048, seed 42. Downstream benchmarks use lm-evaluation-harness via the MLX backend.

1. Qwen3.5-35B-A3B — Scaling Study (PPL + MMLU)

MoE architecture, 35B total parameters, 3B active. Budget-constrained MCKP allocation at 5 levels from 21–72 GB. MMLU: 14,015 questions, 57 subjects, 5-shot.

Model	Size	Avg bits	Bit distribution	PPL (mean)	PPL (median)	Med vs BF16	MMLU (5-shot)
Uniform 4-bit	20.4 GB	4.0	100% 4b	6.835	6.764	+4.2%	70.35%
MINT 21 GB	21 GB	4.5	28% 3b, 60% 4b, 8% 8b, 4% 16b	6.713	6.630	+2.1%	70.22%
MINT 30 GB	30 GB	6.5	54% 4b, 38% 8b, 8% 16b	6.627	6.535	+0.6%	70.91%
MINT 37 GB	37 GB	8.0	16% 4b, 75% 8b, 8% 16b	6.582	6.505	+0.2%	71.71%
MINT 51 GB	51 GB	11.6	55% 8b, 45% 16b	6.597	6.518	+0.4%	71.51%
BF16	72 GB	16.0	100% 16b	6.586	6.494	baseline	~72–73%*

MINT 21 GB MMLU verified by independent rerun (70.22% both times). *BF16 MMLU estimated from Qwen official benchmarks. MINT 37 GB matches BF16 PPL (+0.2%) at 51% of the size with peak MMLU. Diminishing returns above 37 GB.

2. Qwen3-30B-A3B — Multi-Benchmark (PPL + ARC-C + Winogrande)

MoE architecture, 30B total, 3B active. ARC-Challenge: 1,172 questions (25-shot). Winogrande: 7,557 questions (5-shot).

Model	Size	PPL	ARC-C	Winogrande
Min-safe (3-bit)	13.4 GB	10.494	67.24%	71.35%
MINT (default)	16.3 GB	8.972	69.45%	69.46%
Uniform 4-bit	16.0 GB	9.098	69.54%	70.32%
MINT +30%	21.2 GB	8.756	69.45%	70.09%
MINT +60%	26.1 GB	8.771	69.80%	70.48%
8-bit ref	29.3 GB	8.765	69.88%	70.88%

ARC-C spans only 2.7 pp from min-safe to 8-bit ref, while PPL spans 18% — confirming downstream benchmarks saturate at ~4-bit quality.

3. Mixtral-8x7B — Multi-Benchmark

MoE architecture, 47B total, 13B active. MINT outperforms uniform 4-bit on both ARC-C (+0.5 pp) and Winogrande (+1.3 pp) at identical 24.5 GB size.

Model	Size	PPL	ARC-C	Winogrande
Min-safe (3-bit)	19.4 GB	4.926	67.92%	80.19%
MINT (default)	24.5 GB	4.264	70.48%	81.37%
Uniform 4-bit	24.5 GB	4.387	69.97%	80.11%
MINT +30%	31.9 GB	4.218	70.99%	81.69%
MINT +60%	39.2 GB	4.198	71.16%	82.08%
8-bit ref	46.2 GB	4.174	71.33%	81.93%

4. Qwen3-8B — Dense Model (PPL + ARC-C + HellaSwag + MMLU)

Dense 8B model. HellaSwag: 10,042 questions (10-shot). MMLU: 14,015 questions (5-shot). MINT achieves identical MMLU to BF16 despite 60% size reduction.

Model	Size	PPL	ARC-C	HellaSwag	MMLU
BF16	15.3 GB	9.727	44.62%	60.04%	22.95%
MINT	6.1 GB	10.097	43.43%	58.16%	22.95%
Uniform 4-bit	4.1 GB	10.249	42.83%	58.14%	—

MINT outperforms uniform 4-bit on ARC-C (+0.6 pp) and matches on HellaSwag, with identical MMLU to BF16 at 60% size reduction.

5. GLM-4.7-Flash — Min-Bits Analysis

Dense 30B model. Tests the impact of the --min-bits flag on quality. This model has extreme outlier sequences that make mean PPL unreliable — median is the correct metric.

Model	Size	Avg bits	Bit distribution	PPL (mean)	PPL (median)	Med vs BF16
Uniform 4-bit	16.9 GB	4.0	100% 4b	14.749	10.076	+19.0%
MINT (min-bits=3)	15.1 GB	4.0	56% 3b, 36% 4b, 5% 8b, 3% 16b	10.228	9.310	+9.9%
MINT (min-bits=4)	16.5 GB	4.4	97% 4b, 3% 16b	12.541	8.723	+3.0%
BF16	58.0 GB	16.0	100% 16b	11.516	8.470	baseline

min-bits=4 wins on median PPL (8.723 vs 9.310) despite min-bits=3 winning on mean PPL (10.228 vs 12.541). Mean PPL is unreliable on this model — BF16 mean (11.5) appears worse than quantized mean (10.2). Median is the correct metric.

6. Llama-4-Scout — MoE 109B

109B MoE with 16 routed experts + 1 shared expert per layer. Too large for BF16 on any consumer device (203 GB). MINT enables deployment across multiple hardware tiers from a single analysis pass.

Model	Size	PPL	vs Uniform 4-bit
Uniform 4-bit	56.9 GB	7.899	baseline
MINT (58 GB)	58.0 GB	7.628	−3.4%
MINT (163 GB)	163.2 GB	7.359	−6.8%
MINT min-safe	46.9 GB	8.675	+9.8%
MINT no-safety (2-bit)	34.6 GB	23.577	+198%

SQNR safety veto at 9 dB prevents catastrophic 2-bit allocation. The quality cliff is between 2-bit and 3-bit (PPL triples), not between 3-bit and 4-bit (+12.5%).

7. GPTQ Head-to-Head: Data-Free vs Calibration

MINT vs GPTQ (calibration-based) at exactly matched model sizes. MINT outperforms on all three MoE families despite being entirely data-free.

Model	Size	GPTQ PPL (median)	MINT PPL (median)	Delta
Qwen3-30B-A3B	16.0 GB	9.160	8.959	−2.2%
Qwen2-57B-A14B	30.0 GB	6.396	6.335	−1.0%
Mixtral-8x7B	24.5 GB	4.640	4.426	−4.6%

MINT is data-free (no calibration). GPTQ uses Hessian-based calibration with activation data. MINT consistently wins at matched sizes across all three MoE families.

8. Throughput: Tokens Per Second

Apple M2 Ultra, MLX 0.31.1, gen_len=256, median of 3 runs. The question: does mixed-precision cost speed?

Model	Method	Size	TPS p=128	TPS p=512	TPS p=2048	TTFT p=2048
Qwen3-30B	Uniform	17.2 GB	83.3	80.6	78.3	1,498 ms
	MINT	17.2 GB	80.2	76.9	75.3	1,509 ms
	Delta		−3.7%	−4.6%	−3.9%	+0.7%
GLM-4.7-Flash	Uniform	16.9 GB	55.5	55.3	54.7	1,849 ms
	MINT	16.7 GB	54.4	54.2	54.5	2,440 ms
	Delta		−2.0%	−2.0%	−0.4%	+32.0%

Dense models: 2–4% generation overhead from mixed group sizes, negligible prefill impact on Qwen3-30B. GLM-4.7-Flash shows higher TTFT overhead due to a larger fraction of tensors at non-default configurations. Throughput cost is modest relative to the quality gains.

9. MLX Kernel Benchmark: Group Size Performance

MINT assigns group size 32 to 85% of tensors. Does the MLX quantized matmul kernel handle this efficiently? We benchmarked across MLX versions.

Scenario	MLX 0.29.3	MLX 0.31.1	Status
Generation g32/g128	1.07–1.14x penalty	1.01–1.10x	Mostly fixed
Generation g64/g128	~1.05x	1.00x	Fixed
Prefill g32/g128	1.8–2.2x	1.00x	Fully fixed
Prefill g64/g128	~1.3x	1.00x	Fixed

Upstream fixes in MLX PRs #1861 and #2031 resolved the group_size performance regression. MINT’s preference for g32 is now viable at full speed.

10. Key Findings

MINT consistently outperforms uniform quantization and calibration-based GPTQ across all tested models:

vs Uniform 4-bit: −2.0% to −3.4% median PPL at matched sizes
vs GPTQ: −1.0% to −4.6% median PPL at matched sizes (data-free vs calibration-based)
Downstream benchmarks preserved: All MINT-vs-uniform differences within standard error bounds across ARC-C, Winogrande, HellaSwag, MMLU (40,786 total questions)
Budget-targeted: Single analysis produces deployments from 46.9 GB to 163 GB on Llama-4-Scout (109B)
MMLU saturates at ~37 GB (8 avg bits) despite PPL continuing to improve — confirming PPL as the primary optimisation metric
Mean PPL is unreliable: On GLM-4.7-Flash, mean gives inverted quality ordering; median gives correct ranking
min-bits=4 (default) is optimal: 3-bit allocations save space but cost 6.7% median PPL on GLM-4.7-Flash
Throughput overhead: 2–4% generation on dense models, negligible prefill
Reproducibility: MINT 21 GB MMLU verified by independent rerun (70.22% both times)

All evaluations conducted on Apple M2 Ultra (192 GB unified memory). Perplexity: WikiText-2 test split, seq_len=2048, 128 sequences, seed 42. Downstream benchmarks: lm-evaluation-harness via MLX backend. ARC-Challenge (25-shot), Winogrande (5-shot), HellaSwag (10-shot), MMLU (5-shot, 14,015 questions). MLX 0.31.1, mlx_lm 0.30.4, Python 3.12.0. Full paper: baa.ai/articles/24-mint-paper.html. Code: github.com/baa-ai/MINT.

← Back to all articles