MINT Benchmark Results
MINT Research

MINT Benchmark Results: 7 Models, 40,000+ Questions, One Winner

March 2026 · Black Sheep AI Research

Complete evaluation across 7 model families, 5 benchmark suites, and over 40,000 questions. MINT outperforms uniform quantization and calibration-based GPTQ on every model tested — while requiring zero GPUs and zero calibration data.

Claims about quantization quality are cheap. Numbers are not. We evaluated MINT across 7 model families spanning 8B to 109B parameters, dense and MoE architectures, using perplexity (WikiText-2), ARC-Challenge, Winogrande, HellaSwag, and MMLU. Every benchmark, every model, every number is reported below.

All experiments run on an Apple M2 Ultra (192 GB unified memory). Perplexity uses WikiText-2 test split, sequence length 2048, seed 42. Downstream benchmarks use lm-evaluation-harness via the MLX backend.

1. Qwen3.5-35B-A3B — Scaling Study (PPL + MMLU)

MoE architecture, 35B total parameters, 3B active. Budget-constrained MCKP allocation at 5 levels from 21–72 GB. MMLU: 14,015 questions, 57 subjects, 5-shot.

Model Size Avg bits Bit distribution PPL (mean) PPL (median) Med vs BF16 MMLU (5-shot)
Uniform 4-bit20.4 GB4.0100% 4b 6.8356.764+4.2%70.35%
MINT 21 GB21 GB4.528% 3b, 60% 4b, 8% 8b, 4% 16b 6.7136.630+2.1%70.22%
MINT 30 GB30 GB6.554% 4b, 38% 8b, 8% 16b 6.6276.535+0.6%70.91%
MINT 37 GB37 GB8.016% 4b, 75% 8b, 8% 16b 6.5826.505+0.2%71.71%
MINT 51 GB51 GB11.655% 8b, 45% 16b 6.5976.518+0.4%71.51%
BF1672 GB16.0100% 16b 6.5866.494baseline~72–73%*

MINT 21 GB MMLU verified by independent rerun (70.22% both times). *BF16 MMLU estimated from Qwen official benchmarks. MINT 37 GB matches BF16 PPL (+0.2%) at 51% of the size with peak MMLU. Diminishing returns above 37 GB.

2. Qwen3-30B-A3B — Multi-Benchmark (PPL + ARC-C + Winogrande)

MoE architecture, 30B total, 3B active. ARC-Challenge: 1,172 questions (25-shot). Winogrande: 7,557 questions (5-shot).

Model Size PPL ARC-C Winogrande
Min-safe (3-bit)13.4 GB10.49467.24%71.35%
MINT (default)16.3 GB8.97269.45%69.46%
Uniform 4-bit16.0 GB9.09869.54%70.32%
MINT +30%21.2 GB8.75669.45%70.09%
MINT +60%26.1 GB8.77169.80%70.48%
8-bit ref29.3 GB8.76569.88%70.88%

ARC-C spans only 2.7 pp from min-safe to 8-bit ref, while PPL spans 18% — confirming downstream benchmarks saturate at ~4-bit quality.

3. Mixtral-8x7B — Multi-Benchmark

MoE architecture, 47B total, 13B active. MINT outperforms uniform 4-bit on both ARC-C (+0.5 pp) and Winogrande (+1.3 pp) at identical 24.5 GB size.

Model Size PPL ARC-C Winogrande
Min-safe (3-bit)19.4 GB4.92667.92%80.19%
MINT (default)24.5 GB4.26470.48%81.37%
Uniform 4-bit24.5 GB4.38769.97%80.11%
MINT +30%31.9 GB4.21870.99%81.69%
MINT +60%39.2 GB4.19871.16%82.08%
8-bit ref46.2 GB4.17471.33%81.93%

4. Qwen3-8B — Dense Model (PPL + ARC-C + HellaSwag + MMLU)

Dense 8B model. HellaSwag: 10,042 questions (10-shot). MMLU: 14,015 questions (5-shot). MINT achieves identical MMLU to BF16 despite 60% size reduction.

Model Size PPL ARC-C HellaSwag MMLU
BF1615.3 GB9.72744.62%60.04%22.95%
MINT6.1 GB10.09743.43%58.16%22.95%
Uniform 4-bit4.1 GB10.24942.83%58.14%

MINT outperforms uniform 4-bit on ARC-C (+0.6 pp) and matches on HellaSwag, with identical MMLU to BF16 at 60% size reduction.

5. GLM-4.7-Flash — Min-Bits Analysis

Dense 30B model. Tests the impact of the --min-bits flag on quality. This model has extreme outlier sequences that make mean PPL unreliable — median is the correct metric.

Model Size Avg bits Bit distribution PPL (mean) PPL (median) Med vs BF16
Uniform 4-bit16.9 GB4.0100% 4b 14.74910.076+19.0%
MINT (min-bits=3)15.1 GB4.056% 3b, 36% 4b, 5% 8b, 3% 16b 10.2289.310+9.9%
MINT (min-bits=4)16.5 GB4.497% 4b, 3% 16b 12.5418.723+3.0%
BF1658.0 GB16.0100% 16b 11.5168.470baseline

min-bits=4 wins on median PPL (8.723 vs 9.310) despite min-bits=3 winning on mean PPL (10.228 vs 12.541). Mean PPL is unreliable on this model — BF16 mean (11.5) appears worse than quantized mean (10.2). Median is the correct metric.

6. Llama-4-Scout — MoE 109B

109B MoE with 16 routed experts + 1 shared expert per layer. Too large for BF16 on any consumer device (203 GB). MINT enables deployment across multiple hardware tiers from a single analysis pass.

Model Size PPL vs Uniform 4-bit
Uniform 4-bit56.9 GB7.899baseline
MINT (58 GB)58.0 GB7.628−3.4%
MINT (163 GB)163.2 GB7.359−6.8%
MINT min-safe46.9 GB8.675+9.8%
MINT no-safety (2-bit)34.6 GB23.577+198%

SQNR safety veto at 9 dB prevents catastrophic 2-bit allocation. The quality cliff is between 2-bit and 3-bit (PPL triples), not between 3-bit and 4-bit (+12.5%).

7. GPTQ Head-to-Head: Data-Free vs Calibration

MINT vs GPTQ (calibration-based) at exactly matched model sizes. MINT outperforms on all three MoE families despite being entirely data-free.

Model Size GPTQ PPL (median) MINT PPL (median) Delta
Qwen3-30B-A3B16.0 GB9.160 8.959−2.2%
Qwen2-57B-A14B30.0 GB6.396 6.335−1.0%
Mixtral-8x7B24.5 GB4.640 4.426−4.6%

MINT is data-free (no calibration). GPTQ uses Hessian-based calibration with activation data. MINT consistently wins at matched sizes across all three MoE families.

8. Throughput: Tokens Per Second

Apple M2 Ultra, MLX 0.31.1, gen_len=256, median of 3 runs. The question: does mixed-precision cost speed?

Model Method Size TPS p=128 TPS p=512 TPS p=2048 TTFT p=2048
Qwen3-30BUniform17.2 GB83.380.678.31,498 ms
MINT17.2 GB80.276.975.31,509 ms
Delta−3.7%−4.6%−3.9%+0.7%
GLM-4.7-FlashUniform16.9 GB55.555.354.71,849 ms
MINT16.7 GB54.454.254.52,440 ms
Delta−2.0%−2.0%−0.4%+32.0%

Dense models: 2–4% generation overhead from mixed group sizes, negligible prefill impact on Qwen3-30B. GLM-4.7-Flash shows higher TTFT overhead due to a larger fraction of tensors at non-default configurations. Throughput cost is modest relative to the quality gains.

9. MLX Kernel Benchmark: Group Size Performance

MINT assigns group size 32 to 85% of tensors. Does the MLX quantized matmul kernel handle this efficiently? We benchmarked across MLX versions.

Scenario MLX 0.29.3 MLX 0.31.1 Status
Generation g32/g1281.07–1.14x penalty1.01–1.10xMostly fixed
Generation g64/g128~1.05x1.00xFixed
Prefill g32/g1281.8–2.2x1.00xFully fixed
Prefill g64/g128~1.3x1.00xFixed

Upstream fixes in MLX PRs #1861 and #2031 resolved the group_size performance regression. MINT’s preference for g32 is now viable at full speed.

10. Key Findings

MINT consistently outperforms uniform quantization and calibration-based GPTQ across all tested models:


All evaluations conducted on Apple M2 Ultra (192 GB unified memory). Perplexity: WikiText-2 test split, seq_len=2048, 128 sequences, seed 42. Downstream benchmarks: lm-evaluation-harness via MLX backend. ARC-Challenge (25-shot), Winogrande (5-shot), HellaSwag (10-shot), MMLU (5-shot, 14,015 questions). MLX 0.31.1, mlx_lm 0.30.4, Python 3.12.0. Full paper: baa.ai/articles/24-mint-paper.html. Code: github.com/baa-ai/MINT.

← Back to all articles

Ready to see these results on your models?

Our team specialises in data-free model compression, budget-aware quantization, and production AI deployment on commodity hardware.

Talk to Our Team