Beyond Perplexity: Downstream Benchmarks Confirm MINT

A perplexity improvement means nothing if it doesn’t translate to real tasks. We tested MINT on ARC-Challenge and Winogrande to find out.

The Gap Between Proxy and Reality

Perplexity is the universal metric for evaluating quantized models. It measures how well a model predicts the next token in a held-out corpus. But practitioners don’t deploy models to predict WikiText-2—they deploy them to answer questions, reason about problems, and follow instructions. The question that matters: do MINT’s perplexity gains translate to real downstream performance?

The Benchmarks

We evaluated MINT against uniform 4-bit quantization on two standard reasoning benchmarks using lm-evaluation-harness via the MLX backend:

ARC-Challenge (25-shot): Science questions requiring multi-step reasoning. Normalized accuracy.
Winogrande (5-shot): Commonsense reasoning via Winograd schema challenges. Standard accuracy.

Both benchmarks are standard in the LLM evaluation literature and test capabilities that matter for real deployments.

The Results

Model	Method	Size (GB)	ARC-C (acc_norm)	Winogrande (acc)
Qwen3-30B-A3B	MINT	16.3	69.5 ± 1.3	69.5 ± 1.3
Qwen3-30B-A3B	Uniform 4-bit	16.0	69.5 ± 1.3	70.3 ± 1.3
Mixtral-8x7B	MINT	24.5	70.5 ± 1.3	81.4 ± 1.1
Mixtral-8x7B	Uniform 4-bit	24.5	70.0 ± 1.3	80.1 ± 1.1

Mixtral: MINT Wins on Both Tasks

On Mixtral-8x7B, MINT outperforms uniform 4-bit on both benchmarks: +0.5 percentage points on ARC-Challenge and +1.3 percentage points on Winogrande, at exactly the same model size (24.5 GB). This is consistent with MINT’s large PPL advantage on this model (-4.6% median perplexity). The mixed-precision allocation that the MCKP solver produces—giving sensitive tensors finer group sizes and higher bit-widths—translates directly to better reasoning performance.

Qwen3-30B: Statistically Indistinguishable

On Qwen3-30B-A3B, both methods are statistically indistinguishable on both tasks. This is expected: MINT’s PPL advantage over uniform 4-bit is much narrower on this model (+2.3% vs +10.3%—MINT recovers 78% of the BF16-to-4bit gap, but the remaining gap is small). When the perplexity difference is within a few percent, downstream task accuracy converges.

The key observation: MINT never sacrifices downstream accuracy for its perplexity gains. The mixed-precision allocation does not create pathological behaviour on reasoning tasks—it either matches or improves upon uniform quantization.

Why Perplexity Remains a Good Proxy

These results validate that perplexity is a reliable proxy for task performance in the quantization regime. When MINT achieves -4.6% lower PPL (Mixtral), it also achieves better task scores. When the PPL gap is narrow (Qwen3-30B), task scores are indistinguishable. The mapping from perplexity to downstream accuracy is monotonic and predictable.

This is important because perplexity evaluation is fast (minutes) while full benchmark evaluation is slow (hours). Practitioners can use perplexity as a reliable filter during development and run downstream benchmarks as final validation.

What This Means for Deployment

If you are deploying a quantized model and your selection criterion is “best task accuracy at a given memory budget,” MINT matches or beats uniform quantization on every test we ran. The mixed-precision allocation that optimizes perplexity also optimizes—or at minimum preserves—the capabilities that matter for real applications.

Data from the MINT paper: ‘MINT: Budget-Aware Data-Free Mixed-Precision Quantization for Large Language Models via Rate-Distortion Optimization’ (baa.ai, 2026). Benchmarks evaluated using lm-evaluation-harness via MLX backend. ARC-Challenge: 25-shot, normalized accuracy. Winogrande: 5-shot, standard accuracy. Full paper at baa.ai/articles/24-mint-paper.html. Code at github.com/baa-ai/MINT.

← Back to all articles