When Data-Free Beats the Gold Standard

MINT outperforms GPTQ at matched model sizes across three MoE families—without ever seeing a single calibration sample. The calibration paradigm has a problem.

The Conventional Wisdom

GPTQ is the gold standard for post-training quantization. Published at ICLR 2023, it uses a calibration dataset and Hessian-based optimization to make intelligent compression decisions. The algorithm processes each layer sequentially, using second-order information from calibration data to determine how to quantize each weight column while compensating for the error introduced to subsequent columns.

The assumption has always been straightforward: more information leads to better decisions. A method that sees calibration data should always outperform one that does not. A data-free method is, in theory, flying blind—making compression decisions without knowing how the model will actually be used.

Our results challenge this assumption directly.

The Head-to-Head Results

We compared MINT against GPTQ at matched model sizes across three Mixture-of-Experts model families. For each comparison, we ensured the compressed model sizes were as close as possible to provide a fair apples-to-apples evaluation.

Model	Method	Size (GB)	Mean PPL	Median PPL	Δ PPL
Qwen3-30B	GPTQ	16.0	9.122	9.160	—
Qwen3-30B	MINT	16.1	8.970	9.020	−1.7%
Qwen2-57B	GPTQ	29.9	6.390	6.396	—
Qwen2-57B	MINT	29.9	6.329	6.356	−0.95%
Mixtral-8x7B	GPTQ	87.0†	4.608	4.640	—
Mixtral-8x7B	MINT	24.5	4.264	4.266	−4.6%

At matched model sizes, MINT beats GPTQ on every single MoE model tested. The margin ranges from −0.95% to −4.6% perplexity improvement. The Mixtral result is particularly striking—MINT achieves better perplexity at less than a third of the model size.

A data-free method outperforming the calibration-based gold standard. On every model. At matched sizes.

How Is This Possible?

Three factors explain why a data-free method can outperform a calibration-based one:

GPTQ uses fixed group sizes. GPTQ typically quantizes with group size 128 across all tensors. As we have shown in our group size research, 85% of tensors benefit from group size 32. GPTQ’s Hessian-based optimization is sophisticated, but it operates within a constrained configuration space that misses the single largest quality lever available.

Calibration introduces distribution bias. GPTQ’s calibration-derived Hessian represents one input distribution. For general-purpose language models, any single calibration set is necessarily a narrow sample of all possible inputs. For MoE models especially, this is problematic: a single calibration sequence only activates a fraction of the experts in any given layer. Experts that are inactive during calibration receive quantization decisions based on incomplete information.

MINT captures the actual error surface. Instead of estimating quantization sensitivity from a proxy (calibration activations), MINT computes the actual quantization error for each tensor across every candidate configuration in the (bits, group size) space. The rate-distortion curves capture what actually happens when you quantize each tensor—not an estimate of what might happen based on one input distribution.

The MoE Effect

The gap between MINT and GPTQ is largest on Mixtral (−4.6%), the model with the most experts per layer. This is not a coincidence. MoE architectures amplify the calibration coverage problem.

When you have 8 experts per layer, any given input token activates only 2 of them (in Mixtral’s top-2 routing). A calibration set of 128 sequences will activate different subsets of experts in different proportions. Some experts may be heavily activated, receiving accurate Hessian estimates. Others may be rarely activated, receiving noisy or biased estimates. The quantization decisions for under-activated experts are based on insufficient information—yet those experts are just as important during deployment when inputs that activate them arrive.

MINT sidesteps this problem entirely. Because it analyzes weight tensors directly—without any input data—every expert receives equal treatment. The quantization decisions for a rarely-activated expert are just as well-informed as those for a frequently-activated one.

As MoE models grow more popular (Qwen3, DeepSeek-V3, Llama-4 are all MoE), this advantage will become increasingly important. The trend in model architecture is toward more experts with sparser activation—exactly the conditions that make calibration coverage worse.

What This Means for Practitioners

If you are using GPTQ or AWQ today for MoE models, you may be paying for calibration infrastructure that is actually hurting your model quality. The GPU time, the calibration data curation, the iteration on calibration hyperparameters—all of this cost may be producing worse results than a method that uses none of it.

The recommendation is nuanced. Calibration-based methods remain excellent for dense models, where the calibration coverage problem is less severe—every weight participates in every forward pass, so any calibration set provides information about all parameters. But for MoE models, data-free methods deserve serious consideration.

The cost savings are a bonus on top of superior quality. No GPU required for quantization. No calibration dataset to curate and validate. No iteration on calibration hyperparameters. No risk of distribution mismatch between calibration and deployment. The entire quantization process runs on CPU in under an hour.

The Broader Implication

This result challenges a fundamental assumption in the quantization literature: that more information always leads to better decisions. The assumption seems self-evidently true—how could knowing less produce better results?

The answer is that calibration data is not free information. It comes with bias. A calibration set is a sample from a particular distribution, and the Hessian computed from that sample reflects that distribution’s characteristics. When the deployment distribution differs—or when the model architecture means that different parts of the model see different distributions (as in MoE)—the bias introduced by the narrow calibration set can outweigh the benefit of having calibration data at all.

As models grow more heterogeneous—more experts, more architectural variety, more specialized sub-networks—the calibration coverage problem will only get worse. A single calibration set cannot represent all the ways a complex model will be used. The data-free approach, by not introducing any distributional bias, avoids a problem that will become increasingly severe as model architectures continue to evolve.

Sometimes, the best information is no information at all—when the alternative is biased information that leads you astray.

GPTQ baselines from published 4-bit quantization with group size 128, calibration on C4 dataset (128 samples, 2048 sequence length). MINT results from budget-constrained rate-distortion allocation with per-tensor (bits, group size) optimization. All perplexity evaluations on WikiText-2 test split. †Mixtral GPTQ size reflects standard 4-bit g128 quantization of the full model. The full MINT pipeline is open source at github.com/baa-ai/MINT.

← Back to all articles