When Data-Free Beats the Gold Standard
RAM Research

When Data-Free Beats the Gold Standard

March 2026 · baa.ai

RAM outperforms GPTQ at matched model sizes across three MoE families, without ever seeing a single calibration sample. The calibration paradigm has a problem.

The Conventional Wisdom

GPTQ is the gold standard for post-training quantization. Published at ICLR 2023, it uses a calibration dataset and Hessian-based optimization to make intelligent compression decisions. The algorithm processes each layer sequentially, using second-order information from calibration data to determine how to quantize each weight column while compensating for the error introduced to subsequent columns.

The assumption has always been straightforward: more information leads to better decisions. A method that sees calibration data should always outperform one that does not. A data-free method is, in theory, flying blind, making compression decisions without knowing how the model will actually be used.

Our results challenge this assumption directly.

The Head-to-Head Results

We compared RAM against GPTQ at matched model sizes across three Mixture-of-Experts model families. For each comparison, we ensured the compressed model sizes were as close as possible to provide a fair apples-to-apples evaluation.

Model Method Size (GB) Mean PPL Median PPL Δ PPL
Qwen3-30B GPTQ 16.0 9.122 9.160 -
RAM 16.1 8.970 9.020 −1.7%
Qwen2-57B GPTQ 29.9 6.390 6.396 -
RAM 29.9 6.329 6.356 −0.95%
Mixtral-8x7B GPTQ 87.0† 4.608 4.640 -
RAM 24.5 4.264 4.266 −4.6%

At matched model sizes, RAM beats GPTQ on every single MoE model tested. The margin ranges from −0.95% to −4.6% perplexity improvement. The Mixtral result is particularly striking, RAM achieves better perplexity at less than a third of the model size.

A data-free method outperforming the calibration-based gold standard. On every model. At matched sizes.

Why Does This Work?

RAM's proprietary optimization approach enables it to make more informed compression decisions at the per-tensor level than calibration-based methods. Rather than relying on external data to guide quantization, RAM operates directly on the model weights themselves, allowing it to find configurations that calibration-dependent pipelines systematically miss.

The advantage is particularly pronounced on Mixture-of-Experts architectures. The gap between RAM and GPTQ is largest on Mixtral (−4.6%), the model with the most experts per layer. As MoE models grow more popular (Qwen3, DeepSeek-V3, Llama-4 are all MoE), RAM's architecture-aware optimization becomes increasingly valuable. The trend in model design is toward more experts with sparser activation, exactly the conditions where RAM's proprietary approach delivers the largest gains.

What This Means for Practitioners

If you are using GPTQ or AWQ today for MoE models, you may be paying for calibration infrastructure that is actually hurting your model quality. The GPU time, the calibration data curation, the iteration on calibration hyperparameters, all of this cost may be producing worse results than a method that uses none of it.

The recommendation is nuanced. Calibration-based methods remain excellent for dense models, where the calibration coverage problem is less severe, every weight participates in every forward pass, so any calibration set provides information about all parameters. But for MoE models, data-free methods deserve serious consideration.

The cost savings are a bonus on top of superior quality. No GPU required for quantization. No calibration dataset to curate and validate. No iteration on calibration hyperparameters. No risk of distribution mismatch between calibration and deployment. The entire quantization process runs on CPU in under an hour.

The Broader Implication

This result challenges a fundamental assumption in the quantization literature: that more information always leads to better decisions. The assumption seems self-evidently true, how could knowing less produce better results?

The answer is that calibration data is not free information. It introduces constraints and assumptions that may not hold across all deployment scenarios. As models grow more heterogeneous, with more experts and more architectural variety, these constraints become more costly. RAM's proprietary optimization approach avoids these constraints entirely, producing better results by working directly with the model's own structure.

Sometimes, the best information is no information at all, when the alternative is constrained information that limits your optimization space.


GPTQ baselines from published 4-bit quantization with group size 128, calibration on C4 dataset (128 samples, 2048 sequence length). RAM results from budget-constrained quality-optimisation allocation with per-tensor (bits, group size) optimization. All perplexity evaluations on WikiText-2 test split. †Mixtral GPTQ size reflects standard 4-bit g128 quantization of the full model. The full RAM pipeline is open source at github.com/baa-ai/RAM.

Read the Full Paper

The complete RAM paper, including formal derivations, benchmark results across 7 model families and 40,000+ questions, and the full optimal allocation framework, is available on our HuggingFace:

RAM: Compute-Optimal Proprietary Compression for LLMs, Full Paper

huggingface.co/spaces/baa-ai/RAM

Licensed under CC BY-NC-ND 4.0

Continue Reading

Related research from our team.

Beyond Perplexity: Downstream Benchmarks Confirm RAM Beats All Quantization Strategies
RAM Research

Beyond Perplexity: Downstream Benchmarks Confirm RAM Beats All Quantization Strategies

Real benchmarks, real results. RAM wins across downstream tasks, not just perplexity.

The GPU Hours Nobody Needed to Spend
RAM Research

The GPU Hours Nobody Needed to Spend

Proprietary compression eliminates the GPU calibration step entirely. Here's the evidence.

View All Research