RAM outperforms GPTQ at matched model sizes across three MoE families, without ever seeing a single calibration sample. The calibration paradigm has a problem.
The Conventional Wisdom
GPTQ is the gold standard for post-training quantization. Published at ICLR 2023, it uses a calibration dataset and Hessian-based optimization to make intelligent compression decisions. The algorithm processes each layer sequentially, using second-order information from calibration data to determine how to quantize each weight column while compensating for the error introduced to subsequent columns.
The assumption has always been straightforward: more information leads to better decisions. A method that sees calibration data should always outperform one that does not. A data-free method is, in theory, flying blind, making compression decisions without knowing how the model will actually be used.
Our results challenge this assumption directly.
The Head-to-Head Results
We compared RAM against GPTQ at matched model sizes across three Mixture-of-Experts model families. For each comparison, we ensured the compressed model sizes were as close as possible to provide a fair apples-to-apples evaluation.
| Model | Method | Size (GB) | Mean PPL | Median PPL | Δ PPL |
|---|---|---|---|---|---|
| Qwen3-30B | GPTQ | 16.0 | 9.122 | 9.160 | - |
| RAM | 16.1 | 8.970 | 9.020 | −1.7% | |
| Qwen2-57B | GPTQ | 29.9 | 6.390 | 6.396 | - |
| RAM | 29.9 | 6.329 | 6.356 | −0.95% | |
| Mixtral-8x7B | GPTQ | 87.0† | 4.608 | 4.640 | - |
| RAM | 24.5 | 4.264 | 4.266 | −4.6% |
At matched model sizes, RAM beats GPTQ on every single MoE model tested. The margin ranges from −0.95% to −4.6% perplexity improvement. The Mixtral result is particularly striking, RAM achieves better perplexity at less than a third of the model size.
A data-free method outperforming the calibration-based gold standard. On every model. At matched sizes.
Why Does This Work?
RAM's proprietary optimization approach enables it to make more informed compression decisions at the per-tensor level than calibration-based methods. Rather than relying on external data to guide quantization, RAM operates directly on the model weights themselves, allowing it to find configurations that calibration-dependent pipelines systematically miss.
The advantage is particularly pronounced on Mixture-of-Experts architectures. The gap between RAM and GPTQ is largest on Mixtral (−4.6%), the model with the most experts per layer. As MoE models grow more popular (Qwen3, DeepSeek-V3, Llama-4 are all MoE), RAM's architecture-aware optimization becomes increasingly valuable. The trend in model design is toward more experts with sparser activation, exactly the conditions where RAM's proprietary approach delivers the largest gains.
What This Means for Practitioners
If you are using GPTQ or AWQ today for MoE models, you may be paying for calibration infrastructure that is actually hurting your model quality. The GPU time, the calibration data curation, the iteration on calibration hyperparameters, all of this cost may be producing worse results than a method that uses none of it.
The recommendation is nuanced. Calibration-based methods remain excellent for dense models, where the calibration coverage problem is less severe, every weight participates in every forward pass, so any calibration set provides information about all parameters. But for MoE models, data-free methods deserve serious consideration.
The cost savings are a bonus on top of superior quality. No GPU required for quantization. No calibration dataset to curate and validate. No iteration on calibration hyperparameters. No risk of distribution mismatch between calibration and deployment. The entire quantization process runs on CPU in under an hour.
The Broader Implication
This result challenges a fundamental assumption in the quantization literature: that more information always leads to better decisions. The assumption seems self-evidently true, how could knowing less produce better results?
The answer is that calibration data is not free information. It introduces constraints and assumptions that may not hold across all deployment scenarios. As models grow more heterogeneous, with more experts and more architectural variety, these constraints become more costly. RAM's proprietary optimization approach avoids these constraints entirely, producing better results by working directly with the model's own structure.
Sometimes, the best information is no information at all, when the alternative is constrained information that limits your optimization space.
GPTQ baselines from published 4-bit quantization with group size 128, calibration on C4 dataset (128 samples, 2048 sequence length). RAM results from budget-constrained quality-optimisation allocation with per-tensor (bits, group size) optimization. All perplexity evaluations on WikiText-2 test split. †Mixtral GPTQ size reflects standard 4-bit g128 quantization of the full model. The full RAM pipeline is open source at github.com/baa-ai/RAM.
Read the Full Paper
The complete RAM paper, including formal derivations, benchmark results across 7 model families and 40,000+ questions, and the full optimal allocation framework, is available on our HuggingFace:
RAM: Compute-Optimal Proprietary Compression for LLMs, Full Paper
huggingface.co/spaces/baa-ai/RAMLicensed under CC BY-NC-ND 4.0