RAM outperforms GPTQ at matched model sizes across three MoE families, without ever seeing a single calibration sample. The calibration paradigm has a problem.
The Conventional Wisdom
GPTQ is the gold standard for post-training quantization. Published at ICLR 2023, it uses a calibration dataset and Hessian-based optimization to make intelligent compression decisions. The algorithm processes each layer sequentially, using second-order information from calibration data to figure out how to quantize each weight column while compensating for the error introduced to subsequent columns.
The assumption has always been straightforward: more information means better decisions. A method that sees calibration data should always beat one that doesn't. A data-free method is, in theory, flying blind, making compression decisions with no idea how the model will actually be used.
Our results say otherwise.
The Head-to-Head Results
We compared RAM against GPTQ at matched model sizes across three Mixture-of-Experts model families. For each comparison, we made sure the compressed model sizes were as close as possible for a fair apples-to-apples evaluation.
| Model | Method | Size (GB) | Mean PPL | Median PPL | Δ PPL |
|---|---|---|---|---|---|
| Qwen3-30B | GPTQ | 16.0 | 9.122 | 9.160 | - |
| RAM | 16.1 | 8.970 | 9.020 | −1.7% | |
| Qwen2-57B | GPTQ | 29.9 | 6.390 | 6.396 | - |
| RAM | 29.9 | 6.329 | 6.356 | −0.95% | |
| Mixtral-8x7B | GPTQ | 87.0† | 4.608 | 4.640 | - |
| RAM | 24.5 | 4.264 | 4.266 | −4.6% |
At matched model sizes, RAM beats GPTQ on every single MoE model we tested. The margin ranges from −0.95% to −4.6% perplexity improvement. The Mixtral result stands out: RAM achieves better perplexity at less than a third of the model size.
A data-free method beating the calibration-based gold standard. On every model. At matched sizes.
Why Does This Work?
RAM's proprietary optimization approach lets it make more informed compression decisions at the per-tensor level than calibration-based methods can. Instead of relying on external data to guide quantization, RAM works directly on the model weights themselves. That lets it find configurations that calibration-dependent pipelines systematically miss.
The advantage is especially pronounced on Mixture-of-Experts architectures. The gap between RAM and GPTQ is largest on Mixtral (−4.6%), the model with the most experts per layer. As MoE models grow more popular (Qwen3, DeepSeek-V3, Llama-4 are all MoE), RAM's architecture-aware optimization becomes increasingly valuable. Model design is trending toward more experts with sparser activation, exactly the conditions where RAM's approach delivers the largest gains.
What This Means for Practitioners
If you're using GPTQ or AWQ today for MoE models, you may be paying for calibration infrastructure that's actually hurting your model quality. The GPU time, the calibration data curation, the iteration on calibration hyperparameters. All of that cost may be producing worse results than a method that uses none of it.
The picture is nuanced, though. Calibration-based methods remain excellent for dense models, where the calibration coverage problem is less severe. Every weight participates in every forward pass, so any calibration set gives information about all parameters. But for MoE models, data-free methods deserve serious consideration.
The cost savings are a bonus on top of better quality. No GPU required for quantization. No calibration dataset to curate and validate. No iteration on calibration hyperparameters. No risk of distribution mismatch between calibration and deployment. The entire quantization process runs on CPU in under an hour.
The Broader Implication
This result challenges a fundamental assumption in the quantization literature: that more information always leads to better decisions. It seems self-evidently true. How could knowing less produce better results?
The answer is that calibration data isn't free information. It introduces constraints and assumptions that may not hold across all deployment scenarios. As models grow more heterogeneous, with more experts and more architectural variety, those constraints become more costly. RAM's proprietary optimization avoids them entirely, producing better results by working directly with the model's own structure.
Sometimes the best information is no information at all, when the alternative is constrained information that limits your optimization space.
GPTQ baselines from published 4-bit quantization with group size 128, calibration on C4 dataset (128 samples, 2048 sequence length). RAM results from budget-constrained quality-optimisation allocation with per-tensor (bits, group size) optimization. All perplexity evaluations on WikiText-2 test split. †Mixtral GPTQ size reflects standard 4-bit g128 quantization of the full model. The full RAM pipeline is open source at github.com/baa-ai/RAM.
Read the Full Paper
The full RAM paper covers formal derivations, benchmark results across 7 model families and 40,000+ questions, and the optimal allocation framework. It's on our HuggingFace:
RAM: Compute-Optimal Proprietary Compression for LLMs, Full Paper
huggingface.co/spaces/baa-ai/RAMLicensed under CC BY-NC-ND 4.0