When Better Is Worse: What We Learned by Trying to Improve Our Quantizer

Lower reconstruction error. Worse model quality. Here's why that happens , and what it means for LLM quantization research.

When we released RAM, our proprietary compression framework, a natural question followed: why use round-to-nearest (RTN) , the simplest possible quantizer , as the base method? Surely replacing it with something more sophisticated would improve results?

We thought so too. We tested multiple alternatives to RTN. Every one of them either failed to improve model quality or was undeployable. The failures revealed something more interesting than a successful upgrade would have.

The finding

We measured two things for each alternative quantizer: per-tensor reconstruction error (how well the quantized weights match the originals) and model-level perplexity (how well the model actually performs). The assumption , shared by most of the quantization literature , is that reducing the first should improve the second.

It doesn't. We found methods that uniformly improved reconstruction error across every tensor tested, yet made the model worse at perplexity. We found methods that produced exactly zero improvement in the data-free setting. And we found methods that improved reconstruction metrics but couldn't be deployed on standard inference frameworks.

The core issue is that reconstruction error measures average fidelity across all weights, while perplexity measures whether the specific weight values that drive model behaviour survived quantization. Methods that minimize mean squared error can sacrifice critical outlier weights to reduce average error , the reconstruction metric improves, but the model gets worse.

Three takeaways for the field

Per-tensor reconstruction error is not a reliable proxy for model quality

This is the central finding and it's implications beyond RAM. Many quantization papers report reconstruction quality, MSE, or Frobenius norm of the quantization residual as evidence of quality. Our experiments produced a clean counterexample: a method that uniformly improves reconstruction error across all tensors while uniformly degrading the metric that matters. Reconstruction error measures average fidelity; perplexity measures whether the specific weight values that drive model behavior survived quantization. These can move in opposite directions.

The practical consequence is that any quantization method optimizing a per-tensor reconstruction objective , without activation-aware weighting , should be validated end-to-end on model-level metrics before drawing quality conclusions. Per-tensor metrics are useful for comparing configurations of the same quantization method (as RAM's quality curves do), but unreliable for comparing different quantization methods.

Proprietary compression methods have a hard ceiling

Our results reveal a structural limitation: without activation data, you can't determine which weight errors the model is sensitive to. Methods that try to improve per-tensor quantization in the data-free setting either collapse to RTN, make things worse, or improve a metric that doesn't predict the outcome.

This doesn't mean data-free methods are inferior , RAM outperforms calibration-based GPTQ at matched sizes across multiple model families. But the advantage comes from allocation (deciding where to spend bits) rather than from per-tensor quantization quality (deciding how to round). In the data-free setting, RTN appears to be at or near the Pareto frontier for per-tensor quantization. The gains are in the meta-problem, not the inner problem.

RTN's simplicity is a feature, not a limitation

Round-to-nearest has no hyperparameters, no convergence criteria, no failure modes, and runs in microseconds per tensor. Every alternative we tested was slower, more complex, and , when it could be evaluated end-to-end , worse. RTN's min/max clipping may seem naive, but it provides an implicit form of outlier preservation that more "sophisticated" methods optimize away. The combination of speed, reliability, and adequate quality makes RTN a strong choice for proprietary compression.

What would actually help

If per-tensor quantization improvements don't translate to model quality, what would? Based on our results, we believe the answer lies in three directions that we'ven't yet explored:

First, activation-aware allocation , using calibration data not to improve individual tensor quantization, but to better weight the optimal allocation objective. RAM currently uses reconstruction quality as the per-tensor loss; weighting by activation-derived sensitivity could improve the allocation decisions without changing the base quantizer.

Second, GPTQ inside RAM , using calibration-based GPTQ as the per-tensor quantizer while keeping RAM's budget-constrained allocation on top. This sacrifices the data-free property but would test whether the allocation framework provides orthogonal gains on top of a genuinely better quantizer.

Third, better evaluation metrics , our finding that reconstruction error doesn't predict perplexity suggests that the field needs per-tensor quality metrics that correlate with model-level outcomes. This likely requires activation data, bringing us back to the fundamental tension between data-free operation and quality prediction.

Conclusion

We set out to improve RAM's base quantizer and instead learned something more valuable: that the quantization research community's standard quality metric (per-tensor reconstruction error) can be actively misleading, and that round-to-nearest may be closer to optimal in the data-free setting than its simplicity suggests.

We published these negative results in the RAM paper because they save other researchers from repeating the same experiments and because the underlying finding , reconstruction error doesn't predict model quality , is relevant to anyone evaluating quantization methods. In a field that tends to publish only improvements, we think the failures are at least as informative.

Code: github.com/baa-ai/RAM , Pre-quantized models: huggingface.co/baa-ai

Read the Full Paper

The complete RAM paper, including formal derivations, benchmark results across 7 model families and 40,000+ questions, and the full optimal allocation framework, is available on our HuggingFace:

RAM: Compute-Optimal Proprietary Compression for LLMs , Full Paper

huggingface.co/spaces/baa-ai/RAM

Licensed under CC BY-NC-ND 4.0