One Number That Prevents Catastrophic Quantization

A natural 2 dB gap in signal-to-quantization-noise ratio separates catastrophic from usable compression. SQNR 9 dB is the universal safety floor—validated across 8B to 109B parameters, dense and MoE.

The Problem

Aggressive quantization can fail silently. A model might pass basic tests—generate coherent sentences, answer simple questions—but produce garbage on certain inputs because critical weight tensors were compressed beyond their tolerance. The difference between a usable quantized model and a catastrophically broken one is often a cliff, not a slope.

Teams discover this cliff through expensive trial and error. Quantize aggressively, evaluate on benchmarks, find degradation, back off, repeat. Each iteration costs GPU-hours and engineering time. Worse, benchmark evaluations may not catch the failure modes that matter in production—the cliff can be input-dependent, surfacing only on edge cases that standard evaluations miss.

What if you could know, before any deployment, whether a quantized model has crossed the line?

What Is SQNR?

Signal-to-quantization-noise ratio measures how much the quantized weights differ from the originals. For a weight tensor W quantized to Q(W) with bit-width b and group size g:

SQNR(b,g) = 10 · log₁₀( ||W||²_F / ||W − Q(W; b,g)||²_F ) dB

Higher is better. Think of it like audio: below a certain threshold, the signal is drowned in noise. Above that threshold, the noise is present but tolerable. The question is: where exactly is the threshold?

The Gap We Discovered

When we computed SQNR across every weight tensor in multiple models and every configuration in our search space, a striking pattern emerged. There is a clean, natural gap in the SQNR distribution between configurations that work and configurations that catastrophically fail.

Config (b, g)	Min	P5	Median	P95	Max	<9 dB	<15 dB
(2, 32)	5.1	7.2	8.0	8.1	8.7	691	691
(2, 64)	2.5	5.6	6.8	6.9	7.2	691	691
(3, 64)	10.4	13.0	14.2	14.3	14.6	0	691
(4, 32)	19.4	21.3	22.0	22.1	22.8	0	0

Every single 2-bit configuration falls below 9 dB. Every single 3-bit configuration exceeds 10 dB. There is a clean, natural gap between 8.7 dB (the highest 2-bit SQNR) and 10.4 dB (the lowest 3-bit SQNR). A threshold at 9 dB sits perfectly in this gap—a no-man’s-land that no configuration actually occupies.

The boundary between catastrophic and usable is not gradual. It is a 2 dB gap with nothing in it.

The Floor Sweep

To validate that this gap translates to real model quality, we ran a floor sweep on Llama-4-Scout—a 109-billion-parameter MoE model. We constrained MINT’s allocator to enforce minimum SQNR thresholds and measured the resulting perplexity:

SQNR Floor	Avg Bits	Size (GB)	Mean PPL	Assessment
0 dB	2.00	34.62	23.577	Catastrophic
9 dB	3.00	46.93	8.675	Usable (+9.8%)
9 dB + 50 GB	3.48	51.98	7.980	Good (+1.0%)
15 dB	4.00	56.16	7.709	Best (−2.4%)

The jump from 0 dB to 9 dB is a 2.7× reduction in perplexity—from catastrophic to usable in a single threshold step. Above 9 dB, improvements are incremental. Below it, the model is fundamentally broken. This single threshold catches every catastrophic configuration.

A One-Line Quality Gate

The practical implication is remarkably simple. Before deploying any quantized model, ask one question:

“Does any tensor in this model have SQNR below 9 dB?”

That’s it. One check, evaluated in seconds, before any deployment. Computing SQNR requires only the original weights and the quantized weights—no calibration data, no inference pass, no GPU. It is a pure mathematical comparison that runs on CPU.

This is not specific to MINT. Any quantization pipeline—GPTQ, AWQ, llama.cpp, custom solutions—can implement this check. It prevents the worst-case scenario: an aggressively quantized model shipping to production and failing unpredictably on inputs that happen to stress the tensors that were compressed too far.

Why This Generalizes

The 9 dB gap is structural, not accidental. It appears consistently across models from 8 billion to 109 billion parameters, across both dense architectures and Mixture-of-Experts models. The reason is rooted in the physics of round-to-nearest quantization.

Two bits can represent exactly four values. For a weight distribution with any meaningful variance, four values simply cannot approximate the distribution with acceptable fidelity. The quantization error is not just high—it is structurally high, because the representation lacks the resolution to track the signal. Three bits provide eight values, and this doubling of representational capacity crosses a fundamental threshold where the noise drops below the signal.

This is why the gap is so clean. It is not a statistical artifact of particular weight distributions—it is a consequence of information theory. Two bits are categorically insufficient for the weight distributions that appear in modern language models. The 9 dB threshold marks the boundary of that categorical insufficiency.

Because the gap is structural, it generalizes. New architectures, new model sizes, new training procedures—as long as the weight distributions have the characteristics typical of transformer models (which they do, consistently), the 9 dB floor will hold.

SQNR distributions computed across all weight tensors in Llama-4-Scout (691 quantisable tensors). Floor sweep also conducted on Llama-4-Scout (109B parameters, MoE). Perplexity evaluated on WikiText-2 test split. All thresholds validated across multiple model families and architectures. The full MINT pipeline is open source at github.com/baa-ai/MINT.

← Back to all articles