Eight Measurements per Tensor: The End of Single-Point Sensitivity

SWAN v1 used 4-bit NRMSE to predict 4-bit allocation—circular reasoning that missed counterintuitive optima. MINT computes the full error surface and finds allocations that look wrong but are provably right.

The Circularity Problem

SWAN v1 and most sensitivity-based quantization methods share a fundamental flaw: they measure how much error a tensor produces when quantized to 4-bit, then use that measurement to decide whether the tensor should be 4-bit. This is circular reasoning. You are using 4-bit error to predict 4-bit allocation. You are partly predicting your own label.

The problem becomes concrete when you consider edge cases. If a tensor has high error at 4-bit but low error at 3-bit, a single-point metric misses this entirely. The sensitivity metric flags the tensor as “highly sensitive,” and the allocator responds by assigning it 8-bit—an expensive allocation that wastes memory budget. Meanwhile, 3-bit with a smaller group size might have been perfectly adequate, at a fraction of the storage cost.

This is not a theoretical concern. Weight matrices have complex error surfaces. The relationship between bit-width, group size, and quantization error is non-linear and often non-monotonic. A single measurement at one operating point cannot capture this complexity. It is like measuring temperature at noon and concluding you know the weather.

What a Rate-Distortion Curve Reveals

For each tensor, MINT computes the normalized root mean squared error (NRMSE) at 8 different (bit-width, group-size) configurations:

Configuration	Bit-width	Group Size
1	2	32
2	3	64
3	4	32
4	4	64
5	4	128
6	8	64
7	8	128
8	16	0 (no grouping)

This produces a curve showing how quality degrades as compression increases. Some tensors have steep curves—quality drops sharply with compression, and these need more bits. Others have flat curves—quality barely changes regardless of how aggressively you compress, and these can be squeezed hard. The shape of the curve matters more than any single point on it.

Think of it this way: a single-point metric tells you the altitude at one location. A rate-distortion curve gives you the topographic map.

Four Features from the Curve

From each tensor’s rate-distortion curve, MINT extracts four features that capture the shape of the error surface:

Area under curve (f_auc). The total sensitivity across all operating points. A tensor with high AUC is sensitive everywhere—it needs protection regardless of the target bit-width. A tensor with low AUC is robust everywhere—compress it freely.

4-to-8 bit ratio (f_m48). How much quality improves when going from 4-bit to 8-bit. A high ratio means the tensor benefits enormously from the extra bits—it is worth the memory cost. A low ratio means the tensor is already doing fine at 4-bit—spending 8-bit on it is waste.

2-to-4 bit ratio (f_m24). How much quality improves when going from 2-bit to 4-bit. This captures behavior at the aggressive end of the compression spectrum. Some tensors collapse at 2-bit but recover at 4-bit; others are already broken at 4-bit, making the 2-to-4 transition irrelevant.

Local slope (f_slope). How steeply quality changes at the specific operating point under consideration. A steep slope means small changes in bit allocation produce large changes in quality—the tensor is at a critical inflection point. A shallow slope means the tensor is in a stable region.

Together, these four features capture the shape of the error surface, not just a single value. They tell the allocator not just “how sensitive is this tensor?” but “how does this tensor’s sensitivity change across the compression spectrum?”

When the Allocator Disagrees with Intuition

The MCKP solver sometimes produces allocations that look wrong to a human reviewer. For example, it might keep a seemingly unimportant MLP intermediate tensor at 8-bit while quantizing a seemingly critical attention projection matrix to 4-bit. An engineer reviewing this allocation would likely flag it as a bug.

It is not a bug. The MLP tensor has a steep rate-distortion curve—the quality gain from upgrading 4-bit to 8-bit is large relative to the storage cost. Every additional bit allocated to this tensor produces a significant reduction in total model error. The attention tensor, despite its architectural importance, has a flat rate-distortion curve—it tolerates 4-bit quantization with minimal quality loss. Allocating 8-bit to it would consume budget without meaningfully improving the model.

The solver optimizes total model quality, not per-tensor intuition. It does not care about the tensor’s name, its position in the architecture, or its role in the computation graph. It cares about one thing: given a fixed memory budget, which allocation minimizes the sum of distortion across all tensors? The answer sometimes contradicts what an expert would guess, and the answer is provably correct.

Why Multi-Point Beats Single-Point

A single-point metric tells you one thing: how much error at one configuration. It cannot tell you:

• Whether the tensor’s error curve is steep or flat

• Whether a different bit-width or group-size would dramatically change the error

• Whether the tensor is at a knee in the curve—a point where a small budget increase yields a large quality gain

The rate-distortion curve captures all of this. It is the difference between photographing a landscape from one angle and mapping it from eight. With one photograph, you know what the landscape looks like from where you are standing. With eight measurements from different vantage points, you can reconstruct the terrain.

This matters most for tensors at the boundary—the ones that are borderline cases for any given bit-width. These are exactly the tensors where allocation decisions have the largest impact on model quality, and exactly the tensors where single-point metrics are most misleading.

From Heuristics to Optimization

With single-point metrics, allocation is necessarily heuristic: set thresholds, sort by sensitivity, assign bit-widths based on rules of thumb. This works adequately for simple cases but breaks down as models get larger and the configuration space grows. There is no way to prove that a heuristic allocation is optimal, and no way to know how far from optimal it is.

With multi-point rate-distortion curves, allocation becomes a formal optimization problem. Each tensor has a known cost (storage in bytes) and a known benefit (error reduction) at each configuration. The problem of finding the allocation that minimizes total distortion subject to a memory budget is a Multiple-Choice Knapsack Problem (MCKP), and it has a provably optimal solution.

The allocator does not guess. It solves. This is why MINT can offer budget-targeted quantization: given any memory budget—47 GB, 58 GB, 64 GB, any number—the MCKP solver produces the allocation that minimizes total distortion for that exact budget. You cannot do this with single-point metrics because you do not know how quality changes across configurations. You only know one point on the curve, and you cannot optimize over a surface you have not measured.

The Computational Cost

Computing 8 configurations per tensor sounds expensive. It is not.

Each NRMSE computation is a round-trip quantize-dequantize on a single weight matrix. The process is straightforward: take the original weights, quantize them to the target configuration, dequantize back to floating point, and measure the normalized root mean squared error between the original and the round-tripped values. This takes milliseconds per tensor.

For Qwen3-30B with 18,867 tensors, the entire feature extraction process—including spectral analysis, kurtosis computation, and all 8 rate-distortion curve measurements per tensor—takes approximately 50 minutes on a CPU. No GPU required. The allocation step itself, solving the MCKP, takes less than 1 second.

The investment in multi-point measurement is modest. The payoff—provably optimal allocation at any memory budget—is substantial. Fifty minutes of CPU time buys you something that heuristic methods cannot provide at any cost: the mathematical guarantee that no better allocation exists for the given budget.

This article describes rate-distortion curve analysis as implemented in MINT (Memory-Informed N-bit Tuning), developed at baa.ai. For technical details on the MCKP formulation and the full feature extraction pipeline, see the MINT paper. The full pipeline is open source at github.com/baa-ai/MINT.

← Back to all articles

Eight Measurements per Tensor