Beyond NRMSE: The Sensitivity Features MINT Computes But Doesn't Use (Yet)

The current allocator uses raw NRMSE as its sole loss function. But MINT computes far richer per-tensor sensitivity features that reveal structure invisible to reconstruction error alone.

The Limitation MINT Acknowledges

MINT’s allocator uses NRMSE — normalised root mean squared error — as its sole loss function. NRMSE measures weight reconstruction fidelity: how closely the quantized weights match the originals. But not all tensors contribute equally to downstream loss. A tensor with high NRMSE might have minimal impact on output quality, while a tensor with modest NRMSE might be critical for specific capabilities.

The MINT paper explicitly flags this as a limitation and documents three families of sensitivity features that are computed during analysis but not yet used by the allocator. These features represent a research direction for future versions — and understanding them reveals why quantization is harder than it looks.

Spectral Features: What the Singular Values Tell You

MINT computes three scale-invariant spectral features from the singular values of each weight tensor (via randomised SVD with rank k=256):

Stable rank. The effective dimensionality of the weight matrix, defined as the ratio of the squared Frobenius norm to the squared spectral norm. A tensor with low stable rank concentrates most of its information in a few directions — quantization is more likely to corrupt these critical components. Formally: r_s(W) = ||W||_F² / ||W||₂².

Spectral tail mass. The fraction of energy outside the top k/10 singular values. High tail mass means information is distributed across many singular directions — the tensor may tolerate quantization better because no single direction dominates. Low tail mass means a few directions carry most of the signal, making quantization more dangerous.

Approximate log spectral spread. Based on the top-k truncated SVD, this measures how spread out the singular value spectrum is. A large spread (high condition-like number) suggests the tensor has both very important and very unimportant directions — uniform quantization across all elements may be particularly wasteful.

These features are computed from the weight tensors alone — no calibration data needed — and capture structural properties that NRMSE misses.

Per-Group Kurtosis: Where the Outliers Hide

Kurtosis measures how heavy-tailed a distribution is. Standard kurtosis computed over an entire tensor operates at the wrong granularity — quantization happens at the group level (groups of 32, 64, or 128 parameters), so the relevant question is whether individual groups contain outliers.

MINT reshapes each weight tensor into groups of 128 and computes excess kurtosis for each group. Four features are extracted from the distribution of per-group kurtosis values:

P90 kurtosis: The 90th percentile — how leptokurtic are the worst groups?
Tail spread (P99 − P50): How much variation exists between typical and extreme groups?
Outlier group fraction: What fraction of groups contain at least one value beyond 3 standard deviations?
Max-median gap: The difference between the most extreme group and the typical group.

This aligns the statistical metric with the quantization granularity, addressing the critique from recent work (KurTail, EMNLP 2025) that global kurtosis operates at the wrong level.

Norm-Guided Output Noise Amplification: Simulating Activations Without Data

The most sophisticated feature addresses a fundamental question: how much does quantization noise in this tensor amplify into output noise?

MINT cannot observe actual activations (it’s data-free), but it can approximate the input distribution using a clever proxy. Each linear layer is preceded by a normalisation layer with a learned scale parameter γ. MINT samples probe vectors from a Gaussian distribution scaled by γ — encoding the model’s own channel importance without any calibration data.

Using 32 probe vectors and the actual RTN quantization residual at 4-bit, MINT computes the ratio of output noise to clean output: f_out = ||ΔW · X||_F / ||W · X||_F. This measures how much the quantization error is amplified through each specific tensor, accounting for the tensor’s role in the network.

This feature is particularly interesting because it bridges the gap between weight-only analysis (data-free) and activation-aware methods (calibration-required). The norm parameters encode a compressed representation of the activation distribution that the model learned during training.

Why These Features Aren’t Used Yet

The MINT paper is transparent about why these features remain exploratory: the current NRMSE-based allocator already outperforms GPTQ (a calibration-based method) across all tested models. Adding complexity to the loss function introduces risks — learned importance weights could overfit to the evaluation benchmark, or the additional features could disagree with NRMSE in ways that hurt rather than help.

There is one model where the limitation matters: GLM-4.7-Flash. On this model, SWAN v1’s threshold heuristic (which implicitly captures some sensitivity information through its multi-metric scoring) achieves slightly better median PPL (9.084 vs 9.210) by assigning 141 tensors to 8-bit. MINT’s NRMSE-based allocator determined that all quantisable tensors should stay at 4-bit within the budget. The spectral and kurtosis features might identify the tensors that v1 correctly promoted — but this is speculative and requires systematic validation.

The Research Direction

Incorporating these features into the allocator requires solving a chicken-and-egg problem: you need ground-truth tensor importance labels to train an importance predictor, but tensor importance is defined by downstream loss, which you can only measure by running expensive evaluations. The MINT paper suggests Fisher information or Hessian traces as candidates for per-tensor importance weighting that could be computed data-free.

The vision: a future version of MINT where the allocator combines NRMSE (reconstruction fidelity) with spectral features (structural vulnerability), kurtosis features (outlier exposure), and output noise amplification (downstream impact estimation) to make allocation decisions that match or exceed what activation-aware methods achieve — while remaining entirely data-free.

Data from the MINT paper: “MINT: Budget-Aware Data-Free Mixed-Precision Quantization for Large Language Models via Rate-Distortion Optimization” (baa.ai, 2026). Sensitivity features documented in Appendix C of the paper. All features computed from weight tensors alone via randomised SVD (k=256) and group-wise statistical analysis. Full paper at baa.ai/articles/24-mint-paper.html. Code at github.com/baa-ai/MINT.

← Back to all articles