The 16-Bit Allocation Mistake
MINT Research

The 16-Bit Allocation Mistake

March 2026 · baa.ai

SWAN v1 kept 5.6% of parameters at 16-bit precision to protect sensitive tensors. MINT proves those tensors are better served by 4-bit with group size 32 — comparable quality at 25% the storage cost.

The intuition that led us astray

When we built SWAN v1, our sensitivity analysis identified certain tensors — embeddings, attention projections in early layers, language model heads — as highly sensitive to quantization. The natural response was to protect them with maximum precision: keep them at 16-bit.

This consumed 5.6% of the total parameter count, and we also allocated 4.0% at 2-bit on the aggressive end. Both decisions, it turns out, were wrong.

What MINT discovered

MINT’s joint optimization over bit-width AND group size found that every tensor previously allocated 16-bit is better served by 4-bit with group size 32. The 1.3% of tensors that appear at 16-bit in MINT’s allocation are exclusively 1D tensors (biases) that are too small for group quantization — they would have been kept at 16-bit regardless.

Dimension v1 (SWAN) MINT
16-bit allocation 5.6% of params 0%
2-bit allocation 4.0% of params 0%
Group size Fixed (128) Per-tensor (85% chose g32)
Safety floor None SQNR 9 dB veto

The math of why g32 beats 16-bit

16-bit costs 2 bytes per parameter. 4-bit with group size 32 costs approximately 0.5 bytes per parameter (4 bits for the weight, plus scale/zero overhead for every 32 weights). That is 25% of the storage cost.

But the quality difference is negligible because group size 32 provides such fine-grained quantization that the per-group scale factors can closely approximate the original weight distribution. The 16-bit allocation was buying precision the model did not need, at 4× the cost.

Think of it this way: with group size 128, each scale factor must represent 128 weights. Some of those weights may have very different magnitudes, and the shared scale cannot capture all of them well. With group size 32, each scale factor only needs to represent 32 weights — a much more homogeneous group. The quantization error per weight drops dramatically, closing the gap with 16-bit precision to the point where the remaining difference is not worth 4× the storage.

The 2-bit mistake on the other end

SWAN v1 also allocated 4.0% of parameters at 2-bit. MINT blocks all 2-bit allocations via the SQNR safety veto at 9 dB. Every single 2-bit configuration in our analysis fell below this threshold.

2-bit quantization is not “very aggressive” — it is structurally catastrophic for transformer weight matrices. The difference between 2-bit and 3-bit is not gradual; there is a cliff in the SQNR distribution. At 2-bit, a weight can only take one of four values. For a tensor with any meaningful variance in its weight distribution, four reconstruction levels cannot preserve the signal. The quantization noise overwhelms the signal, and the tensor effectively becomes random.

The SQNR safety veto catches this automatically. Any configuration where the signal-to-quantization-noise ratio falls below 9 dB is rejected before it can enter the optimization. This is not a hyperparameter we tuned — it is a structural property of the noise floor visible in every model we analyzed.

Both extremes were wrong

SWAN v1’s allocation ranged from 2-bit to 16-bit. MINT’s allocation ranges from 3-bit (minimum safe) to 8-bit. The allocation space collapsed at both ends.

The bottom was raised by the SQNR safety floor (no more catastrophic 2-bit). The top was lowered by joint group-size optimization (no more wasteful 16-bit). The result is a tighter, more efficient allocation band that achieves better quality at smaller total size.

The optimal allocation space is narrower than anyone expected: 3-bit to 8-bit, with 85% of tensors choosing group size 32.

The storage savings

Eliminating the 5.6% at 16-bit and redirecting those parameters to 4-bit g32 saves significant storage. On a 30B parameter model:

5.6% at 16-bit = ~1.68 billion parameters × 2 bytes = ~3.36 GB. The same parameters at 4-bit g32 cost ~0.84 GB. That is roughly 2.5 GB of savings from the 16-bit elimination alone. Combined with the 2-bit parameters being raised to 3-bit (a small size increase but critical quality improvement), the net effect is a model that is both smaller and better.

The freed budget can be reallocated to protect other sensitive tensors — giving them 5-bit or 6-bit instead of 4-bit, exactly where the rate-distortion curve says it matters most. This is the power of joint optimization: savings in one part of the model fund improvements in another.

Lessons for the field

The conventional wisdom that “some tensors need full precision” deserves scrutiny. With sufficiently fine-grained quantization (small group sizes), the gap between 4-bit and 16-bit narrows dramatically. The industry should stop thinking in terms of “protect or compress” and start thinking in terms of “how many bits AND what group size gives the best rate-distortion tradeoff.”

Three specific lessons emerge from this analysis:

Group size is a first-class optimization variable. Fixing group size at 128 and only varying bit-width leaves enormous efficiency on the table. MINT’s joint optimization shows that 85% of tensors prefer group size 32 — a choice that was never even considered in most quantization pipelines.

Safety floors are non-negotiable. Without the SQNR 9 dB veto, optimizers will happily assign 2-bit to tensors where it causes catastrophic damage. The optimizer minimizes total error and cannot distinguish between “acceptable degradation” and “structural collapse” without an explicit safety constraint.

Human intuition about sensitivity is unreliable. Our experts identified the right tensors as sensitive but prescribed the wrong remedy. The correct response to sensitivity is not maximum precision — it is the minimum precision that preserves the signal, paired with the right group size. That answer requires optimization, not intuition.


Comparison data from MINT Appendix B (v1 vs MINT allocation). The full MINT pipeline is open source at github.com/baa-ai/MINT.

← Back to all articles

Ready to stop wasting storage on 16-bit allocations?

Our team specialises in data-free model compression, budget-aware quantization, and production AI deployment on commodity hardware.

Talk to Our Team