The conventional wisdom says embeddings must stay at 16-bit. The optimizer says otherwise—and it’s right.
The Hard-Coded Approach
Every existing quantization framework has a list of hard-coded protection rules. Embeddings stay at 16-bit. The LM head stays at 16-bit. LayerNorm parameters stay at full precision. These rules are binary: either a tensor is protected or it isn’t. There is no mechanism to express “this tensor is important but could be compressed if the budget demands it.”
SWAN v1 followed this convention: certain tensor types were unconditionally protected at 16-bit. The result was 5.6% of parameters locked at 16-bit regardless of whether that was the best use of those bytes.
The Soft Prior Approach
MINT replaces binary rules with soft multiplicative priors (denoted π) that enter the optimization objective. Instead of “this tensor stays at 16-bit,” MINT says “the optimizer needs X times the quality improvement per byte to justify compressing this tensor.”
Soft protection priors
| Tensor Type | Prior (π) | Rationale |
|---|---|---|
| Embedding | 10.0 | Lookup tables; quantization corrupts rare tokens |
| LM head | 10.0 | Final projection; directly affects output distribution |
| LayerNorm / RMSNorm | ∞ | Tiny (<0.01% of params); critical for stability |
| MoE router | 8.0 | Controls expert routing; errors cascade |
| Vision encoder | 8.0 | Cross-modal alignment is sensitive |
| First layer | 3.0 | No error correction from prior layers |
| Last layer | 2.0 | Directly precedes output |
| Default | 1.0 | No bias |
A prior of π=10 means the allocator treats compression of that tensor as 10x more costly per unit of error. The tensor can still be compressed—but only if the quality-per-byte tradeoff is strong enough. A prior of infinity means genuine hard protection (LayerNorm/RMSNorm only, because they’re tiny and critical).
What the Optimizer Actually Decides
The results are surprising. On Qwen3-30B-A3B at the 19 GB budget:
Embeddings and LM head: stay at 4-bit despite π=10. They are too large—upgrading them to 8-bit would consume budget that produces more quality when distributed across thousands of other tensors as finer group sizes.
First-layer attention: upgraded to 8-bit (14% of tensors). The prior of 3.0 combined with high sensitivity makes these efficient upgrade candidates.
Last-layer attention: also upgraded to 8-bit. The prior of 2.0 tips the balance for tensors that directly precede output.
MoE experts: the most sensitive expert weights receive 8-bit; the rest stay at 4-bit g32. The optimizer differentiates within the same tensor type based on measured sensitivity.
The key insight: embeddings at π=10 are still not worth protecting at 16-bit because their sheer size makes the per-byte quality improvement terrible. The optimizer correctly identifies that those bytes produce 10x more quality when spent on group-size reduction across thousands of 4-bit tensors.
Why v1’s 16-Bit Allocation Was Wasteful
SWAN v1 allocated 5.6% of quantisable parameters to 16-bit via its threshold heuristic. MINT’s MCKP solver never assigns 16-bit to quantisable tensors. The math is straightforward: keeping a tensor at 16-bit costs 2 bytes per parameter. Reducing it to 4-bit g32 costs approximately 0.625 bytes per parameter (0.5 bytes for the quantized values + 0.125 bytes for scale/bias overhead). The saved 1.375 bytes per parameter, redistributed across thousands of tensors as finer group sizes, produces strictly more quality improvement than keeping the original tensor at 16-bit.
This is the fundamental advantage of optimization over heuristics: the optimizer considers the global tradeoff across all tensors simultaneously, while heuristics make local decisions that may be individually reasonable but globally suboptimal.
First Layer vs Last Layer: Asymmetric Protection
The different priors for first (π=3.0) and last (π=2.0) layers reflect a subtle architectural insight. The first layer has no prior layers to correct its errors—quantization noise introduced here propagates through the entire network. The last layer directly precedes the output projection but benefits from error correction by all preceding layers. The optimizer uses these asymmetric priors to make different allocation decisions for architecturally similar tensors.
Vision Encoders and MoE Routers
For multimodal models, vision encoder tensors receive π=8.0—cross-modal alignment learned during training is sensitive to weight perturbation. MoE router tensors also receive π=8.0 because routing errors cascade: a wrong routing decision means the wrong expert processes the token, and no amount of accuracy within the expert can compensate.
These are not hard protections. At extremely tight budgets, the optimizer can and will compress these tensors. But the prior ensures it exhausts all cheaper options first.
The Design Principle
Soft priors encode domain knowledge without constraining the optimizer. They express the relative importance of tensor types, not their absolute protection status. The optimizer combines these priors with measured per-tensor sensitivity (from the rate-distortion curves) and the global budget constraint to make allocation decisions that no hand-tuned heuristic could match.
The result: MINT uses every byte of the budget more efficiently than any system that hard-codes protection rules.
Data from the MINT paper: “MINT: Budget-Aware Data-Free Mixed-Precision Quantization for Large Language Models via Rate-Distortion Optimization” (baa.ai, 2026). Soft prior values from Table 1 of the paper. Allocation analysis on Qwen3-30B-A3B at 19 GB budget. Full paper at baa.ai/articles/24-mint-paper.html. Code at github.com/baa-ai/MINT.