MINT Research Paper
Research Paper

MINT: Compute-Optimal Data-Free Mixed-Precision Quantization for Large Language Models

March 2026 · baa.ai

We present MINT (Memory-Informed N-bit Tuning), a data-free mixed-precision quantization framework that formulates per-tensor bit-width and group-size allocation as a compute-optimal allocation problem. Given a user-specified memory budget, MINT jointly selects the optimal (bit-width, group-size) configuration for each weight tensor by solving a Multiple-Choice Knapsack Problem (MCKP) over per-tensor rate-distortion curves.

The framework introduces three key innovations: (1) budget-targeted quantization—users specify an exact memory target (e.g., “fit in 4 GB for iPhone” or “fit in 24 GB for RTX 4090”) and MINT produces the provably optimal allocation for that budget, with a fitted prediction curve that estimates output quality before running the pipeline; (2) joint bit-width and group-size optimization that treats group size as a first-class allocation variable, revealing that group-size selection provides larger quality gains than bit-width changes; and (3) an SQNR safety veto with an empirically validated 9 dB threshold that exploits the natural gap between catastrophic 2-bit quantization (SQNR < 9 dB, PPL triples) and usable 3-bit quantization (SQNR > 10 dB).

We evaluate MINT on six model families spanning 8B–109B parameters across dense and Mixture-of-Experts architectures. In matched-size comparisons against GPTQ—a calibration-based method—across three MoE families, MINT consistently outperforms GPTQ despite being entirely data-free. The entire pipeline requires no calibration data, no gradient computation, and completes in under 50 minutes on commodity hardware.

Try MINT yourself

The full pipeline is open source under MIT licence. Analyse, allocate, and quantize on your own hardware.

View on GitHub

Contents

1. Introduction

Post-training quantization (PTQ) has become the primary means of deploying large language models on consumer hardware. Methods such as GPTQ [1], AWQ [2], and SqueezeLLM [3] achieve remarkable compression, but they share a common requirement: a representative calibration dataset. This introduces practical concerns—calibration data may be unavailable for proprietary models, the chosen distribution may not generalize to deployment domains, and calibration demands substantial compute.

Existing data-free approaches [6,7,8] typically apply uniform bit-widths or rely on single sensitivity metrics with hand-tuned thresholds. These approaches face two fundamental limitations. First, threshold-based allocation produces fixed bit-width decisions regardless of the deployment memory budget—the user cannot specify “quantize this model to fit in 6 GB” and receive a provably optimal allocation. Second, single-point error proxies create circularity: using 4-bit reconstruction error to decide 4-bit allocation means the method partly predicts its own label.

We address both limitations with MINT (Memory-Informed N-bit Tuning), which reformulates mixed-precision quantization as a constrained optimization problem:

min(bi, gi) Σi πi · αi · NRMSEi(bi, gi)    s.t. Σi sizei(bi, gi) ≤ B

where bi and gi are the bit-width and group size for tensor i, B is the user’s memory budget, πi is a soft protection prior, and αi is a learned importance weight. The key insight is that both bit-width and group size are allocation variables—prior work optimizes bit-width alone, but our evidence shows group-size selection often provides larger quality improvements than bit-width changes.

Contributions

2. Related Work

Calibration-based PTQ

GPTQ [1], AWQ [2], SqueezeLLM [3], SpQR [4], QuIP [5], and SmoothQuant [18] represent the dominant paradigm in post-training quantization. All require calibration data to compute sensitivity information, weight scaling factors, or Hessian approximations. While highly effective, this requirement limits applicability when calibration data is unavailable or unrepresentative of deployment domains.

Data-free quantization

EasyQuant [6], MXQ [7], HQQ [19], and HIGGS [8] eliminate the need for calibration data. These methods typically apply uniform bit-widths across all tensors. MINT differs by formulating allocation as constrained optimization over joint (bit-width, group-size) configurations, enabling budget-targeted deployment and per-tensor mixed-precision decisions.

Sensitivity-based mixed-precision

LLM-MQ [10], SliM-LLM [20], and CherryQ [21] use sensitivity metrics to guide mixed-precision allocation. However, all require calibration data to compute their sensitivity scores. MINT is the first method to combine data-free sensitivity analysis with constrained optimization over both bit-width and group-size variables.

MoE quantization

MC-MoE [12] and MoEQuant [13] address the specific challenges of quantizing Mixture-of-Experts models. Both require calibration data to determine expert importance. MINT’s data-free approach avoids coverage problems inherent in calibration-based MoE quantization, where calibration sequences may not activate all experts.

3. Method

3.1 Pass 1: Feature Extraction and Rate-Distortion Curves

3.1.1 Spectral Features

We extract three scale-invariant features from the singular values of each weight matrix, computed via randomized SVD with rank k=256:

Stable rank measures the effective dimensionality of the weight matrix:

rs(W) = ||W||2F / ||W||22 = Σ σ2i / σ21

Spectral tail mass captures how much energy resides outside the top singular values:

τ(W) = 1 − Σi=1⌊r/10⌋ σ2i / Σ σ2i

Log condition number measures the ratio of largest to smallest singular values:

κ(W) = min(10, log101 / (σmin + ε)))

3.1.2 Per-Group Kurtosis Features

We reshape the weight matrix W into K = ⌈mn/g⌉ groups of size g=128 and compute the excess kurtosis per group:

κj = (1/g) Σ ((wj,i − w̄j) / sj)4 − 3

From the distribution of per-group kurtosis values, we derive four features:

3.1.3 Norm-Guided Output Noise Amplification

Rather than using random Gaussian inputs, we construct probes that respect the input distribution implied by the preceding LayerNorm:

xj ~ N(0, diag(γ2))

where γ is the preceding LayerNorm scale vector. We then measure how quantization noise propagates through the layer:

ΔW = Q(W; 4, 128) − W
fout = ||ΔW · X||F / ||W · X||F   averaged over 32 probes

3.1.4 Rate-Distortion Curves

For each tensor, we compute the normalized root mean squared error (NRMSE) at multiple (bit-width, group-size) configurations:

NRMSEi(b, g) = RMS(Q(Wi; b, g) − Wi) / RMS(Wi)

evaluated at the configuration set C = {(2,32), (3,64), (4,32), (4,64), (4,128), (8,64), (8,128), (16,0)}. From the rate-distortion curve we derive four summary features: fauc (area under curve), fm48 (4-to-8 bit NRMSE ratio), fm24 (2-to-4 bit NRMSE ratio), and fslope (local slope at the operating point).

3.1.5 SQNR Safety Veto

We compute the signal-to-quantization-noise ratio for each tensor at each configuration:

SQNRi(b, g) = 10 · log10(||Wi||2F / ||Wi − Q(Wi; b, g)||2F)   dB

Configurations with SQNR < 9 dB are excluded from the allocation candidate set. This threshold is empirically validated in Section 4.3.

3.2 Pass 2: Normalization, Priors, and Allocation

3.2.1 eCDF Normalization

Each raw feature is normalized to a percentile rank via the empirical cumulative distribution function:

i = |{j : fj ≤ fi}| / T

where T is the total number of tensors. This produces uniform marginals regardless of the original feature scale or distribution.

3.2.2 Soft Protection Priors

Certain tensor categories require stronger protection during quantization. Rather than hard-coding binary keep/quantize rules, MINT uses multiplicative soft priors that inflate the apparent cost of quantizing sensitive tensors:

Tensor Category Prior π
Embedding10.0
LM head10.0
LayerNorm∞ (excluded)
MoE router8.0
Vision8.0
First layer3.0
Last layer2.0
Default1.0

Table 1: Soft protection priors by tensor category.

3.2.3 Budget-Constrained Allocation (MCKP)

The quantized size of each tensor under configuration (b, g) is:

sizei(b, g) = ⌈ni · b / 8⌉ + ⌊ni / g⌋ · 4   bytes

We solve the resulting Multiple-Choice Knapsack Problem using one of three solvers:

3.3 Joint Bit-Width and Group-Size Optimization

A key innovation of MINT is treating group size as a first-class allocation variable rather than a fixed hyperparameter. The configuration space for each tensor is C = {(4,32), (4,64), (4,128), (8,64), (8,128)}, where smaller group sizes provide finer-grained quantization parameters at the cost of increased scale/zero-point overhead. Our results show that 85% of tensors are allocated (4,32)—the smallest available group size—indicating that the quality benefit of finer groups outweighs their storage overhead for the vast majority of tensors.

3.4 Pipeline Summary

Algorithm 1: MINT Pipeline

Input: Model directory, budget B, SQNR floor τ
Output: Per-tensor manifest {(bi, gi)}

// Pass 1: Feature extraction
for each shard in model:
    for each 2D tensor Wi with n ≥ 1024:
        Extract LayerNorm γ from preceding norm layer
        Compute spectral features (rs, τ, κ)
        Compute per-group kurtosis features
        Compute output noise amplification fout
        Compute RD curve: NRMSEi(b, g) for all (b, g) ∈ C
        Compute SQNR map: SQNRi(b, g) for all (b, g) ∈ C

// Pass 2: Allocation
Fit eCDF normalizer over all collected features
Compute soft protection priors πi
Filter configurations by SQNR ≥ τ
Run MCKP solver with budget B
return manifest {(bi, gi)} for each tensor

3.5 Expert Handling for MoE Models

Mixture-of-Experts models pose unique challenges for per-tensor quantization because expert weight matrices within the same layer may have very different sensitivity characteristics.

MINT uses two strategies depending on the number of experts:

For expert groups, we use conservative aggregation:

NRMSEG = maxe NRMSE(e)     SQNRG = mine SQNR(e)     sizeG = Σe size(e)

4. Experiments

We evaluate MINT on six model families: Qwen3-8B, Qwen3-30B-A3B, Qwen2-57B-A14B, Mixtral-8x7B, GLM-4.7-Flash, and Llama-4-Scout. All experiments use an Apple M2 Ultra with 192 GB unified memory. Perplexity is evaluated on WikiText-2 test with 128 sequences of 2048 tokens (seed=42).

4.1 Main Results

Model Method Size (GB) Mean PPL Median PPL Δ vs BF16
Qwen3-8B (dense, 8B parameters)
Qwen3-8BBF1615.269.727
Qwen3-8BAWQ4.0510.50+8.1%
Qwen3-8BGPTQ4.0510.30+6.1%
Qwen3-8BUniform 4-bit4.0510.249+5.4%
Qwen3-8Bv1 (SWAN)6.0510.097+3.8%
Qwen3-8BMINT6.0010.039+3.2%
Qwen3-30B-A3B (MoE, 30B parameters, 3B active)
Qwen3-30BBF1656.878.728
Qwen3-30BUniform 4-bit15.119.629+10.3%
Qwen3-30Bv1 (SWAN)16.738.9248.974+2.8%
Qwen3-30BMINT (16 GB)16.298.9308.971+2.3%
Qwen3-30BMINT (17 GB)17.398.8588.912+1.5%
Qwen3-30BMINT (19 GB)19.018.7828.798+0.6%
GLM-4.7-Flash (dense, 30B parameters)
GLM-4.7BF1658.1611.3448.706
GLM-4.7Uniform 4-bit14.82~11.46+31.6%
GLM-4.7v1 (SWAN)15.929.9309.084+4.3%
GLM-4.7MINT15.829.4279.210+5.8%
Llama-4-Scout (MoE, 109B parameters, 17B active, 16 experts)
ScoutBF16~203exceeds memory
ScoutMINT (no safety)34.6223.57723.714+198%
ScoutMINT (min-safe)46.938.6758.786+9.8%
ScoutMINT (50 GB)51.987.9808.284+1.0%
ScoutUniform 4-bit56.97.899
Scoutv1 (SWAN)59.57.628−3.4%
ScoutMINT (64 GB)58.037.7038.070−2.5%
ScoutMINT (192 GB)163.247.3597.691−6.8%

Table 3: Main perplexity results across four model families. Best results per model highlighted.

4.2 Joint Bit-Width and Group-Size Allocation

Table 4 shows the allocation breakdown for Qwen3-30B at the 19 GB budget. The dominant configuration is (4,32)—4-bit with group size 32—which is selected for 85.2% of tensors.

Configuration (bits, group) Tensors Fraction
(4, 32)15,90885.2%
(4, 64)9<0.1%
(4, 128)960.5%
(8, 128)2,61214.0%
(8, 64)1<0.1%
(16, 0)2411.3%
Total18,867100%

Table 4: Per-tensor allocation breakdown for Qwen3-30B-A3B at the 19 GB budget.

4.3 SQNR Safety Veto Validation

Table 5 shows the SQNR distribution across configurations for Llama-4-Scout. There is a clear gap between 2-bit (all tensors below 9 dB) and 3-bit (all tensors above 10 dB).

Config (b, g) Min P5 Median P95 Max <9 dB <15 dB
(2, 32)5.17.28.08.18.7691691
(2, 64)2.55.66.86.97.2691691
(3, 64)10.413.014.214.314.60691
(4, 32)19.421.322.022.122.800

Table 5: SQNR distribution across configurations for Llama-4-Scout.

Floor Sweep

SQNR Floor Avg Bits Size (GB) Mean PPL Median PPL Assessment
0 dB2.0034.6223.57723.714Catastrophic
9 dB3.0046.938.6758.786Usable (+9.8%)
9 dB + 50 GB3.4851.987.9808.284Good (+1.0%)
15 dB4.0056.167.7098.076Best (−2.4%)
15 dB + 58 GB4.0158.037.7038.070Best (−2.5%)

Table 6: SQNR floor sweep on Llama-4-Scout.

4.4 Budget-Targeted Deployment

Budget Actual Size (GB) Mean PPL Median PPL Note
15.1 GB15.119.629Uniform 4-bit floor
15.3 GB16.138.9709.020iPhone 16 Pro
15.5 GB16.298.9308.971
16.7 GB17.398.8588.912
19.2 GB19.018.7828.798
20.0 GB19.328.7848.803RTX 4070
25.0 GB27.398.7608.779RTX 4090
30.0 GB30.758.6578.684Mac M4 Pro
BF1656.878.728

Table 7: Budget curve for Qwen3-30B-A3B.

The relationship between budget and perplexity is well-described by a fitted prediction curve:

PPL(B) = 8.371 + 0.494 / (B − 15.099)0.135     RMSE = 0.025
Budget Actual Size (GB) Mean PPL Median PPL Note
No safety34.6223.57723.714Catastrophic
Min-safe46.938.6758.786
50 GB51.987.9808.284
56.9 GB56.97.899Uniform 4-bit
64 GB58.037.7038.07064 GB device
192 GB163.247.3597.691192 GB device
BF16~203exceeds memory

Table 8: Budget curve for Llama-4-Scout (109B MoE).

4.5 Matched-Size Comparison with GPTQ

Model Method Size (GB) Mean PPL Median PPL Δ PPL
Qwen3-30BGPTQ16.09.1229.160
Qwen3-30BMINT16.18.9709.020−1.7%
Qwen2-57BGPTQ29.96.3906.396
Qwen2-57BMINT29.96.3296.356−0.95%
Mixtral-8x7BGPTQ87.0†4.6084.640
Mixtral-8x7BUniform 4-bit24.54.4714.461
Mixtral-8x7BMINT24.54.2644.266−4.6%

Table 9: Matched-size comparison with GPTQ. MINT consistently outperforms despite being data-free. †GPTQ Mixtral uses different format.

4.6 Mean vs Median Perplexity

Model Method Mean PPL Median PPL Outliers
GLM-4.7BF1611.3448.7065
GLM-4.7v1 (SWAN)9.9309.0844
GLM-4.7MINT9.4279.2100

Table 10: Mean vs median perplexity. GLM-4.7 BF16 shows a 30% gap between mean and median due to 5 outlier sequences.

4.7 Analysis Efficiency

Model Tensors Analysis Time Allocation Time Total
Qwen3-8B3993 min<1s~10 min
Qwen3-30B18,86750 min<1s~54 min
GLM-4.79,70339 min<1s~44 min
Scout~1,00045 min<1s~50 min

Table 11: Analysis timing on Apple M2 Ultra 192 GB.

5. Discussion

Group size as the primary quality lever. Our most surprising finding is that group-size selection matters more than bit-width selection. At the 19 GB budget for Qwen3-30B, 85.2% of tensors are allocated (4,32) rather than (4,128). The additional overhead of storing per-group scales and zero-points at g=32 (4× more groups than g=128) is more than compensated by the reduction in quantization error.

SQNR veto catches catastrophic configurations. The SQNR safety veto is essential for MoE models. On Llama-4-Scout, disabling the veto produces a model that appears compact (34.6 GB) but is completely unusable (PPL 23.6). The 9 dB threshold exploits a natural gap in the SQNR distribution: all 2-bit configurations fall below 9 dB while all 3-bit configurations exceed 10 dB.

16-bit allocation is not needed. In v1 (SWAN), 5.6% of parameters were allocated 16-bit precision. MINT’s joint optimization reveals that this is unnecessary: the same tensors are better served by 4-bit with group size 32, which provides comparable quality at 25% of the storage cost.

MINT vs GPTQ. MINT consistently outperforms GPTQ at matched model sizes across three MoE families, despite being entirely data-free. We attribute this to three factors: (1) GPTQ uses fixed group sizes; (2) GPTQ’s calibration-derived Hessian may not represent the full input distribution; and (3) MINT’s per-tensor RD curves capture the actual quantization error surface rather than a proxy.

When the allocator disagrees with intuition. The MCKP solver occasionally produces counterintuitive allocations—for example, keeping a seemingly unimportant tensor at 8-bit while quantizing an attention projection to 4-bit. These decisions are correct in the rate-distortion sense: the “unimportant” tensor has a steep RD curve while the attention tensor has a flat curve.

Limitations. MINT currently supports only weight-only quantization; activation quantization is not addressed. The method assumes round-to-nearest quantization. Soft protection priors are hand-specified. The prediction curve is model-specific. Runtime latency is not optimized—smaller group sizes may increase dequantization overhead on some hardware.

6. Conclusion

We have presented MINT, a data-free mixed-precision quantization framework that formulates per-tensor bit-width and group-size allocation as a budget-constrained optimization problem. By solving a Multiple-Choice Knapsack Problem over per-tensor rate-distortion curves, MINT enables hardware-targeted deployment where users specify an exact memory budget and receive a provably optimal allocation.

Our key findings are: (1) group-size selection is the primary quality lever, with 85% of tensors preferring g=32 over conventional g=128; (2) the SQNR safety veto with a 9 dB threshold reliably prevents catastrophic quantization; (3) MINT consistently outperforms the calibration-based GPTQ method at matched sizes across multiple MoE architectures; and (4) median perplexity should be reported alongside means.

MINT requires no calibration data, no gradient computation, and completes in under 50 minutes on commodity hardware.

References

[1] Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR, 2023.

[2] Lin, J., Tang, J., Tang, H., et al. AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration. MLSys, 2024.

[3] Kim, S., Hooper, C., Gholami, A., et al. SqueezeLLM: Dense-and-Sparse Quantization. ICML, 2024.

[4] Dettmers, T., Svirschevski, R., et al. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. ICLR, 2024.

[5] Chee, J., Cai, Y., Kuleshov, V., and De Sa, C. QuIP: 2-Bit Quantization of Large Language Models with Guarantees. NeurIPS, 2024.

[6] Tang, Z., et al. EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs. 2024.

[7] Zhang, Y., Chen, D., and Li, B. MXQ: Mixed-Precision Quantization for Efficient LLM Deployment. ICAART, 2025.

[8] Badri, H., et al. HIGGS: Hardware-Independent Graph-Guided Search for LLM Quantization. NAACL, 2025.

[9] Zhao, Y., et al. KurTail: Kurtosis-Based Tail-Aware Quantization. EMNLP, 2025.

[10] Li, W., et al. LLM-MQ: Mixed-Precision Quantization for Efficient LLM Deployment. 2024.

[11] Li, Z., et al. MixLLM: Mixed-Precision Large Language Model Quantization. 2025.

[12] Wei, Y., et al. MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs. ICLR, 2025.

[13] Huang, L., et al. MoEQuant: Expert-Aware Quantization for Mixture-of-Experts Models. 2025.

[14] Xie, Y., et al. QuantMoEBench: Benchmarking Quantization for MoE Models. NeurIPS, 2024.

[15] Bondarenko, Y., Nagel, M., and Blankevoort, T. Quantization as Regularization. EMNLP, 2023.

[16] Apple. MLX: An Array Framework for Apple Silicon. 2023.

[17] Qwen Team. Qwen3.5 Technical Report. 2026.

[18] Xiao, G., et al. SmoothQuant: Accurate and Efficient Post-Training Quantization. ICML, 2023.

[19] Badri, H., and Shaji, H. HQQ: Half-Quadratic Quantization. 2024.

[20] Huang, J., et al. SliM-LLM: Salience-Driven Mixed-Precision Quantization. ICML, 2025.

[21] Shang, Y., et al. CherryQ: Cherry-Picked Quantization for LLMs. NeurIPS, 2024.

[22] Park, S., et al. HESTIA: Hardware-Efficient STochastic Integer Arithmetic for LLM Inference. 2026.

[23] Xu, C., et al. Qwen3 Quantization: Efficient Deployment of the Qwen3 Family. 2025.

[24] Meta AI. Llama 4: Open Foundation Models. 2025.


Appendix A: Gap Closure Summary

Model Size (GB) Δ vs BF16 Uniform 4-bit Δ Gap Closed
Qwen3-8B6.0+3.2%+5.4%41%
Qwen3-30B16.3+2.3%+10.3%78%
Qwen3-30B17.4+1.5%+10.3%86%
Qwen3-30B19.0+0.6%+10.3%94%
GLM-4.715.8+5.8%+31.6%82%
Scout58.0−2.5%*
Scout163.2−6.8%*

Table A1: Gap closure summary. *vs uniform 4-bit.

Appendix B: Comparison with v1 (SWAN)

Dimension v1 (SWAN) MINT
ObjectiveWeighted sum + thresholdsConstrained optimization
Error metricSingle-point 4-bit NRMSEMulti-point RD curve (8 configs)
Group sizeFixed hyperparameterPer-tensor variable (85% chose g32)
ProtectionBinary hard-coded rulesSoft priors in objective
Safety floorNone (allows SQNR < 5 dB)SQNR veto at 9 dB
BudgetNo user controlUser-specified
16-bit allocation5.6% of params0% (redirected to g32 overhead)
2-bit allocation4.0% of params0% (blocked by SQNR floor)
Quality predictionNot possibleFitted curve

Table B1: Architectural comparison between v1 (SWAN) and MINT.

Appendix C: Reproduction Details

Evaluation protocol

Quantization settings

← Back to all articles

Ready to deploy MINT quantization in your infrastructure?

Our team specialises in data-free model compression, budget-aware quantization, and production AI deployment on commodity hardware.

Talk to Our Team