MINT: Compute-Optimal Data-Free Mixed-Precision Quantization for LLMs

We present MINT (Memory-Informed N-bit Tuning), a data-free mixed-precision quantization framework that formulates per-tensor bit-width and group-size allocation as a compute-optimal allocation problem. Given a user-specified memory budget, MINT jointly selects the optimal (bit-width, group-size) configuration for each weight tensor by solving a Multiple-Choice Knapsack Problem (MCKP) over per-tensor rate-distortion curves.

The framework introduces three key innovations: (1) budget-targeted quantization—users specify an exact memory target (e.g., “fit in 4 GB for iPhone” or “fit in 24 GB for RTX 4090”) and MINT produces the provably optimal allocation for that budget, with a fitted prediction curve that estimates output quality before running the pipeline; (2) joint bit-width and group-size optimization that treats group size as a first-class allocation variable, revealing that group-size selection provides larger quality gains than bit-width changes; and (3) an SQNR safety veto with an empirically validated 9 dB threshold that exploits the natural gap between catastrophic 2-bit quantization (SQNR < 9 dB, PPL triples) and usable 3-bit quantization (SQNR > 10 dB).

We evaluate MINT on six model families spanning 8B–109B parameters across dense and Mixture-of-Experts architectures. In matched-size comparisons against GPTQ—a calibration-based method—across three MoE families, MINT consistently outperforms GPTQ despite being entirely data-free. The entire pipeline requires no calibration data, no gradient computation, and completes in under 50 minutes on commodity hardware.

Try MINT yourself

The full pipeline is open source under MIT licence. Analyse, allocate, and quantize on your own hardware.

View on GitHub

1. Introduction
2. Related Work
3. Method
4. Experiments
5. Discussion
6. Conclusion
References
Appendix A: Gap Closure Summary
Appendix B: Comparison with v1 (SWAN)
Appendix C: Reproduction Details

1. Introduction

Post-training quantization (PTQ) has become the primary means of deploying large language models on consumer hardware. Methods such as GPTQ [1], AWQ [2], and SqueezeLLM [3] achieve remarkable compression, but they share a common requirement: a representative calibration dataset. This introduces practical concerns—calibration data may be unavailable for proprietary models, the chosen distribution may not generalize to deployment domains, and calibration demands substantial compute.

Existing data-free approaches [6,7,8] typically apply uniform bit-widths or rely on single sensitivity metrics with hand-tuned thresholds. These approaches face two fundamental limitations. First, threshold-based allocation produces fixed bit-width decisions regardless of the deployment memory budget—the user cannot specify “quantize this model to fit in 6 GB” and receive a provably optimal allocation. Second, single-point error proxies create circularity: using 4-bit reconstruction error to decide 4-bit allocation means the method partly predicts its own label.

We address both limitations with MINT (Memory-Informed N-bit Tuning), which reformulates mixed-precision quantization as a constrained optimization problem:

min_{(b_i, g_i)} Σ_i π_i · α_i · NRMSE_i(b_i, g_i) s.t. Σ_i size_i(b_i, g_i) ≤ B

where b_i and g_i are the bit-width and group size for tensor i, B is the user’s memory budget, π_i is a soft protection prior, and α_i is a learned importance weight. The key insight is that both bit-width and group size are allocation variables—prior work optimizes bit-width alone, but our evidence shows group-size selection often provides larger quality improvements than bit-width changes.

Contributions

A compute-optimal formulation (MCKP) that jointly optimizes per-tensor bit-width and group size.
Budget-targeted deployment: users specify exact memory target and receive optimal allocation.
Quantizer-aligned sensitivity features: per-group kurtosis, norm-guided probes, multi-point rate-distortion curves.
An SQNR safety veto with empirically determined 9 dB threshold.
Evidence that group-size selection is the primary quality lever (85% of tensors get g32 vs conventional g128).
Recommendation to report median perplexity alongside means.
Evidence that MINT consistently outperforms calibration-based GPTQ at matched sizes.
Expert-grouped allocation for MoE models.
Scalability to 109B parameters.

2. Related Work

Calibration-based PTQ

GPTQ [1], AWQ [2], SqueezeLLM [3], SpQR [4], QuIP [5], and SmoothQuant [18] represent the dominant paradigm in post-training quantization. All require calibration data to compute sensitivity information, weight scaling factors, or Hessian approximations. While highly effective, this requirement limits applicability when calibration data is unavailable or unrepresentative of deployment domains.

Data-free quantization

EasyQuant [6], MXQ [7], HQQ [19], and HIGGS [8] eliminate the need for calibration data. These methods typically apply uniform bit-widths across all tensors. MINT differs by formulating allocation as constrained optimization over joint (bit-width, group-size) configurations, enabling budget-targeted deployment and per-tensor mixed-precision decisions.

Sensitivity-based mixed-precision

LLM-MQ [10], SliM-LLM [20], and CherryQ [21] use sensitivity metrics to guide mixed-precision allocation. However, all require calibration data to compute their sensitivity scores. MINT is the first method to combine data-free sensitivity analysis with constrained optimization over both bit-width and group-size variables.

MoE quantization

MC-MoE [12] and MoEQuant [13] address the specific challenges of quantizing Mixture-of-Experts models. Both require calibration data to determine expert importance. MINT’s data-free approach avoids coverage problems inherent in calibration-based MoE quantization, where calibration sequences may not activate all experts.

3. Method

3.1 Pass 1: Feature Extraction and Rate-Distortion Curves

3.1.1 Spectral Features

We extract three scale-invariant features from the singular values of each weight matrix, computed via randomized SVD with rank k=256:

Stable rank measures the effective dimensionality of the weight matrix:

r_s(W) = ||W||²_F / ||W||²₂ = Σ σ²_i / σ²₁

Spectral tail mass captures how much energy resides outside the top singular values:

τ(W) = 1 − Σ_i=1^⌊r/10⌋ σ²_i / Σ σ²_i

Log condition number measures the ratio of largest to smallest singular values:

κ(W) = min(10, log₁₀(σ₁ / (σ_min + ε)))

3.1.2 Per-Group Kurtosis Features

We reshape the weight matrix W into K = ⌈mn/g⌉ groups of size g=128 and compute the excess kurtosis per group:

κ_j = (1/g) Σ ((w_j,i − w̄_j) / s_j)⁴ − 3

From the distribution of per-group kurtosis values, we derive four features:

f_kurt90 = P90 — 90th percentile of group kurtosis values
f_ratio = P99/P50 — ratio of extreme to typical kurtosis
f_outlier = fraction of groups with >3σ outliers
f_maxmed = max(κ_j)/median(κ_j) — worst-case to typical ratio

3.1.3 Norm-Guided Output Noise Amplification

Rather than using random Gaussian inputs, we construct probes that respect the input distribution implied by the preceding LayerNorm:

x_j ~ N(0, diag(γ²))

where γ is the preceding LayerNorm scale vector. We then measure how quantization noise propagates through the layer:

ΔW = Q(W; 4, 128) − W
f_out = ||ΔW · X||_F / ||W · X||_F averaged over 32 probes

3.1.4 Rate-Distortion Curves

For each tensor, we compute the normalized root mean squared error (NRMSE) at multiple (bit-width, group-size) configurations:

NRMSE_i(b, g) = RMS(Q(W_i; b, g) − W_i) / RMS(W_i)

evaluated at the configuration set C = {(2,32), (3,64), (4,32), (4,64), (4,128), (8,64), (8,128), (16,0)}. From the rate-distortion curve we derive four summary features: f_auc (area under curve), f_m48 (4-to-8 bit NRMSE ratio), f_m24 (2-to-4 bit NRMSE ratio), and f_slope (local slope at the operating point).

3.1.5 SQNR Safety Veto

We compute the signal-to-quantization-noise ratio for each tensor at each configuration:

SQNR_i(b, g) = 10 · log₁₀(||W_i||²_F / ||W_i − Q(W_i; b, g)||²_F) dB

Configurations with SQNR < 9 dB are excluded from the allocation candidate set. This threshold is empirically validated in Section 4.3.

3.2 Pass 2: Normalization, Priors, and Allocation

3.2.1 eCDF Normalization

Each raw feature is normalized to a percentile rank via the empirical cumulative distribution function:

f̂_i = |{j : f_j ≤ f_i}| / T

where T is the total number of tensors. This produces uniform marginals regardless of the original feature scale or distribution.

3.2.2 Soft Protection Priors

Certain tensor categories require stronger protection during quantization. Rather than hard-coding binary keep/quantize rules, MINT uses multiplicative soft priors that inflate the apparent cost of quantizing sensitive tensors:

Tensor Category	Prior π
Embedding	10.0
LM head	10.0
LayerNorm	∞ (excluded)
MoE router	8.0
Vision	8.0
First layer	3.0
Last layer	2.0
Default	1.0

Table 1: Soft protection priors by tensor category.

3.2.3 Budget-Constrained Allocation (MCKP)

The quantized size of each tensor under configuration (b, g) is:

size_i(b, g) = ⌈n_i · b / 8⌉ + ⌊n_i / g⌋ · 4 bytes

We solve the resulting Multiple-Choice Knapsack Problem using one of three solvers:

Greedy (default, <10ms): sorts candidates by efficiency ratio and greedily upgrades tensors from minimum-cost configuration.
LP relaxation: linear programming relaxation provides an upper bound on quality.
ILP exact: integer linear programming yields the provably optimal solution at higher compute cost.

3.3 Joint Bit-Width and Group-Size Optimization

A key innovation of MINT is treating group size as a first-class allocation variable rather than a fixed hyperparameter. The configuration space for each tensor is C = {(4,32), (4,64), (4,128), (8,64), (8,128)}, where smaller group sizes provide finer-grained quantization parameters at the cost of increased scale/zero-point overhead. Our results show that 85% of tensors are allocated (4,32)—the smallest available group size—indicating that the quality benefit of finer groups outweighs their storage overhead for the vast majority of tensors.

3.4 Pipeline Summary

Algorithm 1: MINT Pipeline

Input: Model directory, budget B, SQNR floor τ
Output: Per-tensor manifest {(b_i, g_i)}

// Pass 1: Feature extraction
for each shard in model:
    for each 2D tensor W_i with n ≥ 1024:
        Extract LayerNorm γ from preceding norm layer
        Compute spectral features (r_s, τ, κ)
        Compute per-group kurtosis features
        Compute output noise amplification f_out
        Compute RD curve: NRMSE_i(b, g) for all (b, g) ∈ C
        Compute SQNR map: SQNR_i(b, g) for all (b, g) ∈ C

// Pass 2: Allocation
Fit eCDF normalizer over all collected features
Compute soft protection priors π_i
Filter configurations by SQNR ≥ τ
Run MCKP solver with budget B
return manifest {(b_i, g_i)} for each tensor

3.5 Expert Handling for MoE Models

Mixture-of-Experts models pose unique challenges for per-tensor quantization because expert weight matrices within the same layer may have very different sensitivity characteristics.

MINT uses two strategies depending on the number of experts:

Individual analysis (E ≤ 32): Each expert is analyzed independently. The worst-case representative determines the group’s allocation.
Clustered analysis (E > 32): k-means clustering on lightweight statistics groups similar experts, and one representative is sampled per cluster.

For expert groups, we use conservative aggregation:

NRMSE_G = max_e NRMSE^(e) SQNR_G = min_e SQNR^(e) size_G = Σ_e size^(e)

4. Experiments

We evaluate MINT on six model families: Qwen3-8B, Qwen3-30B-A3B, Qwen2-57B-A14B, Mixtral-8x7B, GLM-4.7-Flash, and Llama-4-Scout. All experiments use an Apple M2 Ultra with 192 GB unified memory. Perplexity is evaluated on WikiText-2 test with 128 sequences of 2048 tokens (seed=42).

4.1 Main Results

Model	Method	Size (GB)	Mean PPL	Median PPL	Δ vs BF16
Qwen3-8B (dense, 8B parameters)
Qwen3-8B	BF16	15.26	9.727	—	—
Qwen3-8B	AWQ	4.05	10.50	—	+8.1%
Qwen3-8B	GPTQ	4.05	10.30	—	+6.1%
Qwen3-8B	Uniform 4-bit	4.05	10.249	—	+5.4%
Qwen3-8B	v1 (SWAN)	6.05	10.097	—	+3.8%
Qwen3-8B	MINT	6.00	10.039	—	+3.2%
Qwen3-30B-A3B (MoE, 30B parameters, 3B active)
Qwen3-30B	BF16	56.87	8.728	—	—
Qwen3-30B	Uniform 4-bit	15.11	9.629	—	+10.3%
Qwen3-30B	v1 (SWAN)	16.73	8.924	8.974	+2.8%
Qwen3-30B	MINT (16 GB)	16.29	8.930	8.971	+2.3%
Qwen3-30B	MINT (17 GB)	17.39	8.858	8.912	+1.5%
Qwen3-30B	MINT (19 GB)	19.01	8.782	8.798	+0.6%
GLM-4.7-Flash (dense, 30B parameters)
GLM-4.7	BF16	58.16	11.344	8.706	—
GLM-4.7	Uniform 4-bit	14.82	~11.46	—	+31.6%
GLM-4.7	v1 (SWAN)	15.92	9.930	9.084	+4.3%
GLM-4.7	MINT	15.82	9.427	9.210	+5.8%
Llama-4-Scout (MoE, 109B parameters, 17B active, 16 experts)
Scout	BF16	~203	exceeds memory
Scout	MINT (no safety)	34.62	23.577	23.714	+198%
Scout	MINT (min-safe)	46.93	8.675	8.786	+9.8%
Scout	MINT (50 GB)	51.98	7.980	8.284	+1.0%
Scout	Uniform 4-bit	56.9	7.899	—	—
Scout	v1 (SWAN)	59.5	7.628	—	−3.4%
Scout	MINT (64 GB)	58.03	7.703	8.070	−2.5%
Scout	MINT (192 GB)	163.24	7.359	7.691	−6.8%

Table 3: Main perplexity results across four model families. Best results per model highlighted.

4.2 Joint Bit-Width and Group-Size Allocation

Table 4 shows the allocation breakdown for Qwen3-30B at the 19 GB budget. The dominant configuration is (4,32)—4-bit with group size 32—which is selected for 85.2% of tensors.

Configuration (bits, group)	Tensors	Fraction
(4, 32)	15,908	85.2%
(4, 64)	9	<0.1%
(4, 128)	96	0.5%
(8, 128)	2,612	14.0%
(8, 64)	1	<0.1%
(16, 0)	241	1.3%
Total	18,867	100%

Table 4: Per-tensor allocation breakdown for Qwen3-30B-A3B at the 19 GB budget.

4.3 SQNR Safety Veto Validation

Table 5 shows the SQNR distribution across configurations for Llama-4-Scout. There is a clear gap between 2-bit (all tensors below 9 dB) and 3-bit (all tensors above 10 dB).

Config (b, g)	Min	P5	Median	P95	Max	<9 dB	<15 dB
(2, 32)	5.1	7.2	8.0	8.1	8.7	691	691
(2, 64)	2.5	5.6	6.8	6.9	7.2	691	691
(3, 64)	10.4	13.0	14.2	14.3	14.6	0	691
(4, 32)	19.4	21.3	22.0	22.1	22.8	0	0

Table 5: SQNR distribution across configurations for Llama-4-Scout.

Floor Sweep

SQNR Floor	Avg Bits	Size (GB)	Mean PPL	Median PPL	Assessment
0 dB	2.00	34.62	23.577	23.714	Catastrophic
9 dB	3.00	46.93	8.675	8.786	Usable (+9.8%)
9 dB + 50 GB	3.48	51.98	7.980	8.284	Good (+1.0%)
15 dB	4.00	56.16	7.709	8.076	Best (−2.4%)
15 dB + 58 GB	4.01	58.03	7.703	8.070	Best (−2.5%)

Table 6: SQNR floor sweep on Llama-4-Scout.

4.4 Budget-Targeted Deployment

Budget	Actual Size (GB)	Mean PPL	Median PPL	Note
15.1 GB	15.11	9.629	—	Uniform 4-bit floor
15.3 GB	16.13	8.970	9.020	iPhone 16 Pro
15.5 GB	16.29	8.930	8.971	—
16.7 GB	17.39	8.858	8.912	—
19.2 GB	19.01	8.782	8.798	—
20.0 GB	19.32	8.784	8.803	RTX 4070
25.0 GB	27.39	8.760	8.779	RTX 4090
30.0 GB	30.75	8.657	8.684	Mac M4 Pro
BF16	56.87	8.728	—	—

Table 7: Budget curve for Qwen3-30B-A3B.

The relationship between budget and perplexity is well-described by a fitted prediction curve:

PPL(B) = 8.371 + 0.494 / (B − 15.099)^0.135 RMSE = 0.025

Budget	Actual Size (GB)	Mean PPL	Median PPL	Note
No safety	34.62	23.577	23.714	Catastrophic
Min-safe	46.93	8.675	8.786	—
50 GB	51.98	7.980	8.284	—
56.9 GB	56.9	7.899	—	Uniform 4-bit
64 GB	58.03	7.703	8.070	64 GB device
192 GB	163.24	7.359	7.691	192 GB device
BF16	~203	exceeds memory

Table 8: Budget curve for Llama-4-Scout (109B MoE).

4.5 Matched-Size Comparison with GPTQ

Model	Method	Size (GB)	Mean PPL	Median PPL	Δ PPL
Qwen3-30B	GPTQ	16.0	9.122	9.160	—
Qwen3-30B	MINT	16.1	8.970	9.020	−1.7%
Qwen2-57B	GPTQ	29.9	6.390	6.396	—
Qwen2-57B	MINT	29.9	6.329	6.356	−0.95%
Mixtral-8x7B	GPTQ	87.0†	4.608	4.640	—
Mixtral-8x7B	Uniform 4-bit	24.5	4.471	4.461	—
Mixtral-8x7B	MINT	24.5	4.264	4.266	−4.6%

Table 9: Matched-size comparison with GPTQ. MINT consistently outperforms despite being data-free. †GPTQ Mixtral uses different format.

4.6 Mean vs Median Perplexity

Model	Method	Mean PPL	Median PPL	Outliers
GLM-4.7	BF16	11.344	8.706	5
GLM-4.7	v1 (SWAN)	9.930	9.084	4
GLM-4.7	MINT	9.427	9.210	0

Table 10: Mean vs median perplexity. GLM-4.7 BF16 shows a 30% gap between mean and median due to 5 outlier sequences.

4.7 Analysis Efficiency

Model	Tensors	Analysis Time	Allocation Time	Total
Qwen3-8B	399	3 min	<1s	~10 min
Qwen3-30B	18,867	50 min	<1s	~54 min
GLM-4.7	9,703	39 min	<1s	~44 min
Scout	~1,000	45 min	<1s	~50 min

Table 11: Analysis timing on Apple M2 Ultra 192 GB.

5. Discussion

Group size as the primary quality lever. Our most surprising finding is that group-size selection matters more than bit-width selection. At the 19 GB budget for Qwen3-30B, 85.2% of tensors are allocated (4,32) rather than (4,128). The additional overhead of storing per-group scales and zero-points at g=32 (4× more groups than g=128) is more than compensated by the reduction in quantization error.

SQNR veto catches catastrophic configurations. The SQNR safety veto is essential for MoE models. On Llama-4-Scout, disabling the veto produces a model that appears compact (34.6 GB) but is completely unusable (PPL 23.6). The 9 dB threshold exploits a natural gap in the SQNR distribution: all 2-bit configurations fall below 9 dB while all 3-bit configurations exceed 10 dB.

16-bit allocation is not needed. In v1 (SWAN), 5.6% of parameters were allocated 16-bit precision. MINT’s joint optimization reveals that this is unnecessary: the same tensors are better served by 4-bit with group size 32, which provides comparable quality at 25% of the storage cost.

MINT vs GPTQ. MINT consistently outperforms GPTQ at matched model sizes across three MoE families, despite being entirely data-free. We attribute this to three factors: (1) GPTQ uses fixed group sizes; (2) GPTQ’s calibration-derived Hessian may not represent the full input distribution; and (3) MINT’s per-tensor RD curves capture the actual quantization error surface rather than a proxy.

When the allocator disagrees with intuition. The MCKP solver occasionally produces counterintuitive allocations—for example, keeping a seemingly unimportant tensor at 8-bit while quantizing an attention projection to 4-bit. These decisions are correct in the rate-distortion sense: the “unimportant” tensor has a steep RD curve while the attention tensor has a flat curve.

Limitations. MINT currently supports only weight-only quantization; activation quantization is not addressed. The method assumes round-to-nearest quantization. Soft protection priors are hand-specified. The prediction curve is model-specific. Runtime latency is not optimized—smaller group sizes may increase dequantization overhead on some hardware.

6. Conclusion

We have presented MINT, a data-free mixed-precision quantization framework that formulates per-tensor bit-width and group-size allocation as a budget-constrained optimization problem. By solving a Multiple-Choice Knapsack Problem over per-tensor rate-distortion curves, MINT enables hardware-targeted deployment where users specify an exact memory budget and receive a provably optimal allocation.

Our key findings are: (1) group-size selection is the primary quality lever, with 85% of tensors preferring g=32 over conventional g=128; (2) the SQNR safety veto with a 9 dB threshold reliably prevents catastrophic quantization; (3) MINT consistently outperforms the calibration-based GPTQ method at matched sizes across multiple MoE architectures; and (4) median perplexity should be reported alongside means.

MINT requires no calibration data, no gradient computation, and completes in under 50 minutes on commodity hardware.

References

[1] Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR, 2023.

[2] Lin, J., Tang, J., Tang, H., et al. AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration. MLSys, 2024.

[3] Kim, S., Hooper, C., Gholami, A., et al. SqueezeLLM: Dense-and-Sparse Quantization. ICML, 2024.

[4] Dettmers, T., Svirschevski, R., et al. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. ICLR, 2024.

[5] Chee, J., Cai, Y., Kuleshov, V., and De Sa, C. QuIP: 2-Bit Quantization of Large Language Models with Guarantees. NeurIPS, 2024.

[6] Tang, Z., et al. EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs. 2024.

[7] Zhang, Y., Chen, D., and Li, B. MXQ: Mixed-Precision Quantization for Efficient LLM Deployment. ICAART, 2025.

[8] Badri, H., et al. HIGGS: Hardware-Independent Graph-Guided Search for LLM Quantization. NAACL, 2025.

[9] Zhao, Y., et al. KurTail: Kurtosis-Based Tail-Aware Quantization. EMNLP, 2025.

[10] Li, W., et al. LLM-MQ: Mixed-Precision Quantization for Efficient LLM Deployment. 2024.

[11] Li, Z., et al. MixLLM: Mixed-Precision Large Language Model Quantization. 2025.

[12] Wei, Y., et al. MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs. ICLR, 2025.

[13] Huang, L., et al. MoEQuant: Expert-Aware Quantization for Mixture-of-Experts Models. 2025.

[14] Xie, Y., et al. QuantMoEBench: Benchmarking Quantization for MoE Models. NeurIPS, 2024.

[15] Bondarenko, Y., Nagel, M., and Blankevoort, T. Quantization as Regularization. EMNLP, 2023.

[16] Apple. MLX: An Array Framework for Apple Silicon. 2023.

[17] Qwen Team. Qwen3.5 Technical Report. 2026.

[18] Xiao, G., et al. SmoothQuant: Accurate and Efficient Post-Training Quantization. ICML, 2023.

[19] Badri, H., and Shaji, H. HQQ: Half-Quadratic Quantization. 2024.

[20] Huang, J., et al. SliM-LLM: Salience-Driven Mixed-Precision Quantization. ICML, 2025.

[21] Shang, Y., et al. CherryQ: Cherry-Picked Quantization for LLMs. NeurIPS, 2024.

[22] Park, S., et al. HESTIA: Hardware-Efficient STochastic Integer Arithmetic for LLM Inference. 2026.

[23] Xu, C., et al. Qwen3 Quantization: Efficient Deployment of the Qwen3 Family. 2025.

[24] Meta AI. Llama 4: Open Foundation Models. 2025.

Appendix A: Gap Closure Summary

Model	Size (GB)	Δ vs BF16	Uniform 4-bit Δ	Gap Closed
Qwen3-8B	6.0	+3.2%	+5.4%	41%
Qwen3-30B	16.3	+2.3%	+10.3%	78%
Qwen3-30B	17.4	+1.5%	+10.3%	86%
Qwen3-30B	19.0	+0.6%	+10.3%	94%
GLM-4.7	15.8	+5.8%	+31.6%	82%
Scout	58.0	−2.5%*	—	—
Scout	163.2	−6.8%*	—	—

Table A1: Gap closure summary. *vs uniform 4-bit.

Appendix B: Comparison with v1 (SWAN)

Dimension	v1 (SWAN)	MINT
Objective	Weighted sum + thresholds	Constrained optimization
Error metric	Single-point 4-bit NRMSE	Multi-point RD curve (8 configs)
Group size	Fixed hyperparameter	Per-tensor variable (85% chose g32)
Protection	Binary hard-coded rules	Soft priors in objective
Safety floor	None (allows SQNR < 5 dB)	SQNR veto at 9 dB
Budget	No user control	User-specified
16-bit allocation	5.6% of params	0% (redirected to g32 overhead)
2-bit allocation	4.0% of params	0% (blocked by SQNR floor)
Quality prediction	Not possible	Fitted curve

Table B1: Architectural comparison between v1 (SWAN) and MINT.

Appendix C: Reproduction Details

Python: 3.12.0
MLX: 0.30.3
mlx_lm: 0.30.4
PyTorch: 2.6.0
SciPy: latest

Evaluation protocol

Perplexity dataset: WikiText-2 test split
Sequence length: 2048 tokens
Number of sequences: 128
Random seed: 42

Quantization settings

Method: Group-wise round-to-nearest (RTN)
SVD: Randomized, rank k=256
MCKP solver: Greedy efficiency ordering
SQNR floor: 9 dB

← Back to all articles

MINT: Compute-Optimal Data-Free Mixed-Precision Quantization for Large Language Models

Contents