We present MINT (Memory-Informed N-bit Tuning), a data-free mixed-precision quantization framework that formulates per-tensor bit-width and group-size allocation as a compute-optimal allocation problem. Given a user-specified memory budget, MINT jointly selects the optimal (bit-width, group-size) configuration for each weight tensor by solving a Multiple-Choice Knapsack Problem (MCKP) over per-tensor rate-distortion curves.
The framework introduces three key innovations: (1) budget-targeted quantization—users specify an exact memory target (e.g., “fit in 4 GB for iPhone” or “fit in 24 GB for RTX 4090”) and MINT produces the provably optimal allocation for that budget, with a fitted prediction curve that estimates output quality before running the pipeline; (2) joint bit-width and group-size optimization that treats group size as a first-class allocation variable, revealing that group-size selection provides larger quality gains than bit-width changes; and (3) an SQNR safety veto with an empirically validated 9 dB threshold that exploits the natural gap between catastrophic 2-bit quantization (SQNR < 9 dB, PPL triples) and usable 3-bit quantization (SQNR > 10 dB).
We evaluate MINT on six model families spanning 8B–109B parameters across dense and Mixture-of-Experts architectures. In matched-size comparisons against GPTQ—a calibration-based method—across three MoE families, MINT consistently outperforms GPTQ despite being entirely data-free. The entire pipeline requires no calibration data, no gradient computation, and completes in under 50 minutes on commodity hardware.
Try MINT yourself
The full pipeline is open source under MIT licence. Analyse, allocate, and quantize on your own hardware.
Contents
1. Introduction
Post-training quantization (PTQ) has become the primary means of deploying large language models on consumer hardware. Methods such as GPTQ [1], AWQ [2], and SqueezeLLM [3] achieve remarkable compression, but they share a common requirement: a representative calibration dataset. This introduces practical concerns—calibration data may be unavailable for proprietary models, the chosen distribution may not generalize to deployment domains, and calibration demands substantial compute.
Existing data-free approaches [6,7,8] typically apply uniform bit-widths or rely on single sensitivity metrics with hand-tuned thresholds. These approaches face two fundamental limitations. First, threshold-based allocation produces fixed bit-width decisions regardless of the deployment memory budget—the user cannot specify “quantize this model to fit in 6 GB” and receive a provably optimal allocation. Second, single-point error proxies create circularity: using 4-bit reconstruction error to decide 4-bit allocation means the method partly predicts its own label.
We address both limitations with MINT (Memory-Informed N-bit Tuning), which reformulates mixed-precision quantization as a constrained optimization problem:
where bi and gi are the bit-width and group size for tensor i, B is the user’s memory budget, πi is a soft protection prior, and αi is a learned importance weight. The key insight is that both bit-width and group size are allocation variables—prior work optimizes bit-width alone, but our evidence shows group-size selection often provides larger quality improvements than bit-width changes.
Contributions
- A compute-optimal formulation (MCKP) that jointly optimizes per-tensor bit-width and group size.
- Budget-targeted deployment: users specify exact memory target and receive optimal allocation.
- Quantizer-aligned sensitivity features: per-group kurtosis, norm-guided probes, multi-point rate-distortion curves.
- An SQNR safety veto with empirically determined 9 dB threshold.
- Evidence that group-size selection is the primary quality lever (85% of tensors get g32 vs conventional g128).
- Recommendation to report median perplexity alongside means.
- Evidence that MINT consistently outperforms calibration-based GPTQ at matched sizes.
- Expert-grouped allocation for MoE models.
- Scalability to 109B parameters.
2. Related Work
Calibration-based PTQ
GPTQ [1], AWQ [2], SqueezeLLM [3], SpQR [4], QuIP [5], and SmoothQuant [18] represent the dominant paradigm in post-training quantization. All require calibration data to compute sensitivity information, weight scaling factors, or Hessian approximations. While highly effective, this requirement limits applicability when calibration data is unavailable or unrepresentative of deployment domains.
Data-free quantization
EasyQuant [6], MXQ [7], HQQ [19], and HIGGS [8] eliminate the need for calibration data. These methods typically apply uniform bit-widths across all tensors. MINT differs by formulating allocation as constrained optimization over joint (bit-width, group-size) configurations, enabling budget-targeted deployment and per-tensor mixed-precision decisions.
Sensitivity-based mixed-precision
LLM-MQ [10], SliM-LLM [20], and CherryQ [21] use sensitivity metrics to guide mixed-precision allocation. However, all require calibration data to compute their sensitivity scores. MINT is the first method to combine data-free sensitivity analysis with constrained optimization over both bit-width and group-size variables.
MoE quantization
MC-MoE [12] and MoEQuant [13] address the specific challenges of quantizing Mixture-of-Experts models. Both require calibration data to determine expert importance. MINT’s data-free approach avoids coverage problems inherent in calibration-based MoE quantization, where calibration sequences may not activate all experts.
3. Method
3.1 Pass 1: Feature Extraction and Rate-Distortion Curves
3.1.1 Spectral Features
We extract three scale-invariant features from the singular values of each weight matrix, computed via randomized SVD with rank k=256:
Stable rank measures the effective dimensionality of the weight matrix:
Spectral tail mass captures how much energy resides outside the top singular values:
Log condition number measures the ratio of largest to smallest singular values:
3.1.2 Per-Group Kurtosis Features
We reshape the weight matrix W into K = ⌈mn/g⌉ groups of size g=128 and compute the excess kurtosis per group:
From the distribution of per-group kurtosis values, we derive four features:
- fkurt90 = P90 — 90th percentile of group kurtosis values
- fratio = P99/P50 — ratio of extreme to typical kurtosis
- foutlier = fraction of groups with >3σ outliers
- fmaxmed = max(κj)/median(κj) — worst-case to typical ratio
3.1.3 Norm-Guided Output Noise Amplification
Rather than using random Gaussian inputs, we construct probes that respect the input distribution implied by the preceding LayerNorm:
where γ is the preceding LayerNorm scale vector. We then measure how quantization noise propagates through the layer:
fout = ||ΔW · X||F / ||W · X||F averaged over 32 probes
3.1.4 Rate-Distortion Curves
For each tensor, we compute the normalized root mean squared error (NRMSE) at multiple (bit-width, group-size) configurations:
evaluated at the configuration set C = {(2,32), (3,64), (4,32), (4,64), (4,128), (8,64), (8,128), (16,0)}. From the rate-distortion curve we derive four summary features: fauc (area under curve), fm48 (4-to-8 bit NRMSE ratio), fm24 (2-to-4 bit NRMSE ratio), and fslope (local slope at the operating point).
3.1.5 SQNR Safety Veto
We compute the signal-to-quantization-noise ratio for each tensor at each configuration:
Configurations with SQNR < 9 dB are excluded from the allocation candidate set. This threshold is empirically validated in Section 4.3.
3.2 Pass 2: Normalization, Priors, and Allocation
3.2.1 eCDF Normalization
Each raw feature is normalized to a percentile rank via the empirical cumulative distribution function:
where T is the total number of tensors. This produces uniform marginals regardless of the original feature scale or distribution.
3.2.2 Soft Protection Priors
Certain tensor categories require stronger protection during quantization. Rather than hard-coding binary keep/quantize rules, MINT uses multiplicative soft priors that inflate the apparent cost of quantizing sensitive tensors:
| Tensor Category | Prior π |
|---|---|
| Embedding | 10.0 |
| LM head | 10.0 |
| LayerNorm | ∞ (excluded) |
| MoE router | 8.0 |
| Vision | 8.0 |
| First layer | 3.0 |
| Last layer | 2.0 |
| Default | 1.0 |
Table 1: Soft protection priors by tensor category.
3.2.3 Budget-Constrained Allocation (MCKP)
The quantized size of each tensor under configuration (b, g) is:
We solve the resulting Multiple-Choice Knapsack Problem using one of three solvers:
- Greedy (default, <10ms): sorts candidates by efficiency ratio and greedily upgrades tensors from minimum-cost configuration.
- LP relaxation: linear programming relaxation provides an upper bound on quality.
- ILP exact: integer linear programming yields the provably optimal solution at higher compute cost.
3.3 Joint Bit-Width and Group-Size Optimization
A key innovation of MINT is treating group size as a first-class allocation variable rather than a fixed hyperparameter. The configuration space for each tensor is C = {(4,32), (4,64), (4,128), (8,64), (8,128)}, where smaller group sizes provide finer-grained quantization parameters at the cost of increased scale/zero-point overhead. Our results show that 85% of tensors are allocated (4,32)—the smallest available group size—indicating that the quality benefit of finer groups outweighs their storage overhead for the vast majority of tensors.
3.4 Pipeline Summary
Input: Model directory, budget B, SQNR floor τ
Output: Per-tensor manifest {(bi, gi)}
// Pass 1: Feature extraction
for each shard in model:
for each 2D tensor Wi with n ≥ 1024:
Extract LayerNorm γ from preceding norm layer
Compute spectral features (rs, τ, κ)
Compute per-group kurtosis features
Compute output noise amplification fout
Compute RD curve: NRMSEi(b, g) for all (b, g) ∈ C
Compute SQNR map: SQNRi(b, g) for all (b, g) ∈ C
// Pass 2: Allocation
Fit eCDF normalizer over all collected features
Compute soft protection priors πi
Filter configurations by SQNR ≥ τ
Run MCKP solver with budget B
return manifest {(bi, gi)} for each tensor
3.5 Expert Handling for MoE Models
Mixture-of-Experts models pose unique challenges for per-tensor quantization because expert weight matrices within the same layer may have very different sensitivity characteristics.
MINT uses two strategies depending on the number of experts:
- Individual analysis (E ≤ 32): Each expert is analyzed independently. The worst-case representative determines the group’s allocation.
- Clustered analysis (E > 32): k-means clustering on lightweight statistics groups similar experts, and one representative is sampled per cluster.
For expert groups, we use conservative aggregation:
4. Experiments
We evaluate MINT on six model families: Qwen3-8B, Qwen3-30B-A3B, Qwen2-57B-A14B, Mixtral-8x7B, GLM-4.7-Flash, and Llama-4-Scout. All experiments use an Apple M2 Ultra with 192 GB unified memory. Perplexity is evaluated on WikiText-2 test with 128 sequences of 2048 tokens (seed=42).
4.1 Main Results
| Model | Method | Size (GB) | Mean PPL | Median PPL | Δ vs BF16 |
|---|---|---|---|---|---|
| Qwen3-8B (dense, 8B parameters) | |||||
| Qwen3-8B | BF16 | 15.26 | 9.727 | — | — |
| Qwen3-8B | AWQ | 4.05 | 10.50 | — | +8.1% |
| Qwen3-8B | GPTQ | 4.05 | 10.30 | — | +6.1% |
| Qwen3-8B | Uniform 4-bit | 4.05 | 10.249 | — | +5.4% |
| Qwen3-8B | v1 (SWAN) | 6.05 | 10.097 | — | +3.8% |
| Qwen3-8B | MINT | 6.00 | 10.039 | — | +3.2% |
| Qwen3-30B-A3B (MoE, 30B parameters, 3B active) | |||||
| Qwen3-30B | BF16 | 56.87 | 8.728 | — | — |
| Qwen3-30B | Uniform 4-bit | 15.11 | 9.629 | — | +10.3% |
| Qwen3-30B | v1 (SWAN) | 16.73 | 8.924 | 8.974 | +2.8% |
| Qwen3-30B | MINT (16 GB) | 16.29 | 8.930 | 8.971 | +2.3% |
| Qwen3-30B | MINT (17 GB) | 17.39 | 8.858 | 8.912 | +1.5% |
| Qwen3-30B | MINT (19 GB) | 19.01 | 8.782 | 8.798 | +0.6% |
| GLM-4.7-Flash (dense, 30B parameters) | |||||
| GLM-4.7 | BF16 | 58.16 | 11.344 | 8.706 | — |
| GLM-4.7 | Uniform 4-bit | 14.82 | ~11.46 | — | +31.6% |
| GLM-4.7 | v1 (SWAN) | 15.92 | 9.930 | 9.084 | +4.3% |
| GLM-4.7 | MINT | 15.82 | 9.427 | 9.210 | +5.8% |
| Llama-4-Scout (MoE, 109B parameters, 17B active, 16 experts) | |||||
| Scout | BF16 | ~203 | exceeds memory | ||
| Scout | MINT (no safety) | 34.62 | 23.577 | 23.714 | +198% |
| Scout | MINT (min-safe) | 46.93 | 8.675 | 8.786 | +9.8% |
| Scout | MINT (50 GB) | 51.98 | 7.980 | 8.284 | +1.0% |
| Scout | Uniform 4-bit | 56.9 | 7.899 | — | — |
| Scout | v1 (SWAN) | 59.5 | 7.628 | — | −3.4% |
| Scout | MINT (64 GB) | 58.03 | 7.703 | 8.070 | −2.5% |
| Scout | MINT (192 GB) | 163.24 | 7.359 | 7.691 | −6.8% |
Table 3: Main perplexity results across four model families. Best results per model highlighted.
4.2 Joint Bit-Width and Group-Size Allocation
Table 4 shows the allocation breakdown for Qwen3-30B at the 19 GB budget. The dominant configuration is (4,32)—4-bit with group size 32—which is selected for 85.2% of tensors.
| Configuration (bits, group) | Tensors | Fraction |
|---|---|---|
| (4, 32) | 15,908 | 85.2% |
| (4, 64) | 9 | <0.1% |
| (4, 128) | 96 | 0.5% |
| (8, 128) | 2,612 | 14.0% |
| (8, 64) | 1 | <0.1% |
| (16, 0) | 241 | 1.3% |
| Total | 18,867 | 100% |
Table 4: Per-tensor allocation breakdown for Qwen3-30B-A3B at the 19 GB budget.
4.3 SQNR Safety Veto Validation
Table 5 shows the SQNR distribution across configurations for Llama-4-Scout. There is a clear gap between 2-bit (all tensors below 9 dB) and 3-bit (all tensors above 10 dB).
| Config (b, g) | Min | P5 | Median | P95 | Max | <9 dB | <15 dB |
|---|---|---|---|---|---|---|---|
| (2, 32) | 5.1 | 7.2 | 8.0 | 8.1 | 8.7 | 691 | 691 |
| (2, 64) | 2.5 | 5.6 | 6.8 | 6.9 | 7.2 | 691 | 691 |
| (3, 64) | 10.4 | 13.0 | 14.2 | 14.3 | 14.6 | 0 | 691 |
| (4, 32) | 19.4 | 21.3 | 22.0 | 22.1 | 22.8 | 0 | 0 |
Table 5: SQNR distribution across configurations for Llama-4-Scout.
Floor Sweep
| SQNR Floor | Avg Bits | Size (GB) | Mean PPL | Median PPL | Assessment |
|---|---|---|---|---|---|
| 0 dB | 2.00 | 34.62 | 23.577 | 23.714 | Catastrophic |
| 9 dB | 3.00 | 46.93 | 8.675 | 8.786 | Usable (+9.8%) |
| 9 dB + 50 GB | 3.48 | 51.98 | 7.980 | 8.284 | Good (+1.0%) |
| 15 dB | 4.00 | 56.16 | 7.709 | 8.076 | Best (−2.4%) |
| 15 dB + 58 GB | 4.01 | 58.03 | 7.703 | 8.070 | Best (−2.5%) |
Table 6: SQNR floor sweep on Llama-4-Scout.
4.4 Budget-Targeted Deployment
| Budget | Actual Size (GB) | Mean PPL | Median PPL | Note |
|---|---|---|---|---|
| 15.1 GB | 15.11 | 9.629 | — | Uniform 4-bit floor |
| 15.3 GB | 16.13 | 8.970 | 9.020 | iPhone 16 Pro |
| 15.5 GB | 16.29 | 8.930 | 8.971 | — |
| 16.7 GB | 17.39 | 8.858 | 8.912 | — |
| 19.2 GB | 19.01 | 8.782 | 8.798 | — |
| 20.0 GB | 19.32 | 8.784 | 8.803 | RTX 4070 |
| 25.0 GB | 27.39 | 8.760 | 8.779 | RTX 4090 |
| 30.0 GB | 30.75 | 8.657 | 8.684 | Mac M4 Pro |
| BF16 | 56.87 | 8.728 | — | — |
Table 7: Budget curve for Qwen3-30B-A3B.
The relationship between budget and perplexity is well-described by a fitted prediction curve:
| Budget | Actual Size (GB) | Mean PPL | Median PPL | Note |
|---|---|---|---|---|
| No safety | 34.62 | 23.577 | 23.714 | Catastrophic |
| Min-safe | 46.93 | 8.675 | 8.786 | — |
| 50 GB | 51.98 | 7.980 | 8.284 | — |
| 56.9 GB | 56.9 | 7.899 | — | Uniform 4-bit |
| 64 GB | 58.03 | 7.703 | 8.070 | 64 GB device |
| 192 GB | 163.24 | 7.359 | 7.691 | 192 GB device |
| BF16 | ~203 | exceeds memory | ||
Table 8: Budget curve for Llama-4-Scout (109B MoE).
4.5 Matched-Size Comparison with GPTQ
| Model | Method | Size (GB) | Mean PPL | Median PPL | Δ PPL |
|---|---|---|---|---|---|
| Qwen3-30B | GPTQ | 16.0 | 9.122 | 9.160 | — |
| Qwen3-30B | MINT | 16.1 | 8.970 | 9.020 | −1.7% |
| Qwen2-57B | GPTQ | 29.9 | 6.390 | 6.396 | — |
| Qwen2-57B | MINT | 29.9 | 6.329 | 6.356 | −0.95% |
| Mixtral-8x7B | GPTQ | 87.0† | 4.608 | 4.640 | — |
| Mixtral-8x7B | Uniform 4-bit | 24.5 | 4.471 | 4.461 | — |
| Mixtral-8x7B | MINT | 24.5 | 4.264 | 4.266 | −4.6% |
Table 9: Matched-size comparison with GPTQ. MINT consistently outperforms despite being data-free. †GPTQ Mixtral uses different format.
4.6 Mean vs Median Perplexity
| Model | Method | Mean PPL | Median PPL | Outliers |
|---|---|---|---|---|
| GLM-4.7 | BF16 | 11.344 | 8.706 | 5 |
| GLM-4.7 | v1 (SWAN) | 9.930 | 9.084 | 4 |
| GLM-4.7 | MINT | 9.427 | 9.210 | 0 |
Table 10: Mean vs median perplexity. GLM-4.7 BF16 shows a 30% gap between mean and median due to 5 outlier sequences.
4.7 Analysis Efficiency
| Model | Tensors | Analysis Time | Allocation Time | Total |
|---|---|---|---|---|
| Qwen3-8B | 399 | 3 min | <1s | ~10 min |
| Qwen3-30B | 18,867 | 50 min | <1s | ~54 min |
| GLM-4.7 | 9,703 | 39 min | <1s | ~44 min |
| Scout | ~1,000 | 45 min | <1s | ~50 min |
Table 11: Analysis timing on Apple M2 Ultra 192 GB.
5. Discussion
Group size as the primary quality lever. Our most surprising finding is that group-size selection matters more than bit-width selection. At the 19 GB budget for Qwen3-30B, 85.2% of tensors are allocated (4,32) rather than (4,128). The additional overhead of storing per-group scales and zero-points at g=32 (4× more groups than g=128) is more than compensated by the reduction in quantization error.
SQNR veto catches catastrophic configurations. The SQNR safety veto is essential for MoE models. On Llama-4-Scout, disabling the veto produces a model that appears compact (34.6 GB) but is completely unusable (PPL 23.6). The 9 dB threshold exploits a natural gap in the SQNR distribution: all 2-bit configurations fall below 9 dB while all 3-bit configurations exceed 10 dB.
16-bit allocation is not needed. In v1 (SWAN), 5.6% of parameters were allocated 16-bit precision. MINT’s joint optimization reveals that this is unnecessary: the same tensors are better served by 4-bit with group size 32, which provides comparable quality at 25% of the storage cost.
MINT vs GPTQ. MINT consistently outperforms GPTQ at matched model sizes across three MoE families, despite being entirely data-free. We attribute this to three factors: (1) GPTQ uses fixed group sizes; (2) GPTQ’s calibration-derived Hessian may not represent the full input distribution; and (3) MINT’s per-tensor RD curves capture the actual quantization error surface rather than a proxy.
When the allocator disagrees with intuition. The MCKP solver occasionally produces counterintuitive allocations—for example, keeping a seemingly unimportant tensor at 8-bit while quantizing an attention projection to 4-bit. These decisions are correct in the rate-distortion sense: the “unimportant” tensor has a steep RD curve while the attention tensor has a flat curve.
Limitations. MINT currently supports only weight-only quantization; activation quantization is not addressed. The method assumes round-to-nearest quantization. Soft protection priors are hand-specified. The prediction curve is model-specific. Runtime latency is not optimized—smaller group sizes may increase dequantization overhead on some hardware.
6. Conclusion
We have presented MINT, a data-free mixed-precision quantization framework that formulates per-tensor bit-width and group-size allocation as a budget-constrained optimization problem. By solving a Multiple-Choice Knapsack Problem over per-tensor rate-distortion curves, MINT enables hardware-targeted deployment where users specify an exact memory budget and receive a provably optimal allocation.
Our key findings are: (1) group-size selection is the primary quality lever, with 85% of tensors preferring g=32 over conventional g=128; (2) the SQNR safety veto with a 9 dB threshold reliably prevents catastrophic quantization; (3) MINT consistently outperforms the calibration-based GPTQ method at matched sizes across multiple MoE architectures; and (4) median perplexity should be reported alongside means.
MINT requires no calibration data, no gradient computation, and completes in under 50 minutes on commodity hardware.
References
[1] Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR, 2023.
[2] Lin, J., Tang, J., Tang, H., et al. AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration. MLSys, 2024.
[3] Kim, S., Hooper, C., Gholami, A., et al. SqueezeLLM: Dense-and-Sparse Quantization. ICML, 2024.
[4] Dettmers, T., Svirschevski, R., et al. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. ICLR, 2024.
[5] Chee, J., Cai, Y., Kuleshov, V., and De Sa, C. QuIP: 2-Bit Quantization of Large Language Models with Guarantees. NeurIPS, 2024.
[6] Tang, Z., et al. EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs. 2024.
[7] Zhang, Y., Chen, D., and Li, B. MXQ: Mixed-Precision Quantization for Efficient LLM Deployment. ICAART, 2025.
[8] Badri, H., et al. HIGGS: Hardware-Independent Graph-Guided Search for LLM Quantization. NAACL, 2025.
[9] Zhao, Y., et al. KurTail: Kurtosis-Based Tail-Aware Quantization. EMNLP, 2025.
[10] Li, W., et al. LLM-MQ: Mixed-Precision Quantization for Efficient LLM Deployment. 2024.
[11] Li, Z., et al. MixLLM: Mixed-Precision Large Language Model Quantization. 2025.
[12] Wei, Y., et al. MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs. ICLR, 2025.
[13] Huang, L., et al. MoEQuant: Expert-Aware Quantization for Mixture-of-Experts Models. 2025.
[14] Xie, Y., et al. QuantMoEBench: Benchmarking Quantization for MoE Models. NeurIPS, 2024.
[15] Bondarenko, Y., Nagel, M., and Blankevoort, T. Quantization as Regularization. EMNLP, 2023.
[16] Apple. MLX: An Array Framework for Apple Silicon. 2023.
[17] Qwen Team. Qwen3.5 Technical Report. 2026.
[18] Xiao, G., et al. SmoothQuant: Accurate and Efficient Post-Training Quantization. ICML, 2023.
[19] Badri, H., and Shaji, H. HQQ: Half-Quadratic Quantization. 2024.
[20] Huang, J., et al. SliM-LLM: Salience-Driven Mixed-Precision Quantization. ICML, 2025.
[21] Shang, Y., et al. CherryQ: Cherry-Picked Quantization for LLMs. NeurIPS, 2024.
[22] Park, S., et al. HESTIA: Hardware-Efficient STochastic Integer Arithmetic for LLM Inference. 2026.
[23] Xu, C., et al. Qwen3 Quantization: Efficient Deployment of the Qwen3 Family. 2025.
[24] Meta AI. Llama 4: Open Foundation Models. 2025.
Appendix A: Gap Closure Summary
| Model | Size (GB) | Δ vs BF16 | Uniform 4-bit Δ | Gap Closed |
|---|---|---|---|---|
| Qwen3-8B | 6.0 | +3.2% | +5.4% | 41% |
| Qwen3-30B | 16.3 | +2.3% | +10.3% | 78% |
| Qwen3-30B | 17.4 | +1.5% | +10.3% | 86% |
| Qwen3-30B | 19.0 | +0.6% | +10.3% | 94% |
| GLM-4.7 | 15.8 | +5.8% | +31.6% | 82% |
| Scout | 58.0 | −2.5%* | — | — |
| Scout | 163.2 | −6.8%* | — | — |
Table A1: Gap closure summary. *vs uniform 4-bit.
Appendix B: Comparison with v1 (SWAN)
| Dimension | v1 (SWAN) | MINT |
|---|---|---|
| Objective | Weighted sum + thresholds | Constrained optimization |
| Error metric | Single-point 4-bit NRMSE | Multi-point RD curve (8 configs) |
| Group size | Fixed hyperparameter | Per-tensor variable (85% chose g32) |
| Protection | Binary hard-coded rules | Soft priors in objective |
| Safety floor | None (allows SQNR < 5 dB) | SQNR veto at 9 dB |
| Budget | No user control | User-specified |
| 16-bit allocation | 5.6% of params | 0% (redirected to g32 overhead) |
| 2-bit allocation | 4.0% of params | 0% (blocked by SQNR floor) |
| Quality prediction | Not possible | Fitted curve |
Table B1: Architectural comparison between v1 (SWAN) and MINT.
Appendix C: Reproduction Details
- Python: 3.12.0
- MLX: 0.30.3
- mlx_lm: 0.30.4
- PyTorch: 2.6.0
- SciPy: latest
Evaluation protocol
- Perplexity dataset: WikiText-2 test split
- Sequence length: 2048 tokens
- Number of sequences: 128
- Random seed: 42
Quantization settings
- Method: Group-wise round-to-nearest (RTN)
- SVD: Randomized, rank k=256
- MCKP solver: Greedy efficiency ordering
- SQNR floor: 9 dB