How to Fit Massive Models onto Tiny Memory Footprints without Losing Accuracy

Specify “24 GB for RTX 4090” and receive the provably optimal quantization. MINT formulates model compression as a constrained optimization problem — and solves it in under a second.

The problem with today’s workflow

Today, deploying a model across different hardware means running the quantization pipeline multiple times. 4-bit for one GPU, 3-bit for a phone, 8-bit for a server. Each run requires a separate calibration, separate validation, separate engineering effort. There is no principled way to say “give me the best model that fits in 24 GB” and receive a mathematically optimal result.

The industry has accepted this as normal. It is not. It is the consequence of treating quantization as a per-configuration problem rather than as a single optimization problem with a budget constraint.

Quantization as a knapsack problem

MINT reformulates mixed-precision quantization as a Multiple-Choice Knapsack Problem (MCKP). Each tensor is an “item” with multiple configuration options — bit-width and group-size combinations. Each option has a “weight” (storage cost in bytes) and a “value” (quality cost as NRMSE). The memory budget is the knapsack capacity.

The solver finds the allocation that minimizes total quality loss while fitting within the budget. Formally:

min Σ π_i · α_i · NRMSE_i(b_i, g_i) subject to Σ size_i(b_i, g_i) ≤ B

Each tensor’s contribution is weighted by a soft protection prior (π) and a learned importance factor (α). The solver evaluates the full combinatorial space of bit-width and group-size assignments and returns the provably optimal allocation — not an approximation, not a heuristic, but the exact minimum-cost solution for the given budget.

The budget curve

Running MINT across a range of budgets for Qwen3-30B-A3B produces a smooth quality curve from the minimum viable 4-bit floor to near-BF16 quality. Each point is the provably optimal allocation for that budget.

Budget	Actual Size (GB)	Mean PPL	Median PPL	Note
15.1 GB	15.11	9.629	—	Uniform 4-bit floor
15.3 GB	16.13	8.970	9.020	iPhone 16 Pro
16.7 GB	17.39	8.858	8.912	—
19.2 GB	19.01	8.782	8.798	—
20.0 GB	19.32	8.784	8.803	RTX 4070
25.0 GB	27.39	8.760	8.779	RTX 4090
30.0 GB	30.75	8.657	8.684	Mac M4 Pro
BF16	56.87	8.728	—	—

A smooth quality curve from the minimum viable 4-bit to near-BF16 quality. Each point is the provably optimal allocation for that budget.

Predicting quality before you quantize

The relationship between budget and quality follows a fitted prediction curve:

PPL(B) = 8.371 + 0.494 / (B − 15.099)^0.135

with RMSE of only 0.025. This means you can predict the output quality at any budget before spending any compute. “Will this model be good enough on a 16 GB phone?” becomes a lookup, not an experiment.

The prediction curve captures the diminishing returns of additional memory: the first few gigabytes above the 4-bit floor produce dramatic quality gains, while additional memory beyond 20 GB yields increasingly marginal improvement. This is exactly the shape you would expect from a well-behaved rate-distortion function — and it means deployment decisions can be made analytically rather than empirically.

One pass, many targets

The model’s weights are analyzed once on CPU, producing a rate-distortion profile for every tensor. From that single analysis, optimal configurations can be generated for any number of memory budgets instantly, with no additional compute.

Four hardware targets? One pass. Twelve targets? Still one pass. The allocation step itself — the MCKP solver — takes less than one second regardless of model size.

This changes the economics of multi-target deployment entirely. Instead of running a separate quantization pipeline for each device class, teams run one analysis pass and generate every variant they need. The marginal cost of an additional hardware target drops from “hours of GPU compute” to “sub-second solver invocation.”

The 109B example

To demonstrate that budget-targeted quantization scales to the largest models, here is the budget curve for Llama-4-Scout — a 109-billion-parameter Mixture-of-Experts model:

Budget	Size (GB)	Mean PPL	Note
No safety	34.62	23.577	Catastrophic
Min-safe	46.93	8.675	—
50 GB	51.98	7.980	—
64 GB	58.03	7.703	64 GB device
192 GB	163.24	7.359	192 GB device

The same 109-billion-parameter model deployed optimally across hardware from 47 GB to 163 GB. No calibration data, no manual tuning, no iteration.

Note the catastrophic collapse at the “no safety” budget: perplexity nearly triples from 8.675 to 23.577 when the SQNR safety floor is removed. The safety veto is not conservative overhead — it is the difference between a usable model and garbage output.

From manual craft to API call

Budget-targeted quantization turns model compression from a multi-day engineering project into a single API call: specify your memory budget, receive the optimal model.

For model hubs serving thousands of models across dozens of hardware targets, this is the difference between quantization as a manual craft and quantization as automated infrastructure. The solver is exact, fast, and deterministic. The same budget always produces the same allocation. The quality is provably optimal for the given constraint.

The era of “run GPTQ three times and pick the one that fits” is over. Tell your GPU how much memory you have, and let the math do the rest.

Budget curves and allocation data from the MINT paper. All perplexity evaluations on WikiText-2 test split. The full MINT pipeline is open source at github.com/baa-ai/MINT.

← Back to all articles