How to Fit Massive Models onto Tiny Memory Footprints without Losing Accuracy
MINT Research

How to Fit Massive Models onto Tiny Memory Footprints without Losing Accuracy

March 2026 · baa.ai

Specify “24 GB for RTX 4090” and receive the provably optimal quantization. MINT formulates model compression as a constrained optimization problem — and solves it in under a second.

The problem with today’s workflow

Today, deploying a model across different hardware means running the quantization pipeline multiple times. 4-bit for one GPU, 3-bit for a phone, 8-bit for a server. Each run requires a separate calibration, separate validation, separate engineering effort. There is no principled way to say “give me the best model that fits in 24 GB” and receive a mathematically optimal result.

The industry has accepted this as normal. It is not. It is the consequence of treating quantization as a per-configuration problem rather than as a single optimization problem with a budget constraint.

Quantization as a knapsack problem

MINT reformulates mixed-precision quantization as a Multiple-Choice Knapsack Problem (MCKP). Each tensor is an “item” with multiple configuration options — bit-width and group-size combinations. Each option has a “weight” (storage cost in bytes) and a “value” (quality cost as NRMSE). The memory budget is the knapsack capacity.

The solver finds the allocation that minimizes total quality loss while fitting within the budget. Formally:

min Σ πi · αi · NRMSEi(bi, gi)    subject to    Σ sizei(bi, gi) ≤ B

Each tensor’s contribution is weighted by a soft protection prior (π) and a learned importance factor (α). The solver evaluates the full combinatorial space of bit-width and group-size assignments and returns the provably optimal allocation — not an approximation, not a heuristic, but the exact minimum-cost solution for the given budget.

The budget curve

Running MINT across a range of budgets for Qwen3-30B-A3B produces a smooth quality curve from the minimum viable 4-bit floor to near-BF16 quality. Each point is the provably optimal allocation for that budget.

Budget Actual Size (GB) Mean PPL Median PPL Note
15.1 GB 15.11 9.629 Uniform 4-bit floor
15.3 GB 16.13 8.970 9.020 iPhone 16 Pro
16.7 GB 17.39 8.858 8.912
19.2 GB 19.01 8.782 8.798
20.0 GB 19.32 8.784 8.803 RTX 4070
25.0 GB 27.39 8.760 8.779 RTX 4090
30.0 GB 30.75 8.657 8.684 Mac M4 Pro
BF16 56.87 8.728

A smooth quality curve from the minimum viable 4-bit to near-BF16 quality. Each point is the provably optimal allocation for that budget.

Predicting quality before you quantize

The relationship between budget and quality follows a fitted prediction curve:

PPL(B) = 8.371 + 0.494 / (B − 15.099)0.135

with RMSE of only 0.025. This means you can predict the output quality at any budget before spending any compute. “Will this model be good enough on a 16 GB phone?” becomes a lookup, not an experiment.

The prediction curve captures the diminishing returns of additional memory: the first few gigabytes above the 4-bit floor produce dramatic quality gains, while additional memory beyond 20 GB yields increasingly marginal improvement. This is exactly the shape you would expect from a well-behaved rate-distortion function — and it means deployment decisions can be made analytically rather than empirically.

One pass, many targets

The model’s weights are analyzed once on CPU, producing a rate-distortion profile for every tensor. From that single analysis, optimal configurations can be generated for any number of memory budgets instantly, with no additional compute.

Four hardware targets? One pass. Twelve targets? Still one pass. The allocation step itself — the MCKP solver — takes less than one second regardless of model size.

This changes the economics of multi-target deployment entirely. Instead of running a separate quantization pipeline for each device class, teams run one analysis pass and generate every variant they need. The marginal cost of an additional hardware target drops from “hours of GPU compute” to “sub-second solver invocation.”

The 109B example

To demonstrate that budget-targeted quantization scales to the largest models, here is the budget curve for Llama-4-Scout — a 109-billion-parameter Mixture-of-Experts model:

Budget Size (GB) Mean PPL Note
No safety 34.62 23.577 Catastrophic
Min-safe 46.93 8.675
50 GB 51.98 7.980
64 GB 58.03 7.703 64 GB device
192 GB 163.24 7.359 192 GB device

The same 109-billion-parameter model deployed optimally across hardware from 47 GB to 163 GB. No calibration data, no manual tuning, no iteration.

Note the catastrophic collapse at the “no safety” budget: perplexity nearly triples from 8.675 to 23.577 when the SQNR safety floor is removed. The safety veto is not conservative overhead — it is the difference between a usable model and garbage output.

From manual craft to API call

Budget-targeted quantization turns model compression from a multi-day engineering project into a single API call: specify your memory budget, receive the optimal model.

For model hubs serving thousands of models across dozens of hardware targets, this is the difference between quantization as a manual craft and quantization as automated infrastructure. The solver is exact, fast, and deterministic. The same budget always produces the same allocation. The quality is provably optimal for the given constraint.

The era of “run GPTQ three times and pick the one that fits” is over. Tell your GPU how much memory you have, and let the math do the rest.


Budget curves and allocation data from the MINT paper. All perplexity evaluations on WikiText-2 test split. The full MINT pipeline is open source at github.com/baa-ai/MINT.

← Back to all articles

Want provably optimal quantization for your exact hardware?

Our team specialises in data-free model compression, budget-aware quantization, and production AI deployment on commodity hardware.

Talk to Our Team