Every result in the MINT paper was produced on a single Apple M2 Ultra. No NVIDIA GPUs. No cloud instances. No calibration data. Just Apple Silicon, unified memory, and the MLX framework. This article explains why that matters and what it means for anyone with a Mac.
The Hardware Nobody Expected
When quantization papers report their results, the hardware section typically reads like a shopping list from a data centre: 8×A100 80 GB, 4×H100, a DGX cluster. The MINT paper’s hardware section reads differently:
Reproduction environment (Appendix C):
- Hardware: Apple M2 Ultra, 192 GB unified memory
- Framework: MLX 0.30.3, mlx_lm 0.30.4
- Python: 3.12.0
- Quantization: Group-wise round-to-nearest (RTN)
That is the entire compute stack. A Mac Studio sitting on a desk. The same machine ran every experiment—from an 8B dense model to a 109B parameter Mixture-of-Experts model with 16 experts. The largest model (Llama-4-Scout at ~203 GB in BF16) cannot even fit in memory unquantized, yet MINT analysed, allocated, and quantized it to run on that same machine.
Why MLX Changes the Equation
Apple’s MLX framework is purpose-built for Apple Silicon. Three properties make it uniquely suited to MINT’s pipeline:
Unified Memory
CPU and GPU share the same memory pool. A 192 GB M2 Ultra can hold 192 GB of model weights without copying between devices. No PCIe bottleneck, no VRAM limit separate from system RAM.
Lazy Evaluation
MLX uses lazy evaluation—tensors are only materialised when needed. MINT streams through model shards one tensor at a time, computing features and rate-distortion curves without ever loading the full model into memory simultaneously.
Native Quantization
MLX supports group-wise quantization natively at 2, 4, and 8-bit with configurable group sizes (32, 64, 128). The quantized format is the deployment format—what MINT produces is exactly what mlx_lm loads for inference.
This is not a compromise. Unified memory is an advantage for quantization workloads. Traditional GPU setups are constrained by VRAM (24 GB on an RTX 4090, 80 GB on an A100). A 109B parameter model in BF16 requires ~203 GB—no single GPU can hold it. On Apple Silicon, the M2 Ultra’s 192 GB is one contiguous memory space accessible by all compute units.
What MINT Actually Does on Apple Silicon
MINT’s pipeline has two passes, both designed for memory-efficient streaming on MLX:
Pass 1: Feature Extraction
For each weight tensor in the model (there are 18,867 in Qwen3-30B alone), MINT computes:
- Spectral features via randomised SVD (rank 256)—stable rank, spectral tail mass, log condition number
- Per-group kurtosis at group size 128—90th percentile, P99/P50 ratio, outlier fraction
- Norm-guided output noise amplification—probes shaped by the preceding LayerNorm’s scale vector
- Rate-distortion curves—NRMSE measured at 8 different (bit-width, group-size) configurations
- SQNR map—signal-to-quantization-noise ratio at each configuration
All of this runs on the M2 Ultra’s compute units via MLX. The randomised SVD, the kurtosis calculations, the quantise-measure-restore loop for rate-distortion curves—every operation is a native MLX tensor operation running on Apple Silicon.
Pass 2: Allocation
Pass 2 takes the collected features and solves a Multiple-Choice Knapsack Problem (MCKP) to find the optimal (bit-width, group-size) for each tensor under a user-specified memory budget. The MCKP solver runs in under 1 second for any model size. The budget is the key input: “fit this model in 30 GB” or “fit it in 64 GB”—and MINT returns the provably optimal allocation.
Timing: How Fast Is It?
From the MINT paper, Table 11—all timings on Apple M2 Ultra 192 GB:
| Model | Parameters | Tensors | Analysis | Allocation | Total |
|---|---|---|---|---|---|
| Qwen3-8B | 8B (dense) | 399 | 3 min | <1s | ~10 min |
| GLM-4.7-Flash | 30B (dense) | 9,703 | 39 min | <1s | ~44 min |
| Qwen3-30B-A3B | 30B (MoE) | 18,867 | 50 min | <1s | ~54 min |
| Llama-4-Scout | 109B (MoE) | ~1,000 | 45 min | <1s | ~50 min |
A 109B parameter model—analysed, optimised, and allocated in under 50 minutes. On a Mac. Compare this to calibration-based methods like GPTQ, which require GPU clusters, representative datasets, and hours of Hessian computation for models of this scale.
Note the allocation time: under 1 second in every case. This means once you’ve run the analysis pass, you can re-target the same model to any number of hardware budgets instantly. Analyse once, deploy everywhere.
Budget-Targeted Deployment: Name Your Hardware
MINT’s defining feature is budget-targeted quantization. You specify exactly how much memory you have, and MINT returns the optimal allocation. The paper demonstrates this with Qwen3-30B-A3B across specific Apple Silicon and GPU targets:
| Target Hardware | Memory Budget | Model Size | Mean PPL | Δ vs BF16 |
|---|---|---|---|---|
| iPhone 16 Pro | 15.3 GB | 16.13 GB | 8.970 | +2.8% |
| RTX 4070 | 20.0 GB | 19.32 GB | 8.784 | +0.6% |
| RTX 4090 | 25.0 GB | 27.39 GB | 8.760 | +0.4% |
| Mac M4 Pro | 30.0 GB | 30.75 GB | 8.657 | −0.8% |
| BF16 (no quantization) | — | 56.87 GB | 8.728 | — |
At the Mac M4 Pro budget of 30 GB, MINT produces a 30.75 GB quantized model with a mean perplexity of 8.657—within 1% of the full BF16 model (8.728) at 54% of the size. The MCKP allocator distributes bits where they matter most, closing 94% of the gap between uniform 4-bit and BF16 at the 19 GB budget point.
The 109B Model on a Mac Studio
The most striking demonstration is Llama-4-Scout—Meta’s 109B parameter MoE model with 16 experts and 17B active parameters per token. In BF16, it requires ~203 GB. No consumer hardware can run it. Here is what MINT does with it on the M2 Ultra:
| Configuration | Size | Mean PPL | Assessment |
|---|---|---|---|
| BF16 | ~203 GB | exceeds memory | Cannot run |
| Uniform 4-bit | 56.9 GB | 7.899 | Baseline |
| MINT @ 192 GB | 163.24 GB | 7.359 | −6.8% vs uniform 4-bit |
| MINT @ 64 GB | 58.03 GB | 7.703 | −2.5% vs uniform 4-bit |
| MINT @ 50 GB | 51.98 GB | 7.980 | +1.0% vs uniform 4-bit |
| MINT min-safe | 46.93 GB | 8.675 | +9.8% vs uniform 4-bit |
The BF16 version of Scout requires ~203 GB and cannot run on any consumer hardware. At the M2 Ultra’s 192 GB budget, MINT produces a 163 GB model that is 6.8% better than uniform 4-bit quantization (7.359 vs 7.899). At 64 GB (feasible on an M4 Max), it still beats uniform 4-bit by 2.5%. A model that was previously impossible to deploy on consumer hardware now runs with excellent quality.
The key enabling technology is the SQNR safety veto. Without it, MINT would aggressively compress tensors to 2-bit, producing a compact 34.6 GB model that is catastrophically broken (PPL 23.6—nearly triple the baseline). The 9 dB SQNR threshold automatically blocks these dangerous configurations, and the MCKP allocator redistributes those bits to where they actually help.
Why Apple Silicon Is Ideal for This Workload
MINT’s pipeline has specific computational characteristics that map perfectly to Apple Silicon’s architecture:
Memory-bound, not compute-bound
Quantization analysis is dominated by reading weights and computing relatively simple per-tensor statistics. Unified memory eliminates the CPU↔GPU transfer bottleneck that plagues discrete GPU setups. The entire model is already where the compute happens.
Streaming access pattern
MINT processes tensors sequentially—load one shard, compute features for each tensor, move to the next shard. This linear access pattern is precisely what unified memory and MLX’s lazy evaluation are designed for. No random access across the full model, no gradient graph, no backpropagation.
No gradient computation
MINT is entirely data-free—no forward pass through the model, no loss computation, no backpropagation. The most expensive operation is randomised SVD (rank 256) per tensor, which runs efficiently on Apple Silicon’s GPU cores via MLX.
Analysis-to-inference continuity
The quantized model produced by MINT is in native MLX format. There is no conversion step. The output goes directly into mlx_lm for inference on the same machine. Analyse, quantize, and serve—all on one device.
The Apple Silicon Model Guide
Based on MINT’s budget curves and the memory capacities of current Apple Silicon configurations, here is what you can realistically run:
| Apple Silicon | Unified Memory | Usable Budget* | What Fits |
|---|---|---|---|
| M4 | 16–32 GB | ~12–24 GB | 8B dense, small MoE at tight budgets |
| M4 Pro | 24–48 GB | ~18–36 GB | 30B MoE (Qwen3-30B at 19 GB = +0.6% PPL) |
| M4 Max | 36–128 GB | ~28–100 GB | 109B MoE (Scout at 64 GB = −2.5% vs uniform 4-bit) |
| M2/M3 Ultra | 128–192 GB | ~100–160 GB | 109B MoE at high quality (−6.8% vs uniform 4-bit) |
| M4 Ultra (expected) | up to 512 GB | ~400 GB | 400B+ dense models at high precision |
*Usable budget accounts for OS, KV cache, and inference overhead. Actual available memory depends on workload.
The critical insight: MINT’s budget-targeted allocation means you don’t guess which quantization preset to use. You tell MINT exactly how much memory you have, and it returns the mathematically optimal allocation for that budget. Different Mac? Different budget. Same pipeline, different optimal answer.
Group Size g=32: Why It Matters for MLX
MINT’s most surprising finding is that 85.2% of tensors are allocated group size 32 rather than the conventional 128. This is directly relevant to MLX deployment:
- MLX natively supports g=32. The
mlx_lm.converttool accepts--group-size 32directly. No custom kernels needed. - Storage overhead is modest. Going from g=128 to g=32 adds ~0.125 bytes/parameter for the extra scale and zero-point values. For a 30B model, that’s ~3.5 GB additional overhead.
- Quality improvement is substantial. The finer granularity of g=32 captures per-group weight distributions more accurately, closing up to 94% of the gap between uniform 4-bit and BF16.
- Dequantization cost on Apple Silicon is low. MLX’s Metal-based dequantization kernels run on the GPU cores. The 4× increase in groups from g=128 to g=32 adds minimal latency because the operation is memory-bound and the scales are stored contiguously.
No Calibration Data: Why This Matters for On-Device
Calibration-based methods like GPTQ and AWQ require representative input data to compute sensitivity. For on-device deployment, this creates three problems:
- Privacy. If you’re quantizing a model for local use on a Mac—perhaps for an enterprise deploying on-premises—sending proprietary data through a calibration pipeline on a cloud GPU defeats the purpose of local deployment.
- Representativeness. A calibration set from English Wikipedia may not represent the Japanese legal documents your deployment processes. MINT avoids this by using only the weights themselves.
- Compute requirements. Calibration requires forward passes through the model, which typically demand GPU clusters. MINT’s data-free pipeline runs entirely on the same Mac that will serve the model.
And the quality? In matched-size comparisons against GPTQ across three MoE model families, MINT wins every time:
| Model | GPTQ PPL | MINT PPL | MINT Advantage |
|---|---|---|---|
| Qwen3-30B-A3B | 9.122 | 8.970 | −1.7% |
| Qwen2-57B-A14B | 6.390 | 6.329 | −0.95% |
| Mixtral-8x7B | 4.608 | 4.264 | −4.6% |
A data-free method, running on a Mac, producing better results than the gold-standard calibration method running on GPU clusters with representative data.
The Workflow: From Download to Deployment
For someone with a Mac and a model they want to deploy, the MINT workflow on Apple Silicon looks like this:
Get the code
MINT is open source (MIT). Clone, install, and run on your Mac today.
- Download the model from Hugging Face in safetensors format
- Run MINT analysis—streams through shards, computes features and RD curves (~10–50 min depending on model size)
- Specify your memory budget—e.g., “24 GB” for an M4 Pro with 32 GB total
- MINT solves MCKP—returns the optimal per-tensor (bit-width, group-size) manifest in <1 second
- Apply quantization—produces MLX-native quantized weights
- Run inference—load directly with
mlx_lmon the same machine
Step 4 is the key innovation. Because allocation takes under 1 second, you can run it for every hardware target you care about from a single analysis pass. One analysis of Qwen3-30B produces optimal allocations for iPhone (15 GB), M4 Pro (30 GB), M4 Max (64 GB), and M2 Ultra (160 GB)—all in seconds.
What This Means
MINT on MLX eliminates three barriers to deploying large language models on Apple Silicon:
No GPU Required
The entire pipeline—analysis, allocation, quantization, and inference—runs on Apple Silicon. No NVIDIA hardware at any stage.
No Data Required
Zero calibration data. The model’s weights are the only input. Deploy models for domains where representative data doesn’t exist or can’t leave the building.
No Guessing
Tell MINT your memory budget. It solves for the provably optimal allocation. No manual tuning, no presets, no “try 4-bit and hope for the best.”
A Mac Studio with an M2 Ultra is now a complete model compression laboratory. It can analyse, optimise, quantize, and serve models up to 109B parameters—producing results that beat calibration-based methods running on GPU clusters. The entire MINT paper is proof.
For the full technical details, see the MINT paper. For findings on specific topics, see our articles on group size as the primary quality lever, the SQNR safety threshold, and budget-targeted quantization.