MLX Quantization on Apple Silicon: How MINT Turns a Mac into a Model Compression Lab

Every result in the MINT paper was produced on a single Apple M2 Ultra. No NVIDIA GPUs. No cloud instances. No calibration data. Just Apple Silicon, unified memory, and the MLX framework. This article explains why that matters and what it means for anyone with a Mac.

The Hardware Nobody Expected

When quantization papers report their results, the hardware section typically reads like a shopping list from a data centre: 8×A100 80 GB, 4×H100, a DGX cluster. The MINT paper’s hardware section reads differently:

Reproduction environment (Appendix C):

Hardware: Apple M2 Ultra, 192 GB unified memory
Framework: MLX 0.30.3, mlx_lm 0.30.4
Python: 3.12.0
Quantization: Group-wise round-to-nearest (RTN)

That is the entire compute stack. A Mac Studio sitting on a desk. The same machine ran every experiment—from an 8B dense model to a 109B parameter Mixture-of-Experts model with 16 experts. The largest model (Llama-4-Scout at ~203 GB in BF16) cannot even fit in memory unquantized, yet MINT analysed, allocated, and quantized it to run on that same machine.

Why MLX Changes the Equation

Apple’s MLX framework is purpose-built for Apple Silicon. Three properties make it uniquely suited to MINT’s pipeline:

Unified Memory

CPU and GPU share the same memory pool. A 192 GB M2 Ultra can hold 192 GB of model weights without copying between devices. No PCIe bottleneck, no VRAM limit separate from system RAM.

Lazy Evaluation

MLX uses lazy evaluation—tensors are only materialised when needed. MINT streams through model shards one tensor at a time, computing features and rate-distortion curves without ever loading the full model into memory simultaneously.

Native Quantization

MLX supports group-wise quantization natively at 2, 4, and 8-bit with configurable group sizes (32, 64, 128). The quantized format is the deployment format—what MINT produces is exactly what mlx_lm loads for inference.

This is not a compromise. Unified memory is an advantage for quantization workloads. Traditional GPU setups are constrained by VRAM (24 GB on an RTX 4090, 80 GB on an A100). A 109B parameter model in BF16 requires ~203 GB—no single GPU can hold it. On Apple Silicon, the M2 Ultra’s 192 GB is one contiguous memory space accessible by all compute units.

What MINT Actually Does on Apple Silicon

MINT’s pipeline has two passes, both designed for memory-efficient streaming on MLX:

Pass 1: Feature Extraction

For each weight tensor in the model (there are 18,867 in Qwen3-30B alone), MINT computes:

Spectral features via randomised SVD (rank 256)—stable rank, spectral tail mass, log condition number
Per-group kurtosis at group size 128—90th percentile, P99/P50 ratio, outlier fraction
Norm-guided output noise amplification—probes shaped by the preceding LayerNorm’s scale vector
Rate-distortion curves—NRMSE measured at 8 different (bit-width, group-size) configurations
SQNR map—signal-to-quantization-noise ratio at each configuration

All of this runs on the M2 Ultra’s compute units via MLX. The randomised SVD, the kurtosis calculations, the quantise-measure-restore loop for rate-distortion curves—every operation is a native MLX tensor operation running on Apple Silicon.

Pass 2: Allocation

Pass 2 takes the collected features and solves a Multiple-Choice Knapsack Problem (MCKP) to find the optimal (bit-width, group-size) for each tensor under a user-specified memory budget. The MCKP solver runs in under 1 second for any model size. The budget is the key input: “fit this model in 30 GB” or “fit it in 64 GB”—and MINT returns the provably optimal allocation.

Timing: How Fast Is It?

From the MINT paper, Table 11—all timings on Apple M2 Ultra 192 GB:

Model	Parameters	Tensors	Analysis	Allocation	Total
Qwen3-8B	8B (dense)	399	3 min	<1s	~10 min
GLM-4.7-Flash	30B (dense)	9,703	39 min	<1s	~44 min
Qwen3-30B-A3B	30B (MoE)	18,867	50 min	<1s	~54 min
Llama-4-Scout	109B (MoE)	~1,000	45 min	<1s	~50 min

A 109B parameter model—analysed, optimised, and allocated in under 50 minutes. On a Mac. Compare this to calibration-based methods like GPTQ, which require GPU clusters, representative datasets, and hours of Hessian computation for models of this scale.

Note the allocation time: under 1 second in every case. This means once you’ve run the analysis pass, you can re-target the same model to any number of hardware budgets instantly. Analyse once, deploy everywhere.

Budget-Targeted Deployment: Name Your Hardware

MINT’s defining feature is budget-targeted quantization. You specify exactly how much memory you have, and MINT returns the optimal allocation. The paper demonstrates this with Qwen3-30B-A3B across specific Apple Silicon and GPU targets:

Target Hardware	Memory Budget	Model Size	Mean PPL	Δ vs BF16
iPhone 16 Pro	15.3 GB	16.13 GB	8.970	+2.8%
RTX 4070	20.0 GB	19.32 GB	8.784	+0.6%
RTX 4090	25.0 GB	27.39 GB	8.760	+0.4%
Mac M4 Pro	30.0 GB	30.75 GB	8.657	−0.8%
BF16 (no quantization)	—	56.87 GB	8.728	—

At the Mac M4 Pro budget of 30 GB, MINT produces a 30.75 GB quantized model with a mean perplexity of 8.657—within 1% of the full BF16 model (8.728) at 54% of the size. The MCKP allocator distributes bits where they matter most, closing 94% of the gap between uniform 4-bit and BF16 at the 19 GB budget point.

The 109B Model on a Mac Studio

The most striking demonstration is Llama-4-Scout—Meta’s 109B parameter MoE model with 16 experts and 17B active parameters per token. In BF16, it requires ~203 GB. No consumer hardware can run it. Here is what MINT does with it on the M2 Ultra:

Configuration	Size	Mean PPL	Assessment
BF16	~203 GB	exceeds memory	Cannot run
Uniform 4-bit	56.9 GB	7.899	Baseline
MINT @ 192 GB	163.24 GB	7.359	−6.8% vs uniform 4-bit
MINT @ 64 GB	58.03 GB	7.703	−2.5% vs uniform 4-bit
MINT @ 50 GB	51.98 GB	7.980	+1.0% vs uniform 4-bit
MINT min-safe	46.93 GB	8.675	+9.8% vs uniform 4-bit

The BF16 version of Scout requires ~203 GB and cannot run on any consumer hardware. At the M2 Ultra’s 192 GB budget, MINT produces a 163 GB model that is 6.8% better than uniform 4-bit quantization (7.359 vs 7.899). At 64 GB (feasible on an M4 Max), it still beats uniform 4-bit by 2.5%. A model that was previously impossible to deploy on consumer hardware now runs with excellent quality.

The key enabling technology is the SQNR safety veto. Without it, MINT would aggressively compress tensors to 2-bit, producing a compact 34.6 GB model that is catastrophically broken (PPL 23.6—nearly triple the baseline). The 9 dB SQNR threshold automatically blocks these dangerous configurations, and the MCKP allocator redistributes those bits to where they actually help.

Why Apple Silicon Is Ideal for This Workload

MINT’s pipeline has specific computational characteristics that map perfectly to Apple Silicon’s architecture:

Memory-bound, not compute-bound

Quantization analysis is dominated by reading weights and computing relatively simple per-tensor statistics. Unified memory eliminates the CPU↔GPU transfer bottleneck that plagues discrete GPU setups. The entire model is already where the compute happens.

Streaming access pattern

MINT processes tensors sequentially—load one shard, compute features for each tensor, move to the next shard. This linear access pattern is precisely what unified memory and MLX’s lazy evaluation are designed for. No random access across the full model, no gradient graph, no backpropagation.

No gradient computation

MINT is entirely data-free—no forward pass through the model, no loss computation, no backpropagation. The most expensive operation is randomised SVD (rank 256) per tensor, which runs efficiently on Apple Silicon’s GPU cores via MLX.

Analysis-to-inference continuity

The quantized model produced by MINT is in native MLX format. There is no conversion step. The output goes directly into mlx_lm for inference on the same machine. Analyse, quantize, and serve—all on one device.

The Apple Silicon Model Guide

Based on MINT’s budget curves and the memory capacities of current Apple Silicon configurations, here is what you can realistically run:

Apple Silicon	Unified Memory	Usable Budget*	What Fits
M4	16–32 GB	~12–24 GB	8B dense, small MoE at tight budgets
M4 Pro	24–48 GB	~18–36 GB	30B MoE (Qwen3-30B at 19 GB = +0.6% PPL)
M4 Max	36–128 GB	~28–100 GB	109B MoE (Scout at 64 GB = −2.5% vs uniform 4-bit)
M2/M3 Ultra	128–192 GB	~100–160 GB	109B MoE at high quality (−6.8% vs uniform 4-bit)
M4 Ultra (expected)	up to 512 GB	~400 GB	400B+ dense models at high precision

*Usable budget accounts for OS, KV cache, and inference overhead. Actual available memory depends on workload.

The critical insight: MINT’s budget-targeted allocation means you don’t guess which quantization preset to use. You tell MINT exactly how much memory you have, and it returns the mathematically optimal allocation for that budget. Different Mac? Different budget. Same pipeline, different optimal answer.

Group Size g=32: Why It Matters for MLX

MINT’s most surprising finding is that 85.2% of tensors are allocated group size 32 rather than the conventional 128. This is directly relevant to MLX deployment:

MLX natively supports g=32. The mlx_lm.convert tool accepts --group-size 32 directly. No custom kernels needed.
Storage overhead is modest. Going from g=128 to g=32 adds ~0.125 bytes/parameter for the extra scale and zero-point values. For a 30B model, that’s ~3.5 GB additional overhead.
Quality improvement is substantial. The finer granularity of g=32 captures per-group weight distributions more accurately, closing up to 94% of the gap between uniform 4-bit and BF16.
Dequantization cost on Apple Silicon is low. MLX’s Metal-based dequantization kernels run on the GPU cores. The 4× increase in groups from g=128 to g=32 adds minimal latency because the operation is memory-bound and the scales are stored contiguously.

No Calibration Data: Why This Matters for On-Device

Calibration-based methods like GPTQ and AWQ require representative input data to compute sensitivity. For on-device deployment, this creates three problems:

Privacy. If you’re quantizing a model for local use on a Mac—perhaps for an enterprise deploying on-premises—sending proprietary data through a calibration pipeline on a cloud GPU defeats the purpose of local deployment.
Representativeness. A calibration set from English Wikipedia may not represent the Japanese legal documents your deployment processes. MINT avoids this by using only the weights themselves.
Compute requirements. Calibration requires forward passes through the model, which typically demand GPU clusters. MINT’s data-free pipeline runs entirely on the same Mac that will serve the model.

And the quality? In matched-size comparisons against GPTQ across three MoE model families, MINT wins every time:

Model	GPTQ PPL	MINT PPL	MINT Advantage
Qwen3-30B-A3B	9.122	8.970	−1.7%
Qwen2-57B-A14B	6.390	6.329	−0.95%
Mixtral-8x7B	4.608	4.264	−4.6%

A data-free method, running on a Mac, producing better results than the gold-standard calibration method running on GPU clusters with representative data.

The Workflow: From Download to Deployment

For someone with a Mac and a model they want to deploy, the MINT workflow on Apple Silicon looks like this:

Get the code

MINT is open source (MIT). Clone, install, and run on your Mac today.

View on GitHub

Download the model from Hugging Face in safetensors format
Run MINT analysis—streams through shards, computes features and RD curves (~10–50 min depending on model size)
Specify your memory budget—e.g., “24 GB” for an M4 Pro with 32 GB total
MINT solves MCKP—returns the optimal per-tensor (bit-width, group-size) manifest in <1 second
Apply quantization—produces MLX-native quantized weights
Run inference—load directly with mlx_lm on the same machine

Step 4 is the key innovation. Because allocation takes under 1 second, you can run it for every hardware target you care about from a single analysis pass. One analysis of Qwen3-30B produces optimal allocations for iPhone (15 GB), M4 Pro (30 GB), M4 Max (64 GB), and M2 Ultra (160 GB)—all in seconds.

What This Means

MINT on MLX eliminates three barriers to deploying large language models on Apple Silicon:

No GPU Required

The entire pipeline—analysis, allocation, quantization, and inference—runs on Apple Silicon. No NVIDIA hardware at any stage.

No Data Required

Zero calibration data. The model’s weights are the only input. Deploy models for domains where representative data doesn’t exist or can’t leave the building.

No Guessing

Tell MINT your memory budget. It solves for the provably optimal allocation. No manual tuning, no presets, no “try 4-bit and hope for the best.”

A Mac Studio with an M2 Ultra is now a complete model compression laboratory. It can analyse, optimise, quantize, and serve models up to 109B parameters—producing results that beat calibration-based methods running on GPU clusters. The entire MINT paper is proof.

For the full technical details, see the MINT paper. For findings on specific topics, see our articles on group size as the primary quality lever, the SQNR safety threshold, and budget-targeted quantization.

← Back to all articles