MLX Quantization on Apple Silicon: How RAM Turns a Mac into a Model Compression Lab

Every result in the RAM paper was produced on a single Apple M2 Ultra. No NVIDIA GPUs. No cloud instances. No calibration data. Just Apple Silicon and unified memory. This article explains why that matters and what it means for anyone with a Mac.

The Hardware Nobody Expected

When quantization papers report results, the hardware section typically reads like a data centre shopping list: 8x A100 80 GB, 4x H100, a DGX cluster. The RAM paper's hardware section reads differently:

Reproduction environment (Appendix C):

Hardware: Apple M2 Ultra, 192 GB unified memory
Framework: MLX 0.30.3, mlx_lm 0.30.4
Python: 3.12.0
Quantization: Group-wise round-to-nearest (RTN)

That's the entire compute stack. A Mac Studio sitting on a desk. The same machine ran every experiment, from an 8B dense model to a 109B parameter Mixture-of-Experts model with 16 experts. The largest model (Llama-4-Scout at ~203 GB in BF16) can't even fit in memory unquantized, yet RAM analysed, allocated, and quantized it to run on that same machine.

Why MLX Changes the Equation

Apple's MLX framework is purpose-built for Apple Silicon. Two properties make it a natural fit for RAM's proprietary compression pipeline:

Unified Memory

CPU and GPU share the same memory pool. A 192 GB M2 Ultra can hold 192 GB of model weights without copying between devices. No PCIe bottleneck, no VRAM limit separate from system RAM.

Native Quantization

MLX supports group-wise quantization natively at multiple bit-widths and group sizes. The quantized format is the deployment format, what RAM produces runs directly for inference on the same machine.

This isn't a compromise. Unified memory is an advantage for quantization workloads. Traditional GPU setups are constrained by VRAM (24 GB on an RTX 4090, 80 GB on an A100). A 109B parameter model in BF16 needs ~203 GB. No single GPU can hold it. On Apple Silicon, the M2 Ultra's 192 GB is one contiguous memory space accessible by all compute units.

What RAM Actually Does on Apple Silicon

RAM's proprietary compression pipeline analyses every weight tensor in a model (18,867 in Qwen3-30B alone) and determines the optimal compression strategy for each one. The entire pipeline runs natively on Apple Silicon.

The key input is your memory budget: "fit this model in 30 GB" or "fit it in 64 GB". RAM's allocator finds the provably optimal per-tensor configuration for that budget in under 1 second. No manual tuning, no presets.

Timing: How Fast Is It?

From the RAM paper, Table 11, all timings on Apple M2 Ultra 192 GB:

Model	Parameters	Tensors	Analysis	Allocation	Total
Qwen3-8B	8B (dense)	399	3 min	<1s	~10 min
GLM-4.7-Flash	30B (dense)	9,703	39 min	<1s	~44 min
Qwen3-30B-A3B	30B (MoE)	18,867	50 min	<1s	~54 min
Llama-4-Scout	109B (MoE)	~1,000	45 min	<1s	~50 min

A 109B parameter model, analysed, optimised, and allocated in under 50 minutes. On a Mac. Compare that to calibration-based methods like GPTQ, which need GPU clusters, representative datasets, and hours of Hessian computation for models this size.

Note the allocation time: under 1 second in every case. Once you've run the analysis pass, you can re-target the same model to any number of hardware budgets instantly. Analyse once, deploy everywhere.

Budget-Targeted Deployment: Name Your Hardware

RAM's defining feature is budget-targeted quantization. You tell it exactly how much memory you have, and it returns the optimal allocation. The paper demonstrates this with Qwen3-30B-A3B across specific Apple Silicon and GPU targets:

Target Hardware	Memory Budget	Model Size	Mean PPL	Δ vs BF16
iPhone 16 Pro	15.3 GB	16.13 GB	8.970	+2.8%
RTX 4070	20.0 GB	19.32 GB	8.784	+0.6%
RTX 4090	25.0 GB	27.39 GB	8.760	+0.4%
Mac M4 Pro	30.0 GB	30.75 GB	8.657	−0.8%
BF16 (no quantization)	-	56.87 GB	8.728	-

At the Mac M4 Pro budget of 30 GB, RAM produces a 30.75 GB quantized model with a mean perplexity of 8.657. That's within 1% of the full BF16 model (8.728) at 54% of the size. RAM's proprietary allocator puts bits where they matter most, closing 94% of the gap between uniform 4-bit and BF16 at the 19 GB budget point.

The 109B Model on a Mac Studio

The most striking demo is Llama-4-Scout, Meta's 109B parameter MoE model with 16 experts and 17B active parameters per token. In BF16 it needs ~203 GB. No consumer hardware can run it. Here's what RAM does with it on the M2 Ultra:

Configuration	Size	Mean PPL	Assessment
BF16	~203 GB	exceeds memory	Cannot run
Uniform 4-bit	56.9 GB	7.899	Baseline
RAM @ 192 GB	163.24 GB	7.359	−6.8% vs uniform 4-bit
RAM @ 64 GB	58.03 GB	7.703	−2.5% vs uniform 4-bit
RAM @ 50 GB	51.98 GB	7.980	+1.0% vs uniform 4-bit
RAM min-safe	46.93 GB	8.675	+9.8% vs uniform 4-bit

Scout in BF16 needs ~203 GB and can't run on any consumer hardware. At the M2 Ultra's 192 GB budget, RAM produces a 163 GB model that's 6.8% better than uniform 4-bit quantization (7.359 vs 7.899). At 64 GB (feasible on an M4 Max), it still beats uniform 4-bit by 2.5%. A model that was previously impossible to deploy on consumer hardware now runs with excellent quality.

RAM's built-in safety mechanisms prevent over-compression. Without them, aggressive 2-bit compression would produce a compact 34.6 GB model that's catastrophically broken (PPL 23.6, nearly triple the baseline). RAM's proprietary quality thresholds automatically block these dangerous configurations and redistribute capacity to where it actually helps.

Why Apple Silicon Is Ideal for This Workload

RAM's proprietary compression pipeline has computational characteristics that map perfectly to Apple Silicon's architecture:

Memory-bound, not compute-bound

Model compression is dominated by reading weights and computing per-tensor statistics. Unified memory kills the CPU-GPU transfer bottleneck that plagues discrete GPU setups. The entire model is already where the compute happens.

No gradient computation

RAM is entirely data-free. No forward pass, no loss computation, no backpropagation. That makes it fundamentally less demanding than calibration-based methods.

Analysis-to-inference continuity

The compressed model produced by RAM runs directly for inference on the same machine. Analyse, compress, and serve, all on one device, no conversion steps required.

The Apple Silicon Model Guide

Based on RAM's budget curves and the memory capacities of current Apple Silicon configurations, here's what you can realistically run:

Apple Silicon	Unified Memory	Usable Budget*	What Fits
M4	16–32 GB	~12–24 GB	8B dense, small MoE at tight budgets
M4 Pro	24–48 GB	~18–36 GB	30B MoE (Qwen3-30B at 19 GB = +0.6% PPL)
M4 Max	36–128 GB	~28–100 GB	109B MoE (Scout at 64 GB = −2.5% vs uniform 4-bit)
M2/M3 Ultra	128–192 GB	~100–160 GB	109B MoE at high quality (−6.8% vs uniform 4-bit)
M4 Ultra (expected)	up to 512 GB	~400 GB	400B+ dense models at high precision

*Usable budget accounts for OS, KV cache, and inference overhead. Actual available memory depends on workload.

The critical point: you don't guess which quantization preset to use. You tell RAM exactly how much memory you have, and it returns the mathematically optimal allocation for that budget. Different Mac? Different budget. Same pipeline, different optimal answer.

Group Size g=32: Why It Matters for MLX

RAM's most surprising finding is that 85.2% of tensors get allocated group size 32 rather than the conventional 128. This matters directly for Apple Silicon deployment:

Apple Silicon natively supports g=32. No custom kernels or workarounds needed.
Storage overhead is modest. Going from g=128 to g=32 adds ~0.125 bytes/parameter. For a 30B model, that's ~3.5 GB additional overhead.
Quality improvement is substantial. The finer granularity of g=32 captures per-group weight distributions more accurately, closing up to 94% of the gap between uniform 4-bit and BF16.
Performance cost is minimal. The 4× increase in groups from g=128 to g=32 adds negligible latency because the operation is memory-bound on Apple Silicon.

No Calibration Data: Why This Matters for On-Device

Calibration-based methods like GPTQ and AWQ need representative input data to compute sensitivity. For on-device deployment, that creates three problems:

Privacy. If you're quantizing a model for local use on a Mac, maybe for an enterprise deploying on-premises, sending proprietary data through a calibration pipeline on a cloud GPU defeats the purpose of local deployment.
Representativeness. A calibration set from English Wikipedia may not represent the Japanese legal documents your deployment processes. RAM sidesteps this by using only the weights themselves.
Compute requirements. Calibration needs forward passes through the model, which typically demand GPU clusters. RAM's data-free pipeline runs entirely on the same Mac that will serve the model.

What about quality? In matched-size comparisons against GPTQ across three MoE model families, RAM wins every time:

Model	GPTQ PPL	RAM PPL	RAM Advantage
Qwen3-30B-A3B	9.122	8.970	−1.7%
Qwen2-57B-A14B	6.390	6.329	−0.95%
Mixtral-8x7B	4.608	4.264	−4.6%

A data-free method, running on a Mac, producing better results than the gold-standard calibration method running on GPU clusters with representative data.

The Workflow: From Download to Deployment

For someone with a Mac and a model they want to deploy, the RAM workflow on Apple Silicon looks like this:

Get the code

RAM is open source (MIT). Clone, install, and run on your Mac today.

View on GitHub

Download the model from Hugging Face in safetensors format
Run RAM analysis, the proprietary compression pipeline analyses the model (~10–50 min depending on size)
Specify your memory budget, e.g., "24 GB" for an M4 Pro with 32 GB total
RAM allocates, returns the optimal per-tensor compression configuration in <1 second
Apply compression, produces optimised weights ready for inference
Run inference on the same machine, no conversion steps

Step 4 is the key innovation. Because allocation takes under 1 second, you can target any number of hardware budgets from a single analysis pass. One analysis of Qwen3-30B produces optimal allocations for iPhone (15 GB), M4 Pro (30 GB), M4 Max (64 GB), and M2 Ultra (160 GB), all in seconds.

What This Means

RAM on MLX removes three barriers to deploying large language models on Apple Silicon:

No GPU Required

The entire pipeline (analysis, allocation, quantization, and inference) runs on Apple Silicon. No NVIDIA hardware at any stage.

No Data Required

Zero calibration data. The model's weights are the only input. Deploy models for domains where representative data doesn't exist or can't leave the building.

No Guessing

Tell RAM your memory budget. It solves for the provably optimal allocation. No manual tuning, no presets, no "try 4-bit and hope for the best."

A Mac Studio with an M2 Ultra is now a complete model compression lab. It can analyse, optimise, quantize, and serve models up to 109B parameters, producing results that beat calibration-based methods running on GPU clusters. The entire RAM paper is the proof.

For the full technical details, see the RAM paper on HuggingFace.

Read the Full Paper

The complete RAM paper, including formal derivations, benchmark results across 7 model families and 40,000+ questions, and the full optimal allocation framework, is available on our HuggingFace:

RAM: Compute-Optimal Proprietary Compression for LLMs, Full Paper

huggingface.co/spaces/baa-ai/RAM

Licensed under CC BY-NC-ND 4.0