MLX Quantization on Apple Silicon
RAM Research

MLX Quantization on Apple Silicon: How RAM Turns a Mac into a Model Compression Lab

March 2026 · Black Sheep AI Research

Every result in the RAM paper was produced on a single Apple M2 Ultra. No NVIDIA GPUs. No cloud instances. No calibration data. Just Apple Silicon and unified memory. This article explains why that matters and what it means for anyone with a Mac.

The Hardware Nobody Expected

When quantization papers report their results, the hardware section typically reads like a shopping list from a data centre: 8×A100 80 GB, 4×H100, a DGX cluster. The RAM paper’s hardware section reads differently:

Reproduction environment (Appendix C):

That is the entire compute stack. A Mac Studio sitting on a desk. The same machine ran every experiment, from an 8B dense model to a 109B parameter Mixture-of-Experts model with 16 experts. The largest model (Llama-4-Scout at ~203 GB in BF16) cannot even fit in memory unquantized, yet RAM analysed, allocated, and quantized it to run on that same machine.

Why MLX Changes the Equation

Apple’s MLX framework is purpose-built for Apple Silicon. Two properties make it uniquely suited to RAM’s proprietary compression pipeline:

Unified Memory

CPU and GPU share the same memory pool. A 192 GB M2 Ultra can hold 192 GB of model weights without copying between devices. No PCIe bottleneck, no VRAM limit separate from system RAM.

Native Quantization

MLX supports group-wise quantization natively at multiple bit-widths and group sizes. The quantized format is the deployment format, what RAM produces runs directly for inference on the same machine.

This is not a compromise. Unified memory is an advantage for quantization workloads. Traditional GPU setups are constrained by VRAM (24 GB on an RTX 4090, 80 GB on an A100). A 109B parameter model in BF16 requires ~203 GB, no single GPU can hold it. On Apple Silicon, the M2 Ultra’s 192 GB is one contiguous memory space accessible by all compute units.

What RAM Actually Does on Apple Silicon

RAM’s proprietary compression pipeline analyses every weight tensor in a model, for example there are 18,867 in Qwen3-30B alone, and determines the optimal compression strategy for each one. The entire pipeline runs natively on Apple Silicon.

The key input is your memory budget: “fit this model in 30 GB” or “fit it in 64 GB”. RAM’s allocator finds the provably optimal per-tensor configuration for that budget in under 1 second. No manual tuning, no presets.

Timing: How Fast Is It?

From the RAM paper, Table 11, all timings on Apple M2 Ultra 192 GB:

Model Parameters Tensors Analysis Allocation Total
Qwen3-8B8B (dense)3993 min<1s~10 min
GLM-4.7-Flash30B (dense)9,70339 min<1s~44 min
Qwen3-30B-A3B30B (MoE)18,86750 min<1s~54 min
Llama-4-Scout109B (MoE)~1,00045 min<1s~50 min

A 109B parameter model, analysed, optimised, and allocated in under 50 minutes. On a Mac. Compare this to calibration-based methods like GPTQ, which require GPU clusters, representative datasets, and hours of Hessian computation for models of this scale.

Note the allocation time: under 1 second in every case. This means once you’ve run the analysis pass, you can re-target the same model to any number of hardware budgets instantly. Analyse once, deploy everywhere.

Budget-Targeted Deployment: Name Your Hardware

RAM’s defining feature is budget-targeted quantization. You specify exactly how much memory you have, and RAM returns the optimal allocation. The paper demonstrates this with Qwen3-30B-A3B across specific Apple Silicon and GPU targets:

Target Hardware Memory Budget Model Size Mean PPL Δ vs BF16
iPhone 16 Pro15.3 GB16.13 GB8.970+2.8%
RTX 407020.0 GB19.32 GB8.784+0.6%
RTX 409025.0 GB27.39 GB8.760+0.4%
Mac M4 Pro30.0 GB30.75 GB8.657−0.8%
BF16 (no quantization)-56.87 GB8.728-

At the Mac M4 Pro budget of 30 GB, RAM produces a 30.75 GB quantized model with a mean perplexity of 8.657, within 1% of the full BF16 model (8.728) at 54% of the size. RAM’s proprietary allocator distributes bits where they matter most, closing 94% of the gap between uniform 4-bit and BF16 at the 19 GB budget point.

The 109B Model on a Mac Studio

The most striking demonstration is Llama-4-Scout, Meta’s 109B parameter MoE model with 16 experts and 17B active parameters per token. In BF16, it requires ~203 GB. No consumer hardware can run it. Here is what RAM does with it on the M2 Ultra:

Configuration Size Mean PPL Assessment
BF16~203 GBexceeds memoryCannot run
Uniform 4-bit56.9 GB7.899Baseline
RAM @ 192 GB163.24 GB7.359−6.8% vs uniform 4-bit
RAM @ 64 GB58.03 GB7.703−2.5% vs uniform 4-bit
RAM @ 50 GB51.98 GB7.980+1.0% vs uniform 4-bit
RAM min-safe46.93 GB8.675+9.8% vs uniform 4-bit

The BF16 version of Scout requires ~203 GB and cannot run on any consumer hardware. At the M2 Ultra’s 192 GB budget, RAM produces a 163 GB model that is 6.8% better than uniform 4-bit quantization (7.359 vs 7.899). At 64 GB (feasible on an M4 Max), it still beats uniform 4-bit by 2.5%. A model that was previously impossible to deploy on consumer hardware now runs with excellent quality.

RAM’s built-in safety mechanisms prevent over-compression. Without them, aggressive 2-bit compression would produce a compact 34.6 GB model that is catastrophically broken (PPL 23.6, nearly triple the baseline). RAM’s proprietary quality thresholds automatically block these dangerous configurations and redistribute capacity to where it actually helps.

Why Apple Silicon Is Ideal for This Workload

RAM’s proprietary compression pipeline has computational characteristics that map perfectly to Apple Silicon’s architecture:

Memory-bound, not compute-bound

Model compression is dominated by reading weights and computing per-tensor statistics. Unified memory eliminates the CPU↔GPU transfer bottleneck that plagues discrete GPU setups. The entire model is already where the compute happens.

No gradient computation

RAM is entirely data-free, no forward pass through the model, no loss computation, no backpropagation. This makes it fundamentally less demanding than calibration-based methods.

Analysis-to-inference continuity

The compressed model produced by RAM runs directly for inference on the same machine. Analyse, compress, and serve, all on one device, no conversion steps required.

The Apple Silicon Model Guide

Based on RAM’s budget curves and the memory capacities of current Apple Silicon configurations, here is what you can realistically run:

Apple Silicon Unified Memory Usable Budget* What Fits
M416–32 GB~12–24 GB8B dense, small MoE at tight budgets
M4 Pro24–48 GB~18–36 GB30B MoE (Qwen3-30B at 19 GB = +0.6% PPL)
M4 Max36–128 GB~28–100 GB109B MoE (Scout at 64 GB = −2.5% vs uniform 4-bit)
M2/M3 Ultra128–192 GB~100–160 GB109B MoE at high quality (−6.8% vs uniform 4-bit)
M4 Ultra (expected)up to 512 GB~400 GB400B+ dense models at high precision

*Usable budget accounts for OS, KV cache, and inference overhead. Actual available memory depends on workload.

The critical insight: RAM’s budget-targeted allocation means you don’t guess which quantization preset to use. You tell RAM exactly how much memory you have, and it returns the mathematically optimal allocation for that budget. Different Mac? Different budget. Same pipeline, different optimal answer.

Group Size g=32: Why It Matters for MLX

RAM’s most surprising finding is that 85.2% of tensors are allocated group size 32 rather than the conventional 128. This is directly relevant to Apple Silicon deployment:

No Calibration Data: Why This Matters for On-Device

Calibration-based methods like GPTQ and AWQ require representative input data to compute sensitivity. For on-device deployment, this creates three problems:

  1. Privacy. If you’re quantizing a model for local use on a Mac, perhaps for an enterprise deploying on-premises, sending proprietary data through a calibration pipeline on a cloud GPU defeats the purpose of local deployment.
  2. Representativeness. A calibration set from English Wikipedia may not represent the Japanese legal documents your deployment processes. RAM avoids this by using only the weights themselves.
  3. Compute requirements. Calibration requires forward passes through the model, which typically demand GPU clusters. RAM’s data-free pipeline runs entirely on the same Mac that will serve the model.

And the quality? In matched-size comparisons against GPTQ across three MoE model families, RAM wins every time:

Model GPTQ PPL RAM PPL RAM Advantage
Qwen3-30B-A3B9.1228.970−1.7%
Qwen2-57B-A14B6.3906.329−0.95%
Mixtral-8x7B4.6084.264−4.6%

A data-free method, running on a Mac, producing better results than the gold-standard calibration method running on GPU clusters with representative data.

The Workflow: From Download to Deployment

For someone with a Mac and a model they want to deploy, the RAM workflow on Apple Silicon looks like this:

Get the code

RAM is open source (MIT). Clone, install, and run on your Mac today.

View on GitHub
  1. Download the model from Hugging Face in safetensors format
  2. Run RAM analysis, the proprietary compression pipeline analyses the model (~10–50 min depending on model size)
  3. Specify your memory budget, e.g., “24 GB” for an M4 Pro with 32 GB total
  4. RAM allocates, returns the optimal per-tensor compression configuration in <1 second
  5. Apply compression, produces optimised weights ready for inference
  6. Run inference on the same machine, no conversion steps

Step 4 is the key innovation. Because allocation takes under 1 second, you can target any number of hardware budgets from a single analysis pass. One analysis of Qwen3-30B produces optimal allocations for iPhone (15 GB), M4 Pro (30 GB), M4 Max (64 GB), and M2 Ultra (160 GB), all in seconds.

What This Means

RAM on MLX eliminates three barriers to deploying large language models on Apple Silicon:

No GPU Required

The entire pipeline, analysis, allocation, quantization, and inference, runs on Apple Silicon. No NVIDIA hardware at any stage.

No Data Required

Zero calibration data. The model’s weights are the only input. Deploy models for domains where representative data doesn’t exist or can’t leave the building.

No Guessing

Tell RAM your memory budget. It solves for the provably optimal allocation. No manual tuning, no presets, no “try 4-bit and hope for the best.”

A Mac Studio with an M2 Ultra is now a complete model compression laboratory. It can analyse, optimise, quantize, and serve models up to 109B parameters, producing results that beat calibration-based methods running on GPU clusters. The entire RAM paper is proof.

For the full technical details, see the RAM paper on HuggingFace.

Read the Full Paper

The complete RAM paper, including formal derivations, benchmark results across 7 model families and 40,000+ questions, and the full optimal allocation framework, is available on our HuggingFace:

RAM: Compute-Optimal Proprietary Compression for LLMs, Full Paper

huggingface.co/spaces/baa-ai/RAM

Licensed under CC BY-NC-ND 4.0

Continue Reading

Related research from our team.

RAM on Apple Silicon: Running 400B Parameter Models on a Single Mac
RAM Research

RAM on Apple Silicon: Running 400B Parameter Models on a Single Mac

How RAM compression enables frontier-scale models to run entirely on Apple Silicon hardware.

View All Research