Every result in the RAM paper was produced on a single Apple M2 Ultra. No NVIDIA GPUs. No cloud instances. No calibration data. Just Apple Silicon and unified memory. This article explains why that matters and what it means for anyone with a Mac.
The Hardware Nobody Expected
When quantization papers report their results, the hardware section typically reads like a shopping list from a data centre: 8×A100 80 GB, 4×H100, a DGX cluster. The RAM paper’s hardware section reads differently:
Reproduction environment (Appendix C):
- Hardware: Apple M2 Ultra, 192 GB unified memory
- Framework: MLX 0.30.3, mlx_lm 0.30.4
- Python: 3.12.0
- Quantization: Group-wise round-to-nearest (RTN)
That is the entire compute stack. A Mac Studio sitting on a desk. The same machine ran every experiment, from an 8B dense model to a 109B parameter Mixture-of-Experts model with 16 experts. The largest model (Llama-4-Scout at ~203 GB in BF16) cannot even fit in memory unquantized, yet RAM analysed, allocated, and quantized it to run on that same machine.
Why MLX Changes the Equation
Apple’s MLX framework is purpose-built for Apple Silicon. Two properties make it uniquely suited to RAM’s proprietary compression pipeline:
Unified Memory
CPU and GPU share the same memory pool. A 192 GB M2 Ultra can hold 192 GB of model weights without copying between devices. No PCIe bottleneck, no VRAM limit separate from system RAM.
Native Quantization
MLX supports group-wise quantization natively at multiple bit-widths and group sizes. The quantized format is the deployment format, what RAM produces runs directly for inference on the same machine.
This is not a compromise. Unified memory is an advantage for quantization workloads. Traditional GPU setups are constrained by VRAM (24 GB on an RTX 4090, 80 GB on an A100). A 109B parameter model in BF16 requires ~203 GB, no single GPU can hold it. On Apple Silicon, the M2 Ultra’s 192 GB is one contiguous memory space accessible by all compute units.
What RAM Actually Does on Apple Silicon
RAM’s proprietary compression pipeline analyses every weight tensor in a model, for example there are 18,867 in Qwen3-30B alone, and determines the optimal compression strategy for each one. The entire pipeline runs natively on Apple Silicon.
The key input is your memory budget: “fit this model in 30 GB” or “fit it in 64 GB”. RAM’s allocator finds the provably optimal per-tensor configuration for that budget in under 1 second. No manual tuning, no presets.
Timing: How Fast Is It?
From the RAM paper, Table 11, all timings on Apple M2 Ultra 192 GB:
| Model | Parameters | Tensors | Analysis | Allocation | Total |
|---|---|---|---|---|---|
| Qwen3-8B | 8B (dense) | 399 | 3 min | <1s | ~10 min |
| GLM-4.7-Flash | 30B (dense) | 9,703 | 39 min | <1s | ~44 min |
| Qwen3-30B-A3B | 30B (MoE) | 18,867 | 50 min | <1s | ~54 min |
| Llama-4-Scout | 109B (MoE) | ~1,000 | 45 min | <1s | ~50 min |
A 109B parameter model, analysed, optimised, and allocated in under 50 minutes. On a Mac. Compare this to calibration-based methods like GPTQ, which require GPU clusters, representative datasets, and hours of Hessian computation for models of this scale.
Note the allocation time: under 1 second in every case. This means once you’ve run the analysis pass, you can re-target the same model to any number of hardware budgets instantly. Analyse once, deploy everywhere.
Budget-Targeted Deployment: Name Your Hardware
RAM’s defining feature is budget-targeted quantization. You specify exactly how much memory you have, and RAM returns the optimal allocation. The paper demonstrates this with Qwen3-30B-A3B across specific Apple Silicon and GPU targets:
| Target Hardware | Memory Budget | Model Size | Mean PPL | Δ vs BF16 |
|---|---|---|---|---|
| iPhone 16 Pro | 15.3 GB | 16.13 GB | 8.970 | +2.8% |
| RTX 4070 | 20.0 GB | 19.32 GB | 8.784 | +0.6% |
| RTX 4090 | 25.0 GB | 27.39 GB | 8.760 | +0.4% |
| Mac M4 Pro | 30.0 GB | 30.75 GB | 8.657 | −0.8% |
| BF16 (no quantization) | - | 56.87 GB | 8.728 | - |
At the Mac M4 Pro budget of 30 GB, RAM produces a 30.75 GB quantized model with a mean perplexity of 8.657, within 1% of the full BF16 model (8.728) at 54% of the size. RAM’s proprietary allocator distributes bits where they matter most, closing 94% of the gap between uniform 4-bit and BF16 at the 19 GB budget point.
The 109B Model on a Mac Studio
The most striking demonstration is Llama-4-Scout, Meta’s 109B parameter MoE model with 16 experts and 17B active parameters per token. In BF16, it requires ~203 GB. No consumer hardware can run it. Here is what RAM does with it on the M2 Ultra:
| Configuration | Size | Mean PPL | Assessment |
|---|---|---|---|
| BF16 | ~203 GB | exceeds memory | Cannot run |
| Uniform 4-bit | 56.9 GB | 7.899 | Baseline |
| RAM @ 192 GB | 163.24 GB | 7.359 | −6.8% vs uniform 4-bit |
| RAM @ 64 GB | 58.03 GB | 7.703 | −2.5% vs uniform 4-bit |
| RAM @ 50 GB | 51.98 GB | 7.980 | +1.0% vs uniform 4-bit |
| RAM min-safe | 46.93 GB | 8.675 | +9.8% vs uniform 4-bit |
The BF16 version of Scout requires ~203 GB and cannot run on any consumer hardware. At the M2 Ultra’s 192 GB budget, RAM produces a 163 GB model that is 6.8% better than uniform 4-bit quantization (7.359 vs 7.899). At 64 GB (feasible on an M4 Max), it still beats uniform 4-bit by 2.5%. A model that was previously impossible to deploy on consumer hardware now runs with excellent quality.
RAM’s built-in safety mechanisms prevent over-compression. Without them, aggressive 2-bit compression would produce a compact 34.6 GB model that is catastrophically broken (PPL 23.6, nearly triple the baseline). RAM’s proprietary quality thresholds automatically block these dangerous configurations and redistribute capacity to where it actually helps.
Why Apple Silicon Is Ideal for This Workload
RAM’s proprietary compression pipeline has computational characteristics that map perfectly to Apple Silicon’s architecture:
Memory-bound, not compute-bound
Model compression is dominated by reading weights and computing per-tensor statistics. Unified memory eliminates the CPU↔GPU transfer bottleneck that plagues discrete GPU setups. The entire model is already where the compute happens.
No gradient computation
RAM is entirely data-free, no forward pass through the model, no loss computation, no backpropagation. This makes it fundamentally less demanding than calibration-based methods.
Analysis-to-inference continuity
The compressed model produced by RAM runs directly for inference on the same machine. Analyse, compress, and serve, all on one device, no conversion steps required.
The Apple Silicon Model Guide
Based on RAM’s budget curves and the memory capacities of current Apple Silicon configurations, here is what you can realistically run:
| Apple Silicon | Unified Memory | Usable Budget* | What Fits |
|---|---|---|---|
| M4 | 16–32 GB | ~12–24 GB | 8B dense, small MoE at tight budgets |
| M4 Pro | 24–48 GB | ~18–36 GB | 30B MoE (Qwen3-30B at 19 GB = +0.6% PPL) |
| M4 Max | 36–128 GB | ~28–100 GB | 109B MoE (Scout at 64 GB = −2.5% vs uniform 4-bit) |
| M2/M3 Ultra | 128–192 GB | ~100–160 GB | 109B MoE at high quality (−6.8% vs uniform 4-bit) |
| M4 Ultra (expected) | up to 512 GB | ~400 GB | 400B+ dense models at high precision |
*Usable budget accounts for OS, KV cache, and inference overhead. Actual available memory depends on workload.
The critical insight: RAM’s budget-targeted allocation means you don’t guess which quantization preset to use. You tell RAM exactly how much memory you have, and it returns the mathematically optimal allocation for that budget. Different Mac? Different budget. Same pipeline, different optimal answer.
Group Size g=32: Why It Matters for MLX
RAM’s most surprising finding is that 85.2% of tensors are allocated group size 32 rather than the conventional 128. This is directly relevant to Apple Silicon deployment:
- Apple Silicon natively supports g=32. No custom kernels or workarounds needed.
- Storage overhead is modest. Going from g=128 to g=32 adds ~0.125 bytes/parameter. For a 30B model, that’s ~3.5 GB additional overhead.
- Quality improvement is substantial. The finer granularity of g=32 captures per-group weight distributions more accurately, closing up to 94% of the gap between uniform 4-bit and BF16.
- Performance cost is minimal. The 4× increase in groups from g=128 to g=32 adds negligible latency because the operation is memory-bound on Apple Silicon.
No Calibration Data: Why This Matters for On-Device
Calibration-based methods like GPTQ and AWQ require representative input data to compute sensitivity. For on-device deployment, this creates three problems:
- Privacy. If you’re quantizing a model for local use on a Mac, perhaps for an enterprise deploying on-premises, sending proprietary data through a calibration pipeline on a cloud GPU defeats the purpose of local deployment.
- Representativeness. A calibration set from English Wikipedia may not represent the Japanese legal documents your deployment processes. RAM avoids this by using only the weights themselves.
- Compute requirements. Calibration requires forward passes through the model, which typically demand GPU clusters. RAM’s data-free pipeline runs entirely on the same Mac that will serve the model.
And the quality? In matched-size comparisons against GPTQ across three MoE model families, RAM wins every time:
| Model | GPTQ PPL | RAM PPL | RAM Advantage |
|---|---|---|---|
| Qwen3-30B-A3B | 9.122 | 8.970 | −1.7% |
| Qwen2-57B-A14B | 6.390 | 6.329 | −0.95% |
| Mixtral-8x7B | 4.608 | 4.264 | −4.6% |
A data-free method, running on a Mac, producing better results than the gold-standard calibration method running on GPU clusters with representative data.
The Workflow: From Download to Deployment
For someone with a Mac and a model they want to deploy, the RAM workflow on Apple Silicon looks like this:
Get the code
RAM is open source (MIT). Clone, install, and run on your Mac today.
- Download the model from Hugging Face in safetensors format
- Run RAM analysis, the proprietary compression pipeline analyses the model (~10–50 min depending on model size)
- Specify your memory budget, e.g., “24 GB” for an M4 Pro with 32 GB total
- RAM allocates, returns the optimal per-tensor compression configuration in <1 second
- Apply compression, produces optimised weights ready for inference
- Run inference on the same machine, no conversion steps
Step 4 is the key innovation. Because allocation takes under 1 second, you can target any number of hardware budgets from a single analysis pass. One analysis of Qwen3-30B produces optimal allocations for iPhone (15 GB), M4 Pro (30 GB), M4 Max (64 GB), and M2 Ultra (160 GB), all in seconds.
What This Means
RAM on MLX eliminates three barriers to deploying large language models on Apple Silicon:
No GPU Required
The entire pipeline, analysis, allocation, quantization, and inference, runs on Apple Silicon. No NVIDIA hardware at any stage.
No Data Required
Zero calibration data. The model’s weights are the only input. Deploy models for domains where representative data doesn’t exist or can’t leave the building.
No Guessing
Tell RAM your memory budget. It solves for the provably optimal allocation. No manual tuning, no presets, no “try 4-bit and hope for the best.”
A Mac Studio with an M2 Ultra is now a complete model compression laboratory. It can analyse, optimise, quantize, and serve models up to 109B parameters, producing results that beat calibration-based methods running on GPU clusters. The entire RAM paper is proof.
For the full technical details, see the RAM paper on HuggingFace.
Read the Full Paper
The complete RAM paper, including formal derivations, benchmark results across 7 model families and 40,000+ questions, and the full optimal allocation framework, is available on our HuggingFace:
RAM: Compute-Optimal Proprietary Compression for LLMs, Full Paper
huggingface.co/spaces/baa-ai/RAMLicensed under CC BY-NC-ND 4.0