A Mac Mini with 16 GB unified memory can read, transform, and write a 250 GB language model end-to-end without ever holding the full model in RAM. The bottleneck is disk space, not memory, and most Apple-Silicon ML pipelines are leaving this on the table.
The conventional wisdom is wrong
Quantizing a 250 GB model, say, a 122B-parameter MoE checkpoint, is generally treated as a workstation-class task. The "rule of thumb" most engineers use is RAM ≥ model size × 1.5, which would put a 250 GB model on a Mac Studio M2 Ultra (192 GB) at minimum, and arguably on a 256 GB Linux box.
This rule is wrong for any pipeline built around streaming load. We have run end-to-end mixed-precision quantization of 250 GB BF16 models on a Mac Mini with 16 GB of unified memory, with <15 GB peak RSS at every stage. The disk fills up before the memory does.
What "lazy" means in MLX
The trick is that mlx_lm.convert() and the underlying safetensors reader both support per-tensor / per-shard streaming. They never materialise the whole model. The pipeline stages and their measured peaks:
| Stage | What it touches | Peak memory |
|---|---|---|
| Compression planning (numerical analysis on allocation tables) | JSON only, no model | ~100 MB |
| Conversion (BF16 → mixed-precision quantized) | One safetensors shard at a time | ~5–10 GB |
| Bitwise verification (per-tensor SQNR vs source) | One tensor at a time | ~5–10 GB |
| Tokenizer + config staging | Small JSON + vocab files | <200 MB |
The conversion stage dominates and tops out around 10 GB, well below the 16 GB ceiling. The verification stage uses the same shard-streaming primitive in safetensors.safe_open to compare bf16 source against quantized output one tensor at a time, never holding two copies.
The two engineering rules
Rule 1: never materialise the full model. Use safetensors.safe_open(...) and iterate f.keys() rather than f.tensors(). Use mlx_lm.convert() rather than mlx_lm.load() → quantize → save. The naive PyTorch pattern of model = AutoModel.from_pretrained(...) is what makes a 250 GB model "need" 256 GB of RAM.
Rule 2: compute on tensor metadata where you can. The shape and dtype of every tensor in a safetensors file are readable from the file's header without loading any tensor data. This is enough to plan a quantization budget, build a tensor-name → bit-width map, or compute the post-quantization size, all in well under 100 MB of working memory.
The disk side: actually a real bottleneck
If memory isn't the constraint, disk is. End-to-end conversion of a 250 GB BF16 model needs:
- 250 GB for the source (read-only, can be on external drive)
- ~25 GB for the quantized output (the savings, typically 4-bit average)
- ~10 GB scratch space for shard intermediates during conversion
- Total: ~285 GB of disk
A 1 TB Mac Mini has plenty. A 256 GB MacBook Air does not. We've run the same pipeline on a base-model 16 GB / 256 GB MacBook Air by pointing the source-model directory at an external SSD over Thunderbolt; the only operational overhead is conversion runtime (Thunderbolt sequential read tops out around 2.5 GB/s in practice, so a 250 GB sequential pass takes roughly 100 seconds of pure I/O on top of the ~30 minutes of compute).
What this means
Apple Silicon democratises large-model engineering for anyone who can afford the disk. A Mac Mini at $599 with a 1 TB external SSD ($80) is a fully-functional rig for compressing any publicly released open-weights model up to and including the 700B-parameter class, including the new Kimi-K2.6, DeepSeek-V4, GLM-5, and Llama-4-Scout family, provided the conversion pipeline you use is shard-streaming end-to-end.
This isn't a clever hack. It's just disciplined use of the streaming primitives that safetensors and mlx_lm already publish. The reason most teams' pipelines blow up in memory is that they reach for transformers.AutoModel, which is built around the assumption that "model" means "everything in RAM at once." On Apple Silicon, that assumption is what costs you the workstation upgrade.
When the rule breaks
Two cases where the 16 GB rig genuinely doesn't work:
- Inference at full precision on a model larger than your unified memory. Streaming load doesn't help during forward passes, the model has to live somewhere addressable to the GPU. Inference is what 192 GB Mac Studios are for.
- Training or fine-tuning at full precision. Same reason: gradients and activations require the whole model on-device. LoRA fine-tuning at low rank is sometimes feasible with aggressive sharding, but full-precision SFT of a 250 GB model is out.
For everything in between, preparing a model for distribution, evaluating quality at multiple bit-widths, building a quantized release artifact, the 16 GB Mac is the right tool.
Source: the engineering practice is documented in RAM/RUN/README.md (open-source pipeline). The lazy-load behaviour is built into safetensors (mainline) and mlx_lm.convert (mainline).
Read more: Streaming-Shard MLX Conversion: Trillion-Param MoE in 15 GB RAM, GGUF on Hybrid SSM-Attention Models: The 7 pp Conversion Loss.