Quantizing a Trillion-Parameter MoE in 15 GB of RAM

Standard mlx_lm.convert() materialises the entire BF16 model in unified memory before re-quantizing, for Kimi-K2.6 or DeepSeek-V4 at trillion-parameter scale, that's about 2 TB of RAM. A shard-streaming converter does the same job at peak 15 GB, regardless of source size. Here's how, and why it works.

The default pipeline OOMs at scale

mlx_lm.convert() is the official tool for converting a HuggingFace BF16 checkpoint to a mixed-precision MLX model. It's reliable, well-tested, and used in production by thousands of teams. It also assumes the entire BF16 model fits in unified memory at conversion time.

For dense models up to ~70 B parameters (140 GB BF16), this assumption holds on a 192 GB Mac Studio. For trillion-parameter MoE checkpoints, Kimi-K2.6, DeepSeek-V4, GLM-5, Llama-4-Maverick, the BF16 intermediate would need ~2 TB of RAM, which no consumer Mac has and very few cloud instances do at any reasonable price.

The result: teams either skip MLX conversion for these models entirely (forcing users onto GGUF with its own format losses) or rent A100 cloud nodes for the conversion step itself.

Neither is necessary.

The streaming converter

The fix is to operate one shard at a time. Source models are stored as a sequence of model-XXXXX-of-YYYYY.safetensors files, typically 4-8 GB each. The streaming converter:

Opens the safetensors index to know which tensor lives in which shard.
Loops over output shards (we use 5 GB per output shard by default, HuggingFace ingestion balks above 10 GB).
For each tensor in the next output shard, reads it from its source shard, applies the bit-width assignment from the allocation manifest, quantizes to MLX's Quantized* types, and writes to the output shard.
Closes both source and output shards before moving to the next batch.

Peak memory holds at one source shard plus one output shard plus working space, typically ~15 GB total. The model size does not enter the calculation.

What stays the same

The quantization decisions, the post-quantization quality, and the output file layout are byte-identical to what mlx_lm.convert() would produce, given the same allocation manifest and bit-widths. The streaming converter is a drop-in replacement for the conversion step alone, not a different quantization method.

The interface of the manifest you feed in is the standard {tensor_name: bits} dict that any allocation method produces. If your team uses uniform 4-bit, the manifest is {tensor: 4 for tensor in all_tensors}. If you use a more sophisticated allocation, the manifest reflects that. The streaming converter doesn't care which.

What changes

Three behaviours differ from the reference converter:

Output shard sizing is explicit. mlx_lm.convert() writes a single output file. Trillion-parameter quantized models still need to be sharded to ingest into HuggingFace; the streaming converter writes the shards directly during conversion at a configurable size (default 5 GB).

Routed expert handling is uniform-bits. Manifest entries cover tensors the upstream allocator considered. For routed experts in models with hundreds of experts per layer, individual expert measurement is impractical at conversion time, the streaming converter takes a single --expert-bits value (default 4) and applies it uniformly. This matches the design of every public MoE quantization pipeline we know of.

Memory peak is bounded, not predictable. Reading a shard, quantizing tensors inside it, and writing an output shard happens with several short-lived allocations along the way. The peak holds below 15 GB on every model we've tested up to 1 T parameters, but the exact number varies by model architecture and shard sizes. Treat 15 GB as the practical upper bound; on most models the peak is closer to 10 GB.

The recipe

import safetensors.torch
import mlx.core as mx
from mlx_lm.utils import save_weights

source_dir = "models/Kimi-K2.6"
output_dir = "models/Kimi-K2.6-RAM"
manifest = json.load(open("manifest.json"))  # {tensor_name: bits}

with open(f"{source_dir}/model.safetensors.index.json") as f:
    index = json.load(f)["weight_map"]

source_shards = sorted(set(index.values()))
output_shard_idx = 0
output_buffer = {}
output_shard_bytes = 0
TARGET_SHARD_BYTES = 5 * 1024**3

for shard_name in source_shards:
    src = safetensors.torch.load_file(f"{source_dir}/{shard_name}")
    for tensor_name, weight in src.items():
        bits = manifest.get(tensor_name, 16)  # default 16 for unmapped
        # Routed experts use uniform expert_bits, see args
        if "experts" in tensor_name:
            bits = expert_bits

        # Apply per-tensor quantization at the chosen bit-width
        quantized = mlx_quantize(mx.array(weight), bits=bits, group_size=64)
        output_buffer[tensor_name] = quantized
        output_shard_bytes += quantized.nbytes

        if output_shard_bytes >= TARGET_SHARD_BYTES:
            save_weights(f"{output_dir}/model-{output_shard_idx:05d}.safetensors", output_buffer)
            output_buffer.clear()
            output_shard_bytes = 0
            output_shard_idx += 1
    del src  # free source shard before opening the next

if output_buffer:
    save_weights(f"{output_dir}/model-{output_shard_idx:05d}.safetensors", output_buffer)

This is the skeleton. The production version handles edge cases (FP8 source models, see FP8 Block-Scale Conversion; MoE expert layout reassembly; tied embeddings) but the core loop is exactly this shape.

When to reach for it

Source model > 60% of your unified memory. This is the threshold where standard convert starts swapping or OOMing. Below it, use mlx_lm.convert(), it's better-tested and slightly faster on small models.
Source model is multi-shard. If your model came as one safetensors file (~5 GB max), streaming offers no benefit.
You're shipping for HuggingFace. Output sharding is mandatory above 10 GB anyway.

The honest cost

Conversion runtime is ~10% slower than mlx_lm.convert() because the streaming converter pays per-shard open/close overhead. On a 250 GB source model, that's an extra ~3 minutes. Disk write rate is ~2.5 GB/s on a recent Mac internal SSD, so the I/O is not the bottleneck, the per-tensor MLX quantize call is.

Worth the trade for being able to run conversion on a Mac Mini.

Source: open-source convert_kimi_mlx.py in the RAM pipeline. Tested on Kimi-K2.6 (1.0 T), DeepSeek-V4 (685 B), GLM-5 (450 B), Llama-4-Maverick (400 B + 16 experts).