Metal Buffer Limits Block LoRA Scaling on MoE Models

We hit a wall training LoRA on 128-expert MoE layers. Metal caps out at 499,000 buffer descriptors, and that's it. Not a RAM problem. Our 192 GB machine crashed with only 60 GB in use. Only rank 2 makes it through alive.

The Problem

We ran into this while training LoRA adapters on Qwen3.5-35B-A3B’s MoE layers (128 experts per layer, SwitchLinear architecture). Nobody documents this failure mode. Not in MLX issues, not in Apple’s developer resources. Training just dies with:

[metal::malloc] Resource limit (499000) exceeded

Same crash on a 64 GB Mac Studio and a 192 GB Mac Pro. On the bigger machine, peak memory at the moment of failure was only ~60 GB. So it's not byte-OOM. It's a hard cap in the Metal API on how many buffer descriptors (handles) can exist at the same time.

The Mechanism

Here's what happens under the hood. LoRA on a SwitchLinear layer with 128 experts spawns LoRA parameters for every single expert. As the gradient-checkpointed forward/backward pass runs, Metal piles up buffer descriptors for each intermediate tensor. Several things drive the count up:

Number of experts (128 per layer)
Number of target layers (3 layers = {8, 14, 20})
LoRA rank (higher rank = more parameters = more descriptors per step)
Training iteration (buffers pile up across steps until something clears them)

With rank 2, the accumulation stays slow enough that you can finish 200 iterations before bumping into the 499,000 wall. Rank 4 and above? You hit the ceiling somewhere between iteration 40 and 125.

Rank Survival Matrix

LoRA Rank	Max Iterations Before Crash	Usable Checkpoints	Status
Rank 2	200+ (completes)	50, 100, 150, 200	Viable
Rank 4	~40	None	Dead
Rank 8	~100–125	50, 100	Partial

You'll notice something odd: rank 8 actually survives longer than rank 4. We don't think that's meaningful. It probably comes down to how MLX lays out LoRA parameter buffers differently at different sizes.

Workarounds Attempted (All Failed)

We tried everything we could think of. None of it solved the problem for ranks above 2.

Approach	Result
Gradient checkpointing	Already on by default in MLX for each decoder layer. Can't squeeze anything more out of it.
Batch size = 1	Already as small as it gets. Doesn't touch the descriptor count anyway.
clear-cache-threshold = 1	This one actually lets rank 2 finish by clearing descriptors every step. Still doesn't save rank 4+.
MLX_MAX_OPS_PER_BUFFER env var	Counterproductive. Every value we tried crashed sooner than the default. The results were all over the place (ops=64 → ~iter 60; ops=16 → ~iter 50; ops=4 → ~iter 70).
Mid-step mx.eval (graph cut)	Illegal in MLX. You get “eval during function transformations not allowed”. Can't call eval inside value_and_grad/compile/vmap.
Wired memory limit increase (180 GB)	Zero effect. The problem is descriptor count, not bytes.

Root Cause Analysis

As far as we can tell, 499,000 is a hard constant in the Metal API. That's the maximum number of buffer objects allowed to exist simultaneously in a Metal device’s allocation pool. Every LoRA parameter matrix, on every expert, on every layer, contributes buffers. Backpropagation adds transient buffers for intermediate gradients on top of that. The total scales roughly as:

buffers ∝ num_layers × num_experts × rank × (forward + backward intermediates)

Do the math for our setup: 3 layers × 128 experts × rank 4 × ~8 buffers per LoRA op (A, B, grad_A, grad_B, plus intermediates). That's roughly 12,288 buffers per step just from LoRA parameters alone, and they accumulate across the computation graph until the hard limit kills the run.

Practical Impact

If you're training LoRA on large MoE models through MLX on Apple Silicon, here's what you need to know:

Rank 2 is the ceiling for LoRA targeting SwitchLinear experts on models with ≥128 experts
Adding RAM won't help. A 192 GB machine crashes at the exact same point as a 64 GB machine
The error message is misleading. “Resource limit exceeded” sounds like you're out of memory, but it's actually a descriptor-count limit
clear-cache-threshold=1 is required even for rank 2 to finish reliably when targeting multiple layers

Does Rank Matter?

Here's the silver lining. In our tests, rank 8 at 100 iterations (the most we could squeeze out before the crash) scored within +0.5 percentage points of rank 2 at 200 iterations on held-out evaluation. So the rank ceiling doesn't seem to be what's actually limiting model quality, at least not for our task (MMLU-Pro multiple-choice knowledge injection). The real bottleneck is somewhere else entirely. We dig into that in our companion article on the stacking confound.

In theory, you could get past the rank limit on Apple Silicon by writing a custom top-k SwitchLinear implementation, maybe ~150 lines, that skips buffer creation for inactive experts during LoRA forward/backward. We didn't bother. The quality gap between rank 2 and rank 8 was negligible for what we were doing.

Model: Qwen3.5-35B-A3B (hybrid MoE, 128 experts per SwitchLinear layer, 40 MoE layers total). Target: layers {8,14,20} down_proj including switch_mlp. Hardware: Mac Studio M2 Ultra 64 GB, Mac Pro M2 Ultra 192 GB. Framework: MLX / mlx_lm.

Metal Buffer Limits Block LoRA Scaling on MoE Models

The Problem

The Mechanism

Rank Survival Matrix

Workarounds Attempted (All Failed)

Root Cause Analysis

Practical Impact

Does Rank Matter?

Continue Reading

Identity Survives LoRA Stacking

The Stacking Confound: Why LoRA Recovery Numbers Lie

MoE Routing Layers Converge Across Subjects