Metal Buffer Limits Block LoRA Scaling
MoE Research

Metal Buffer Limits Block LoRA Scaling on MoE Models

May 2026 · Black Sheep AI Research

We hit a wall training LoRA on 128-expert MoE layers. Metal caps out at 499,000 buffer descriptors, and that's it. Not a RAM problem. Our 192 GB machine crashed with only 60 GB in use. Only rank 2 makes it through alive.

The Problem

We ran into this while training LoRA adapters on Qwen3.5-35B-A3B’s MoE layers (128 experts per layer, SwitchLinear architecture). Nobody documents this failure mode. Not in MLX issues, not in Apple’s developer resources. Training just dies with:

[metal::malloc] Resource limit (499000) exceeded

Same crash on a 64 GB Mac Studio and a 192 GB Mac Pro. On the bigger machine, peak memory at the moment of failure was only ~60 GB. So it's not byte-OOM. It's a hard cap in the Metal API on how many buffer descriptors (handles) can exist at the same time.

The Mechanism

Here's what happens under the hood. LoRA on a SwitchLinear layer with 128 experts spawns LoRA parameters for every single expert. As the gradient-checkpointed forward/backward pass runs, Metal piles up buffer descriptors for each intermediate tensor. Several things drive the count up:

With rank 2, the accumulation stays slow enough that you can finish 200 iterations before bumping into the 499,000 wall. Rank 4 and above? You hit the ceiling somewhere between iteration 40 and 125.

Rank Survival Matrix

LoRA Rank Max Iterations Before Crash Usable Checkpoints Status
Rank 2 200+ (completes) 50, 100, 150, 200 Viable
Rank 4 ~40 None Dead
Rank 8 ~100–125 50, 100 Partial

You'll notice something odd: rank 8 actually survives longer than rank 4. We don't think that's meaningful. It probably comes down to how MLX lays out LoRA parameter buffers differently at different sizes.

Workarounds Attempted (All Failed)

We tried everything we could think of. None of it solved the problem for ranks above 2.

Approach Result
Gradient checkpointing Already on by default in MLX for each decoder layer. Can't squeeze anything more out of it.
Batch size = 1 Already as small as it gets. Doesn't touch the descriptor count anyway.
clear-cache-threshold = 1 This one actually lets rank 2 finish by clearing descriptors every step. Still doesn't save rank 4+.
MLX_MAX_OPS_PER_BUFFER env var Counterproductive. Every value we tried crashed sooner than the default. The results were all over the place (ops=64 → ~iter 60; ops=16 → ~iter 50; ops=4 → ~iter 70).
Mid-step mx.eval (graph cut) Illegal in MLX. You get “eval during function transformations not allowed”. Can't call eval inside value_and_grad/compile/vmap.
Wired memory limit increase (180 GB) Zero effect. The problem is descriptor count, not bytes.

Root Cause Analysis

As far as we can tell, 499,000 is a hard constant in the Metal API. That's the maximum number of buffer objects allowed to exist simultaneously in a Metal device’s allocation pool. Every LoRA parameter matrix, on every expert, on every layer, contributes buffers. Backpropagation adds transient buffers for intermediate gradients on top of that. The total scales roughly as:

buffers ∝ num_layers × num_experts × rank × (forward + backward intermediates)

Do the math for our setup: 3 layers × 128 experts × rank 4 × ~8 buffers per LoRA op (A, B, grad_A, grad_B, plus intermediates). That's roughly 12,288 buffers per step just from LoRA parameters alone, and they accumulate across the computation graph until the hard limit kills the run.

Practical Impact

If you're training LoRA on large MoE models through MLX on Apple Silicon, here's what you need to know:

Does Rank Matter?

Here's the silver lining. In our tests, rank 8 at 100 iterations (the most we could squeeze out before the crash) scored within +0.5 percentage points of rank 2 at 200 iterations on held-out evaluation. So the rank ceiling doesn't seem to be what's actually limiting model quality, at least not for our task (MMLU-Pro multiple-choice knowledge injection). The real bottleneck is somewhere else entirely. We dig into that in our companion article on the stacking confound.

In theory, you could get past the rank limit on Apple Silicon by writing a custom top-k SwitchLinear implementation, maybe ~150 lines, that skips buffer creation for inactive experts during LoRA forward/backward. We didn't bother. The quality gap between rank 2 and rank 8 was negligible for what we were doing.


Model: Qwen3.5-35B-A3B (hybrid MoE, 128 experts per SwitchLinear layer, 40 MoE layers total). Target: layers {8,14,20} down_proj including switch_mlp. Hardware: Mac Studio M2 Ultra 64 GB, Mac Pro M2 Ultra 192 GB. Framework: MLX / mlx_lm.

Continue Reading

Related research from our team.

Identity Survives LoRA Stacking
MoE Research

Identity Survives LoRA Stacking

Persona preservation holds at 18/20 across all stacking conditions on a 128-expert MoE model.

The Stacking Confound
MoE Research

The Stacking Confound: Why LoRA Recovery Numbers Lie

~80% of apparent knowledge injection is a weight-perturbation artifact, not learned facts.

MoE Routing Layers Converge
MoE Research

MoE Routing Layers Converge Across Subjects

Per-subject top-3 routing layers collapse to a shared backbone. Domain-specific targeting offers no advantage.

View All Research