Metal Buffer Limits Block LoRA Scaling on MoE Models

Training LoRA on 128-expert MoE layers hits a hard Metal buffer-count ceiling at 499,000 descriptors. This isn't a RAM problem , machines with 192 GB unified memory crash at 60 GB usage. Only rank 2 survives.

The Problem

When training LoRA adapters on Qwen3.5-35B-A3B's MoE layers (128 experts per layer, SwitchLinear architecture), we discovered a hard failure mode that isn't documented in any MLX issue or Apple developer resource. The training crashes with:

[metal::malloc] Resource limit (499000) exceeded

This occurs identically on a 64 GB Mac Studio and a 192 GB Mac Pro. Peak memory usage at crash is only ~60 GB on the 192 GB machine. The failure is not byte-OOM , it's a Metal API limit on the total number of buffer descriptors (handles) that can be alive simultaneously.

The Mechanism

LoRA on a SwitchLinear layer with 128 experts creates LoRA parameters for each expert. During the gradient-checkpointed forward/backward pass, Metal accumulates buffer descriptors for every intermediate tensor. The count scales with:

Number of experts (128 per layer)
Number of target layers (3 layers = {8, 14, 20})
LoRA rank (higher rank = more parameters = more descriptors per step)
Training iteration (buffers accumulate across steps until cleared)

At rank 2, the accumulation is slow enough that 200 iterations complete before hitting 499,000. At rank 4 and above, the ceiling is reached within the first 40–125 iterations.

Rank Survival Matrix

LoRA Rank	Max Iterations Before Crash	Usable Checkpoints	Status
Rank 2	200+ (completes)	50, 100, 150, 200	Viable
Rank 4	~40	None	Dead
Rank 8	~100–125	50, 100	Partial

The non-monotonicity (rank 8 survives longer than rank 4) likely reflects memory layout differences in how MLX allocates LoRA parameter buffers at different sizes, not a meaningful trend.

Workarounds Attempted (All Failed)

Approach	Result
Gradient checkpointing	Already enabled (MLX default for each decoder layer). No further reduction possible.
Batch size = 1	Already minimal. Doesn't affect descriptor count.
clear-cache-threshold = 1	Enables rank 2 to complete (clears descriptors every step). Doesn't save rank 4+.
MLX_MAX_OPS_PER_BUFFER env var	Counterproductive , all values crashed earlier than default. Non-monotonic results (ops=64 → ~iter 60; ops=16 → ~iter 50; ops=4 → ~iter 70).
Mid-step mx.eval (graph cut)	Illegal in MLX , "eval during function transformations not allowed". Can't eval inside value_and_grad/compile/vmap.
Wired memory limit increase (180 GB)	No effect , problem is descriptor count, not bytes.

Root Cause Analysis

The 499,000 limit appears to be a Metal API constant , the maximum number of buffer objects that can exist simultaneously in a Metal device's allocation pool. Each LoRA parameter matrix on each expert on each layer contributes buffers. During backpropagation, intermediate gradients create additional transient buffers. The total scales as:

buffers ∝ num_layers × num_experts × rank × (forward + backward intermediates)

For 3 layers × 128 experts × rank 4 × ~8 buffers per LoRA op (A, B, grad_A, grad_B, intermediates) = ~12,288 buffers per step just from LoRA parameters, accumulating across the graph until the hard limit is reached.

Practical Impact

For anyone training LoRA on large MoE models via MLX on Apple Silicon:

Rank 2 is the ceiling for LoRA targeting SwitchLinear experts on models with ≥128 experts
Adding RAM won't help , a 192 GB machine crashes at the same point as a 64 GB machine
The error message is misleading , "Resource limit exceeded" sounds like OOM but is a descriptor-count limit
clear-cache-threshold=1 is required even for rank 2 to complete reliably on multi-layer targets

Does Rank Matter?

In our experiments, rank 8 at 100 iterations (the most we could extract before crash) scored within +0.5 percentage points of rank 2 at 200 iterations on held-out evaluation. The rank ceiling doesn't appear to be the binding constraint on model quality , at least for this task (MMLU-Pro multiple-choice knowledge injection). The true binding constraint is elsewhere (see our companion article on the stacking confound).

The only path to higher ranks on Apple Silicon would be a custom top-k SwitchLinear implementation (~150 lines) that avoids creating buffers for inactive experts during LoRA forward/backward. We did not pursue this, as the rank-2-vs-8 quality difference was negligible for our use case.

Model: Qwen3.5-35B-A3B (hybrid MoE, 128 experts per SwitchLinear layer, 40 MoE layers total). Target: layers {8,14,20} down_proj including switch_mlp. Hardware: Mac Studio M2 Ultra 64 GB, Mac Pro M2 Ultra 192 GB. Framework: MLX / mlx_lm.

Metal Buffer Limits Block LoRA Scaling on MoE Models

The Problem

The Mechanism

Rank Survival Matrix

Workarounds Attempted (All Failed)

Root Cause Analysis

Practical Impact

Does Rank Matter?

Continue Reading

Identity Survives LoRA Stacking

The Stacking Confound: Why LoRA Recovery Numbers Lie

MoE Routing Layers Converge Across Subjects