Metal Buffer Limits Block LoRA Scaling
MoE Research

Metal Buffer Limits Block LoRA Scaling on MoE Models

May 2026 · Black Sheep AI Research

Training LoRA on 128-expert MoE layers hits a hard Metal buffer-count ceiling at 499,000 descriptors. This isn't a RAM problem , machines with 192 GB unified memory crash at 60 GB usage. Only rank 2 survives.

The Problem

When training LoRA adapters on Qwen3.5-35B-A3B's MoE layers (128 experts per layer, SwitchLinear architecture), we discovered a hard failure mode that isn't documented in any MLX issue or Apple developer resource. The training crashes with:

[metal::malloc] Resource limit (499000) exceeded

This occurs identically on a 64 GB Mac Studio and a 192 GB Mac Pro. Peak memory usage at crash is only ~60 GB on the 192 GB machine. The failure is not byte-OOM , it's a Metal API limit on the total number of buffer descriptors (handles) that can be alive simultaneously.

The Mechanism

LoRA on a SwitchLinear layer with 128 experts creates LoRA parameters for each expert. During the gradient-checkpointed forward/backward pass, Metal accumulates buffer descriptors for every intermediate tensor. The count scales with:

At rank 2, the accumulation is slow enough that 200 iterations complete before hitting 499,000. At rank 4 and above, the ceiling is reached within the first 40–125 iterations.

Rank Survival Matrix

LoRA Rank Max Iterations Before Crash Usable Checkpoints Status
Rank 2 200+ (completes) 50, 100, 150, 200 Viable
Rank 4 ~40 None Dead
Rank 8 ~100–125 50, 100 Partial

The non-monotonicity (rank 8 survives longer than rank 4) likely reflects memory layout differences in how MLX allocates LoRA parameter buffers at different sizes, not a meaningful trend.

Workarounds Attempted (All Failed)

Approach Result
Gradient checkpointing Already enabled (MLX default for each decoder layer). No further reduction possible.
Batch size = 1 Already minimal. Doesn't affect descriptor count.
clear-cache-threshold = 1 Enables rank 2 to complete (clears descriptors every step). Doesn't save rank 4+.
MLX_MAX_OPS_PER_BUFFER env var Counterproductive , all values crashed earlier than default. Non-monotonic results (ops=64 → ~iter 60; ops=16 → ~iter 50; ops=4 → ~iter 70).
Mid-step mx.eval (graph cut) Illegal in MLX , "eval during function transformations not allowed". Can't eval inside value_and_grad/compile/vmap.
Wired memory limit increase (180 GB) No effect , problem is descriptor count, not bytes.

Root Cause Analysis

The 499,000 limit appears to be a Metal API constant , the maximum number of buffer objects that can exist simultaneously in a Metal device's allocation pool. Each LoRA parameter matrix on each expert on each layer contributes buffers. During backpropagation, intermediate gradients create additional transient buffers. The total scales as:

buffers ∝ num_layers × num_experts × rank × (forward + backward intermediates)

For 3 layers × 128 experts × rank 4 × ~8 buffers per LoRA op (A, B, grad_A, grad_B, intermediates) = ~12,288 buffers per step just from LoRA parameters, accumulating across the graph until the hard limit is reached.

Practical Impact

For anyone training LoRA on large MoE models via MLX on Apple Silicon:

Does Rank Matter?

In our experiments, rank 8 at 100 iterations (the most we could extract before crash) scored within +0.5 percentage points of rank 2 at 200 iterations on held-out evaluation. The rank ceiling doesn't appear to be the binding constraint on model quality , at least for this task (MMLU-Pro multiple-choice knowledge injection). The true binding constraint is elsewhere (see our companion article on the stacking confound).

The only path to higher ranks on Apple Silicon would be a custom top-k SwitchLinear implementation (~150 lines) that avoids creating buffers for inactive experts during LoRA forward/backward. We did not pursue this, as the rank-2-vs-8 quality difference was negligible for our use case.


Model: Qwen3.5-35B-A3B (hybrid MoE, 128 experts per SwitchLinear layer, 40 MoE layers total). Target: layers {8,14,20} down_proj including switch_mlp. Hardware: Mac Studio M2 Ultra 64 GB, Mac Pro M2 Ultra 192 GB. Framework: MLX / mlx_lm.

Continue Reading

Related research from our team.

Identity Survives LoRA Stacking
MoE Research

Identity Survives LoRA Stacking

Persona preservation holds at 18/20 across all stacking conditions on a 128-expert MoE model.

The Stacking Confound
MoE Research

The Stacking Confound: Why LoRA Recovery Numbers Lie

~80% of apparent knowledge injection is a weight-perturbation artifact, not learned facts.

MoE Routing Layers Converge
MoE Research

MoE Routing Layers Converge Across Subjects

Per-subject top-3 routing layers collapse to a shared backbone. Domain-specific targeting offers no advantage.

View All Research