Training LoRA on 128-expert MoE layers hits a hard Metal buffer-count ceiling at 499,000 descriptors. This isn't a RAM problem , machines with 192 GB unified memory crash at 60 GB usage. Only rank 2 survives.
The Problem
When training LoRA adapters on Qwen3.5-35B-A3B's MoE layers (128 experts per layer, SwitchLinear architecture), we discovered a hard failure mode that isn't documented in any MLX issue or Apple developer resource. The training crashes with:
[metal::malloc] Resource limit (499000) exceeded
This occurs identically on a 64 GB Mac Studio and a 192 GB Mac Pro. Peak memory usage at crash is only ~60 GB on the 192 GB machine. The failure is not byte-OOM , it's a Metal API limit on the total number of buffer descriptors (handles) that can be alive simultaneously.
The Mechanism
LoRA on a SwitchLinear layer with 128 experts creates LoRA parameters for each expert. During the gradient-checkpointed forward/backward pass, Metal accumulates buffer descriptors for every intermediate tensor. The count scales with:
- Number of experts (128 per layer)
- Number of target layers (3 layers = {8, 14, 20})
- LoRA rank (higher rank = more parameters = more descriptors per step)
- Training iteration (buffers accumulate across steps until cleared)
At rank 2, the accumulation is slow enough that 200 iterations complete before hitting 499,000. At rank 4 and above, the ceiling is reached within the first 40–125 iterations.
Rank Survival Matrix
| LoRA Rank | Max Iterations Before Crash | Usable Checkpoints | Status |
|---|---|---|---|
| Rank 2 | 200+ (completes) | 50, 100, 150, 200 | Viable |
| Rank 4 | ~40 | None | Dead |
| Rank 8 | ~100–125 | 50, 100 | Partial |
The non-monotonicity (rank 8 survives longer than rank 4) likely reflects memory layout differences in how MLX allocates LoRA parameter buffers at different sizes, not a meaningful trend.
Workarounds Attempted (All Failed)
| Approach | Result |
|---|---|
| Gradient checkpointing | Already enabled (MLX default for each decoder layer). No further reduction possible. |
| Batch size = 1 | Already minimal. Doesn't affect descriptor count. |
| clear-cache-threshold = 1 | Enables rank 2 to complete (clears descriptors every step). Doesn't save rank 4+. |
| MLX_MAX_OPS_PER_BUFFER env var | Counterproductive , all values crashed earlier than default. Non-monotonic results (ops=64 → ~iter 60; ops=16 → ~iter 50; ops=4 → ~iter 70). |
| Mid-step mx.eval (graph cut) | Illegal in MLX , "eval during function transformations not allowed". Can't eval inside value_and_grad/compile/vmap. |
| Wired memory limit increase (180 GB) | No effect , problem is descriptor count, not bytes. |
Root Cause Analysis
The 499,000 limit appears to be a Metal API constant , the maximum number of buffer objects that can exist simultaneously in a Metal device's allocation pool. Each LoRA parameter matrix on each expert on each layer contributes buffers. During backpropagation, intermediate gradients create additional transient buffers. The total scales as:
buffers ∝ num_layers × num_experts × rank × (forward + backward intermediates)
For 3 layers × 128 experts × rank 4 × ~8 buffers per LoRA op (A, B, grad_A, grad_B, intermediates) = ~12,288 buffers per step just from LoRA parameters, accumulating across the graph until the hard limit is reached.
Practical Impact
For anyone training LoRA on large MoE models via MLX on Apple Silicon:
- Rank 2 is the ceiling for LoRA targeting SwitchLinear experts on models with ≥128 experts
- Adding RAM won't help , a 192 GB machine crashes at the same point as a 64 GB machine
- The error message is misleading , "Resource limit exceeded" sounds like OOM but is a descriptor-count limit
- clear-cache-threshold=1 is required even for rank 2 to complete reliably on multi-layer targets
Does Rank Matter?
In our experiments, rank 8 at 100 iterations (the most we could extract before crash) scored within +0.5 percentage points of rank 2 at 200 iterations on held-out evaluation. The rank ceiling doesn't appear to be the binding constraint on model quality , at least for this task (MMLU-Pro multiple-choice knowledge injection). The true binding constraint is elsewhere (see our companion article on the stacking confound).
The only path to higher ranks on Apple Silicon would be a custom top-k SwitchLinear implementation (~150 lines) that avoids creating buffers for inactive experts during LoRA forward/backward. We did not pursue this, as the rank-2-vs-8 quality difference was negligible for our use case.
Model: Qwen3.5-35B-A3B (hybrid MoE, 128 experts per SwitchLinear layer, 40 MoE layers total). Target: layers {8,14,20} down_proj including switch_mlp. Hardware: Mac Studio M2 Ultra 64 GB, Mac Pro M2 Ultra 192 GB. Framework: MLX / mlx_lm.