The conventional wisdom: power-of-2 bit widths are always faster. Our benchmark says otherwise — and the reason has nothing to do with the dequantization kernels themselves.
When we expanded RAM’s configuration space to include 5-bit and 6-bit quantization, we expected a throughput penalty. MLX’s Metal dequantization kernels are optimised for power-of-2 bit widths (2, 4, 8) which pack neatly into byte boundaries. Non-power-of-2 widths require irregular bit extraction. We benchmarked three configurations on Qwen3.5-35B-A3B (M2 Ultra, 192 GB) to quantify the cost.
What we found was the opposite of what we expected.
The benchmark
Three RAM speed modes, same 30 GB budget, same model, same hardware. Each tested at three prompt lengths (128, 512, 2048 tokens) with 256 tokens generated, 1 warmup run, 3 measured runs, median reported:
| Model Config | Bit Distribution | Gen TPS (128) | Gen TPS (512) | Gen TPS (2048) |
|---|---|---|---|---|
| Fast (4/8-bit only) | 54% 4-bit, 38% 8-bit | 69.6 | 68.7 | 68.6 |
| Balanced (mostly 6-bit) | 81% 6-bit, 8% 8-bit | 70.5 | 69.9 | 69.6 |
| Full (3–8-bit mix) | 15% 5-bit, 71% 6-bit | 68.7 | 67.9 | 67.7 |
The balanced mode — 81% of parameters at 6-bit, a non-power-of-2 width — is +1.5% faster at generation than the fast mode that uses only power-of-2 bit widths. The pattern is consistent across all three prompt lengths.
Meanwhile, the full mode (which adds 5-bit into the mix) is 1.3% slower than fast mode. So it’s not that non-power-of-2 is inherently faster — something specific about the balanced allocation makes it win.
The mechanism: kernel dispatch homogeneity
During autoregressive generation, each token requires a forward pass through every layer of the model. Each linear layer performs a quantised matrix multiplication, which dispatches a Metal compute kernel. The kernel is specialised for the layer’s bit width.
In the fast mode allocation, the model has 54% of parameters at 4-bit and 38% at 8-bit. This means the GPU alternates between 4-bit and 8-bit dequantization kernels as it moves through the network. Every alternation involves:
- A different compute pipeline state (the Metal shader for that bit width)
- Different memory access patterns (4-bit reads half the bytes of 8-bit per parameter)
- Different register allocation within the GPU threadgroup
In the balanced mode, 81% of layers use the same 6-bit kernel. The GPU settles into a rhythm: same pipeline, same access pattern, same register layout, layer after layer. The Metal command encoder can batch consecutive dispatches more efficiently when they use the same pipeline state. The GPU’s instruction cache stays warm. Branch prediction (yes, GPUs have it) stabilises.
This is why the full mode is slower despite also being mostly 6-bit (71%): it adds a 15% 5-bit allocation that reintroduces heterogeneity. The 5-bit kernel itself is also the slowest non-power-of-2 kernel due to its 8-values-per-5-bytes packing, but the dispatch switching matters more than the kernel’s intrinsic speed.
Why prompt processing tells a different story
| Mode | Prompt TPS (2048) | TTFT (2048) |
|---|---|---|
| Fast | 1,760 tok/s | 1,164 ms |
| Balanced | 1,579 tok/s | 1,297 ms |
| Full | 1,629 tok/s | 1,257 ms |
Prompt processing flips the result: fast mode is 11% faster (1,760 vs 1,579 tok/s). This makes sense when you consider the difference in workload.
During generation, each layer processes a single token vector (batch size 1). The bottleneck is memory bandwidth: reading the weight matrix. The actual dequantization compute is small relative to the memory transfer, so kernel dispatch overhead dominates.
During prompt processing, each layer processes the full prompt (batch size 2048). The bottleneck shifts to compute: the quantised matrix multiplication is doing real work. Here, the intrinsic speed of the dequantization kernel matters more than dispatch overhead, and 4-bit’s byte-aligned unpacking is genuinely faster than 6-bit’s non-aligned extraction.
The practical takeaway
For interactive LLM use — chatbots, coding assistants, document analysis — generation speed is what users feel. Prompt processing happens once per turn; generation determines the sustained experience. A user does not notice a 130 ms difference in time-to-first-token (1,164 vs 1,297 ms), but they notice if tokens stream noticeably faster or slower.
At 69.6 tok/s versus 68.6 tok/s, the balanced mode is imperceptibly faster — but the real win is the quality: PPL 6.587 (matching BF16) versus 6.627 (+0.6% worse). You get better quality and faster generation by using a mostly-uniform 6-bit allocation instead of a heterogeneous 4/8-bit mix.
The broader lesson: bit-width uniformity matters for inference speed, not just bit-width magnitude. A mixed-precision allocation where most tensors share the same configuration can outperform one with “faster” individual kernels but more switching between them. Mixed-precision quantization frameworks should consider kernel dispatch homogeneity as an optimisation target alongside reconstruction error and model size.
Methodology
All benchmarks on Apple Mac Studio, M2 Ultra (192 GB unified memory), MLX 0.31, macOS 15.4. Generation length 256 tokens, 1 warmup + 3 measured runs per configuration. Model: Qwen3.5-35B-A3B (35B MoE, 3B active parameters). Budget: 30 GB for all three modes. Throughput measured via mlx_lm.stream_generate() with Metal cache cleared between runs.
The three modes differ only in which bit widths the proprietary optimisation allocator may use. The quality curves, quality safety threshold, protection priors, and budget constraint are identical. All three produce models within 0.4 GB of the 30 GB target.
Code: github.com/baa-ai/RAM — Pre-quantized models: huggingface.co/baa-ai
Read the Full Paper
The complete RAM paper, including formal derivations, benchmark results across 7 model families and 40,000+ questions, and the full optimal allocation framework, is available on our HuggingFace:
RAM: Compute-Optimal Proprietary Compression for LLMs — Full Paper
huggingface.co/spaces/baa-ai/RAMLicensed under CC BY-NC-ND 4.0