Quantizing a hybrid SSM-attention model to GGUF format loses about 7 percentage points of MMLU-Pro accuracy compared to native MLX, even at the F16 ceiling. The loss is fundamental to llama.cpp's tensor-type taxonomy, not a quantization artefact, and most teams don't know to measure it.
The result that surprised us
We measured a Qwen3.6-35B-A3B build at three points along the format axis, holding the underlying model and evaluation set constant (MMLU-Pro 140Q, seed 42, non-thinking sampling). The numbers:
| Build | Format | Size | MMLU-Pro 140Q |
|---|---|---|---|
| Reference quantized | MLX mixed-precision | 25.9 GB | 77.9% |
| Best GGUF we could produce | GGUF Path B + imatrix | 26.0 GB | 69.3% |
| GGUF ceiling (no quantization, format-only) | F16 GGUF | 65 GB | 70.7% |
Two findings stand out:
- F16 GGUF is already 7 pp below the MLX build. This is before any GGUF quantization. The conversion to GGUF format alone, taking a BF16 source model and writing it as F16 GGUF, costs 7 pp on this model.
- Path B + imatrix recovers 98% of the GGUF ceiling (69.3 / 70.7 = 0.981). Within the GGUF format, our build is essentially as good as the format allows. The 7 pp gap is not something quantization can fix.
Why is the format itself lossy?
Hybrid SSM-attention models like Qwen3.6, Mamba-Hybrid variants, and the linear-attention Llama variants don't fit cleanly into llama.cpp's GGUF tensor-type taxonomy. GGUF has fixed quantization recipes (Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_0, F16) optimised for transformer attention and MLP weights. The state-space modules (SSM transition matrices, gated linear convolutions) don't map well to those recipes, and crucially, llama.cpp's conversion script for hybrid models silently emits F16 for tensors it can't classify, plus a few approximation steps in the math.
The 7 pp loss is the cumulative cost of:
- Numerical precision drift in the state-space dispatch (BF16 source → F16 GGUF representation)
- Convolution weight layout transformations during conversion
- Slight differences in how llama.cpp evaluates linear-attention vs how the MLX inference path does
- Saturation in the F16 representation for some narrow-distribution tensors
None of this is a quantization decision. F16 GGUF is unquantized and the loss is still there.
What this means in practice
If you're shipping a hybrid SSM-attention model to llama.cpp / Ollama / LM Studio, you're paying 7 pp on benchmarks like MMLU-Pro vs. what users on MLX get from the same source weights. There are three ways to handle this:
Option A, Ship MLX-only on Apple Silicon, GGUF as best-effort elsewhere.
This is what we recommend. MLX users get the full quality; GGUF users get a working but degraded build. Document the gap so users on llama.cpp don't compare your model to non-hybrid GGUF builds and conclude something is wrong.
Option B, Don't ship GGUF for hybrid architectures.
If your audience is overwhelmingly llama.cpp users, the 7 pp loss may be a dealbreaker. In practice this means dense-only model families on the GGUF channel.
Option C, Wait for llama.cpp to add native SSM tensor types.
This is the real fix and is in progress upstream as of mid-2026. When llama.cpp ships dedicated SSM dispatch (rather than F16 fall-through), the gap should close.
How to measure it on your own model
The recipe is straightforward, just don't skip the F16 GGUF reference build, which is what most teams omit:
- Convert your source model to F16 GGUF (no quantization). This is your format ceiling.
- Convert your source model with your favourite GGUF quantization recipe (Path B + imatrix is what we use).
- Run the same eval on the source MLX model, F16 GGUF, and quantized GGUF.
- The MLX→F16-GGUF gap is the format loss. The F16-GGUF→quantized-GGUF gap is the quantization loss. The two are separable.
Most "is this quantization good?" comparisons silently elide the first gap by comparing quantized GGUF to BF16 source. That puts the format loss in the quantization budget and makes both look worse.
What to expect on dense models
For dense transformer models, Llama-3.X, Qwen3-dense, Gemma-4-dense, the MLX vs F16 GGUF gap is small (<1 pp typically). The format loss is mostly the SSM-specific issue. Dense models pay only the quantization-recipe cost, which Path B + imatrix handles well.
If your model has any of these architectural elements, expect the gap:
- State-space components (Mamba, Mamba-2, hybrid layers)
- Linear attention with kernel approximations (RWKV, RetNet-style, Qwen3.6-style linear_attn)
- Custom convolution gates
- Non-standard normalization positions (some signed-RMSNorm variants, see SmoothQuant Breaks on Signed RMSNorm)
The honest summary
GGUF is great for what it was built for, dense transformer attention/MLP weights at a small set of well-tuned bit-widths. Hybrid SSM-attention is outside its design centre. Ship native MLX on Apple Silicon, ship a GGUF with the format-loss documented, and don't tune your quantization recipe trying to recover format loss it can't reach.
Source: measured 2026 on Qwen3.6-35B-A3B; full numbers in our public RAM pipeline documentation. F16 GGUF reference build is convert_hf_to_gguf.py upstream; Path B + imatrix is the in-house quantization recipe.
Read more: A 16 GB Mac Mini Can Quantize a 250 GB Model, SmoothQuant Breaks on Signed RMSNorm: A Negative Result.