When Quantizing FP8 Models Produces Garbage: The Block-Scale Trap
Quantization

When Quantizing FP8 Models Produces Garbage: The Block-Scale Trap

May 2026 · Black Sheep AI Research

If you load DeepSeek-V4 weights with safetensors or mlx.load() and quantize the result, the output model is broken, the loader silently upcasts FP8 (E4M3) to BF16 without applying the per-block scale metadata. Same trap awaits any FP8-source model. Here's the diagnosis and the fix.

The symptom

You convert a public DeepSeek-V4 checkpoint to a mixed-precision MLX or GGUF build using a pipeline that handles other model families just fine. The conversion completes without errors. You load the resulting model and run any prompt through it.

The output is garbage. Token gibberish, broken syntax, no coherence. Not "low-quality output", actually broken at the byte level.

This is not a quantization issue. The conversion never had the right weights to start with.

What's actually happening

DeepSeek-V4 (and a growing number of recently-published large MoE checkpoints) ship in FP8 (E4M3) format with per-block scale metadata stored alongside each tensor. Each weight tensor carries:

To recover the underlying floating-point values, the consumer must do:

weight_bf16 = fp8_decode(weight_fp8) × scale_per_block_broadcast

The catch: safetensors.safe_open() returns the FP8 tensor without applying the scale, and mx.load() does the same. Both libraries treat the scale tensor as just another tensor with a name like weight_scale_inv. If your conversion code reads weight and silently upcasts it to BF16 without multiplying by the scale, you get values that are ~3-4 orders of magnitude off from the true weights. Then your quantizer gets trained on noise, and the output model produces noise.

How to diagnose

Three signals that tell you the FP8 trap is what bit you:

  1. The source model has files named *_scale_inv.safetensors or contains tensors with scale in the name. That's the metadata your conversion is dropping.
  2. The converted model loads without errors but produces gibberish on every prompt, including prompts as simple as "Hello". A real quantization quality issue degrades gradually; this is binary failure.
  3. Bitwise comparison of converted weights against the source shows ~3-orders-of-magnitude scaling error, not the small SQNR drift you'd expect from quantization rounding.

Run this in 30 seconds against any FP8 source model:

import safetensors.torch
state = safetensors.torch.load_file("model-00001-of-XX.safetensors")
weight_keys = [k for k in state if not k.endswith("_scale_inv")]
scale_keys = [k for k in state if k.endswith("_scale_inv")]
print(f"weights: {len(weight_keys)}  scales: {len(scale_keys)}")
print(f"weight dtype: {state[weight_keys[0]].dtype}")  # should be uint8 / float8_e4m3fn
print(f"scale dtype:  {state[scale_keys[0]].dtype}")    # usually bfloat16 or float32

If you see a non-zero count of _scale_inv keys, your conversion needs FP8 dequantization. If you see them and your conversion code never reads them, you have the bug.

The fix in pseudocode

for shard in source_shards:
    open shard with safetensors + torch (NOT mlx.load, mlx silently upcasts)
    for tensor_name, weight_fp8 in shard:
        if tensor_name in scale_inv_keys:
            continue  # we'll consume this paired with weights
        scale = shard.get(f"{tensor_name}_scale_inv")  # if exists
        if scale is not None:
            weight_bf16 = dequantize_e4m3(weight_fp8) * scale_broadcast
        else:
            weight_bf16 = weight_fp8.to(bfloat16)  # already-bf16 tensor, no scale
        # Now feed weight_bf16 into your quantization pipeline as if it were a regular bf16 source
        ...

The dequantize_e4m3 step is supported natively in PyTorch as of 2024:

torch.float8_e4m3fn  # the dtype
weight_bf16 = weight_fp8.to(torch.bfloat16)  # native PyTorch dequant; respects E4M3 layout

But the scale multiplication is your responsibility, neither safetensors, mlx_lm.convert, nor mlx.load does it for you.

Why MLX silently upcasts wrong

mx.load() reads the safetensors file and upcasts byte-tensors with FP8 dtype to BF16 by interpreting the bytes as E4M3 directly. That's the unscaled dequantization. The library has no way to know which tensor is the scale and which is the weight, because the safetensors metadata doesn't carry that semantic relationship. The relationship is a convention of the model publisher (in DeepSeek-V4's case, weight_name → weight_name_scale_inv).

safetensors.torch.load_file() is more honest, it gives you the FP8 bytes as torch.uint8 if PyTorch is older, or torch.float8_e4m3fn if newer, and leaves the dequantization to you. But it also doesn't apply the scale.

Which model families are affected

The full converter we ship

For DeepSeek-V4-family, our open RAM pipeline ships an FP8-aware conversion script (convert_v4_fp8_model.py) that:

  1. Streams shards (no full model in memory)
  2. Detects *_scale_inv companion tensors automatically
  3. Applies per-block scale during dequantization to BF16
  4. Hands clean BF16 to the standard MLX quantize step
  5. Writes output in fixed-size shards (default 5 GB)

Peak memory: ~15 GB regardless of source size. Output: a byte-correct mixed-precision MLX model that actually generates coherent text.

What to do if you're hit

  1. Don't trust quantized output until you've verified the source-load step.
  2. Add a one-line sanity check after dequantization: assert weight_bf16.abs().max() > 1e-2 and weight_bf16.abs().max() < 1e3. Real BF16 weights live in this range; FP8-without-scale lives at ~1e-3 to 1e-1.
  3. If your pipeline doesn't have an explicit scale step, it has a silent bug for FP8 sources. Add the step before quantizing.

Source: documented as Symptom 4 in our public RAM pipeline notes. Bug surfaced converting DeepSeek-V4 in March 2026; pattern generalises to any FP8 checkpoint.

Read more: A 16 GB Mac Mini Can Quantize a 250 GB Model, Streaming-Shard MLX Conversion: Trillion-Param MoE in 15 GB RAM.