The Compression Variable Everyone Ignored
MINT Research

The Compression Variable Everyone Ignored

March 2026 · baa.ai

85% of tensors want group size 32, not 128. Per-tensor group-size selection provides larger quality gains than bit-width changes—and the industry has been treating it as a fixed default.

When practitioners think about quantization, they think about bit-width. Should this model be 4-bit or 8-bit? How low can we go before quality degrades? The entire quantization discourse—papers, blog posts, leaderboards—is framed around bits per weight.

But there is another variable hiding in plain sight: group size—the number of weights that share a single scale factor. The industry standardized on group size 128 years ago and never looked back. Our MINT research proves this is a significant mistake.

What Is Group Size?

Quantization works by mapping continuous floating-point weights to a small set of discrete values. To preserve accuracy, weights are divided into groups, and each group gets its own scale factor and zero-point. These per-group parameters allow the quantized representation to adapt to local weight distributions rather than forcing a single scale across an entire tensor.

Group size 128 means every 128 consecutive weights share one scale factor and one zero-point. Smaller groups—64, or 32—mean more fine-grained scaling. Each group can more closely track its local weight distribution, reducing quantization error. The tradeoff is storage: smaller groups require more scale/zero-point pairs, adding overhead to the compressed model.

The conventional wisdom has been that this overhead is not worth the quality gain. Our data says otherwise.

The Evidence

On Qwen3-30B-A3B at a 19 GB memory budget, MINT’s rate-distortion allocator was given full freedom to choose any combination of bit-width and group size for each of the model’s 18,867 weight tensors. Here is what it chose:

Configuration (bits, group) Tensors Fraction
(4, 32) 15,908 85.2%
(4, 64) 9 <0.1%
(4, 128) 96 0.5%
(8, 128) 2,612 14.0%
(8, 64) 1 <0.1%
(16, 0) 241 1.3%

The allocator overwhelmingly chose 4-bit with group size 32. Not 4-bit with group size 128. Not 8-bit. The smallest available group size won 85.2% of all allocation decisions.

When given the freedom to choose, the optimizer chose finer groups over more bits, 85% of the time.

Why Group Size 32 Wins

The arithmetic explains the result. Going from group size 128 to group size 32 means storing 4× more scale/zero-point pairs. For 4-bit quantization with FP16 scales, this adds approximately 0.125 bytes per parameter in overhead—a modest cost.

Now compare that to upgrading a tensor from 4-bit to 8-bit, which costs 0.5 bytes per parameter—four times more expensive than switching to the finer group size. Yet the quality gain from group size 32 is consistently larger than what those extra bits provide. You get more quality per byte from finer groups than from more bits.

This makes intuitive sense. Weight distributions within a tensor are not uniform. They have local structure—clusters, outliers, regions of varying magnitude. A group size of 128 forces a single scale to cover a wide range of local distributions. Group size 32 allows the quantization to adapt to local structure four times more precisely, capturing variations that a coarser group would smooth over or distort.

Why Did 128 Become the Default?

Computational convenience and legacy. Early quantization frameworks needed a reasonable default, and 128 was chosen as a balance point—small enough to improve over per-tensor quantization, large enough to keep the overhead minimal and the implementation simple.

It became the default in llama.cpp. Then MLX. Then GPTQ. Then AWQ. Each framework inherited the convention from its predecessors. Nobody seriously questioned it because bit-width was the variable everyone optimized. Group size was treated as a constant, not a variable.

This is a classic case of premature standardization. A reasonable initial choice calcified into an unquestioned default, and the entire field optimized around it rather than through it.

What This Means for Deployed Models

Every quantized model running today with group size 128 is leaving quality on the table. The magnitude of the loss depends on the model and the bit-width, but the direction is consistent: finer group sizes improve quality at a cost that is smaller than the equivalent bit-width upgrade.

The fix is not complicated. Frameworks need to support variable group sizes per tensor—allowing different weight matrices in the same model to use different group sizes based on their individual sensitivity profiles. The format changes are modest. GGUF and safetensors can already store per-tensor metadata including group size. The real shift is conceptual: stop thinking about bit-width alone and start thinking about the joint configuration space of (bits, group size) pairs.

For practitioners deploying models today, the immediate takeaway is simple: if your framework supports group size 32, use it. The overhead is small, and the quality improvement is real and measurable.

Implications for Framework Developers

llama.cpp, vLLM, TensorRT-LLM, and MLX all need per-tensor variable group size support. Today, most of these frameworks apply a single group size across all tensors in a model. This is the equivalent of applying a single bit-width to every layer—something the field moved past years ago with mixed-precision quantization.

The storage formats are already capable. GGUF supports per-tensor block sizes. Safetensors can store arbitrary metadata per tensor. The bottleneck is in the inference kernels, which need to handle mixed group sizes within the same model. This is an engineering challenge, not a research challenge—the kernels need to dispatch to different dequantization routines based on per-tensor configuration, rather than assuming a global group size.

The frameworks that implement variable group size support first will deliver measurably better quality at the same model sizes. In a field where fractions of a perplexity point matter, this is a competitive advantage waiting to be claimed.


Data from MINT allocation analysis on Qwen3-30B-A3B. Allocation decisions made by rate-distortion optimization across the full (bits, group size) configuration space. All figures reflect the optimal budget-constrained solution at 19 GB target size. The full MINT pipeline is open source at github.com/baa-ai/MINT.

← Back to all articles

Want to unlock the quality your quantized models are leaving behind?

Our team specialises in data-free model compression, budget-aware quantization, and production AI deployment on commodity hardware.

Talk to Our Team