What RAM Actually Delivers: Evidence from Four Models and 20,000 Tensors

We ran RAM across four production models, two dense and two Mixture-of-Experts, covering over 20,000 weight tensors. We measured perplexity, ran academic benchmarks, discovered evaluation artifacts, and X-rayed model architectures none of us had the training code for. Here's what proprietary compression actually delivers. No hedging.

The Core Claim, Tested

RAM's promise is simple: analyse a model's weight statistics, assign each tensor the minimum bit width it can tolerate, and compress the model without needing a single sample of calibration data.

We tested it across four architectures with very different characteristics:

Model	Type	Total Params	Tensors	BF16 Size
Qwen3-8B	Dense	8.19B	399	15.3 GB
Qwen3-30B-A3B	MoE (128 experts)	30.53B	18,867	56.9 GB
GLM-4.7	Dense	~9B	~400	~17 GB
GLM-4.7-Flash	MoE	~31B	~19,000	58.2 GB

Every evaluation used the same protocol: WikiText-2 perplexity (2048 tokens, 256 samples, seed 42) on an Apple M2 Ultra with 192 GB unified memory. For Qwen3-8B we added ARC-Challenge (25-shot) and HellaSwag (10-shot) via lm-evaluation-harness.

Value #1: Better Compression at Every Scale

The baseline is uniform 4-bit quantization: every weight tensor gets the same bit width. That's what you get from mlx_lm.convert --quant or any standard quantization tool. RAM consistently beats it.

Model	Method	Size	PPL	ΔPPL vs BF16	Compression
Qwen3-8B (Dense)
Qwen3-8B	BF16	15.3 GB	9.727	-	1.0×
Qwen3-8B	Uniform 4-bit	4.1 GB	10.250	+5.4%	3.77×
Qwen3-8B	RAM v3	6.1 GB	10.097	+3.8%	2.52×
Qwen3-30B-A3B (MoE)
Qwen3-30B-A3B	BF16	56.9 GB	8.728	-	1.0×
Qwen3-30B-A3B	Uniform 4-bit	15.1 GB	9.629	+10.3%	3.76×
Qwen3-30B-A3B	RAM v3	16.2 GB	9.041	+3.6%	3.51×
GLM-4.7-Flash (MoE), median PPL
GLM-4.7-Flash	BF16	58.2 GB	8.706	-	1.0×
GLM-4.7-Flash	RAM v3	15.9 GB	9.084	+4.3%	3.66×

The headline numbers: RAM cuts perplexity degradation by 30% on the dense model (5.4% down to 3.8%) and by 65% on the MoE model (10.3% down to 3.6%). The MoE result is especially dramatic. Uniform 4-bit loses over 10% of quality while RAM holds the line at under 4%, with only marginally less compression.

There's no magic here. MoE models have 128 experts per layer, each with different weight characteristics. Treating every expert identically is wasteful. RAM spots which experts are sensitive and allocates bits accordingly. Some get 2-bit, many stay at 4-bit, and the critical ones get 8-bit or 16-bit.

Value #2: Benchmark-Validated, Not Just Perplexity

Perplexity is necessary but it's not the whole picture. A model could score well on next-token prediction while failing at actual reasoning. So we ran two academic benchmarks on Qwen3-8B: ARC-Challenge (science reasoning, 25-shot) and HellaSwag (commonsense inference, 10-shot) across all three conditions.

Benchmark	BF16	RAM v3	Δ	Uniform 4-bit	Δ
ARC-Challenge (acc_norm)	44.62%	43.43%	-1.19pp	42.83%	-1.79pp
HellaSwag (acc_norm)	60.04%	58.16%	-1.88pp	58.14%	-1.90pp

On ARC-Challenge, RAM keeps 66% more accuracy than uniform 4-bit (1.19pp loss vs 1.79pp). The science reasoning that matters most, normalised accuracy on genuinely hard questions, is better preserved by sensitivity-aware allocation.

HellaSwag shows near-identical results between the two (-1.88pp vs -1.90pp). That makes sense. Commonsense inference is spread broadly across the model, so there's less to gain from selective bit allocation. But RAM still matches uniform 4-bit while using 50% more storage (6.1 GB vs 4.1 GB). Those extra bits are being spent on preserving the capabilities that do benefit from higher precision.

The cross-validation between perplexity and benchmarks is what matters here. RAM's 3.8% PPL improvement over uniform 4-bit translates into a measurable accuracy gain on reasoning tasks. The metrics agree. This isn't an evaluation artifact.

Value #3: No Data Required

This is the property that changes everything for production deployment.

Most competitive quantization methods (GPTQ, AWQ, SqueezeLLM) need calibration data. You feed representative samples through the model to measure activation patterns, then optimise quantization parameters against those observations. This creates three problems:

Privacy exposure. If your model was trained on sensitive data, calibration samples may need to come from similar distributions. In regulated industries like healthcare, finance, or government, that can be a compliance blocker.
Distribution bias. Calibration data determines which model behaviours get preserved. If your calibration set doesn't represent real production queries, the quantised model may degrade on exactly the tasks that matter most.
Pipeline friction. Every time you update a model, you need to source and validate calibration data, run it through the model, and hope the activation statistics are representative. That's a manual step that doesn't belong in an automated CI/CD pipeline.

RAM analyses only the weight tensors themselves. It uses a proprietary sensitivity analysis measured directly from the weights, with no forward pass required. The entire analysis runs in 3 minutes for an 8B model and 30 minutes for a 30B MoE model.

That means RAM can slot into an automated model registry. A new checkpoint lands, RAM analyses it, optimal bit allocation is determined, and the quantised model deploys. No human in the loop. No calibration data to curate.

Value #4: Models as Diagnostic Subjects

This was the unexpected discovery. RAM's per-tensor sensitivity analysis doesn't just tell you how to quantise a model. It reveals how the model was built.

Look at the sensitivity score distributions across our test models:

Model	Sensitivity Span	Max 2-bit	Max 16-bit	Interpretation
Qwen3-8B	1.173	16.8%	20.3%	Diverse, clear sensitive/robust layers
Qwen3-30B-A3B	1.378	41.3%	7.3%	Highly diverse, many robust experts
GLM-4.7-Flash	0.073	0.0%	3.3%	Homogeneous, all tensors look the same

The Qwen models show a sensitivity span of 1.17 to 1.38: there are clearly tough tensors that need protection and tolerant ones that can take aggressive compression. That variance is exactly what RAM exploits.

GLM-4.7-Flash tells a radically different story. Its sensitivity span is 0.073, 18x more uniform than Qwen. Every tensor looks nearly identical to RAM's analysis. That has two implications:

For quantization: RAM can't effectively differentiate tensors, so it defaults to near-uniform allocation. Mixed-precision has limited upside on heavily regularised models.
For the model builders: This uniformity likely reflects heavy regularisation or normalisation during training. The same property that makes tensors indistinguishable to RAM also correlates with the confidence calibration fragility we observed: 5 catastrophic sequences where the model assigns near-zero probability to correct tokens, producing perplexity spikes of 25,000 to 81,000.

RAM became a model X-ray. Without access to training code, training data, or any insider knowledge, the sensitivity profile alone told us GLM was trained differently from Qwen. It also predicted the exact class of evaluation artifacts we later found.

Value #5: Discovering Evaluation Blind Spots

RAM's evaluation on GLM-4.7-Flash produced a result that should have been a headline: the quantized model reported 12.5% lower perplexity than the BF16 baseline. A model compressed to a quarter of its size, scoring better than full precision.

We spent a week digging into it. The answer: 5 out of 256 WikiText-2 test sequences produce catastrophic perplexity (25,000 to 81,000) in the full-precision model. Quantization noise accidentally stabilises these pathological sequences. Standard mean perplexity, dominated by those outliers, makes the quantised model look better.

The honest number, median perplexity, shows RAM is 4.3% worse. Exactly what you'd expect from lossy compression.

This isn't a footnote. Perplexity is the most-reported metric in quantization research. If standard mean PPL can be dominated by 2% of evaluation sequences, then published results across the field may be unreliable. The same outlier dynamics that make quantization hard (heavy-tailed weight distributions) also make evaluating quantization unreliable (heavy-tailed sequence distributions).

RAM's contribution here wasn't the quantization itself. It was the rigour of the evaluation process. By investigating an anomalous result instead of celebrating it, we found a systemic weakness in how the entire field measures compression quality.

Value #6: Purpose-Built for the MoE Era

The industry is going MoE. Qwen3, DeepSeek-V3, Mixtral, DBRX, GLM-4: the largest and most capable open models increasingly use sparse expert architectures. And MoE is where RAM's value proposition is strongest.

Here's why. A 30B MoE model with 128 experts per layer has enormous internal diversity. Some experts activate frequently and encode critical knowledge. Others activate rarely and handle niche patterns. Uniform 4-bit treats them identically and loses 10.3% on Qwen3-30B-A3B.

RAM's per-tensor analysis spots the critical experts automatically. On Qwen3-30B-A3B, it allocates:

16.6% of tensors to 2-bit, robust experts that handle extreme compression just fine
71.9% at 4-bit, the standard allocation for most weights
6.3% at 8-bit, moderately sensitive tensors
5.3% at 16-bit, the most sensitive attention and embedding layers

The result: 65% less quality degradation (3.6% vs 10.3%) with only 7% less compression (3.51x vs 3.76x). In efficiency terms, RAM v3 on MoE hits 0.82% degradation per unit of compression. That's the best ratio across every condition we tested.

As models get larger and more expert-heavy, this advantage compounds. RAM doesn't just compress MoE models. It understands their internal structure.

Where RAM Falls Short

Honest evaluation means reporting limitations. We found three.

1. Highly regularised models neutralise it. GLM-4.7-Flash's extreme weight uniformity (sensitivity span 0.073) means RAM can't tell tensors apart effectively. If every tensor has nearly identical statistics, mixed-precision allocation has nothing to work with. RAM doesn't hurt in these cases; it just falls back to something close to uniform. But it can't help either.

2. Dense models see modest gains. On Qwen3-8B, RAM reduces PPL degradation from 5.4% to 3.8%. That's a real improvement, but with 50% more storage. The efficiency ratio is less compelling than on MoE. For dense models, the value is real but incremental.

3. Adaptive normalization can backfire. RAM v2's adaptive normalization amplifies tiny differences in narrow metric ranges, causing unnecessary bit upgrades on dense models. On Qwen3-8B, v2 actually performed worse than v1 (6.3% vs 4.1% degradation). The v3 hybrid approach fixes this, but the sensitivity to normalization strategy is a design constraint that needs continued attention.

The Composite Value

RAM isn't one thing. The evaluation revealed five distinct capabilities coming from the same analysis pipeline:

Capability	Evidence	Who Benefits
Data-free compression	3–4% PPL degradation at 2.5–3.7× compression, no calibration data needed	Regulated industries, privacy-sensitive deployments
MoE-optimised allocation	65% less quality loss vs uniform 4-bit on Qwen3-30B-A3B	Anyone deploying MoE models on constrained hardware
Model diagnostics	Predicted GLM's calibration fragility from weight statistics alone	Model developers, quality assurance teams
Evaluation methodology	Discovered perplexity anomaly affecting published benchmarks	The entire quantization research community
Pipeline automation	3–30 min end-to-end, no human-in-the-loop steps	MLOps teams running model registries

Most quantization tools do exactly one thing: compress a model. RAM does that too, but the analysis it performs along the way turns out to be at least as valuable as the compression itself.

What This Means in Practice

A 30B MoE model in BF16 needs 57 GB of memory. Most edge devices, Mac laptops, workstations, and embedded systems can't load it. At 16.2 GB after RAM compression, it fits comfortably on a 32 GB M-series Mac with room for KV cache and inference overhead. The quality cost: 3.6% higher perplexity.

For the Qwen3-8B dense model, RAM at 6.1 GB runs on any modern laptop with 8 GB of RAM. It retains 98.5% of ARC-Challenge accuracy and 96.9% of HellaSwag accuracy compared to the 15.3 GB BF16 version that wouldn't fit in memory at all.

The models don't just fit. They run faster. Smaller models mean less memory bandwidth pressure, which is the actual bottleneck on Apple Silicon and most inference hardware. In our benchmarks, the RAM-quantised Qwen3-8B finished ARC-Challenge evaluation in 1,060 seconds vs 1,199 seconds for BF16. That's an 11.6% speedup from reduced model size alone.

The Bottom Line

RAM delivers what the quantization field has been missing: intelligent, automated, data-free compression that understands model architecture.

On MoE models, the architecture the industry is converging on, RAM achieves near-lossless compression at 3.5x. On dense models, it provides meaningful improvements over uniform quantization with complete transparency about the trade-offs. And the analysis it performs along the way reveals things about models that no other tool can show you.

The results are real. Four models, 20,000 tensors, cross-validated with both perplexity and academic benchmarks, with every anomaly investigated rather than swept under the rug.

RAM is open source. All evaluation data, per-tensor manifests, and analysis tools are available at github.com/baa-ai/swan-quantization. Hardware: Apple M2 Ultra, 192 GB unified memory. Software: MLX 0.30.3, Python 3.12.

Read the Full Paper

The full RAM paper covers formal derivations of the proprietary compression framework, evaluation across four models and 20,000+ tensors, and deployment methodology. It's on our HuggingFace:

RAM: Proprietary Compression via Proprietary Compression, Full Paper

huggingface.co/spaces/baa-ai/swan-paper

Licensed under CC BY-NC-ND 4.0

← Previous: RAM Evaluation Results All Articles →