We ran RAM across four production models, two dense and two Mixture-of-Experts, covering over 20,000 weight tensors. We measured perplexity, ran academic benchmarks, discovered evaluation artifacts, and X-rayed model architectures none of us had the training code for. Here's what proprietary compression actually delivers. No hedging.
The Core Claim, Tested
RAM's promise is simple: analyse a model's weight statistics, assign each tensor the minimum bit width it can tolerate, and compress the model without needing a single sample of calibration data.
We tested it across four architectures with very different characteristics:
| Model | Type | Total Params | Tensors | BF16 Size |
|---|---|---|---|---|
| Qwen3-8B | Dense | 8.19B | 399 | 15.3 GB |
| Qwen3-30B-A3B | MoE (128 experts) | 30.53B | 18,867 | 56.9 GB |
| GLM-4.7 | Dense | ~9B | ~400 | ~17 GB |
| GLM-4.7-Flash | MoE | ~31B | ~19,000 | 58.2 GB |
Every evaluation used the same protocol: WikiText-2 perplexity (2048 tokens, 256 samples, seed 42) on an Apple M2 Ultra with 192 GB unified memory. For Qwen3-8B we added ARC-Challenge (25-shot) and HellaSwag (10-shot) via lm-evaluation-harness.
Value #1: Better Compression at Every Scale
The baseline is uniform 4-bit quantization: every weight tensor gets the same bit width. That's what you get from mlx_lm.convert --quant or any standard quantization tool. RAM consistently beats it.
| Model | Method | Size | PPL | ΔPPL vs BF16 | Compression |
|---|---|---|---|---|---|
| Qwen3-8B (Dense) | |||||
| Qwen3-8B | BF16 | 15.3 GB | 9.727 | - | 1.0× |
| Qwen3-8B | Uniform 4-bit | 4.1 GB | 10.250 | +5.4% | 3.77× |
| Qwen3-8B | RAM v3 | 6.1 GB | 10.097 | +3.8% | 2.52× |
| Qwen3-30B-A3B (MoE) | |||||
| Qwen3-30B-A3B | BF16 | 56.9 GB | 8.728 | - | 1.0× |
| Qwen3-30B-A3B | Uniform 4-bit | 15.1 GB | 9.629 | +10.3% | 3.76× |
| Qwen3-30B-A3B | RAM v3 | 16.2 GB | 9.041 | +3.6% | 3.51× |
| GLM-4.7-Flash (MoE), median PPL | |||||
| GLM-4.7-Flash | BF16 | 58.2 GB | 8.706 | - | 1.0× |
| GLM-4.7-Flash | RAM v3 | 15.9 GB | 9.084 | +4.3% | 3.66× |
The headline numbers: RAM cuts perplexity degradation by 30% on the dense model (5.4% down to 3.8%) and by 65% on the MoE model (10.3% down to 3.6%). The MoE result is especially dramatic. Uniform 4-bit loses over 10% of quality while RAM holds the line at under 4%, with only marginally less compression.
There's no magic here. MoE models have 128 experts per layer, each with different weight characteristics. Treating every expert identically is wasteful. RAM spots which experts are sensitive and allocates bits accordingly. Some get 2-bit, many stay at 4-bit, and the critical ones get 8-bit or 16-bit.
Value #2: Benchmark-Validated, Not Just Perplexity
Perplexity is necessary but it's not the whole picture. A model could score well on next-token prediction while failing at actual reasoning. So we ran two academic benchmarks on Qwen3-8B: ARC-Challenge (science reasoning, 25-shot) and HellaSwag (commonsense inference, 10-shot) across all three conditions.
| Benchmark | BF16 | RAM v3 | Δ | Uniform 4-bit | Δ |
|---|---|---|---|---|---|
| ARC-Challenge (acc_norm) | 44.62% | 43.43% | -1.19pp | 42.83% | -1.79pp |
| HellaSwag (acc_norm) | 60.04% | 58.16% | -1.88pp | 58.14% | -1.90pp |
On ARC-Challenge, RAM keeps 66% more accuracy than uniform 4-bit (1.19pp loss vs 1.79pp). The science reasoning that matters most, normalised accuracy on genuinely hard questions, is better preserved by sensitivity-aware allocation.
HellaSwag shows near-identical results between the two (-1.88pp vs -1.90pp). That makes sense. Commonsense inference is spread broadly across the model, so there's less to gain from selective bit allocation. But RAM still matches uniform 4-bit while using 50% more storage (6.1 GB vs 4.1 GB). Those extra bits are being spent on preserving the capabilities that do benefit from higher precision.
The cross-validation between perplexity and benchmarks is what matters here. RAM's 3.8% PPL improvement over uniform 4-bit translates into a measurable accuracy gain on reasoning tasks. The metrics agree. This isn't an evaluation artifact.
Value #3: No Data Required
This is the property that changes everything for production deployment.
Most competitive quantization methods (GPTQ, AWQ, SqueezeLLM) need calibration data. You feed representative samples through the model to measure activation patterns, then optimise quantization parameters against those observations. This creates three problems:
- Privacy exposure. If your model was trained on sensitive data, calibration samples may need to come from similar distributions. In regulated industries like healthcare, finance, or government, that can be a compliance blocker.
- Distribution bias. Calibration data determines which model behaviours get preserved. If your calibration set doesn't represent real production queries, the quantised model may degrade on exactly the tasks that matter most.
- Pipeline friction. Every time you update a model, you need to source and validate calibration data, run it through the model, and hope the activation statistics are representative. That's a manual step that doesn't belong in an automated CI/CD pipeline.
RAM analyses only the weight tensors themselves. It uses a proprietary sensitivity analysis measured directly from the weights, with no forward pass required. The entire analysis runs in 3 minutes for an 8B model and 30 minutes for a 30B MoE model.
That means RAM can slot into an automated model registry. A new checkpoint lands, RAM analyses it, optimal bit allocation is determined, and the quantised model deploys. No human in the loop. No calibration data to curate.
Value #4: Models as Diagnostic Subjects
This was the unexpected discovery. RAM's per-tensor sensitivity analysis doesn't just tell you how to quantise a model. It reveals how the model was built.
Look at the sensitivity score distributions across our test models:
| Model | Sensitivity Span | Max 2-bit | Max 16-bit | Interpretation |
|---|---|---|---|---|
| Qwen3-8B | 1.173 | 16.8% | 20.3% | Diverse, clear sensitive/robust layers |
| Qwen3-30B-A3B | 1.378 | 41.3% | 7.3% | Highly diverse, many robust experts |
| GLM-4.7-Flash | 0.073 | 0.0% | 3.3% | Homogeneous, all tensors look the same |
The Qwen models show a sensitivity span of 1.17 to 1.38: there are clearly tough tensors that need protection and tolerant ones that can take aggressive compression. That variance is exactly what RAM exploits.
GLM-4.7-Flash tells a radically different story. Its sensitivity span is 0.073, 18x more uniform than Qwen. Every tensor looks nearly identical to RAM's analysis. That has two implications:
- For quantization: RAM can't effectively differentiate tensors, so it defaults to near-uniform allocation. Mixed-precision has limited upside on heavily regularised models.
- For the model builders: This uniformity likely reflects heavy regularisation or normalisation during training. The same property that makes tensors indistinguishable to RAM also correlates with the confidence calibration fragility we observed: 5 catastrophic sequences where the model assigns near-zero probability to correct tokens, producing perplexity spikes of 25,000 to 81,000.
RAM became a model X-ray. Without access to training code, training data, or any insider knowledge, the sensitivity profile alone told us GLM was trained differently from Qwen. It also predicted the exact class of evaluation artifacts we later found.
Value #5: Discovering Evaluation Blind Spots
RAM's evaluation on GLM-4.7-Flash produced a result that should have been a headline: the quantized model reported 12.5% lower perplexity than the BF16 baseline. A model compressed to a quarter of its size, scoring better than full precision.
We spent a week digging into it. The answer: 5 out of 256 WikiText-2 test sequences produce catastrophic perplexity (25,000 to 81,000) in the full-precision model. Quantization noise accidentally stabilises these pathological sequences. Standard mean perplexity, dominated by those outliers, makes the quantised model look better.
The honest number, median perplexity, shows RAM is 4.3% worse. Exactly what you'd expect from lossy compression.
This isn't a footnote. Perplexity is the most-reported metric in quantization research. If standard mean PPL can be dominated by 2% of evaluation sequences, then published results across the field may be unreliable. The same outlier dynamics that make quantization hard (heavy-tailed weight distributions) also make evaluating quantization unreliable (heavy-tailed sequence distributions).
RAM's contribution here wasn't the quantization itself. It was the rigour of the evaluation process. By investigating an anomalous result instead of celebrating it, we found a systemic weakness in how the entire field measures compression quality.
Value #6: Purpose-Built for the MoE Era
The industry is going MoE. Qwen3, DeepSeek-V3, Mixtral, DBRX, GLM-4: the largest and most capable open models increasingly use sparse expert architectures. And MoE is where RAM's value proposition is strongest.
Here's why. A 30B MoE model with 128 experts per layer has enormous internal diversity. Some experts activate frequently and encode critical knowledge. Others activate rarely and handle niche patterns. Uniform 4-bit treats them identically and loses 10.3% on Qwen3-30B-A3B.
RAM's per-tensor analysis spots the critical experts automatically. On Qwen3-30B-A3B, it allocates:
- 16.6% of tensors to 2-bit, robust experts that handle extreme compression just fine
- 71.9% at 4-bit, the standard allocation for most weights
- 6.3% at 8-bit, moderately sensitive tensors
- 5.3% at 16-bit, the most sensitive attention and embedding layers
The result: 65% less quality degradation (3.6% vs 10.3%) with only 7% less compression (3.51x vs 3.76x). In efficiency terms, RAM v3 on MoE hits 0.82% degradation per unit of compression. That's the best ratio across every condition we tested.
As models get larger and more expert-heavy, this advantage compounds. RAM doesn't just compress MoE models. It understands their internal structure.
Where RAM Falls Short
Honest evaluation means reporting limitations. We found three.
1. Highly regularised models neutralise it. GLM-4.7-Flash's extreme weight uniformity (sensitivity span 0.073) means RAM can't tell tensors apart effectively. If every tensor has nearly identical statistics, mixed-precision allocation has nothing to work with. RAM doesn't hurt in these cases; it just falls back to something close to uniform. But it can't help either.
2. Dense models see modest gains. On Qwen3-8B, RAM reduces PPL degradation from 5.4% to 3.8%. That's a real improvement, but with 50% more storage. The efficiency ratio is less compelling than on MoE. For dense models, the value is real but incremental.
3. Adaptive normalization can backfire. RAM v2's adaptive normalization amplifies tiny differences in narrow metric ranges, causing unnecessary bit upgrades on dense models. On Qwen3-8B, v2 actually performed worse than v1 (6.3% vs 4.1% degradation). The v3 hybrid approach fixes this, but the sensitivity to normalization strategy is a design constraint that needs continued attention.
The Composite Value
RAM isn't one thing. The evaluation revealed five distinct capabilities coming from the same analysis pipeline:
| Capability | Evidence | Who Benefits |
|---|---|---|
| Data-free compression | 3–4% PPL degradation at 2.5–3.7× compression, no calibration data needed | Regulated industries, privacy-sensitive deployments |
| MoE-optimised allocation | 65% less quality loss vs uniform 4-bit on Qwen3-30B-A3B | Anyone deploying MoE models on constrained hardware |
| Model diagnostics | Predicted GLM's calibration fragility from weight statistics alone | Model developers, quality assurance teams |
| Evaluation methodology | Discovered perplexity anomaly affecting published benchmarks | The entire quantization research community |
| Pipeline automation | 3–30 min end-to-end, no human-in-the-loop steps | MLOps teams running model registries |
Most quantization tools do exactly one thing: compress a model. RAM does that too, but the analysis it performs along the way turns out to be at least as valuable as the compression itself.
What This Means in Practice
A 30B MoE model in BF16 needs 57 GB of memory. Most edge devices, Mac laptops, workstations, and embedded systems can't load it. At 16.2 GB after RAM compression, it fits comfortably on a 32 GB M-series Mac with room for KV cache and inference overhead. The quality cost: 3.6% higher perplexity.
For the Qwen3-8B dense model, RAM at 6.1 GB runs on any modern laptop with 8 GB of RAM. It retains 98.5% of ARC-Challenge accuracy and 96.9% of HellaSwag accuracy compared to the 15.3 GB BF16 version that wouldn't fit in memory at all.
The models don't just fit. They run faster. Smaller models mean less memory bandwidth pressure, which is the actual bottleneck on Apple Silicon and most inference hardware. In our benchmarks, the RAM-quantised Qwen3-8B finished ARC-Challenge evaluation in 1,060 seconds vs 1,199 seconds for BF16. That's an 11.6% speedup from reduced model size alone.
The Bottom Line
RAM delivers what the quantization field has been missing: intelligent, automated, data-free compression that understands model architecture.
On MoE models, the architecture the industry is converging on, RAM achieves near-lossless compression at 3.5x. On dense models, it provides meaningful improvements over uniform quantization with complete transparency about the trade-offs. And the analysis it performs along the way reveals things about models that no other tool can show you.
The results are real. Four models, 20,000 tensors, cross-validated with both perplexity and academic benchmarks, with every anomaly investigated rather than swept under the rug.
RAM is open source. All evaluation data, per-tensor manifests, and analysis tools are available at github.com/baa-ai/swan-quantization. Hardware: Apple M2 Ultra, 192 GB unified memory. Software: MLX 0.30.3, Python 3.12.
Read the Full Paper
The full RAM paper covers formal derivations of the proprietary compression framework, evaluation across four models and 20,000+ tensors, and deployment methodology. It's on our HuggingFace:
RAM: Proprietary Compression via Proprietary Compression, Full Paper
huggingface.co/spaces/baa-ai/swan-paperLicensed under CC BY-NC-ND 4.0