We tested RAM on four models (Qwen3-8B, Qwen3-30B-A3B, GLM-4.7-Flash, GLM-4.7) spanning three architectures (dense transformer, sparse MoE, dense MoE) across multiple RAM versions from v1 through v3-opt. Here's everything we found.
Test Environment
Every experiment ran on a single workstation with fixed seeds and the same evaluation protocol. Reproducibility wasn't an afterthought; it was a hard requirement from day one.
Hardware
- Compute: Apple M2 Ultra
- Memory: 192 GB Unified Memory
Software
- Python: 3.12.0
- MLX: 0.30.3
- mlx_lm: 0.30.4
- PyTorch: 2.6.0
Evaluation Protocol
- Perplexity: WikiText-2 test set,
seq_len=2048, 256 samples,seed=42 - Benchmarks: lm-evaluation-harness v0.4.11 via
mlx_lm.evaluate
Perplexity Results: Qwen3-8B (Dense Transformer, 8.19B Parameters)
Qwen3-8B is a standard dense transformer where every parameter fires on every forward pass. This is RAM's primary design target. The results show a clear progression from v1 through v3-opt.
| Condition | Avg Bits | Size (GB) | PPL | ΔPPL vs BF16 | Peak Mem (GB) |
|---|---|---|---|---|---|
| BF16 baseline | 16.00 | 15.26 | 9.727 | , | 17.44 |
| Uniform 4-bit | 4.00 | 4.05 | 10.250 | +5.4% | 6.36 |
| RAM v1 (fixed norm) | 6.17 | 5.88 | 10.122 | +4.1% | 8.13 |
| RAM v2 (adaptive norm) | 6.63 | 6.32 | 10.337 | +6.3% | 8.57 |
| RAM v3 (hybrid) | ~5.82 | 6.95 | 10.102 | +3.9% | 9.25 |
| RAM v3-opt | 5.82 | 6.05 | 10.097 | +3.8% | 8.30 |
The standout result: RAM v3-opt hits the best perplexity of any RAM version (+3.8% vs BF16) at 6.05 GB. That's 60% smaller than BF16 and only 49% larger than uniform 4-bit. The threshold optimisation step in v3-opt strips away v3's size overhead while keeping its quality gains intact.
Notice the v2 regression. On a dense model with relatively uniform tensors, adaptive norm allocation over-spends bits without actually improving quality. That's what pushed us toward the v3 hybrid approach.
Perplexity Results: Qwen3-30B-A3B (Sparse MoE, 30.53B Total / 3.3B Active)
Qwen3-30B-A3B is a sparse Mixture-of-Experts model with 128 experts, 8 active per token. The vast majority of parameters live in expert FFN layers that only fire for specific inputs. This creates a fundamentally different quantisation challenge. Most tensors are expert weights with similar structure, and each expert carries specialised knowledge that makes the model far more sensitive to uniform compression.
| Condition | Avg Bits | Size (GB) | PPL | ΔPPL vs BF16 | Peak Mem (GB) |
|---|---|---|---|---|---|
| BF16 baseline | 16.00 | 56.87 | 8.728 | , | 59.18 |
| Uniform 4-bit | 4.00 | 15.11 | 9.629 | +10.3% | 17.42 |
| RAM v1 (fixed norm) | 4.51 | 16.02 | 9.180 | +5.2% | 18.33 |
| RAM v2 (adaptive norm) | 4.69 | 16.65 | 8.976 | +2.8% | 18.97 |
| RAM v3 (hybrid) | ~4.5 | 16.20 | 9.041 | +3.6% | 18.51 |
| RAM v2-opt | ~4.6 | 16.42 | 9.057 | +3.8% | 18.73 |
The takeaway: On MoE, RAM v2 wins clearly at +2.8% degradation. That's less than a third of uniform 4-bit's +10.3%. The v3 hybrid approach actually hurts MoE performance because its norm-based fallback doesn't account for the structural regularity of expert layers. This architecture-dependent behaviour is exactly what motivated v4's auto-detection system, which routes dense models to v3-opt and MoE models to v2.
That uniform 4-bit result is worth pausing on. A 10.3% degradation means uniform quantisation is destroying meaningful expert specialisation. Sensitivity-aware allocation preserves it by protecting the weights that matter most.
Perplexity Results: GLM-4.7-Flash (Dense MoE, 31B Parameters)
GLM-4.7-Flash uses a dense MoE architecture. Its perplexity results need careful reading because of the outlier sequence problem we covered in a previous article.
| Condition | Standard PPL | Median PPL | Size (GB) |
|---|---|---|---|
| BF16 | 11.344 | 8.706 | 58.2 |
| FP16 | 11.208 | 8.609 | 55.8 |
| Uniform 4-bit | 11.532 | , | 14.8 |
| RAM v3 | 9.930* | 9.084 | 15.9 |
*Standard PPL is misleading here. The 9.930 figure looks like quantisation improves on the BF16 baseline. It doesn't. Five outlier sequences (PPL values of 25,000 to 81,000) dominate the mean in BF16 but get partially suppressed by quantisation noise. Median PPL tells the real story: RAM v3 degrades by 4.3% (9.084 vs 8.706), which is consistent with 3.66x compression. See article 18 for the full breakdown.
Academic Benchmark Results: Qwen3-8B
Perplexity measures prediction quality on raw text. But can the model still reason, recall knowledge, and follow instructions after compression? We ran ARC-Challenge (25-shot) and HellaSwag (10-shot) via mlx_lm.evaluate on Qwen3-8B across three conditions to find out.
| Model | Size | ARC-C (25-shot) | HellaSwag (10-shot) | PPL |
|---|---|---|---|---|
| BF16 | 15.26 GB | 44.62% | 60.04% | 9.727 |
| RAM v3-opt | 6.05 GB | 43.43% (-1.2%) | 58.16% (-1.9%) | 10.097 |
| Uniform 4-bit | 4.05 GB | 42.83% (-1.8%) | 58.14% (-1.9%) | 10.249 |
What we found: The ordering BF16 > RAM v3-opt > Uniform 4-bit holds across all three metrics. RAM beats uniform 4-bit on ARC-Challenge (43.43% vs 42.83%), showing that sensitivity-aware bit allocation does a better job of preserving reasoning than uniform compression. On HellaSwag, both quantised variants land within 0.02 percentage points of each other. That benchmark just isn't very sensitive to per-tensor precision differences.
In practical terms, RAM v3-opt gives you better reasoning quality than uniform 4-bit at a 49% size premium (6.05 GB vs 4.05 GB). Whether that trade-off makes sense depends on your deployment constraints.
Bit Allocation Analysis
RAM's core trick is giving different bit widths to different tensors based on quantisation sensitivity. Looking at how those bits actually get distributed tells you a lot about how each model's architecture interacts with the allocator.
Qwen3-8B Bit Distribution
| Bit Width | Uniform | RAM v1 | RAM v2 |
|---|---|---|---|
| 2-bit | 0% | 0% | 2.2% |
| 4-bit | 100% | 81.7% | 73.5% |
| 8-bit | 0% | 3.1% | 6.0% |
| 16-bit | 0% | 15.2% | 18.3% |
From v1 to v2, you can see more aggressive differentiation. v2's adaptive norm scoring finds a small tail of tensors (2.2%) that can handle brutal 2-bit quantisation, freeing bits that get redirected to the 18.3% of tensors kept at full 16-bit precision. Most tensors stay at 4-bit, but the edges of the distribution get wider.
Qwen3-30B-A3B Bit Distribution
| Bit Width | Uniform | RAM v1 | RAM v2 |
|---|---|---|---|
| 2-bit | 0% | 0% | 16.6% |
| 4-bit | 100% | 97.2% | 71.9% |
| 8-bit | 0% | 0.8% | 6.3% |
| 16-bit | 0% | 2.1% | 5.3% |
The MoE distribution looks completely different. RAM v2 pushes 16.6% of tensors down to 2-bit. Nearly all of these are expert FFN layers that the sensitivity analysis flags as redundant or highly tolerant of quantisation noise. This aggressive low-bit allocation for non-critical experts is exactly why RAM v2 shines on MoE. It recognises that not all experts matter equally and compresses the least sensitive ones hard.
Only 5.3% of tensors get 16-bit treatment here (vs 18.3% in Qwen3-8B). That makes sense: MoE models have fewer critical shared layers relative to their total parameter count.
Sensitivity Score Discrimination
RAM only works well when there's real variation in how sensitive different tensors are to quantisation. If all tensors respond the same way, there's nothing for the allocator to exploit. The sensitivity score span, the range between the most and least sensitive tensors, quantifies this opportunity.
| Model | Sensitivity Score Span | Max 2-bit % | Max 16-bit % |
|---|---|---|---|
| Qwen3-8B | 1.173 | 16.8% | 20.3% |
| Qwen3-30B-A3B | 1.378 | 41.3% | 7.3% |
| GLM-4.7-Flash | 0.073 | 0.0% | 3.3% |
This is telling. GLM-4.7-Flash tensors are 18x more uniform than Qwen tensors. With a sensitivity span of just 0.073, there's almost no variation for RAM to work with. The algorithm correctly responds by barely changing anything (0% at 2-bit, only 3.3% at 16-bit), effectively falling back to near-uniform quantisation.
That explains why RAM's benefit on GLM-4.7-Flash is modest compared to its impact on Qwen models. It's not an algorithm problem; it's a model property. When the architecture produces uniformly sensitive tensors, the correct allocation is uniform. RAM correctly figures this out and acts accordingly.
On the other end, Qwen3-30B-A3B's high span (1.378) with 41.3% of tensors eligible for 2-bit quantisation explains why RAM scores its biggest win on this model, cutting uniform 4-bit's 10.3% degradation down to 2.8%.
Compression Efficiency
Raw compression ratio and raw perplexity degradation don't tell the full story on their own. Compression efficiency, the ratio of perplexity degradation to compression ratio, captures how much quality you sacrifice per unit of compression.
| Model | Condition | Compression | PPL Degradation | Efficiency |
|---|---|---|---|---|
| Qwen3-8B | Uniform 4-bit | 3.77× | +5.4% | 1.43 |
| Qwen3-8B | RAM v1 | 2.59× | +4.1% | 1.58 |
| Qwen3-30B-A3B | Uniform 4-bit | 3.76× | +10.3% | 2.74 |
| Qwen3-30B-A3B | RAM v2 | 3.42× | +2.8% | 0.82 |
Lower is better here. RAM v2 on the MoE model hits 0.82, meaning each unit of compression costs less than 1% perplexity. Uniform 4-bit on the same model scores 2.74, paying nearly 3% perplexity per compression unit. That 3.3x efficiency gap is RAM's strongest result across the entire evaluation.
For Qwen3-8B, uniform 4-bit actually has a better efficiency ratio (1.43 vs 1.58) because it achieves much higher compression (3.77x vs 2.59x). RAM v1 preserves more quality but at a lower compression ratio, so the per-unit cost ends up slightly higher. This is a good reminder that efficiency isn't the only metric. Absolute quality and absolute size both matter for real deployment decisions.
Processing Time
RAM adds an analysis phase before quantisation, profiling every tensor to compute sensitivity scores. The conversion itself is fast. The analysis is what takes time.
| Phase | Qwen3-8B | Qwen3-30B-A3B |
|---|---|---|
| Tensors analysed | 399 | 18,867 |
| v3 analysis time | ~3.3 min | ~44 min |
| Conversion time | ~6s | ~15s |
The tensor count jumps 47x from Qwen3-8B to Qwen3-30B-A3B (399 to 18,867), but analysis time only increases about 13x. The sub-linear scaling happens because many MoE expert tensors share similar shapes and can be analysed in batches. Conversion itself is negligible, under 15 seconds even for a 30B parameter model.
In practice, the analysis is a one-time cost. Once you've generated the sensitivity manifest, trying different threshold settings (like sweeping v3-opt parameters) only requires the conversion step.
Key Takeaways
Across four models, three architectures, and multiple RAM versions, several patterns show up consistently:
- RAM v3-opt is the pick for dense models. It hits the lowest perplexity (+3.8% on Qwen3-8B) at a reasonable size (6.05 GB). Threshold optimisation cuts v3's size overhead while keeping quality intact.
- RAM v2 stays best for MoE. The v3 hybrid fallback pathway hurts MoE performance because it doesn't account for expert layer regularity. On Qwen3-30B-A3B, v2 hits +2.8% vs v3's +3.6%.
- v4 auto-detection solves the routing problem. It detects the model architecture and sends dense models to v3-opt and MoE models to v2 automatically. Users don't need to understand these trade-offs.
- Threshold optimisation (v3-opt) saves 13% size at zero quality cost. The optimised thresholds produce a 6.05 GB model vs v3's 6.95 GB, with perplexity actually ticking down from 10.102 to 10.097.
- Standard perplexity can't be trusted alone. Always report median alongside mean. On GLM-4.7-Flash, the mean suggests quantisation improves quality. The median correctly shows 4.3% degradation.
- MoE models gain the most from sensitivity-aware quantisation. Uniform 4-bit costs Qwen3-30B-A3B 10.3% perplexity; RAM v2 brings that to 2.8%. The gap is narrower on dense Qwen3-8B (5.4% vs 3.8%) because dense models have more uniform tensor sensitivity.
Reproducibility
Every result in this article can be reproduced exactly:
- All random seeds fixed at 42 across all evaluation runs
- Per-tensor sensitivity manifests saved as JSON files, so you can reproduce bit allocation decisions exactly
- All perplexity and benchmark results stored as JSON with full configuration metadata (model paths, quantisation settings, evaluation parameters)
- Code available at github.com/baa-ai/swan-quantization
Hardware: Apple M2 Ultra, 192 GB unified memory. Software: Python 3.12.0, MLX 0.30.3, mlx_lm 0.30.4, PyTorch 2.6.0.
Read the Full Paper
The full RAM paper covers formal derivations of the proprietary compression framework, evaluation across four models and 20,000+ tensors, and deployment methodology. It's available on our HuggingFace:
RAM: Proprietary Compression via Proprietary Compression, Full Paper
huggingface.co/spaces/baa-ai/swan-paperLicensed under CC BY-NC-ND 4.0