RAM Evaluation Results
Evaluation Report

RAM Evaluation Results: Four Models, Three Architectures, One Framework

March 2026 · Black Sheep AI Research

We evaluated RAM across four models (Qwen3-8B, Qwen3-30B-A3B, GLM-4.7-Flash, GLM-4.7), three architectures (dense transformer, sparse MoE, dense MoE), and multiple RAM versions (v1 through v3-opt). Here are the complete results.

Test Environment

All experiments were conducted on a single workstation with fixed seeds and consistent evaluation protocols. Reproducibility was a first-class concern throughout.

Hardware

Software

Evaluation Protocol

Perplexity Results — Qwen3-8B (Dense Transformer, 8.19B Parameters)

Qwen3-8B is a standard dense transformer — the architecture where every parameter is active on every forward pass. This is RAM’s primary design target, and the results demonstrate a clear progression from v1 through v3-opt.

ConditionAvg BitsSize (GB)PPLΔPPL vs BF16Peak Mem (GB)
BF16 baseline16.0015.269.72717.44
Uniform 4-bit4.004.0510.250+5.4%6.36
RAM v1 (fixed norm)6.175.8810.122+4.1%8.13
RAM v2 (adaptive norm)6.636.3210.337+6.3%8.57
RAM v3 (hybrid)~5.826.9510.102+3.9%9.25
RAM v3-opt5.826.0510.097+3.8%8.30

Key observation: RAM v3-opt achieves the best perplexity of any RAM version (+3.8% vs BF16) at a model size of 6.05 GB — 60% smaller than BF16 and only 49% larger than uniform 4-bit. The threshold optimisation step in v3-opt eliminates the size overhead of v3’s hybrid approach while preserving its quality gains.

Note the v2 regression: adaptive norm allocation on a dense model with relatively homogeneous tensors can over-allocate bits without commensurate quality improvement. This is what motivated the v3 hybrid approach.

Perplexity Results — Qwen3-30B-A3B (Sparse MoE, 30.53B Total / 3.3B Active)

Qwen3-30B-A3B is a sparse Mixture-of-Experts model with 128 experts, of which 8 are active per token. The vast majority of parameters sit in expert FFN layers that are only activated for specific inputs. This architecture presents a fundamentally different quantisation challenge: most tensors are expert weights with similar structure, and the model is far more sensitive to uniform compression because each expert carries specialised knowledge.

ConditionAvg BitsSize (GB)PPLΔPPL vs BF16Peak Mem (GB)
BF16 baseline16.0056.878.72859.18
Uniform 4-bit4.0015.119.629+10.3%17.42
RAM v1 (fixed norm)4.5116.029.180+5.2%18.33
RAM v2 (adaptive norm)4.6916.658.976+2.8%18.97
RAM v3 (hybrid)~4.516.209.041+3.6%18.51
RAM v2-opt~4.616.429.057+3.8%18.73

Key observation: On MoE, RAM v2 is the clear winner at +2.8% degradation — less than a third of uniform 4-bit’s +10.3%. The v3 hybrid approach actually hurts MoE performance because its norm-based fallback pathway doesn’t account for the structural regularity of expert layers. This architecture-dependent behaviour is what led to the v4 auto-detection system, which routes dense models to v3-opt and MoE models to v2.

The uniform 4-bit result here is striking: 10.3% degradation means uniform quantisation destroys meaningful expert specialisation. Sensitivity-aware allocation preserves it by protecting the most critical expert weights.

Perplexity Results — GLM-4.7-Flash (Dense MoE, 31B Parameters)

GLM-4.7-Flash is a dense MoE architecture. Its perplexity results require careful interpretation due to the outlier sequence phenomenon documented in our previous article.

ConditionStandard PPLMedian PPLSize (GB)
BF1611.3448.70658.2
FP1611.2088.60955.8
Uniform 4-bit11.53214.8
RAM v39.930*9.08415.9

*Standard PPL is misleading. The 9.930 figure appears to show quantisation improving over the BF16 baseline. This is an artifact caused by 5 outlier sequences (PPL values of 25,000–81,000) that dominate the mean in BF16 but are partially suppressed by quantisation noise. Median PPL tells the true story: RAM v3 degrades by 4.3% (9.084 vs 8.706), which is consistent with 3.66× compression. See article 18 for the full analysis.

Academic Benchmark Results — Qwen3-8B

Perplexity measures prediction quality on raw text, but academic benchmarks test whether the model can still reason, recall knowledge, and follow instructions after compression. We ran ARC-Challenge (25-shot) and HellaSwag (10-shot) via mlx_lm.evaluate on Qwen3-8B across three conditions.

ModelSizeARC-C (25-shot)HellaSwag (10-shot)PPL
BF1615.26 GB44.62%60.04%9.727
RAM v3-opt6.05 GB43.43% (-1.2%)58.16% (-1.9%)10.097
Uniform 4-bit4.05 GB42.83% (-1.8%)58.14% (-1.9%)10.249

Key finding: The ordering BF16 > RAM v3-opt > Uniform 4-bit is consistent across all three metrics. RAM beats uniform 4-bit on ARC-Challenge (43.43% vs 42.83%), demonstrating that sensitivity-aware bit allocation preserves reasoning capability better than uniform compression. On HellaSwag, both quantised variants land within 0.02 percentage points of each other, suggesting this benchmark is less sensitive to per-tensor precision allocation.

The practical implication: RAM v3-opt gives you better reasoning quality than uniform 4-bit at a 49% size premium (6.05 GB vs 4.05 GB). Whether that trade-off is worth it depends on your deployment constraints.

Bit Allocation Analysis

RAM’s core mechanism is assigning different bit widths to different tensors based on their quantisation sensitivity. How those bits actually get distributed reveals how each model’s architecture interacts with the allocation algorithm.

Qwen3-8B Bit Distribution

Bit WidthUniformRAM v1RAM v2
2-bit0%0%2.2%
4-bit100%81.7%73.5%
8-bit0%3.1%6.0%
16-bit0%15.2%18.3%

The v1→v2 progression shows more aggressive differentiation: v2’s adaptive norm scoring identifies a small tail of tensors (2.2%) that can tolerate aggressive 2-bit quantisation, freeing bits that are reallocated to the 18.3% of tensors kept at full 16-bit precision. The majority of tensors remain at 4-bit, but the edges of the distribution widen.

Qwen3-30B-A3B Bit Distribution

Bit WidthUniformRAM v1RAM v2
2-bit0%0%16.6%
4-bit100%97.2%71.9%
8-bit0%0.8%6.3%
16-bit0%2.1%5.3%

The MoE distribution is dramatically different. RAM v2 pushes 16.6% of tensors to 2-bit — nearly all of these are expert FFN layers that the sensitivity analysis identifies as redundant or highly resilient to quantisation noise. This aggressive low-bit allocation for non-critical experts is precisely why RAM v2 excels on MoE: it recognises that not all experts are equally important and compresses the least sensitive ones aggressively.

Meanwhile, only 5.3% of tensors get 16-bit treatment (vs 18.3% in Qwen3-8B), reflecting the fact that MoE models have fewer critical shared layers relative to their total parameter count.

Sensitivity Score Discrimination

RAM’s effectiveness depends on there being meaningful variation in how sensitive different tensors are to quantisation. If all tensors respond similarly, there’s nothing for sensitivity-aware allocation to exploit. The sensitivity score span — the range between the most and least sensitive tensors — quantifies this opportunity.

ModelSensitivity Score SpanMax 2-bit %Max 16-bit %
Qwen3-8B1.17316.8%20.3%
Qwen3-30B-A3B1.37841.3%7.3%
GLM-4.7-Flash0.0730.0%3.3%

Key insight: GLM-4.7-Flash tensors are 18× more homogeneous than Qwen tensors. With a sensitivity score span of just 0.073, there is almost no variation for RAM to exploit. The algorithm correctly responds by making minimal allocation changes (0% at 2-bit, only 3.3% at 16-bit), effectively falling back to near-uniform quantisation.

This explains why RAM’s benefit on GLM-4.7-Flash is modest compared to its impact on Qwen models. It is not a limitation of the algorithm — it is a property of the model. When the architecture produces uniformly sensitive tensors, the correct allocation is uniform. RAM correctly identifies this and acts accordingly.

Conversely, Qwen3-30B-A3B’s high span (1.378) with 41.3% of tensors eligible for 2-bit quantisation explains why RAM achieves its most dramatic improvement on this model — cutting uniform 4-bit’s 10.3% degradation down to 2.8%.

Compression Efficiency

Raw compression ratio and raw perplexity degradation do not tell the full story individually. Compression efficiency — defined as the ratio of perplexity degradation to compression ratio — captures how much quality you sacrifice per unit of compression.

ModelConditionCompressionPPL DegradationEfficiency
Qwen3-8BUniform 4-bit3.77×+5.4%1.43
Qwen3-8BRAM v12.59×+4.1%1.58
Qwen3-30B-A3BUniform 4-bit3.76×+10.3%2.74
Qwen3-30B-A3BRAM v23.42×+2.8%0.82

Lower efficiency scores are better — they mean less quality loss per unit of compression. RAM v2 on the MoE model achieves 0.82, meaning each unit of compression costs less than 1% perplexity. Uniform 4-bit on the same model scores 2.74, paying nearly 3% perplexity per compression unit. This 3.3× efficiency advantage is RAM’s strongest result across the entire evaluation.

For Qwen3-8B, uniform 4-bit actually has a better efficiency ratio (1.43 vs 1.58) because it achieves much higher compression (3.77× vs 2.59×). RAM v1 preserves more quality but at a lower compression ratio, so the per-unit cost is slightly higher. This highlights that efficiency is not the only metric that matters — the absolute quality level and the absolute size also factor into deployment decisions.

Processing Time

RAM adds an analysis phase before quantisation, where it profiles every tensor to compute sensitivity scores. The conversion step itself is fast; the analysis is the bottleneck.

PhaseQwen3-8BQwen3-30B-A3B
Tensors analysed39918,867
v3 analysis time~3.3 min~44 min
Conversion time~6s~15s

The 47× increase in tensor count from Qwen3-8B to Qwen3-30B-A3B (399 to 18,867) drives a roughly 13× increase in analysis time. The sub-linear scaling reflects that many MoE expert tensors share similar shapes and can be analysed in batches. Conversion itself is negligible — under 15 seconds even for a 30B parameter model.

For deployment workflows, the analysis phase is a one-time cost. Once the sensitivity manifest is generated, subsequent conversions with different threshold settings (e.g., sweeping v3-opt parameters) require only the conversion step.

Key Takeaways

Across four models, three architectures, and multiple RAM versions, several patterns emerge consistently:

Reproducibility

Every result in this article is fully reproducible:

Hardware: Apple M2 Ultra, 192 GB unified memory. Software: Python 3.12.0, MLX 0.30.3, mlx_lm 0.30.4, PyTorch 2.6.0.

Read the Full Paper

The complete RAM paper, including formal derivations of the proprietary compression framework, evaluation across four models and 20,000+ tensors, and deployment methodology, is available on our HuggingFace:

RAM: Proprietary Compression via Proprietary Compression — Full Paper

huggingface.co/spaces/baa-ai/swan-paper

Licensed under CC BY-NC-ND 4.0

← Previous: Why RAM Matters All Articles →

Continue Reading

Related research from our team.

What RAM Actually Delivers: Evidence from Four Models and 20,000 Tensors
RAM Research

What RAM Actually Delivers: Evidence from Four Models and 20,000 Tensors

Concrete results showing what RAM delivers in practice across diverse model architectures.

Why RAM Matters: Proprietary Compression and the Future of Model Deployment
RAM Research

Why RAM Matters: Proprietary Compression and the Future of Model Deployment

The big picture on why proprietary compression changes everything for model deployment.

RAM Benchmark Results: 7 Models, 40,000+ Questions, One Winner
RAM Research

RAM Benchmark Results: 7 Models, 40,000+ Questions, One Winner

Comprehensive benchmark results across 7 model families and 40,000+ questions.

View All Research