RAM Evaluation Results
Evaluation Report

RAM Evaluation Results: Four Models, Three Architectures, One Framework

March 2026 · Black Sheep AI Research

We tested RAM on four models (Qwen3-8B, Qwen3-30B-A3B, GLM-4.7-Flash, GLM-4.7) spanning three architectures (dense transformer, sparse MoE, dense MoE) across multiple RAM versions from v1 through v3-opt. Here's everything we found.

Test Environment

Every experiment ran on a single workstation with fixed seeds and the same evaluation protocol. Reproducibility wasn't an afterthought; it was a hard requirement from day one.

Hardware

Software

Evaluation Protocol

Perplexity Results: Qwen3-8B (Dense Transformer, 8.19B Parameters)

Qwen3-8B is a standard dense transformer where every parameter fires on every forward pass. This is RAM's primary design target. The results show a clear progression from v1 through v3-opt.

ConditionAvg BitsSize (GB)PPLΔPPL vs BF16Peak Mem (GB)
BF16 baseline16.0015.269.727, 17.44
Uniform 4-bit4.004.0510.250+5.4%6.36
RAM v1 (fixed norm)6.175.8810.122+4.1%8.13
RAM v2 (adaptive norm)6.636.3210.337+6.3%8.57
RAM v3 (hybrid)~5.826.9510.102+3.9%9.25
RAM v3-opt5.826.0510.097+3.8%8.30

The standout result: RAM v3-opt hits the best perplexity of any RAM version (+3.8% vs BF16) at 6.05 GB. That's 60% smaller than BF16 and only 49% larger than uniform 4-bit. The threshold optimisation step in v3-opt strips away v3's size overhead while keeping its quality gains intact.

Notice the v2 regression. On a dense model with relatively uniform tensors, adaptive norm allocation over-spends bits without actually improving quality. That's what pushed us toward the v3 hybrid approach.

Perplexity Results: Qwen3-30B-A3B (Sparse MoE, 30.53B Total / 3.3B Active)

Qwen3-30B-A3B is a sparse Mixture-of-Experts model with 128 experts, 8 active per token. The vast majority of parameters live in expert FFN layers that only fire for specific inputs. This creates a fundamentally different quantisation challenge. Most tensors are expert weights with similar structure, and each expert carries specialised knowledge that makes the model far more sensitive to uniform compression.

ConditionAvg BitsSize (GB)PPLΔPPL vs BF16Peak Mem (GB)
BF16 baseline16.0056.878.728, 59.18
Uniform 4-bit4.0015.119.629+10.3%17.42
RAM v1 (fixed norm)4.5116.029.180+5.2%18.33
RAM v2 (adaptive norm)4.6916.658.976+2.8%18.97
RAM v3 (hybrid)~4.516.209.041+3.6%18.51
RAM v2-opt~4.616.429.057+3.8%18.73

The takeaway: On MoE, RAM v2 wins clearly at +2.8% degradation. That's less than a third of uniform 4-bit's +10.3%. The v3 hybrid approach actually hurts MoE performance because its norm-based fallback doesn't account for the structural regularity of expert layers. This architecture-dependent behaviour is exactly what motivated v4's auto-detection system, which routes dense models to v3-opt and MoE models to v2.

That uniform 4-bit result is worth pausing on. A 10.3% degradation means uniform quantisation is destroying meaningful expert specialisation. Sensitivity-aware allocation preserves it by protecting the weights that matter most.

Perplexity Results: GLM-4.7-Flash (Dense MoE, 31B Parameters)

GLM-4.7-Flash uses a dense MoE architecture. Its perplexity results need careful reading because of the outlier sequence problem we covered in a previous article.

ConditionStandard PPLMedian PPLSize (GB)
BF1611.3448.70658.2
FP1611.2088.60955.8
Uniform 4-bit11.532, 14.8
RAM v39.930*9.08415.9

*Standard PPL is misleading here. The 9.930 figure looks like quantisation improves on the BF16 baseline. It doesn't. Five outlier sequences (PPL values of 25,000 to 81,000) dominate the mean in BF16 but get partially suppressed by quantisation noise. Median PPL tells the real story: RAM v3 degrades by 4.3% (9.084 vs 8.706), which is consistent with 3.66x compression. See article 18 for the full breakdown.

Academic Benchmark Results: Qwen3-8B

Perplexity measures prediction quality on raw text. But can the model still reason, recall knowledge, and follow instructions after compression? We ran ARC-Challenge (25-shot) and HellaSwag (10-shot) via mlx_lm.evaluate on Qwen3-8B across three conditions to find out.

ModelSizeARC-C (25-shot)HellaSwag (10-shot)PPL
BF1615.26 GB44.62%60.04%9.727
RAM v3-opt6.05 GB43.43% (-1.2%)58.16% (-1.9%)10.097
Uniform 4-bit4.05 GB42.83% (-1.8%)58.14% (-1.9%)10.249

What we found: The ordering BF16 > RAM v3-opt > Uniform 4-bit holds across all three metrics. RAM beats uniform 4-bit on ARC-Challenge (43.43% vs 42.83%), showing that sensitivity-aware bit allocation does a better job of preserving reasoning than uniform compression. On HellaSwag, both quantised variants land within 0.02 percentage points of each other. That benchmark just isn't very sensitive to per-tensor precision differences.

In practical terms, RAM v3-opt gives you better reasoning quality than uniform 4-bit at a 49% size premium (6.05 GB vs 4.05 GB). Whether that trade-off makes sense depends on your deployment constraints.

Bit Allocation Analysis

RAM's core trick is giving different bit widths to different tensors based on quantisation sensitivity. Looking at how those bits actually get distributed tells you a lot about how each model's architecture interacts with the allocator.

Qwen3-8B Bit Distribution

Bit WidthUniformRAM v1RAM v2
2-bit0%0%2.2%
4-bit100%81.7%73.5%
8-bit0%3.1%6.0%
16-bit0%15.2%18.3%

From v1 to v2, you can see more aggressive differentiation. v2's adaptive norm scoring finds a small tail of tensors (2.2%) that can handle brutal 2-bit quantisation, freeing bits that get redirected to the 18.3% of tensors kept at full 16-bit precision. Most tensors stay at 4-bit, but the edges of the distribution get wider.

Qwen3-30B-A3B Bit Distribution

Bit WidthUniformRAM v1RAM v2
2-bit0%0%16.6%
4-bit100%97.2%71.9%
8-bit0%0.8%6.3%
16-bit0%2.1%5.3%

The MoE distribution looks completely different. RAM v2 pushes 16.6% of tensors down to 2-bit. Nearly all of these are expert FFN layers that the sensitivity analysis flags as redundant or highly tolerant of quantisation noise. This aggressive low-bit allocation for non-critical experts is exactly why RAM v2 shines on MoE. It recognises that not all experts matter equally and compresses the least sensitive ones hard.

Only 5.3% of tensors get 16-bit treatment here (vs 18.3% in Qwen3-8B). That makes sense: MoE models have fewer critical shared layers relative to their total parameter count.

Sensitivity Score Discrimination

RAM only works well when there's real variation in how sensitive different tensors are to quantisation. If all tensors respond the same way, there's nothing for the allocator to exploit. The sensitivity score span, the range between the most and least sensitive tensors, quantifies this opportunity.

ModelSensitivity Score SpanMax 2-bit %Max 16-bit %
Qwen3-8B1.17316.8%20.3%
Qwen3-30B-A3B1.37841.3%7.3%
GLM-4.7-Flash0.0730.0%3.3%

This is telling. GLM-4.7-Flash tensors are 18x more uniform than Qwen tensors. With a sensitivity span of just 0.073, there's almost no variation for RAM to work with. The algorithm correctly responds by barely changing anything (0% at 2-bit, only 3.3% at 16-bit), effectively falling back to near-uniform quantisation.

That explains why RAM's benefit on GLM-4.7-Flash is modest compared to its impact on Qwen models. It's not an algorithm problem; it's a model property. When the architecture produces uniformly sensitive tensors, the correct allocation is uniform. RAM correctly figures this out and acts accordingly.

On the other end, Qwen3-30B-A3B's high span (1.378) with 41.3% of tensors eligible for 2-bit quantisation explains why RAM scores its biggest win on this model, cutting uniform 4-bit's 10.3% degradation down to 2.8%.

Compression Efficiency

Raw compression ratio and raw perplexity degradation don't tell the full story on their own. Compression efficiency, the ratio of perplexity degradation to compression ratio, captures how much quality you sacrifice per unit of compression.

ModelConditionCompressionPPL DegradationEfficiency
Qwen3-8BUniform 4-bit3.77×+5.4%1.43
Qwen3-8BRAM v12.59×+4.1%1.58
Qwen3-30B-A3BUniform 4-bit3.76×+10.3%2.74
Qwen3-30B-A3BRAM v23.42×+2.8%0.82

Lower is better here. RAM v2 on the MoE model hits 0.82, meaning each unit of compression costs less than 1% perplexity. Uniform 4-bit on the same model scores 2.74, paying nearly 3% perplexity per compression unit. That 3.3x efficiency gap is RAM's strongest result across the entire evaluation.

For Qwen3-8B, uniform 4-bit actually has a better efficiency ratio (1.43 vs 1.58) because it achieves much higher compression (3.77x vs 2.59x). RAM v1 preserves more quality but at a lower compression ratio, so the per-unit cost ends up slightly higher. This is a good reminder that efficiency isn't the only metric. Absolute quality and absolute size both matter for real deployment decisions.

Processing Time

RAM adds an analysis phase before quantisation, profiling every tensor to compute sensitivity scores. The conversion itself is fast. The analysis is what takes time.

PhaseQwen3-8BQwen3-30B-A3B
Tensors analysed39918,867
v3 analysis time~3.3 min~44 min
Conversion time~6s~15s

The tensor count jumps 47x from Qwen3-8B to Qwen3-30B-A3B (399 to 18,867), but analysis time only increases about 13x. The sub-linear scaling happens because many MoE expert tensors share similar shapes and can be analysed in batches. Conversion itself is negligible, under 15 seconds even for a 30B parameter model.

In practice, the analysis is a one-time cost. Once you've generated the sensitivity manifest, trying different threshold settings (like sweeping v3-opt parameters) only requires the conversion step.

Key Takeaways

Across four models, three architectures, and multiple RAM versions, several patterns show up consistently:

Reproducibility

Every result in this article can be reproduced exactly:

Hardware: Apple M2 Ultra, 192 GB unified memory. Software: Python 3.12.0, MLX 0.30.3, mlx_lm 0.30.4, PyTorch 2.6.0.

Read the Full Paper

The full RAM paper covers formal derivations of the proprietary compression framework, evaluation across four models and 20,000+ tensors, and deployment methodology. It's available on our HuggingFace:

RAM: Proprietary Compression via Proprietary Compression, Full Paper

huggingface.co/spaces/baa-ai/swan-paper

Licensed under CC BY-NC-ND 4.0

← Previous: Why RAM Matters All Articles →

Continue Reading

Related research from our team.

What RAM Actually Delivers: Evidence from Four Models and 20,000 Tensors
RAM Research

What RAM Actually Delivers: Evidence from Four Models and 20,000 Tensors

Concrete results showing what RAM delivers in practice across diverse model architectures.

Why RAM Matters: Proprietary Compression and the Future of Model Deployment
RAM Research

Why RAM Matters: Proprietary Compression and the Future of Model Deployment

The big picture on why proprietary compression changes everything for model deployment.

RAM Benchmark Results: 7 Models, 40,000+ Questions, One Winner
RAM Research

RAM Benchmark Results: 7 Models, 40,000+ Questions, One Winner

Comprehensive benchmark results across 7 model families and 40,000+ questions.

View All Research