RAM Benchmark Results
RAM Research

RAM Benchmark Results: 7 Models, 40,000+ Questions, One Winner

March 2026 · Black Sheep AI Research

Complete evaluation across 7 model families, 5 benchmark suites, and over 40,000 questions. RAM outperforms uniform quantization and calibration-based GPTQ on every model tested — while requiring zero GPUs and zero calibration data.

Claims about quantization quality are cheap. Numbers are not. We evaluated RAM across 7 model families spanning 8B to 109B parameters, dense and MoE architectures, using perplexity (WikiText-2), ARC-Challenge, Winogrande, HellaSwag, and MMLU. Every benchmark, every model, every number is reported below.

All experiments run on an Apple M2 Ultra (192 GB unified memory). Perplexity uses WikiText-2 test split, sequence length 2048, seed 42. Downstream benchmarks use lm-evaluation-harness via the MLX backend.

1. Qwen3.5-35B-A3B — Scaling Study (PPL + MMLU)

MoE architecture, 35B total parameters, 3B active. Budget-constrained optimal allocation at 5 levels from 21–72 GB. MMLU: 14,015 questions, 57 subjects, 5-shot.

Model Size Avg bits PPL (mean) PPL (median) Med vs BF16 MMLU (5-shot)
Uniform 4-bit20.4 GB4.0 6.8356.764+4.2%70.35%
RAM 21 GB21 GB4.5 6.7136.630+2.1%70.22%
RAM 30 GB30 GB6.5 6.6276.535+0.6%70.91%
RAM 37 GB37 GB8.0 6.5826.505+0.2%71.71%
RAM 51 GB51 GB11.6 6.5976.518+0.4%71.51%
BF1672 GB16.0 6.5866.494baseline~72–73%*

RAM 21 GB MMLU verified by independent rerun (70.22% both times). *BF16 MMLU estimated from Qwen official benchmarks. RAM 37 GB matches BF16 PPL (+0.2%) at 51% of the size with peak MMLU. Diminishing returns above 37 GB.

2. Qwen3-30B-A3B — Multi-Benchmark (PPL + ARC-C + Winogrande)

MoE architecture, 30B total, 3B active. ARC-Challenge: 1,172 questions (25-shot). Winogrande: 7,557 questions (5-shot).

Model Size PPL ARC-C Winogrande
Min-safe (3-bit)13.4 GB10.49467.24%71.35%
RAM (default)16.3 GB8.97269.45%69.46%
Uniform 4-bit16.0 GB9.09869.54%70.32%
RAM +30%21.2 GB8.75669.45%70.09%
RAM +60%26.1 GB8.77169.80%70.48%
8-bit ref29.3 GB8.76569.88%70.88%

ARC-C spans only 2.7 pp from min-safe to 8-bit ref, while PPL spans 18% — confirming downstream benchmarks saturate at ~4-bit quality.

3. Mixtral-8x7B — Multi-Benchmark

MoE architecture, 47B total, 13B active. RAM outperforms uniform 4-bit on both ARC-C (+0.5 pp) and Winogrande (+1.3 pp) at identical 24.5 GB size.

Model Size PPL ARC-C Winogrande
Min-safe (3-bit)19.4 GB4.92667.92%80.19%
RAM (default)24.5 GB4.26470.48%81.37%
Uniform 4-bit24.5 GB4.38769.97%80.11%
RAM +30%31.9 GB4.21870.99%81.69%
RAM +60%39.2 GB4.19871.16%82.08%
8-bit ref46.2 GB4.17471.33%81.93%

4. Qwen3-8B — Dense Model (PPL + ARC-C + HellaSwag + MMLU)

Dense 8B model. HellaSwag: 10,042 questions (10-shot). MMLU: 14,015 questions (5-shot). RAM achieves identical MMLU to BF16 despite 60% size reduction.

Model Size PPL ARC-C HellaSwag MMLU
BF1615.3 GB9.72744.62%60.04%22.95%
RAM6.1 GB10.09743.43%58.16%22.95%
Uniform 4-bit4.1 GB10.24942.83%58.14%

RAM outperforms uniform 4-bit on ARC-C (+0.6 pp) and matches on HellaSwag, with identical MMLU to BF16 at 60% size reduction.

5. GLM-4.7-Flash — Min-Bits Analysis

Dense 30B model. Tests the impact of the --min-bits flag on quality. This model has extreme outlier sequences that make mean PPL unreliable — median is the correct metric.

Model Size Avg bits PPL (mean) PPL (median) Med vs BF16
Uniform 4-bit16.9 GB4.0 14.74910.076+19.0%
RAM (min-bits=3)15.1 GB4.0 10.2289.310+9.9%
RAM (min-bits=4)16.5 GB4.4 12.5418.723+3.0%
BF1658.0 GB16.0 11.5168.470baseline

min-bits=4 wins on median PPL (8.723 vs 9.310) despite min-bits=3 winning on mean PPL (10.228 vs 12.541). Mean PPL is unreliable on this model — BF16 mean (11.5) appears worse than quantized mean (10.2). Median is the correct metric.

6. Llama-4-Scout — MoE 109B

109B MoE with 16 routed experts + 1 shared expert per layer. Too large for BF16 on any consumer device (203 GB). RAM enables deployment across multiple hardware tiers from a single analysis pass.

Model Size PPL vs Uniform 4-bit
Uniform 4-bit56.9 GB7.899baseline
RAM (58 GB)58.0 GB7.628−3.4%
RAM (163 GB)163.2 GB7.359−6.8%
RAM min-safe46.9 GB8.675+9.8%
RAM no-safety (2-bit)34.6 GB23.577+198%

quality safety threshold at 9 dB prevents catastrophic 2-bit allocation. The quality cliff is between 2-bit and 3-bit (PPL triples), not between 3-bit and 4-bit (+12.5%).

7. GPTQ Head-to-Head: Data-Free vs Calibration

RAM vs GPTQ (calibration-based) at exactly matched model sizes. RAM outperforms on all three MoE families despite being entirely data-free.

Model Size GPTQ PPL (median) RAM PPL (median) Delta
Qwen3-30B-A3B16.0 GB9.160 8.959−2.2%
Qwen2-57B-A14B30.0 GB6.396 6.335−1.0%
Mixtral-8x7B24.5 GB4.640 4.426−4.6%

RAM is data-free (no calibration). GPTQ uses Hessian-based calibration with activation data. RAM consistently wins at matched sizes across all three MoE families.

8. Throughput: Tokens Per Second

Apple M2 Ultra, MLX 0.31.1, gen_len=256, median of 3 runs. The question: does mixed-precision cost speed?

Model Method Size TPS p=128 TPS p=512 TPS p=2048 TTFT p=2048
Qwen3-30BUniform17.2 GB83.380.678.31,498 ms
RAM17.2 GB80.276.975.31,509 ms
Delta−3.7%−4.6%−3.9%+0.7%
GLM-4.7-FlashUniform16.9 GB55.555.354.71,849 ms
RAM16.7 GB54.454.254.52,440 ms
Delta−2.0%−2.0%−0.4%+32.0%

Dense models: 2–4% generation overhead from mixed group sizes, negligible prefill impact on Qwen3-30B. GLM-4.7-Flash shows higher TTFT overhead due to a larger fraction of tensors at non-default configurations. Throughput cost is modest relative to the quality gains.

9. MLX Kernel Benchmark: Group Size Performance

RAM assigns group size 32 to 85% of tensors. Does the MLX quantized matmul kernel handle this efficiently? We benchmarked across MLX versions.

Scenario MLX 0.29.3 MLX 0.31.1 Status
Generation g32/g1281.07–1.14x penalty1.01–1.10xMostly fixed
Generation g64/g128~1.05x1.00xFixed
Prefill g32/g1281.8–2.2x1.00xFully fixed
Prefill g64/g128~1.3x1.00xFixed

Upstream fixes in MLX PRs #1861 and #2031 resolved the group_size performance regression. RAM’s preference for g32 is now viable at full speed.

10. Key Findings

RAM consistently outperforms uniform quantization and calibration-based GPTQ across all tested models:


All evaluations conducted on Apple M2 Ultra (192 GB unified memory). Perplexity: WikiText-2 test split, seq_len=2048, 128 sequences, seed 42. Downstream benchmarks: lm-evaluation-harness via MLX backend. ARC-Challenge (25-shot), Winogrande (5-shot), HellaSwag (10-shot), MMLU (5-shot, 14,015 questions). MLX 0.31.1, mlx_lm 0.30.4, Python 3.12.0. Full paper: huggingface.co/spaces/baa-ai/RAM. Code: github.com/baa-ai/RAM.

Read the Full Paper

The complete RAM paper, including formal derivations, benchmark results across 7 model families and 40,000+ questions, and the full optimal allocation framework, is available on our HuggingFace:

RAM: Compute-Optimal Proprietary Compression for LLMs — Full Paper

huggingface.co/spaces/baa-ai/RAM

Licensed under CC BY-NC-ND 4.0

Continue Reading

Related research from our team.

Eight Things Our Benchmarks Reveal That Nobody Expected
RAM Research

Eight Things Our Benchmarks Reveal That Nobody Expected

Surprising findings from our benchmark suite that challenge conventional quantization wisdom.

Beyond Perplexity: Downstream Benchmarks Confirm RAM Beats All Quantization Strategies
RAM Research

Beyond Perplexity: Downstream Benchmarks Confirm RAM Beats All Quantization Strategies

Real benchmarks, real results. RAM wins across downstream tasks, not just perplexity.

View All Research