Beyond Perplexity: Downstream Benchmarks Confirm RAM Beats All Quantization Strategies
RAM Research

Beyond Perplexity: Downstream Benchmarks Confirm RAM Beats All Quantization Strategies

March 2026 · Black Sheep AI Research

A perplexity improvement means nothing if it doesn’t translate to real tasks. We tested RAM on ARC-Challenge and Winogrande to find out.

The Gap Between Proxy and Reality

Perplexity is the universal metric for evaluating quantized models. It measures how well a model predicts the next token in a held-out corpus. But practitioners don’t deploy models to predict WikiText-2—they deploy them to answer questions, reason about problems, and follow instructions. The question that matters: do RAM’s perplexity gains translate to real downstream performance?

The Benchmarks

We evaluated RAM against uniform 4-bit quantization on two standard reasoning benchmarks using lm-evaluation-harness via the MLX backend:

Both benchmarks are standard in the LLM evaluation literature and test capabilities that matter for real deployments.

The Results

Model Method Size (GB) ARC-C (acc_norm) Winogrande (acc)
Qwen3-30B-A3B RAM 16.3 69.5 ± 1.3 69.5 ± 1.3
Qwen3-30B-A3B Uniform 4-bit 16.0 69.5 ± 1.3 70.3 ± 1.3
Mixtral-8x7B RAM 24.5 70.5 ± 1.3 81.4 ± 1.1
Mixtral-8x7B Uniform 4-bit 24.5 70.0 ± 1.3 80.1 ± 1.1

Mixtral: RAM Wins on Both Tasks

On Mixtral-8x7B, RAM outperforms uniform 4-bit on both benchmarks: +0.5 percentage points on ARC-Challenge and +1.3 percentage points on Winogrande, at exactly the same model size (24.5 GB). This is consistent with RAM’s large PPL advantage on this model (-4.6% median perplexity). The mixed-precision allocation that the proprietary allocator produces—giving sensitive tensors finer group sizes and higher bit-widths—translates directly to better reasoning performance.

Qwen3-30B: Statistically Indistinguishable

On Qwen3-30B-A3B, both methods are statistically indistinguishable on both tasks. This is expected: RAM’s PPL advantage over uniform 4-bit is much narrower on this model (+2.3% vs +10.3%—RAM recovers 78% of the BF16-to-4bit gap, but the remaining gap is small). When the perplexity difference is within a few percent, downstream task accuracy converges.

The key observation: RAM never sacrifices downstream accuracy for its perplexity gains. The mixed-precision allocation does not create pathological behaviour on reasoning tasks—it either matches or improves upon uniform quantization.

Why Perplexity Remains a Good Proxy

These results validate that perplexity is a reliable proxy for task performance in the quantization regime. When RAM achieves -4.6% lower PPL (Mixtral), it also achieves better task scores. When the PPL gap is narrow (Qwen3-30B), task scores are indistinguishable. The mapping from perplexity to downstream accuracy is monotonic and predictable.

This is important because perplexity evaluation is fast (minutes) while full benchmark evaluation is slow (hours). Practitioners can use perplexity as a reliable filter during development and run downstream benchmarks as final validation.

What This Means for Deployment

If you are deploying a quantized model and your selection criterion is “best task accuracy at a given memory budget,” RAM matches or beats uniform quantization on every test we ran. The mixed-precision allocation that optimizes perplexity also optimizes—or at minimum preserves—the capabilities that matter for real applications.


Data from the RAM paper: ‘RAM: Budget-Aware Proprietary Compression for Large Language Models via Rate-Distortion Optimization’ (baa.ai, 2026). Benchmarks evaluated using lm-evaluation-harness via MLX backend. ARC-Challenge: 25-shot, normalized accuracy. Winogrande: 5-shot, standard accuracy. Full paper at huggingface.co/spaces/baa-ai/RAM. Code at github.com/baa-ai/RAM.

Read the Full Paper

The complete RAM paper, including formal derivations, benchmark results across 7 model families and 40,000+ questions, and the full optimal allocation framework, is available on our HuggingFace:

RAM: Compute-Optimal Proprietary Compression for LLMs — Full Paper

huggingface.co/spaces/baa-ai/RAM

Licensed under CC BY-NC-ND 4.0

Continue Reading

Related research from our team.

RAM Benchmark Results: 7 Models, 40,000+ Questions, One Winner
RAM Research

RAM Benchmark Results: 7 Models, 40,000+ Questions, One Winner

Comprehensive benchmark results across 7 model families and 40,000+ questions.

When Data-Free Beats the Gold Standard
RAM Research

When Data-Free Beats the Gold Standard

RAM outperforms GPTQ by up to 4.6% while using 72% less memory — without any calibration data.

Eight Things Our Benchmarks Reveal That Nobody Expected
RAM Research

Eight Things Our Benchmarks Reveal That Nobody Expected

Surprising findings from our benchmark suite that challenge conventional quantization wisdom.

View All Research