A perplexity improvement means nothing if it doesn’t translate to real tasks. We tested RAM on ARC-Challenge and Winogrande to find out.
The Gap Between Proxy and Reality
Perplexity is the universal metric for evaluating quantized models. It measures how well a model predicts the next token in a held-out corpus. But practitioners don’t deploy models to predict WikiText-2—they deploy them to answer questions, reason about problems, and follow instructions. The question that matters: do RAM’s perplexity gains translate to real downstream performance?
The Benchmarks
We evaluated RAM against uniform 4-bit quantization on two standard reasoning benchmarks using lm-evaluation-harness via the MLX backend:
- ARC-Challenge (25-shot): Science questions requiring multi-step reasoning. Normalized accuracy.
- Winogrande (5-shot): Commonsense reasoning via Winograd schema challenges. Standard accuracy.
Both benchmarks are standard in the LLM evaluation literature and test capabilities that matter for real deployments.
The Results
| Model | Method | Size (GB) | ARC-C (acc_norm) | Winogrande (acc) |
|---|---|---|---|---|
| Qwen3-30B-A3B | RAM | 16.3 | 69.5 ± 1.3 | 69.5 ± 1.3 |
| Qwen3-30B-A3B | Uniform 4-bit | 16.0 | 69.5 ± 1.3 | 70.3 ± 1.3 |
| Mixtral-8x7B | RAM | 24.5 | 70.5 ± 1.3 | 81.4 ± 1.1 |
| Mixtral-8x7B | Uniform 4-bit | 24.5 | 70.0 ± 1.3 | 80.1 ± 1.1 |
Mixtral: RAM Wins on Both Tasks
On Mixtral-8x7B, RAM outperforms uniform 4-bit on both benchmarks: +0.5 percentage points on ARC-Challenge and +1.3 percentage points on Winogrande, at exactly the same model size (24.5 GB). This is consistent with RAM’s large PPL advantage on this model (-4.6% median perplexity). The mixed-precision allocation that the proprietary allocator produces—giving sensitive tensors finer group sizes and higher bit-widths—translates directly to better reasoning performance.
Qwen3-30B: Statistically Indistinguishable
On Qwen3-30B-A3B, both methods are statistically indistinguishable on both tasks. This is expected: RAM’s PPL advantage over uniform 4-bit is much narrower on this model (+2.3% vs +10.3%—RAM recovers 78% of the BF16-to-4bit gap, but the remaining gap is small). When the perplexity difference is within a few percent, downstream task accuracy converges.
The key observation: RAM never sacrifices downstream accuracy for its perplexity gains. The mixed-precision allocation does not create pathological behaviour on reasoning tasks—it either matches or improves upon uniform quantization.
Why Perplexity Remains a Good Proxy
These results validate that perplexity is a reliable proxy for task performance in the quantization regime. When RAM achieves -4.6% lower PPL (Mixtral), it also achieves better task scores. When the PPL gap is narrow (Qwen3-30B), task scores are indistinguishable. The mapping from perplexity to downstream accuracy is monotonic and predictable.
This is important because perplexity evaluation is fast (minutes) while full benchmark evaluation is slow (hours). Practitioners can use perplexity as a reliable filter during development and run downstream benchmarks as final validation.
What This Means for Deployment
If you are deploying a quantized model and your selection criterion is “best task accuracy at a given memory budget,” RAM matches or beats uniform quantization on every test we ran. The mixed-precision allocation that optimizes perplexity also optimizes—or at minimum preserves—the capabilities that matter for real applications.
Data from the RAM paper: ‘RAM: Budget-Aware Proprietary Compression for Large Language Models via Rate-Distortion Optimization’ (baa.ai, 2026). Benchmarks evaluated using lm-evaluation-harness via MLX backend. ARC-Challenge: 25-shot, normalized accuracy. Winogrande: 5-shot, standard accuracy. Full paper at huggingface.co/spaces/baa-ai/RAM. Code at github.com/baa-ai/RAM.
Read the Full Paper
The complete RAM paper, including formal derivations, benchmark results across 7 model families and 40,000+ questions, and the full optimal allocation framework, is available on our HuggingFace:
RAM: Compute-Optimal Proprietary Compression for LLMs — Full Paper
huggingface.co/spaces/baa-ai/RAMLicensed under CC BY-NC-ND 4.0