RAM research report
Research Report

Gemma 4 at Half the Size, But Full Performance.

April 2026 · Black Sheep AI Research

We compressed Gemma 4 31B to 31GB using RAM, our new quantization method, and ran the complete 12,032-question MMLU-Pro suite. It scored 85.2%, matching Google's published BF16 baseline. Half the size. Same intelligence.

85.2%
MMLU-Pro Overall
31 GB
50% of BF16
85.2%
BF16 Baseline
12,032
Questions Graded

How to read this result.

MMLU-Pro is currently the most demanding multiple-choice academic benchmark in widespread use. It spans 14 categories, from mathematics and physics through law and philosophy, and for every question it forces the model to choose between ten answer options instead of four. That change alone makes random guessing four times less useful and pushes the benchmark into territory where only genuinely capable models score well.

Google's published score for the full BF16 version of Gemma 4 31B on MMLU-Pro is 85.2%. The RAM-compressed 31GB version we built scored 85.2% on the exact same 12,032-question suite. This is not a graceful degradation, a partial recovery, or a selective comparison on a favourable subset — it is the same published number on the full suite. Statistically, the compressed model is indistinguishable from the full-precision one.

And it does this at half the size. The BF16 version of Gemma 4 31B weighs in at roughly 62GB, which puts it out of reach of most consumer hardware and forces you into workstation territory: 96GB-class machines, multi-GPU rigs, or cloud. The RAM version is 31GB, which runs comfortably on a 48GB unified-memory Mac. That is the entire story in one sentence: the same model intelligence, on hardware people actually own.

Overall result vs baseline.

Configuration Size MMLU-Pro
Gemma 4 31B BF16 (Google, published) ~62 GB 85.2%
Gemma 4 31B RAM 31GB (Black Sheep AI) 31 GB 85.2%

Benchmarking was performed end-to-end with thinking mode enabled and a 2,048 token answer budget, against the complete MMLU-Pro test set. No subset selection, no early stopping, no scoring shortcuts.

Complete per-category breakdown.

Here is every category on MMLU-Pro, sorted by score. These are the raw, unedited numbers from the full 12,032-question run against the RAM-compressed 31GB model. Google has not published per-category BF16 scores for Gemma 4 31B, so the overall 85.2% figure is the only direct comparison point, but the category-level distribution is published here in full for anyone who wants to scrutinize the result.

Category Questions Correct Score
Math 1,351 1,274 94.4%
Biology 717 665 92.7%
Physics 1,299 1,167 89.8%
Business 789 706 89.5%
Chemistry 1,132 1,012 89.4%
Economics 844 752 89.0%
Computer Science 410 362 88.3%
Psychology 798 678 85.0%
Philosophy 499 402 80.6%
Health 818 655 80.1%
Engineering 969 771 79.6%
Other 924 721 77.9%
History 381 290 76.1%
Law 1,101 792 71.9%
Overall 12,032 10,247 85.2%

Highlighted rows are the six categories scoring 89% or higher. Every category uses the full question set, no sampling.

Download the raw results

Every one of the 12,032 questions, the model's answer, and whether it got it right. JSON, 2.4 MB.

Download JSON

Where RAM shines.

The first thing that stands out is the STEM dominance. Math at 94.4%, Biology at 92.7%, Physics at 89.8%. These are the categories where MMLU-Pro is at its most demanding and where weaker models tend to collapse. The compressed model handles them as comfortably as the full-precision one would.

What this makes possible for your organisation.

Until now, running a frontier 31B model privately meant either paying for multi-GPU infrastructure or sending your data to an API provider. RAM compression changes that equation entirely: the same intelligence, at half the size, on hardware you can actually budget for.

Private AI is now feasible

A 31GB model fits on a single GPU instance in your AWS, Azure, or GCP VPC. Run it 10 hours a day for under $9k a year. Compare that to $50–100k+ in API fees, and every token stays in your infrastructure.

Mac Studio as AI infrastructure

A Mac Studio with 96GB unified memory runs this model natively via MLX. One-time hardware cost of ~$8,000. No cloud dependency. On-prem, cloud-hosted, or at an employee’s desk.

Complete data sovereignty

Every token, every document, every query stays on your hardware. No third-party API touches your data. Privacy, compliance, and regulatory requirements become the architectural default.

Governed by Watchman

Every RAM-compressed model is audited by Watchman before deployment. You get a verified capability report and LLM security scan, not a benchmark score, a structural guarantee.

How RAM works.

RAM is Black Sheep AI's data-free post-training compression method, designed for both NVIDIA GPU and Apple Silicon deployment. It targets the specific failure modes that make most compression approaches collapse at the sizes you would actually want to deploy. The full methodology is currently under patent review and will be described in a future technical paper. This article reports benchmark results only.

Run this model in your infrastructure.

Shepherd takes RAM-compressed models from research to production in your VPC, on your Mac fleet, or in your air-gapped facility. Every deployment includes a Watchman capability audit and security scan.

AWS
g7e.2xlarge from ~$2,700/yr on spot
Azure
NC24ads A100 v4 from ~$2,900/yr low-priority
GCP
a2-highgpu-1g from ~$2,900/yr preemptible
Mac
Mac Studio M4 Ultra ~$8,000 one-time
Models on HuggingFace Join Our Discord