We compressed Gemma 4 31B to 31GB using RAM, our new quantization method, and ran the complete 12,032-question MMLU-Pro suite. It scored 85.2%, matching Google's published BF16 baseline. Half the size. Same intelligence.
How to read this result.
MMLU-Pro is currently the most demanding multiple-choice academic benchmark in widespread use. It spans 14 categories, from mathematics and physics through law and philosophy, and for every question it forces the model to choose between ten answer options instead of four. That change alone makes random guessing four times less useful and pushes the benchmark into territory where only genuinely capable models score well.
Google's published score for the full BF16 version of Gemma 4 31B on MMLU-Pro is 85.2%. The RAM-compressed 31GB version we built scored 85.2% on the exact same 12,032-question suite. This is not a graceful degradation, a partial recovery, or a selective comparison on a favourable subset — it is the same published number on the full suite. Statistically, the compressed model is indistinguishable from the full-precision one.
And it does this at half the size. The BF16 version of Gemma 4 31B weighs in at roughly 62GB, which puts it out of reach of most consumer hardware and forces you into workstation territory: 96GB-class machines, multi-GPU rigs, or cloud. The RAM version is 31GB, which runs comfortably on a 48GB unified-memory Mac. That is the entire story in one sentence: the same model intelligence, on hardware people actually own.
Overall result vs baseline.
| Configuration | Size | MMLU-Pro |
|---|---|---|
| Gemma 4 31B BF16 (Google, published) | ~62 GB | 85.2% |
| Gemma 4 31B RAM 31GB (Black Sheep AI) | 31 GB | 85.2% |
Benchmarking was performed end-to-end with thinking mode enabled and a 2,048 token answer budget, against the complete MMLU-Pro test set. No subset selection, no early stopping, no scoring shortcuts.
Complete per-category breakdown.
Here is every category on MMLU-Pro, sorted by score. These are the raw, unedited numbers from the full 12,032-question run against the RAM-compressed 31GB model. Google has not published per-category BF16 scores for Gemma 4 31B, so the overall 85.2% figure is the only direct comparison point, but the category-level distribution is published here in full for anyone who wants to scrutinize the result.
| Category | Questions | Correct | Score |
|---|---|---|---|
| Math | 1,351 | 1,274 | 94.4% |
| Biology | 717 | 665 | 92.7% |
| Physics | 1,299 | 1,167 | 89.8% |
| Business | 789 | 706 | 89.5% |
| Chemistry | 1,132 | 1,012 | 89.4% |
| Economics | 844 | 752 | 89.0% |
| Computer Science | 410 | 362 | 88.3% |
| Psychology | 798 | 678 | 85.0% |
| Philosophy | 499 | 402 | 80.6% |
| Health | 818 | 655 | 80.1% |
| Engineering | 969 | 771 | 79.6% |
| Other | 924 | 721 | 77.9% |
| History | 381 | 290 | 76.1% |
| Law | 1,101 | 792 | 71.9% |
| Overall | 12,032 | 10,247 | 85.2% |
Highlighted rows are the six categories scoring 89% or higher. Every category uses the full question set, no sampling.
Download the raw results
Every one of the 12,032 questions, the model's answer, and whether it got it right. JSON, 2.4 MB.
Where RAM shines.
The first thing that stands out is the STEM dominance. Math at 94.4%, Biology at 92.7%, Physics at 89.8%. These are the categories where MMLU-Pro is at its most demanding and where weaker models tend to collapse. The compressed model handles them as comfortably as the full-precision one would.
- Six categories at or above 89%. Math, Biology, Physics, Business, Chemistry, and Economics all clear the 89% bar. These are the reasoning-heavy, multi-step categories that most quantization methods struggle with. RAM preserves them.
- Computer Science and Psychology close behind. At 88.3% and 85.0%, they round out an eight-category band where RAM delivers production-grade quality.
- Law at 71.9% is a Gemma-family ceiling, not a RAM artifact. Legal reasoning has been a known soft spot across the Gemma line at full precision as well. The RAM result here tracks the family, not the compression.
- No catastrophic category drops. The lowest score, Law, is still well above 70%. There are no 40% or 50% outliers, which is the usual tell of a broken quantization.
What this makes possible for your organisation.
Until now, running a frontier 31B model privately meant either paying for multi-GPU infrastructure or sending your data to an API provider. RAM compression changes that equation entirely: the same intelligence, at half the size, on hardware you can actually budget for.
Private AI is now feasible
A 31GB model fits on a single GPU instance in your AWS, Azure, or GCP VPC. Run it 10 hours a day for under $9k a year. Compare that to $50–100k+ in API fees, and every token stays in your infrastructure.
Mac Studio as AI infrastructure
A Mac Studio with 96GB unified memory runs this model natively via MLX. One-time hardware cost of ~$8,000. No cloud dependency. On-prem, cloud-hosted, or at an employee’s desk.
Complete data sovereignty
Every token, every document, every query stays on your hardware. No third-party API touches your data. Privacy, compliance, and regulatory requirements become the architectural default.
Governed by Watchman
Every RAM-compressed model is audited by Watchman before deployment. You get a verified capability report and LLM security scan, not a benchmark score, a structural guarantee.
How RAM works.
RAM is Black Sheep AI's data-free post-training compression method, designed for both NVIDIA GPU and Apple Silicon deployment. It targets the specific failure modes that make most compression approaches collapse at the sizes you would actually want to deploy. The full methodology is currently under patent review and will be described in a future technical paper. This article reports benchmark results only.
Run this model in your infrastructure.
Shepherd takes RAM-compressed models from research to production in your VPC, on your Mac fleet, or in your air-gapped facility. Every deployment includes a Watchman capability audit and security scan.