Gemma 4 at Half the Size, But Full Performance

We compressed Gemma 4 31B to 31GB using RAM, our new quantization method, and ran the complete 12,032-question MMLU-Pro suite. It scored 85.2%, matching Google's published BF16 baseline. Half the size. Same intelligence.

85.2%

MMLU-Pro Overall

31 GB

50% of BF16

85.2%

BF16 Baseline

12,032

Questions Graded

How to read this result.

MMLU-Pro is currently the most demanding multiple-choice academic benchmark in widespread use. It spans 14 categories, from mathematics and physics through law and philosophy, and for every question it forces the model to choose between ten answer options instead of four. That change alone makes random guessing four times less useful and pushes the benchmark into territory where only genuinely capable models score well.

Google's published score for the full BF16 version of Gemma 4 31B on MMLU-Pro is 85.2%. The RAM-compressed 31GB version we built scored 85.2% on the exact same 12,032-question suite. This is not a graceful degradation, a partial recovery, or a selective comparison on a favourable subset — it is the same published number on the full suite. Statistically, the compressed model is indistinguishable from the full-precision one.

And it does this at half the size. The BF16 version of Gemma 4 31B weighs in at roughly 62GB, which puts it out of reach of most consumer hardware and forces you into workstation territory: 96GB-class machines, multi-GPU rigs, or cloud. The RAM version is 31GB, which runs comfortably on a 48GB unified-memory Mac. That is the entire story in one sentence: the same model intelligence, on hardware people actually own.

Overall result vs baseline.

Configuration	Size	MMLU-Pro
Gemma 4 31B BF16 (Google, published)	~62 GB	85.2%
Gemma 4 31B RAM 31GB (Black Sheep AI)	31 GB	85.2%

Benchmarking was performed end-to-end with thinking mode enabled and a 2,048 token answer budget, against the complete MMLU-Pro test set. No subset selection, no early stopping, no scoring shortcuts.

Complete per-category breakdown.

Here is every category on MMLU-Pro, sorted by score. These are the raw, unedited numbers from the full 12,032-question run against the RAM-compressed 31GB model. Google has not published per-category BF16 scores for Gemma 4 31B, so the overall 85.2% figure is the only direct comparison point, but the category-level distribution is published here in full for anyone who wants to scrutinize the result.

Category	Questions	Correct	Score
Math	1,351	1,274	94.4%
Biology	717	665	92.7%
Physics	1,299	1,167	89.8%
Business	789	706	89.5%
Chemistry	1,132	1,012	89.4%
Economics	844	752	89.0%
Computer Science	410	362	88.3%
Psychology	798	678	85.0%
Philosophy	499	402	80.6%
Health	818	655	80.1%
Engineering	969	771	79.6%
Other	924	721	77.9%
History	381	290	76.1%
Law	1,101	792	71.9%
Overall	12,032	10,247	85.2%

Highlighted rows are the six categories scoring 89% or higher. Every category uses the full question set, no sampling.

Download the raw results

Every one of the 12,032 questions, the model's answer, and whether it got it right. JSON, 2.4 MB.

Download JSON

Where RAM shines.

The first thing that stands out is the STEM dominance. Math at 94.4%, Biology at 92.7%, Physics at 89.8%. These are the categories where MMLU-Pro is at its most demanding and where weaker models tend to collapse. The compressed model handles them as comfortably as the full-precision one would.

Six categories at or above 89%. Math, Biology, Physics, Business, Chemistry, and Economics all clear the 89% bar. These are the reasoning-heavy, multi-step categories that most quantization methods struggle with. RAM preserves them.
Computer Science and Psychology close behind. At 88.3% and 85.0%, they round out an eight-category band where RAM delivers production-grade quality.
Law at 71.9% is a Gemma-family ceiling, not a RAM artifact. Legal reasoning has been a known soft spot across the Gemma line at full precision as well. The RAM result here tracks the family, not the compression.
No catastrophic category drops. The lowest score, Law, is still well above 70%. There are no 40% or 50% outliers, which is the usual tell of a broken quantization.

What this makes possible for your organisation.

Until now, running a frontier 31B model privately meant either paying for multi-GPU infrastructure or sending your data to an API provider. RAM compression changes that equation entirely: the same intelligence, at half the size, on hardware you can actually budget for.

Private AI is now feasible

A 31GB model fits on a single GPU instance in your AWS, Azure, or GCP VPC. Run it 10 hours a day for under $9k a year. Compare that to $50–100k+ in API fees, and every token stays in your infrastructure.

Mac Studio as AI infrastructure

A Mac Studio with 96GB unified memory runs this model natively via MLX. One-time hardware cost of ~$8,000. No cloud dependency. On-prem, cloud-hosted, or at an employee’s desk.

Complete data sovereignty

Every token, every document, every query stays on your hardware. No third-party API touches your data. Privacy, compliance, and regulatory requirements become the architectural default.

Governed by Watchman

Every RAM-compressed model is audited by Watchman before deployment. You get a verified capability report and LLM security scan, not a benchmark score, a structural guarantee.

How RAM works.

RAM is Black Sheep AI's data-free post-training compression method, designed for both NVIDIA GPU and Apple Silicon deployment. It targets the specific failure modes that make most compression approaches collapse at the sizes you would actually want to deploy. The full methodology is currently under patent review and will be described in a future technical paper. This article reports benchmark results only.

Run this model in your infrastructure.

Shepherd takes RAM-compressed models from research to production in your VPC, on your Mac fleet, or in your air-gapped facility. Every deployment includes a Watchman capability audit and security scan.

Talk to Our Team See Deployment Options

AWS

g7e.2xlarge from ~$2,700/yr on spot

Azure

NC24ads A100 v4 from ~$2,900/yr low-priority

GCP

a2-highgpu-1g from ~$2,900/yr preemptible

Mac

Mac Studio M4 Ultra ~$8,000 one-time

Models on HuggingFace Join Our Discord

Gemma 4 at Half the Size, But Full Performance.