GPQA-Diamond is the strongest reasoning benchmark we have for graduate-level science Q&A. It's also small (198 questions), 4-way multiple choice, and produces a 95% confidence interval of about ±4 percentage points under any sane sampling protocol. If your model-A vs model-B comparison shows a 2 pp difference, you have not measured anything.
The arithmetic
GPQA-Diamond has 198 graduate-level science multiple-choice questions, each with 4 candidate answers (A/B/C/D). Random guessing gives 25% accuracy. A typical strong model like Claude 3.7 or GPT-4o lands in the 50-65% range; a weak quantized variant might land closer to 35%.
Treat each question as an independent Bernoulli trial. With n=198 and p ≈ 0.5 (the worst-case for SE), the standard error of the proportion is:
SE = sqrt(p(1-p)/n) = sqrt(0.5 × 0.5 / 198) = 0.0355
The 95% confidence interval is roughly ±2 × SE = ±7.1 pp for a single eval run. That's enormous. Even with 3 evaluation seeds (594 trials total), the interval narrows only to ±4 pp.
If you run model A and model B both at single-seed GPQA-Diamond and see A score 47% and B score 50%, you have not measured a real difference. You've measured the noise floor.
Why this is worse than it sounds
GPQA-Diamond's noise compounds badly with three common eval mistakes:
Mistake 1: temperature 0 / greedy decoding.
Greedy decoding makes the eval deterministic given a prompt. That sounds like it removes noise but it actually makes the noise worse, because the variation between prompts now has nowhere to go. A single bad prompt that the model misreads pulls the whole score down by 0.5 pp with no averaging across runs.
Mistake 2: thinking mode for models that don't close the thinking tag.
Many recent models (Qwen3, Claude 3.7, DeepSeek-V4) have a "thinking" sampling mode that emits scratch-pad text before the final answer. If the thinking tag never closes, the model never emits a final \boxed{}, and your evaluator scores it as wrong on a question it would have answered correctly. We've seen 35-second runs balloon to multi-minute runs where the model never decides on an answer. Use non-thinking mode for benchmarks unless you've explicitly tuned the stop conditions.
Mistake 3: not shuffling answer choices.
GPQA-Diamond ships with answer A always being the correct answer, by convention of the dataset format. Models that have any positional bias (every model has some) will inflate their scores if you don't shuffle. Use 3 property-shuffle seeds, randomize the answer permutation independently per seed, and report the mean of the three runs. This is the standard noise-floor characterization.
What "3 property-shuffle seeds" looks like
For each question, you have 4 candidate answers, with the correct one at index 0 by convention:
choices = [correct, wrong_1, wrong_2, wrong_3]
order = list(range(4))
random.Random(seed).shuffle(order)
shuffled = [choices[k] for k in order]
correct_position = order.index(0) # where the correct answer ended up
options_text = "\n".join(f"({chr(65+j)}) {a}" for j, a in enumerate(shuffled))
prompt = f"Question: {q['Question']}\n\n{options_text}\n\nAnswer:"
# Run model, extract chosen letter, compare to chr(65 + correct_position)
Run this with seed in {0, 1, 2}. Three different shufflings produce three different "scores" on the same model. Report the mean of the three; report the range as the noise band.
What you can claim and what you can't
With n=198 × 3 seeds = 594 trials per arm:
- A 4 pp difference between two arms is at the edge of significance (Mann-Whitney U on per-question outcomes; p ≈ 0.05).
- A 2 pp difference is well within the noise floor. You cannot claim it.
- A 6+ pp difference is robust at the per-question level even allowing for cross-shuffle correlation.
For LLM compression / fine-tuning experiments where you expect small effect sizes (sub-3 pp on average benchmarks), GPQA-Diamond is not the right benchmark. It's too small. Use MATH-500 (500 problems, exact-match scoring, no MC noise) or MMLU-Pro 200Q stratified for tighter noise floors at similar reasoning depth.
When GPQA-Diamond is the right choice
Despite the noise, GPQA-Diamond has one thing nothing else has: graduate-level science questions outside the training distribution of most public datasets. For testing whether a model has real science reasoning vs. memorized exam prep, it's irreplaceable. The trick is to not over-claim from a single noisy run.
Three ways to make GPQA-Diamond claim-worthy:
- Always run the 3-seed shuffle protocol. If you see only single-seed numbers in a paper, treat them as having ±7 pp uncertainty.
- Pre-register your effect size. If you predict a 6 pp difference and observe 7 pp, that's a clean win. If you predict 1 pp and search until you find a 1 pp number that helps you, that's noise mining.
- Pair GPQA with a tighter benchmark. Most experiments that move GPQA-Diamond by >4 pp will also move MATH-500 by ≥1 pp (which is well-resolved at n=500). If the move shows up on both, it's real. If only on GPQA, it's probably a 2σ event.
The cost-of-noise calculation
A 3-seed GPQA-Diamond run on a 27B model takes ~50 minutes per seed on an 8×H100 instance, about 2.5 hours per arm. For a 4-arm comparison, that's 10 hours of GPU time per benchmark. People skip the multi-seed protocol because it triples the cost.
The cost of skipping it: every claim from the run is unfalsifiable. The eval was a single coin flip. Do the 3 seeds.
Source: measured noise floor on Qwen3.6-27B and Gemma-4-31B at multiple budgets in 2026; pattern holds across all model families we've tested. Standard property-shuffle GPQA harness is public.
Read more: MATH-500 Boxed-Answer Extraction Edge Cases, Mean Perplexity Is Lying to You.