Why Your Single-Seed GPQA-Diamond Comparison Is Probably Noise
Eval Methodology

Why Your Single-Seed GPQA-Diamond Comparison Is Probably Noise

May 2026 · Black Sheep AI Research

GPQA-Diamond is the strongest reasoning benchmark we have for graduate-level science Q&A. It's also small (198 questions), 4-way multiple choice, and produces a 95% confidence interval of about ±4 percentage points under any sane sampling protocol. If your model-A vs model-B comparison shows a 2 pp difference, you have not measured anything.

The arithmetic

GPQA-Diamond has 198 graduate-level science multiple-choice questions, each with 4 candidate answers (A/B/C/D). Random guessing gives 25% accuracy. A typical strong model like Claude 3.7 or GPT-4o lands in the 50-65% range; a weak quantized variant might land closer to 35%.

Treat each question as an independent Bernoulli trial. With n=198 and p ≈ 0.5 (the worst-case for SE), the standard error of the proportion is:

SE = sqrt(p(1-p)/n) = sqrt(0.5 × 0.5 / 198) = 0.0355

The 95% confidence interval is roughly ±2 × SE = ±7.1 pp for a single eval run. That's enormous. Even with 3 evaluation seeds (594 trials total), the interval narrows only to ±4 pp.

If you run model A and model B both at single-seed GPQA-Diamond and see A score 47% and B score 50%, you have not measured a real difference. You've measured the noise floor.

Why this is worse than it sounds

GPQA-Diamond's noise compounds badly with three common eval mistakes:

Mistake 1: temperature 0 / greedy decoding.
Greedy decoding makes the eval deterministic given a prompt. That sounds like it removes noise but it actually makes the noise worse, because the variation between prompts now has nowhere to go. A single bad prompt that the model misreads pulls the whole score down by 0.5 pp with no averaging across runs.

Mistake 2: thinking mode for models that don't close the thinking tag.
Many recent models (Qwen3, Claude 3.7, DeepSeek-V4) have a "thinking" sampling mode that emits scratch-pad text before the final answer. If the thinking tag never closes, the model never emits a final \boxed{}, and your evaluator scores it as wrong on a question it would have answered correctly. We've seen 35-second runs balloon to multi-minute runs where the model never decides on an answer. Use non-thinking mode for benchmarks unless you've explicitly tuned the stop conditions.

Mistake 3: not shuffling answer choices.
GPQA-Diamond ships with answer A always being the correct answer, by convention of the dataset format. Models that have any positional bias (every model has some) will inflate their scores if you don't shuffle. Use 3 property-shuffle seeds, randomize the answer permutation independently per seed, and report the mean of the three runs. This is the standard noise-floor characterization.

What "3 property-shuffle seeds" looks like

For each question, you have 4 candidate answers, with the correct one at index 0 by convention:

choices = [correct, wrong_1, wrong_2, wrong_3]
order = list(range(4))
random.Random(seed).shuffle(order)
shuffled = [choices[k] for k in order]
correct_position = order.index(0)  # where the correct answer ended up
options_text = "\n".join(f"({chr(65+j)}) {a}" for j, a in enumerate(shuffled))
prompt = f"Question: {q['Question']}\n\n{options_text}\n\nAnswer:"
# Run model, extract chosen letter, compare to chr(65 + correct_position)

Run this with seed in {0, 1, 2}. Three different shufflings produce three different "scores" on the same model. Report the mean of the three; report the range as the noise band.

What you can claim and what you can't

With n=198 × 3 seeds = 594 trials per arm:

For LLM compression / fine-tuning experiments where you expect small effect sizes (sub-3 pp on average benchmarks), GPQA-Diamond is not the right benchmark. It's too small. Use MATH-500 (500 problems, exact-match scoring, no MC noise) or MMLU-Pro 200Q stratified for tighter noise floors at similar reasoning depth.

When GPQA-Diamond is the right choice

Despite the noise, GPQA-Diamond has one thing nothing else has: graduate-level science questions outside the training distribution of most public datasets. For testing whether a model has real science reasoning vs. memorized exam prep, it's irreplaceable. The trick is to not over-claim from a single noisy run.

Three ways to make GPQA-Diamond claim-worthy:

  1. Always run the 3-seed shuffle protocol. If you see only single-seed numbers in a paper, treat them as having ±7 pp uncertainty.
  2. Pre-register your effect size. If you predict a 6 pp difference and observe 7 pp, that's a clean win. If you predict 1 pp and search until you find a 1 pp number that helps you, that's noise mining.
  3. Pair GPQA with a tighter benchmark. Most experiments that move GPQA-Diamond by >4 pp will also move MATH-500 by ≥1 pp (which is well-resolved at n=500). If the move shows up on both, it's real. If only on GPQA, it's probably a 2σ event.

The cost-of-noise calculation

A 3-seed GPQA-Diamond run on a 27B model takes ~50 minutes per seed on an 8×H100 instance, about 2.5 hours per arm. For a 4-arm comparison, that's 10 hours of GPU time per benchmark. People skip the multi-seed protocol because it triples the cost.

The cost of skipping it: every claim from the run is unfalsifiable. The eval was a single coin flip. Do the 3 seeds.


Source: measured noise floor on Qwen3.6-27B and Gemma-4-31B at multiple budgets in 2026; pattern holds across all model families we've tested. Standard property-shuffle GPQA harness is public.

Read more: MATH-500 Boxed-Answer Extraction Edge Cases, Mean Perplexity Is Lying to You.