Pre-Registering Predictions in ML Research: A Worked Example

Locking your hypothesis, decision criteria, and analysis plan in a timestamped file before running an experiment is the single biggest reduction in self-deception available to ML researchers. It costs 30 minutes per study, it doesn't change what you measure, and it makes every result you report unfakeable in retrospect. Here's how we do it.

The problem nobody talks about

ML research has a quiet crisis. The machine-readable, GitHub-trackable parts of the field, code, model weights, training logs, eval scripts, are all reproducible to the bit. The non-machine-readable parts, what hypothesis you were testing, which thresholds you'd have called "success," whether you ran 5 variants or 50 before finding the one you reported, are not.

The fully-honest version of "we got X% on benchmark Y" is: "We tried N variants, X% is the best, the others are at our personal discretion to discuss or omit." Without pre-registration, a reader cannot tell N. Was it N=1 (clean result)? Or N=50 with the reader being shown the best of 50 (not a result, a search)?

Pre-registration is the discipline that closes this gap. It costs about 30 minutes per experiment and makes your results unfakeable in retrospect.

What goes in a pre-registration

A pre-registration document, locked before any model trains, contains:

Hypothesis under test. One sentence. "Configuration A beats configuration B on metric M by ≥ ε."
Operationalization. What experiment you'll run to test the hypothesis. Includes the model(s), the dataset(s), the eval(s), the hyperparameters. Specifically: anything you could "tune" to produce a different result.
Pre-registered prediction. A specific quantitative claim. Not "we expect A to be better", that's vacuous. Instead: "we predict A beats B by 2-4 pp on benchmark M; if the result is <2 pp the hypothesis is rejected, if >4 pp the hypothesis is supported in a stronger form than expected."
Decision rule. What outcome causes you to publish a positive result, what outcome causes you to publish a negative result, and what outcome causes you to declare the experiment inconclusive.
Deviations log (template). Every deviation from this plan you make during execution must be recorded as it happens, with a one-line rationale. Deviations made before running are fine. Deviations made after seeing partial results are P-hacking in slow motion.
Explicit "not in scope" list. Things that would be cool to test but aren't part of the hypothesis. This prevents scope creep when results come in: "oh and we also tried..."

That's it. About 200-500 words. Lock it as a markdown file under git, push, and only then start training.

A worked example

Suppose your hypothesis: "A specific quantization recipe X yields LM perplexity within 1% of BF16 on WikiText-2 at 4-bit average."

The pre-registration:

# Pre-registered protocol: Quantization Recipe X, perplexity preservation
**Locked: 2026-05-05 09:00 UTC. Do not edit after this point.**

## Hypothesis
Quantization recipe X (parameters specified in `recipe_x.yaml`)
preserves LM perplexity on WikiText-2 within ±1.0% relative
to the BF16 source model at an average bit-width of 4.

## Operationalization
- Source model: Qwen3-14B-base (HF revision a8ad2c26)
- Eval: WikiText-2 test split, 32 sequences × 2048 tokens
- Aggregation: median per-sequence loss, exp(median) reported
- Hardware: M2 Ultra Mac Studio (CPU-mode, deterministic)

## Pre-registered prediction
ppl_recipe_X / ppl_BF16 ∈ [0.99, 1.01]

## Decision rule
- IF ratio ∈ [0.99, 1.01] → publish positive result
- IF ratio ∈ [1.01, 1.03] → publish negative result, "near miss"
- IF ratio > 1.03 → publish negative result, "rejected"
- IF ratio < 0.99 → unexpected; treat as bug until reproduced

## Deviations log
(empty at lock time; append entries chronologically)

## Not in scope
- MMLU/GPQA/MATH benchmarks (covered by separate pre-reg)
- Other model sizes (covered by separate pre-reg)
- Recipe X variants (this is recipe X-as-specified, not a sweep)

Push to git. Then, and only then, run the experiment.

What happens during execution

You will be tempted to deviate. Some examples we've seen:

"The eval ran on too few sequences, let me bump it from 32 to 256." That's a deviation. Log it: "Increased to 256 sequences after observing per-sequence variance > 0.5 in the first 32. Effect: tightens noise floor; doesn't change the binary decision rule outcome."
"The recipe failed on a few tensors; let me protect them at higher precision." That's a recipe modification. Either declare a deviation ("recipe X' tested instead of recipe X, with rationale Y") or bail and re-pre-register.
"Actually the prediction was too narrow, the right band is ±2%." That's the worst kind of deviation. You're moving the goalposts after seeing how far the ball flew. Don't do this. If you really think the prediction was wrong, publish the original prediction failed and start a new pre-registration with the wider band.

The deviation log is the audit trail. Reviewers reading the eventual paper see what you predicted, what actually happened, and what you decided to do about it. The result is interpretable.

What this gets you

Three things, in order of value:

1. Self-honesty. When you write "we predict 2-4 pp" and then observe 0.5 pp, you cannot rationalize that as "well, maybe the effect is more subtle than we thought." The pre-registration locks the goalposts so your future self can't move them.

2. Reviewer-readable provenance. A reader who's skeptical of your headline number can read the pre-reg and see exactly what you were testing for. "I expected 2-4 pp, got 5 pp" is a stronger result than "I got 5 pp"; the former came with an honest commitment to a smaller band.

3. Negative-result capture. Most experiments don't work. Pre-registered negative results are publishable; pre-registration is what makes negative results more useful than positive ones, because you've explicitly committed to a band the result didn't fall into. Without pre-reg, "we tried it and it didn't work" is unpublishable. With pre-reg, "we predicted A would beat B by 2-4 pp and observed 0.3 pp, rejecting the hypothesis" is a clean negative finding.

Common objections

"This is too rigid for ML, we don't always know what we're going to find."
Then pre-register an exploratory study. The pre-reg says "this is exploratory; we're describing the data we'll see, not testing a hypothesis." That's still better than no commitment.

"What if I find something interesting that wasn't in scope?"
Add it to the deviation log as "exploratory finding; will pre-register a confirmatory study before publishing." Don't bury it in the original paper as if it were a primary result.

"30 minutes is too much overhead per experiment."
The cost is paid once per hypothesis, not per training run. If you're running 5 model variants to test "is recipe X within 1% of BF16", they share one pre-registration. The 30 minutes is amortized.

The minimum viable pre-reg

If 200 words feels long, the absolute minimum is three sentences, locked before training starts:

"We predict that [system X] achieves [metric Y] in the range [a, b] on [benchmark Z]."
"If observed value is in [a, b] we conclude X works; if outside, X does not."
"Anything we change about this plan during execution will be logged in this file before we look at the results."

Three sentences, push to git, run. The discipline transfers; the cost doesn't.

Source: standard practice in our pre-registered experimental work; format adapted from clinical-trial pre-registration norms (clinicaltrials.gov, OSF preregistration).