Decision Gates as Forcing Functions: Green / Yellow / Red Pre-Registered Verdicts

Pre-registering a prediction is half the discipline. The other half is pre-registering the decision you'll take when results come in. A four-tier verdict system, strong, modest, null, failure, locked before launch makes "should we publish?" a one-line lookup rather than a months-long argument with yourself.

The trap pre-registration alone doesn't fix

You pre-registered: "We predict variant A beats variant B by 2-4 pp." Good.

The experiment runs. You observe 0.7 pp.

Now what? Three readings of the same number:

Optimistic reading: "Within margin of measurement noise; effect is plausibly real but underpowered. We should run more seeds and probably publish if it holds." (Outcome: paper published, reviewers skeptical because n=1 result barely above noise.)
Pessimistic reading: "The prediction was 2-4 pp; we got 0.7 pp; the hypothesis is rejected." (Outcome: shelf the result, lose 6 weeks of work.)
Bargaining reading: "Well, the secondary benchmark moved 1.5 pp, and that's closer to the prediction band, so maybe the right framing is 'mixed evidence with one positive subgroup'..." (Outcome: scope-creep paper that says nothing clearly.)

Without pre-registered decision rules, all three readings are available to you. The reading you pick will reflect your career incentives more than the data.

What a decision gate looks like

A decision gate is a 4-row table, locked before you launch, that maps each possible outcome to one of four verdicts: STRONG, MODEST, NULL, FAILURE. Each verdict has a pre-committed action.

Worked example, for a hypothetical "compression recipe X preserves quality" study:

Verdict	Conditions	Action
STRONG	Recipe X within 1% of BF16 on WikiText-2 AND beats baseline by >2 pp on MMLU-Pro AND no eval shows >0.5% regression	Publish as headline result; recommend for production use; trigger follow-up scaling study
MODEST	Recipe X within 1.5% of BF16 on WikiText-2 AND no >2% regression on any eval	Publish as methods result; document tradeoffs; do not recommend for production without a per-deployment evaluation
NULL	All evals within ±1% of BF16, but no eval shows a meaningful gain	Publish as parity result ("recipe X matches BF16, no advantage observed"); useful negative finding
FAILURE	Recipe X >2% worse than BF16 on any eval	Publish as failure analysis ("recipe X breaks on Y type of model under Z conditions"); useful negative finding for the field

Read this table aloud to yourself before launching the experiment. If you can't agree with what each verdict implies for your career, your team's roadmap, or the paper's framing, you're not ready to run it. Adjust the gate or pull the experiment.

Why four tiers, not two

A binary decision (publish / don't publish) has a perverse incentive: every observation gets pushed onto the "publish" side of the cutoff because nobody wants to admit a project failed. With four tiers:

STRONG has a high bar. You won't push a marginal result over it.
MODEST absorbs the "interesting but not definitive" results that would otherwise get inflated to look like STRONG.
NULL is its own publishable category. "We tested this and it doesn't work" is a finding, not a failure.
FAILURE is for the actively-broken case where your method is worse than baseline, also publishable as a useful warning to others.

The structure makes every outcome publishable in some form, which removes the implicit "if it doesn't work I'll bury it" pressure that distorts results. Combined with pre-registered predictions, your eventual paper looks like: "We predicted X. We observed Y. By our pre-registered decision rule, that's verdict Z."

What goes wrong without it

Two failure modes we've seen, both common:

Failure 1: Goalpost drift.
A team predicts "A beats B by 2-4 pp." Observes 1.2 pp. Without a pre-committed verdict, they end up writing: "We observed an interesting trend in the predicted direction; future work should explore the conditions under which this effect strengthens." That's a paper that means nothing. With a pre-committed gate, the same observation reads: "1.2 pp is verdict NULL: A and B are functionally equivalent on this benchmark; the predicted effect did not materialize at this scale."

Failure 2: Conditional refinement.
Team predicts A beats B. Observes mixed results, A wins on 2 of 4 benchmarks, B wins on 2. Without a pre-registered gate, they redefine the question: "A is best on reasoning benchmarks, B on factual." Now you need a pre-reg of that question (was reasoning vs. factual the original split, or post-hoc?), and almost always, it wasn't.

A pre-registered gate forces them to commit, before the data, to "what fraction of evals must A win for STRONG vs. MODEST vs. NULL?"

How to write a tight gate

Three rules:

Rule 1: gates apply to pre-registered metrics, not to "any reasonable interpretation."
If your gate says "STRONG = wins on MMLU-Pro by ≥2 pp," then a 4 pp win on MATH-500 with no MMLU-Pro movement is not STRONG. The gate is the gate. If you want MATH-500 in the gate, put it in the gate before launch.

Rule 2: each verdict has a pre-committed action.
Not "we'll see what makes sense." A specific commitment: publish, don't publish, follow up with study X, etc. If you can't decide the action ahead of time, you're not ready to set the gate.

Rule 3: the gates partition the outcome space.
There should be no observable result that doesn't fit somewhere. If you observe 1.5 pp and your gate has STRONG ≥ 2 pp and NULL ≤ 1 pp, what's 1.5? That's a hole in the gate; close it before launch ("1-2 pp = MODEST"). Otherwise you'll fill the hole post-hoc, defeating the purpose.

Pair with a deviation log

A pre-registered gate is binding only if every deviation from the experimental plan is logged as it happens. Without a deviation log, a clever researcher will subtly change the experiment to land in the verdict they want and call it the original plan.

The deviation log is one row per deviation, written before the next training run starts:

2026-05-05 14:32  Increased eval N from 200 to 500 to reduce noise. Doesn't change verdict thresholds; gate untouched.
2026-05-05 16:12  Excluded layer 0 from X due to numerical instability observed in initial step. Gate untouched.
2026-05-06 09:45  Hypothesis modified to "A beats B by 0.5-2 pp." DEVIATION: this is a goalpost shift; original hypothesis is rejected; this becomes a new exploratory study.

The third entry is the kind of thing that pre-registration alone can't prevent, but writing it down, before continuing, at least makes it visible.

What it costs vs. what it buys

Setup cost: 30 minutes per study to write the gate, the predictions, and the not-in-scope list.

What it buys:
- Faster post-experiment decisions (gate is a lookup table)
- Higher-credibility published results (every claim is bound to a pre-committed criterion)
- Higher-quality negative results (NULL and FAILURE verdicts are publishable, not personal failures)
- Resistance to your own future motivated reasoning ("but maybe...")

The cost is small and front-loaded. The benefits compound across every experiment in your career.

Source: adapted from clinical-trial decision-rule conventions; specific four-tier structure refined through our 2026 quantization and adaptation experiments.