The General-Knowledge Floor in Knowledge-Injection Evaluation
Evaluation Methodology

The General-Knowledge Floor: Why Knowledge-Injection Evals Cluster at 70%

June 2026 · Black Sheep AI Research

We fine-tuned four wildly different models — a 7B dense, a 35B MoE, a 120B reasoning-channel model, and a 27B dense — to inject the same document knowledge. Every one landed within 10 points of the others, regardless of recipe. That suspiciously tight clustering was not architecture-invariance. It was a ~60% general-knowledge floor hiding the real signal. The eval that removes it shows the uncomfortable truth: parametric injection recovers only ~14–22% of facts the model didn’t already know.

The Suspicious Clustering

We ran a knowledge-injection sweep: take a model, fine-tune a LoRA on synthetic Q&A generated from a document corpus, then test closed-book on 80 held-out questions (graded by a separate strong judge). We varied the architecture, the augmentation level, the adapter rank, and the learning rate. The scores barely moved:

Model Architecture Base (no injection) Best injected
OLMo-2-7Bdense37.5%66.2%
Qwen3.5-35B-A3BMoE66.2%68.8%
gpt-oss-120Breasoning-channel73.8%72.5%
Qwen3.6-27Bdense67.5%68.8%

Three of four base models, before any injection, already sat at 66–74%. A 120B model and a 27B model, injected or not, were indistinguishable. When results are that flat across architectures that differ by 17× in size and route tokens in completely different ways, the right reaction is not “we found architecture-invariance.” It is: the benchmark is measuring something other than injection.

Suspect One: A Lenient Judge

The obvious culprit is the LLM judge — if it passes any fluent, on-topic answer, every model caps near its leniency rate. We audited it directly. For all 80 questions we fed the judge three candidate types with the exact grading prompt used in the eval:

Candidate fed to judge Pass rate Expected
the reference answer itself100.0%~100 (no false negatives)
a generic hedge non-answer0.0%~0 (not lenient)
a different question’s reference0.0%~0 (discriminating)

The judge is clean. It passes correct answers 100% of the time and rejects vague or off-topic ones 0% of the time. It is not the artifact. (Independently, a retrieval-augmented baseline scores 78.8% with the same judge, so there is no hard ceiling at 70 either.)

Suspect Two: The Eval Itself

We built the question×model correctness matrix across all arms. The structure is unmistakable:

That is the whole story. Any model with broad pre-training starts at the ~60% general-knowledge floor. Architecture and recipe can only move the contested minority. The benchmark was scoring what the model already knew, not what we injected — and compressing every competent model into the same narrow band.

The Methodology: A Document-Specific Eval

The fix is to drive the floor to zero by construction. We built a discriminating eval in three steps:

On this eval the base model scores ~0–3%. The floor is gone. Now the metric measures injection, and only injection.

The Uncomfortable Result

Model / lever Document-specific accuracy
OLMo-2-7B base3%
OLMo — augmentation 1× / 4× / 16×8 / 13 / 13%
OLMo — LoRA rank 8 / 32 / 649 / 14 / 14%
Qwen3.5-35B base0%
Qwen3.5 — inject @ lr 1e-4 / 3e-519 / 22%

Every lever caps low. The best result — a 35B model with a tuned learning rate — recovers 22% of the facts it didn’t already know. The 7B model caps at ~14%. The “66%” from the original benchmark was almost entirely pre-existing knowledge; the genuine new-fact injection was buried 10–20 points above a 60% floor.

What Actually Injects Knowledge

With a discriminating eval, the levers finally separate — and the ranking is sobering:

Practical Takeaway

This is a measurement result first and a methods result second.

The clustering that looked like a finding was a warning. The eval that exposed it took an afternoon to build and changed the conclusion entirely. That is usually the better afternoon to spend.


Models: OLMo-2-7B (dense), Qwen3.5-35B-A3B (MoE), gpt-oss-120B (reasoning-channel), Qwen3.6-27B (dense), all on Apple Silicon / MLX. Injection: per-document synthetic Q&A → LoRA → merge. Eval: 80-question general benchmark vs a 100-question document-specific set (231 specific QA generated, filtered to the 100 Qwen3.5-35B answered incorrectly). Judge: Qwen3-30B-A3B, temperature 0, audited at 100% / 0% / 0% on reference / hedge / off-topic candidates.

Continue Reading

Related research from our team.

The Stacking Confound
Knowledge Injection

The Stacking Confound: Why LoRA Recovery Numbers Lie

~80% of apparent knowledge injection is a weight-perturbation artifact, not learned facts.

Mean Perplexity Is Lying
Evaluation Methodology

Mean Perplexity Is Lying

Tail effects mean the median, not the mean, is the honest summary of a quantized model’s quality.

GPQA Diamond Noise Floor
Evaluation Methodology

GPQA Diamond Is Below the Noise Floor

Small benchmarks can’t distinguish models whose true gap is smaller than the sampling noise.

View All Research