We fine-tuned four wildly different models — a 7B dense, a 35B MoE, a 120B reasoning-channel model, and a 27B dense — to inject the same document knowledge. Every one landed within 10 points of the others, regardless of recipe. That suspiciously tight clustering was not architecture-invariance. It was a ~60% general-knowledge floor hiding the real signal. The eval that removes it shows the uncomfortable truth: parametric injection recovers only ~14–22% of facts the model didn’t already know.
The Suspicious Clustering
We ran a knowledge-injection sweep: take a model, fine-tune a LoRA on synthetic Q&A generated from a document corpus, then test closed-book on 80 held-out questions (graded by a separate strong judge). We varied the architecture, the augmentation level, the adapter rank, and the learning rate. The scores barely moved:
| Model | Architecture | Base (no injection) | Best injected |
|---|---|---|---|
| OLMo-2-7B | dense | 37.5% | 66.2% |
| Qwen3.5-35B-A3B | MoE | 66.2% | 68.8% |
| gpt-oss-120B | reasoning-channel | 73.8% | 72.5% |
| Qwen3.6-27B | dense | 67.5% | 68.8% |
Three of four base models, before any injection, already sat at 66–74%. A 120B model and a 27B model, injected or not, were indistinguishable. When results are that flat across architectures that differ by 17× in size and route tokens in completely different ways, the right reaction is not “we found architecture-invariance.” It is: the benchmark is measuring something other than injection.
Suspect One: A Lenient Judge
The obvious culprit is the LLM judge — if it passes any fluent, on-topic answer, every model caps near its leniency rate. We audited it directly. For all 80 questions we fed the judge three candidate types with the exact grading prompt used in the eval:
| Candidate fed to judge | Pass rate | Expected |
|---|---|---|
| the reference answer itself | 100.0% | ~100 (no false negatives) |
| a generic hedge non-answer | 0.0% | ~0 (not lenient) |
| a different question’s reference | 0.0% | ~0 (discriminating) |
The judge is clean. It passes correct answers 100% of the time and rejects vague or off-topic ones 0% of the time. It is not the artifact. (Independently, a retrieval-augmented baseline scores 78.8% with the same judge, so there is no hard ceiling at 70 either.)
Suspect Two: The Eval Itself
We built the question×model correctness matrix across all arms. The structure is unmistakable:
- The three strong base models agree at Jaccard 0.82–0.87 on which questions they get right — they have nearly identical correct-sets.
- 60% of all 80 questions are answered correctly by all three strong bases with no document access at all — they are answerable from general ML knowledge already in the weights.
- 26% of questions are answered by every arm (a shared easy floor); 12.5% by none (a shared hard ceiling); only the middle is contested.
That is the whole story. Any model with broad pre-training starts at the ~60% general-knowledge floor. Architecture and recipe can only move the contested minority. The benchmark was scoring what the model already knew, not what we injected — and compressing every competent model into the same narrow band.
The Methodology: A Document-Specific Eval
The fix is to drive the floor to zero by construction. We built a discriminating eval in three steps:
- Generate specific questions. Prompt the generator for questions answerable only from the passage — exact numbers, proper nouns, named methods, quantitative results — explicitly avoiding generic textbook questions. This produced 231 candidates.
- Filter to base-wrong. Run a strong model (Qwen3.5-35B) closed-book and keep only the questions it gets wrong. 100 of 231 survived. These are, by construction, facts no strong model knows from pre-training — genuinely document-specific.
- Re-evaluate the existing adapters on this set. No retraining — the same injected models, a harder eval.
On this eval the base model scores ~0–3%. The floor is gone. Now the metric measures injection, and only injection.
The Uncomfortable Result
| Model / lever | Document-specific accuracy |
|---|---|
| OLMo-2-7B base | 3% |
| OLMo — augmentation 1× / 4× / 16× | 8 / 13 / 13% |
| OLMo — LoRA rank 8 / 32 / 64 | 9 / 14 / 14% |
| Qwen3.5-35B base | 0% |
| Qwen3.5 — inject @ lr 1e-4 / 3e-5 | 19 / 22% |
Every lever caps low. The best result — a 35B model with a tuned learning rate — recovers 22% of the facts it didn’t already know. The 7B model caps at ~14%. The “66%” from the original benchmark was almost entirely pre-existing knowledge; the genuine new-fact injection was buried 10–20 points above a 60% floor.
What Actually Injects Knowledge
With a discriminating eval, the levers finally separate — and the ranking is sobering:
- Model scale is the biggest parametric lever. 35B recovers 22% vs the 7B’s 14%. More capacity absorbs more facts.
- Learning rate is real, not just anti-forgetting. On a strong base, lowering the rate (1e-4 → 3e-5) lifts specific-fact recall 19% → 22% — it genuinely writes more facts, not merely preserves old ones.
- Augmentation helps but saturates early (4× ≈ 16×) and is no better than adding rank. Its apparent dominance on the original benchmark was a floor effect.
- Retrieval is the real lever. RAG reaches 78.8% precisely by retrieving the document-specific facts parametric injection misses. Parametric injection is a lossy compression of the gist; retrieval gives verbatim access.
Practical Takeaway
This is a measurement result first and a methods result second.
- Never evaluate knowledge injection on questions answerable from general knowledge. Filter your eval to facts the base model gets wrong, so the floor is ~0 and the number measures injection — not pre-training. A 60% floor will make every method look like a winner and every architecture look identical.
- For factual knowledge, reach for retrieval. Parametric injection tops out near 20% on facts the model doesn’t know. If you must go parametric (offline, on-device), use the biggest model you can, a tuned-low learning rate, and modest augmentation — and expect ~one-in-five recall.
- Be suspicious of tight clusters. When radically different systems score within noise of each other, suspect the benchmark before you believe the equivalence.
The clustering that looked like a finding was a warning. The eval that exposed it took an afternoon to build and changed the conclusion entirely. That is usually the better afternoon to spend.
Models: OLMo-2-7B (dense), Qwen3.5-35B-A3B (MoE), gpt-oss-120B (reasoning-channel), Qwen3.6-27B (dense), all on Apple Silicon / MLX. Injection: per-document synthetic Q&A → LoRA → merge. Eval: 80-question general benchmark vs a 100-question document-specific set (231 specific QA generated, filtered to the 100 Qwen3.5-35B answered incorrectly). Judge: Qwen3-30B-A3B, temperature 0, audited at 100% / 0% / 0% on reference / hedge / off-topic candidates.