OpenThoughts-114k Was Removed From the Hub. Here's the Substitute.
Datasets

OpenThoughts-114k Was Removed From the Hub. Here's the Substitute.

May 2026 · Black Sheep AI Research

One of the strongest open reasoning-SFT datasets, OpenThoughts/OpenThoughts-114k, returned a DatasetNotFoundError on May 4, 2026. Here's a one-line filter on allenai/tulu-3-sft-mixture that recovers the same training signal, and a longer note on why having one and only one curated reasoning corpus made the field fragile.

What broke

On 2026-05-04 we tried to load:

from datasets import load_dataset
ds = load_dataset("OpenThoughts/OpenThoughts-114k", split="train")

and got:

DatasetNotFoundError: Dataset 'OpenThoughts/OpenThoughts-114k' doesn't exist on the Hub or cannot be accessed.

The repository has either been removed, made private, or had its name changed. The dataset card is no longer in our HuggingFace cache and the URL huggingface.co/datasets/OpenThoughts/OpenThoughts-114k returns a 404 as of the time of writing.

If your reasoning fine-tuning pipeline depended on it, your code is broken until you swap in a substitute.

What OpenThoughts-114k was

For context: 114,000 examples of curated reasoning traces, math problems, code completion, science Q&A, multi-step puzzles, with chain-of-thought solutions distilled from DeepSeek-R1. It was the canonical "reasoning SFT mix" for the public-LLM compression and fine-tuning communities through late 2025 and early 2026, used as the SFT corpus for many published distillation studies.

The substitute that works

allenai/tulu-3-sft-mixture (~939,000 examples) is the closest direct replacement available on Hub today. It's much larger and more general (it includes chat instruction-following, persona content, and code), but it has a source field that lets you filter to a reasoning-heavy subset:

from datasets import load_dataset

ds = load_dataset("allenai/tulu-3-sft-mixture", split="train")

reasoning_sources = {
    "flan_v2",
    "math_v6",
    "code_alpaca",
    "open_orca",
    "personahub_math_v5_regen_149960",
}

ds = ds.filter(
    lambda x: any(s in str(x.get("source", "")) for s in reasoning_sources),
    num_proc=4,
)

# Now you have ~150-200k reasoning-heavy examples; subsample to 50k for parity with OpenThoughts.
ds = ds.shuffle(seed=42).select(range(50_000))

Why this works as a substitute

The reasoning traces in flan_v2, math_v6, and the personahub_math slice are the same kind of content OpenThoughts curated, multi-step problem-solution pairs with chain-of-thought reasoning. They were generated by different upstream models (R1 for OpenThoughts, a mix including GPT-4 and Claude for tulu-3 sources) but the task structure is identical: give the model a problem, have it produce a worked solution.

What you lose:

  1. R1-specific reasoning patterns. OpenThoughts traces were R1's idiom, <think>-tagged scratch space, specific phrasing patterns. If your downstream evaluation rewards R1-style output specifically, the tulu-3 substitution will underperform there. Most benchmarks (MATH, GPQA, MMLU) don't.
  2. Single-source consistency. OpenThoughts had one teacher; tulu-3-reasoning has ~5 different upstream sources. The training signal is noisier per example. Compensate with more data, we recommend ~150k examples from the filtered set for the same effective signal as 50k OpenThoughts.

What you gain:

  1. Available. It's still on Hub.
  2. Larger. ~5× more data after filtering, more if you broaden the source filter.
  3. Better licence clarity. allenai/tulu-3-sft-mixture has a clearly stated ODC-BY licence; OpenThoughts-114k's licence terms shifted multiple times during its lifetime.

Why the field had a single point of failure

OpenThoughts-114k was the single curated reasoning corpus that everyone reached for, because:

When it disappeared, the field had no fallback. There was no "well, we'll use OpenThoughts-95k instead" because the substitutes are either much smaller (NuminaMath, ~10k) or much larger and more general (tulu-3, requires filtering).

This is a recurring pattern in open ML: a critical dataset becomes load-bearing for hundreds of downstream studies, then changes status overnight when the maintainer changes terms or the dataset is taken down for licensing reasons or the upstream model changes API.

What to do for resilience

Three rules we now apply to every pipeline that depends on a public dataset:

  1. Pin the revision. load_dataset("...", revision="<commit_hash>") is the only reproducible form. Latest-tag pinning means you'll get a different dataset next month.
  2. Mirror to local first. Once you've decided on a corpus, dataset.save_to_disk("./local/path") and check it in alongside the experiment scripts. Disk is cheap; the dataset is small after tokenization.
  3. Document the fallback. If a dataset disappears, what's the substitute? Write it down in your prepare_data.py now, not when the load fails.

A small irony

The reason OpenThoughts-114k was so widely used is that it was the most curated of the open reasoning corpora, a small team of people had hand-filtered, deduplicated, and quality-checked it. That curation work made it valuable, but it also meant a single team had operational control. Larger, less-curated corpora like tulu-3 or RedPajama don't have that single-point-of-failure: they're maintained by larger institutions or are scrape-based, so their availability is more durable even if the per-example quality is lower.

For the next 12 months, plan around the assumption that any single curated dataset can disappear with 30 days notice or less. Pin revisions, mirror locally, document substitutes.


Source: observed 2026-05-04 in our reasoning-SFT pipeline; substitution recipe is what we shipped in prepare_data.py.

Read more: GPQA-Diamond's 4 pp Noise Floor, MATH-500 Boxed-Answer Extraction Edge Cases.