The Thesis

A 3B specialist can beat a 500B generalist
in its own domain.

This page holds the peer-reviewed evidence behind Project Nucleus. Every claim has a citation. Every citation has the numbers. If you think the thesis is implausible, start here.

2,000x

Smaller model beat PaLM-540B on e-SNLI

Hsieh et al. 2023

+45

Points: 7B beat GPT-3.5+RAG on HuggingFace API

RAFT, Berkeley 2024

1.3B

Model learned CoT from 175B teacher

SCoTD, ACL 2023

∞ ≠ ∞

Self-play converges to fixed point (proved)

SPIN, ICML 2024

The Question

Can a 3-billion parameter model actually compete with a 500-billion parameter frontier model in a domain the frontier model was trained on?

The cost of being wrong about this is the entire Project Nucleus hypothesis. If physics says no — that no amount of chain-of-thought distillation, RAFT training, or adversarial convergence can bridge the gap — then the hundreds of hours of compute behind each Super Skill is wasted. So before anything else, here is the evidence that the physics does not say no.

The Karpathy Thesis

Founding member of OpenAI, director of AI at Tesla, founder of Eureka Labs

“The future is not one giant model. It’s many small models, each a true expert in one thing.”

— Andrej Karpathy, paraphrasing his public commentary across 2023-2024

nanoGPT & micrograd

Karpathy's from-scratch reference implementations prove that GPT-2 is reproducible in ~300 lines of code on a single node. Small models are not a compromise — they're accessible by design.

github.com/karpathy/nanoGPT ↗

Zero to Hero

“The magic is not in the model — it's in the data.” Karpathy's entire teaching thesis is that the transformer is understandable, and the hard part is corpus quality. The Nucleus KICE 7-layer extraction is built directly on this principle.

karpathy.ai/zero-to-hero ↗

LLM101n — Eureka Labs

Karpathy's announced (2024) course: train an AI Storyteller from scratch. The stated goal is teaching people to build small, domain-specialized LLMs end-to-end. Nucleus is building the expert this course would produce — automated.

github.com/karpathy/LLM101n ↗

Software 3.0

Karpathy's framing (Sequoia AI Ascent 2024): LLMs are programs. Software 1.0 = code. 2.0 = weights. 3.0 = prompts. The future is many specialized LLMs deployed close to the problem, not one centralized giant.

The Software 3.0 talk ↗

These resources are among the clearest public articulations of the small-specialist thesis. Project Nucleus is an attempt to follow these studies and this vision — to begin the journey of growing domain experts the literature says are possible.

The Theoretical Backstop

The Information Bottleneck Principle

A generalist model has to preserve mutual information across every domain: I(X; Y) — all inputs to all outputs across all human knowledge.

A Super Skill only has to preserve I(X; Y | Domain) — inputs to outputs within a specific domain.

That is a dramatically smaller information surface. The 3B parameters do not have to compete with the 500B parameters. They only have to encode what is relevant to the domain — and most of the 500B model's capacity is spent on things irrelevant to any single domain. Based on Tishby & Zaslavsky's Information Bottleneck framework (2015), recently extended to LLMs in 2024-2026 literature.

The Evidence

Filter by research pillar. Every paper is peer-reviewed. Every claim has a specific number.

Distillation EfficiencyACL Findings 2023

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Hsieh, Li, Yeh, Nakhost, Fujii, Ratner, Krishna, Lee, Pfister

arXiv ↗

770M T5 beat 540B PaLM in-domain — a 700x parameter reduction

220M model beat PaLM-540B on e-SNLI

2,000x smaller, wins in-domain

770M model beat PaLM-540B on ANLI

700x smaller with 80% less data

11B model beat PaLM-540B on CommonsenseQA

45x smaller

Standard fine-tuning could not catch up to PaLM even with 100% of data

Chain-of-thought distillation is the differentiator

Chain-of-Thought TransferACL 2023

Symbolic Chain-of-Thought Distillation: Small Models Can Also Think Step-by-Step

Li, Hessel, Yu, Ren, Chang, Choi

arXiv ↗

Disproved the claim that chain-of-thought reasoning requires 50B+ parameters

OPT-1.3B learned CoT from GPT-3 code-davinci-002 (175B)

QuaRel: 71.2% → 84.9%

OpenBookQA benchmark

50.0% → 67.0%

CommonsenseQA

67.2% with N=30 rationales per instance

IMDB contrast set generalization

81.6% → 92.0% — a 10.4 point robustness gain

Optimal sampling strategy

~30 reasoning chains per instance from teacher

Document Grounding (RAFT)UC Berkeley + Microsoft, 2024

RAFT: Adapting Language Model to Domain Specific RAG

Zhang, Patil, Jain, Shen, Zaharia, Stoica, Gonzalez

arXiv ↗

7B domain model beat GPT-3.5+RAG by 45 points on HuggingFace API docs

RAFT-7B vs GPT-3.5+RAG on HuggingFace API

74.0 vs 29.08 — 45 point win

RAFT-7B vs GPT-3.5+RAG on Torch Hub

84.95 vs 60.21 — 25 point win

RAFT-7B vs GPT-3.5+RAG on TensorFlow

86.86 vs 65.59 — 21 point win

RAFT-7B vs GPT-3.5+RAG on PubMed (medical)

73.30 vs 71.60 — wins on medicine

Only loss: general multi-hop QA (HotpotQA)

Domain specialization is the key variable

Chain-of-thought ablation on HuggingFace

59.07 → 74.00 — CoT adds 14.93 points

Proved iterative self-improvement converges to a fixed point

Zephyr-7B baseline → iteration 1

58.14% → 60.80% (+2.66)

Iteration 1 → iteration 2

+1.32% additional

Three iterations total

58.14% → 63.16% (+5.02)

MT-Bench

5.94 → 6.78 (+0.84)

Iteration-0 SPIN matched DPO + 62k GPT-4 preference pairs

Self-play replaces expensive human preference data

Theoretical guarantee

improvement at t+1 < t, converges to fixed point

Adversarial training under pressure beats every distribution-matching baseline

Reframes knowledge distillation as imitation learning

Action-value moment matching

On-policy and off-policy adversarial training

Outperforms all KD baselines

Theoretical backing for adversarial swarm architectures

The weather is the mechanism

Domain SpecializationMicrosoft Research Blog 2023

Phi-2: The Surprising Power of Small Language Models

Microsoft Research

arXiv ↗

2.7B model matched LLaMA-2-70B on math and coding

GSM8K math (8-shot)

Phi-2: 61.1% vs LLaMA-2-70B: 64.1% (25x smaller)

HumanEval + MBPP coding

Phi-2: 53.7% vs LLaMA-2: 21.0-38.3%

Training key: textbook-quality synthetic data

Quality over quantity

Domain SpecializationEPFL, 2023

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Chen, Cherkaoui, Köpf, Schärli, Oliva, Ibrahim, Hartley, Sallinen, Pagliardini, Hassani, Bosselut, Bommasani, Salathé, Jaggi

arXiv ↗

7B medical model gained 10% average over baselines on medical benchmarks

MEDITRON-7B vs PMC-LLaMA-7B on medical benchmarks

+10% average improvement

MEDITRON-70B competes with GPT-3.5, GPT-4, Med-PaLM (540B), Med-PaLM-2

7.7x smaller approaching frontier

Continued pretraining on PubMed + medical guidelines

Domain-specific corpus is the lever

Chain-of-Thought TransferACL 2023

SCOTT: Self-Consistent Chain-of-Thought Distillation

Wang, Lipton, Tsvetkov

arXiv ↗

Contrastive decoding makes student CoTs faithful to the teacher's reasoning

Self-consistent rationale distillation

Prevents student from shortcut learning

Counterfactual reasoning training

Improves faithfulness of student chains

Chain-of-Thought TransferNeurIPS 2022

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou

arXiv ↗

The foundational paper establishing that step-by-step prompting elicits reasoning

CoT prompting improved 540B model math accuracy

GSM8K: 17.9% → 56.9%

Originally thought to be an emergent ability at 100B+ scale

Later disproved by SCoTD (see above)

Karpathy's from-scratch training reference — reproduces GPT-2 in ~300 lines of code

nanoGPT reproduces GPT-2 (124M) on OpenWebText in ~4 days on 8x A100

Validates: small models are fully accessible to individual researchers

Karpathy's thesis: most of the complexity of LLMs is not the model itself

The complexity is the data and training procedure

The minGPT family (minGPT, nanoGPT, makemore) is the bootstrap curriculum

Used directly as Nucleus Track 1 corpus source

Chain-of-Thought TransferYouTube lecture series, 2022-2023

Neural Networks: Zero to Hero — From Micrograd to GPT-2

Andrej Karpathy

arXiv ↗

Karpathy's teaching thesis: you can build GPT from scratch if you understand the fundamentals

The entire transformer can be built from first principles in 8 lectures

micrograd → makemore → nanoGPT → tokenizer

Karpathy: 'the magic is not in the model — it's in the data'

Directly validates the Nucleus corpus-first architecture

The Zero-to-Hero curriculum is the exact subdomain map for the AI Model Engineer Super Skill

13 Nucleus subdomains trace to Karpathy chapters

Distillation EfficiencyAnnounced 2024, in development

LLM101n: Let's Build a Storyteller — Eureka Labs

Andrej Karpathy (Eureka Labs)

arXiv ↗

Karpathy's announced course to train an AI Storyteller from scratch — exactly the Nucleus bootstrap thesis

Karpathy's stated goal: teach people to build small, domain-specialized LLMs end-to-end

Course is in development, material not yet released

The Nucleus Super Skill pipeline is building what LLM101n teaches, automated

We are building the expert the course would produce

Eureka Labs vision: 'AI-native schools' where each domain has its own small expert

Aligned with Nucleus: thousands of specialists, not one giant

Domain SpecializationTalk at Sequoia AI Ascent, 2024

Software 3.0: LLMs as a New Kind of Computer

Andrej Karpathy

arXiv ↗

Karpathy's framing: LLMs are programs, and the future is many small task-specific ones

Software 1.0 = hand-written code. Software 2.0 = neural net weights. Software 3.0 = natural-language prompts to LLMs

LLMs as a general-purpose programmable substrate

Karpathy: the future is not one giant model — it's many specialized ones, each deployed close to the problem

The anti-centralization thesis, from the person who built Tesla Autopilot

Directly validates Nucleus: small, owned, domain-specific models are the endgame

Not speculation — stated publicly by a founding OpenAI member

Distillation EfficiencyPublic commentary, 2023-2024

On the critical importance of data quality for small models

Andrej Karpathy (various Twitter/X threads)

arXiv ↗

Karpathy repeatedly: the bottleneck for small models is data quality, not parameter count

'The LLM scaling laws are really about data quality, not compute'

Nucleus KICE 7-layer extraction is designed around this insight

On Phi-2 (2.7B beating 70B): 'This is what textbook-quality data does'

Validates synthetic high-quality corpus generation via teacher probing

'The future is many small models, each a true expert in one thing'

The core Nucleus thesis, stated by the person who coined Software 2.0

What Distillation Cannot Do

Credibility requires naming the limits. These are the known failure modes.

1. The teacher is the ceiling

If the teacher model does not know it, the student cannot learn it. Choice of teacher is choice of ceiling.

2. Out-of-distribution degradation

Small specialists degrade more sharply than generalists on queries outside the domain. This is why the adversarial swarm systematically probes the boundary.

3. Open-ended reasoning vs structured tasks

The cleanest wins for small models are on structured or narrowly-scoped tasks. GPT-4 still dominates on some open-ended medical reasoning.

4. Static corpus brittleness

Training on a frozen corpus produces a brittle student. Nucleus addresses this with AutoResearch rubric evolution and continuous KICE re-extraction.

These are not counterarguments. They are the exact problems the adversarial swarm, AutoResearch, and corpus evolution are designed to solve. They strengthen the case.

The Bottom Line

The physics does not say a 3B specialist cannot beat a 500B generalist in-domain. The physics says it can — and the literature shows it repeatedly does.

Hsieh et al. demonstrated 700x parameter reduction with better accuracy. RAFT demonstrated 7B beating GPT-3.5 by 45 points in-domain. SCoTD disproved the “CoT only emerges at 50B” assumption. SPIN proved self-improvement converges. None of this is speculation.

Nucleus composes every one of these techniques into a single plant-growing apparatus. The open question is not whether it works — the literature answers that. The open question is: how far can a community of growers take this?

Ready to grow something?

Start building a Super Skill →

Learn →