QuKaiZen — Your Intelligence. Forever owned. Always improving.

Filter by research pillar. Every paper is peer-reviewed. Every claim has a specific number.

Distillation EfficiencyACL Findings 2023

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Hsieh, Li, Yeh, Nakhost, Fujii, Ratner, Krishna, Lee, Pfister

arXiv ↗

770M T5 beat 540B PaLM in-domain — a 700x parameter reduction

220M model beat PaLM-540B on e-SNLI

2,000x smaller, wins in-domain

770M model beat PaLM-540B on ANLI

700x smaller with 80% less data

11B model beat PaLM-540B on CommonsenseQA

45x smaller

Standard fine-tuning could not catch up to PaLM even with 100% of data

Chain-of-thought distillation is the differentiator

Chain-of-Thought TransferACL 2023

Symbolic Chain-of-Thought Distillation: Small Models Can Also Think Step-by-Step

Li, Hessel, Yu, Ren, Chang, Choi

arXiv ↗

Disproved the claim that chain-of-thought reasoning requires 50B+ parameters

OPT-1.3B learned CoT from GPT-3 code-davinci-002 (175B)

QuaRel: 71.2% → 84.9%

OpenBookQA benchmark

50.0% → 67.0%

CommonsenseQA

67.2% with N=30 rationales per instance

IMDB contrast set generalization

81.6% → 92.0% — a 10.4 point robustness gain

Optimal sampling strategy

~30 reasoning chains per instance from teacher

Document Grounding (RAFT)UC Berkeley + Microsoft, 2024

RAFT: Adapting Language Model to Domain Specific RAG

Zhang, Patil, Jain, Shen, Zaharia, Stoica, Gonzalez

arXiv ↗

7B domain model beat GPT-3.5+RAG by 45 points on HuggingFace API docs

RAFT-7B vs GPT-3.5+RAG on HuggingFace API

74.0 vs 29.08 — 45 point win

RAFT-7B vs GPT-3.5+RAG on Torch Hub

84.95 vs 60.21 — 25 point win

RAFT-7B vs GPT-3.5+RAG on TensorFlow

86.86 vs 65.59 — 21 point win

RAFT-7B vs GPT-3.5+RAG on PubMed (medical)

73.30 vs 71.60 — wins on medicine

Only loss: general multi-hop QA (HotpotQA)

Domain specialization is the key variable

Chain-of-thought ablation on HuggingFace

59.07 → 74.00 — CoT adds 14.93 points

Convergence GuaranteesICML 2024

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Chen, Deng, Yuan, Ji, Gu

arXiv ↗

Proved iterative self-improvement converges to a fixed point

Zephyr-7B baseline → iteration 1

58.14% → 60.80% (+2.66)

Iteration 1 → iteration 2

+1.32% additional

Three iterations total

58.14% → 63.16% (+5.02)

MT-Bench

5.94 → 6.78 (+0.84)

Iteration-0 SPIN matched DPO + 62k GPT-4 preference pairs

Self-play replaces expensive human preference data

Theoretical guarantee

improvement at t+1 < t, converges to fixed point

Adversarial TrainingNeurIPS 2024

Adversarial Moment-Matching Distillation of Large Language Models

Chen Jia

arXiv ↗

Adversarial training under pressure beats every distribution-matching baseline

Reframes knowledge distillation as imitation learning

Action-value moment matching

On-policy and off-policy adversarial training

Outperforms all KD baselines

Theoretical backing for adversarial swarm architectures

The weather is the mechanism

Domain SpecializationMicrosoft Research Blog 2023

Phi-2: The Surprising Power of Small Language Models

Microsoft Research

arXiv ↗

2.7B model matched LLaMA-2-70B on math and coding

GSM8K math (8-shot)

Phi-2: 61.1% vs LLaMA-2-70B: 64.1% (25x smaller)

HumanEval + MBPP coding

Phi-2: 53.7% vs LLaMA-2: 21.0-38.3%

Training key: textbook-quality synthetic data

Quality over quantity

Domain SpecializationEPFL, 2023

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Chen, Cherkaoui, Köpf, Schärli, Oliva, Ibrahim, Hartley, Sallinen, Pagliardini, Hassani, Bosselut, Bommasani, Salathé, Jaggi

arXiv ↗

7B medical model gained 10% average over baselines on medical benchmarks

MEDITRON-7B vs PMC-LLaMA-7B on medical benchmarks

+10% average improvement

MEDITRON-70B competes with GPT-3.5, GPT-4, Med-PaLM (540B), Med-PaLM-2

7.7x smaller approaching frontier

Continued pretraining on PubMed + medical guidelines

Domain-specific corpus is the lever

Chain-of-Thought TransferACL 2023

SCOTT: Self-Consistent Chain-of-Thought Distillation

Wang, Lipton, Tsvetkov

arXiv ↗

Contrastive decoding makes student CoTs faithful to the teacher's reasoning

Self-consistent rationale distillation

Prevents student from shortcut learning

Counterfactual reasoning training

Improves faithfulness of student chains

Chain-of-Thought TransferNeurIPS 2022

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou

arXiv ↗

The foundational paper establishing that step-by-step prompting elicits reasoning

CoT prompting improved 540B model math accuracy

GSM8K: 17.9% → 56.9%

Originally thought to be an emergent ability at 100B+ scale

Later disproved by SCoTD (see above)

Domain SpecializationGitHub, ongoing

nanoGPT — The Simplest, Fastest Repository for Training Medium-Sized GPTs

Andrej Karpathy

arXiv ↗

Karpathy's from-scratch training reference — reproduces GPT-2 in ~300 lines of code

nanoGPT reproduces GPT-2 (124M) on OpenWebText in ~4 days on 8x A100

Validates: small models are fully accessible to individual researchers

Karpathy's thesis: most of the complexity of LLMs is not the model itself

The complexity is the data and training procedure

The minGPT family (minGPT, nanoGPT, makemore) is the bootstrap curriculum

Used directly as Nucleus Track 1 corpus source

Chain-of-Thought TransferYouTube lecture series, 2022-2023

Neural Networks: Zero to Hero — From Micrograd to GPT-2

Andrej Karpathy

arXiv ↗

Karpathy's teaching thesis: you can build GPT from scratch if you understand the fundamentals

The entire transformer can be built from first principles in 8 lectures

micrograd → makemore → nanoGPT → tokenizer

Karpathy: 'the magic is not in the model — it's in the data'

Directly validates the Nucleus corpus-first architecture

The Zero-to-Hero curriculum is the exact subdomain map for the AI Model Engineer Super Skill

13 Nucleus subdomains trace to Karpathy chapters

Distillation EfficiencyAnnounced 2024, in development

LLM101n: Let's Build a Storyteller — Eureka Labs

Andrej Karpathy (Eureka Labs)

arXiv ↗

Karpathy's announced course to train an AI Storyteller from scratch — exactly the Nucleus bootstrap thesis

Karpathy's stated goal: teach people to build small, domain-specialized LLMs end-to-end

Course is in development, material not yet released

The Nucleus Super Skill pipeline is building what LLM101n teaches, automated

We are building the expert the course would produce

Eureka Labs vision: 'AI-native schools' where each domain has its own small expert

Aligned with Nucleus: thousands of specialists, not one giant

Domain SpecializationTalk at Sequoia AI Ascent, 2024

Software 3.0: LLMs as a New Kind of Computer

Andrej Karpathy

arXiv ↗

Karpathy's framing: LLMs are programs, and the future is many small task-specific ones

Software 1.0 = hand-written code. Software 2.0 = neural net weights. Software 3.0 = natural-language prompts to LLMs

LLMs as a general-purpose programmable substrate

Karpathy: the future is not one giant model — it's many specialized ones, each deployed close to the problem

The anti-centralization thesis, from the person who built Tesla Autopilot

Directly validates Nucleus: small, owned, domain-specific models are the endgame

Not speculation — stated publicly by a founding OpenAI member

Distillation EfficiencyPublic commentary, 2023-2024

On the critical importance of data quality for small models

Andrej Karpathy (various Twitter/X threads)

arXiv ↗

Karpathy repeatedly: the bottleneck for small models is data quality, not parameter count

'The LLM scaling laws are really about data quality, not compute'

Nucleus KICE 7-layer extraction is designed around this insight

On Phi-2 (2.7B beating 70B): 'This is what textbook-quality data does'

Validates synthetic high-quality corpus generation via teacher probing

'The future is many small models, each a true expert in one thing'

The core Nucleus thesis, stated by the person who coined Software 2.0

A 3B specialist can beat a 500B generalist
in its own domain.

The Question

The Karpathy Thesis

The Theoretical Backstop

The Information Bottleneck Principle

The Evidence

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Symbolic Chain-of-Thought Distillation: Small Models Can Also Think Step-by-Step

RAFT: Adapting Language Model to Domain Specific RAG

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Adversarial Moment-Matching Distillation of Large Language Models

Phi-2: The Surprising Power of Small Language Models

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

SCOTT: Self-Consistent Chain-of-Thought Distillation

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

nanoGPT — The Simplest, Fastest Repository for Training Medium-Sized GPTs

Neural Networks: Zero to Hero — From Micrograd to GPT-2

LLM101n: Let's Build a Storyteller — Eureka Labs

Software 3.0: LLMs as a New Kind of Computer

On the critical importance of data quality for small models

What Distillation Cannot Do

1. The teacher is the ceiling

2. Out-of-distribution degradation

3. Open-ended reasoning vs structured tasks

4. Static corpus brittleness

The Bottom Line