From seed to DNA
The full journey of a Super Skill: from the moment you plant the domain to the moment it graduates with a cryptographic DNA you can verify forever. This is what happens between “I want an expert in X” and “here is your permanently-owned, domain-deep, air-gapped shard of knowledge.”
Day 0 — you plant the domain
A user picks a domain on /start. 'Linux Kernel Engineering.' 'AI Model Engineer.' 'Internal knowledge of my farm.' One sentence. The pipeline wakes up. The run has a unique run_id now — this becomes part of the DNA.
Hours 0-12 — autonomous agents build the corpus
The teacher probe fires first — it interrogates DeepSeek-R1 about every subdomain in the ontology. Paper scouts query arxiv. Repo scouts pull from known GitHub sources. Doc scouts scrape framework docs. KICE runs its 7-layer extraction on everything that arrives. The corpus grows from zero to thousands of structured reasoning examples before the human has had lunch.
Day 1+ — the user adds esoteric sources when ready
POST /corpus/ingest accepts text, URLs, internal docs — whatever makes this domain unique to the user. Each upload triggers KICE re-extraction. The corpus version increments. The pipeline never blocks on human input. Mode 2 data is what makes the resulting model truly owned — it knows things no public model has ever seen.
Days 1-7 — the student absorbs teacher reasoning
The synthesizer takes each KICE extraction example and sends its reasoning prompt to the teacher. The teacher responds with <think> blocks — structured step-by-step reasoning. These become training pairs: (question, oracle docs, distractors, CoT steps, answer). This is Symbolic Chain-of-Thought Distillation combined with Retrieval Augmented Fine-Tuning. The student absorbs how to reason, not just what the answers are.
Days 7-30 — the adversarial swarm stress-tests the plant
The Interrogator sends rain (deep domain probes). The Adversary sends wind (twisted scenarios and traps). The Evaluator measures how the reasoning holds. The Corrector creates new training data targeting every failure. AutoResearch changes the seasons — evolving the rubric based on where the model keeps getting hit. This is the exercise that turns a fragile student into something that survives storms.
Day 14-30+ — when the storm runs out of new weather
Graduation isn't a test. It's the adversarial swarm giving up trying to break the model. When AutoResearch can't find a new angle of attack. When the Adversary's best traps score above 95%. SPIN (ICML 2024) proved this converges to a mathematical fixed point — the improvement curve asymptotes, and the pipeline knows it's done.
Post-graduation — external validation before the DNA
Gate 1 (General Capability Regression): the graduated model must retain ≥85% of the base model's scores on MMLU, HellaSwag, ARC, GSM8K. It can't have forgotten how to reason in general. Gate 2 (Domain Mastery): it must match or exceed the teacher on a held-out domain benchmark. Gate 3 (Hallucination Audit): less than 2% hallucination rate on domain probes, zero fabricated entities. Fail any gate, you don't get the DNA.
Immutable. Cryptographic. Owned forever.
When all three gates pass, the Nucleus Seal is minted. It's an Ed25519 signature over the complete DNA chain: teacher model hash, corpus manifest hash, pipeline config, AutoResearch final report, post-graduation QA results. The DNA is verifiable by anyone with the public key. If anything in the chain is tampered with, the DNA breaks. This is how you prove a model's provenance in a world where 'trust me' doesn't scale.
The graduated Super Skill
The shard lives on your machine. It runs on commodity hardware — a MacBook, a workstation, an edge server. It knows your domain at a depth no generalist model will ever match because no generalist has the parameter budget to know your domain that deeply. It doesn't phone home. It can't be deprecated by a vendor's pricing change. The pipeline can build another one tomorrow in a different domain. Grow a forest of experts.
The Fork Point · ADR-0013
…and the pipeline keeps learning. When it converges again, v1.1 is minted — same architecture, fresh base distillation from the evolved corpus, your specialization carried forward through LoRA, no forgetting (gated by a regression test against v1.0's full benchmark). New DNA card. New seal. You choose whether to adopt it. v1.0 is still yours, still sealed, still working — the pipeline never pushes.
Ed25519 cryptographic provenance chain
Every graduated Super Skill gets a unique DNA card. The DNA is the hash of all six components below, cryptographically signed. Verify it once with the public key, trust it forever. Change any component, the DNA breaks. This is how provenance works in a world where model weights can be faked, swapped, or silently replaced.
Teacher SHA-256
Cryptographic hash of the exact teacher model weights used for distillation
Corpus Manifest Hash
SHA-256 of the complete versioned corpus — every example, every source, every layer
Pipeline Config
The superskill.yaml at run time — modes, weights, thresholds, reproducible settings
AutoResearch Report
Final rubric versions, convergence trajectory, failure patterns discovered and resolved
Three-Gate Results
Exact scores from Gate 1 (regression), Gate 2 (domain), Gate 3 (hallucination)
Ed25519 Signature
Cryptographic DNA binding all of the above to the final model weights
Claims need receipts. Every graduated Super Skill publishes a benchmark report — the receipts that justify the DNA. Here is what gets measured.
Runs EleutherAI's lm-evaluation-harness: MMLU (general knowledge), HellaSwag (commonsense), ARC (reasoning), GSM8K (math), IFEval (instruction following). The graduated Super Skill must retain at least 85% of the base Qwen 3B scores. This proves specialization didn't destroy general intelligence.
Stanford HELM benchmark with custom domain probes, plus LLM-as-Judge evaluation on a held-out test set. The Super Skill must match or exceed the teacher model on the domain it was trained for. This is the “did it actually get better at the thing” test.
HalluLens benchmark plus custom domain hallucination probes. Hallucination rate must be under 2%. Fabricated entities (made-up APIs, non-existent functions, invented people) count as hard-fail — zero tolerance. Out-of-domain questions must be refused with >90% accuracy.
All three gates are run by independent harnesses, not the training pipeline itself. The adversarial swarm that trained the model never sees these benchmarks. If any gate fails, the run is marked not-graduated and the DNA is never minted.