Not for nothing

How we look for positive findings
— the validation process.

The plant metaphor frames the philosophy, but a metaphor isn't credibility. Every graduated Super Skill goes through eight pillars of rigorous validation before the DNA is minted. This page is the high-level rollup: what we measure, how we measure it, and why it counts. It ships with every POC so skeptics can understand exactly what's being claimed and what isn't.

The Commitment

The pipeline spends just as much effort verifying the shard as it does creating it.

Growing the plant is half the work. Validating it grew correctly — to a standard rigorous enough that an independent researcher would accept the result — is the other half. Below are the eight pillars of post-graduation validation that every Super Skill must pass before the Nucleus Seal is minted.

The Eight Pillars

01

Real Benchmarks, Not Probes

Gate 1 — Did we preserve general intelligence?

What

We run both the base Qwen 2.5-3B and the graduated shard through EleutherAI's lm-evaluation-harness — the same benchmark suite the HuggingFace Open LLM Leaderboard uses. Identical seeds, identical sampling, identical harness.

How

Six tasks: MMLU (general knowledge), HellaSwag (commonsense), ARC-Challenge (reasoning), Winogrande (reference resolution), GSM8K (multi-step math), IFEval (instruction following). Each task is scored with Wilson confidence intervals so we report not just a number but a range.

Why it counts

A specialized model can get great at one thing and catastrophically forget everything else. This measures retention — how much general capability survived the specialization. The threshold is 85% mean retention with no task below 70%. Below that, we assume we broke the model.

Measured by

  • Mean retention ≥ 85% (hard pass)
  • Per-task floor ≥ 70% (no catastrophic forgetting)
  • Wilson 95% CIs on every score
  • Raw lm-eval JSON output preserved for reviewer reproduction
02

Held-Out Domain Benchmark

Gate 2 — Did the model actually learn the domain?

What

At KICE extraction time, 10% of examples are deterministically split off into a holdout set and excluded from every downstream phase. Synthesis doesn't see them. Training doesn't see them. The adversarial swarm never probes them. They're the reserved test.

How

The holdout is hashed into the corpus manifest. Any tampering with the holdout file invalidates the Nucleus Seal. At certification time, the student generates answers to the holdout questions. A second LLM (DeepSeek-R1-14B locally, or Claude API for independence) scores student responses against a 5-criterion rubric: correctness, depth, edge-case coverage, reasoning faithfulness, domain idiom.

Why it counts

This is the honest test. The student either learned the domain well enough to handle questions it was never shown, or it didn't. We compare student scores to the teacher's scores on the same holdout and run a Wilcoxon signed-rank test — if the student isn't significantly worse than the teacher, it passes.

Measured by

  • Student mean judge score ≥ teacher × 0.95
  • Wilcoxon paired test p > 0.05 (not significantly worse)
  • Cohen's kappa ≥ 0.6 (judge inter-rater reliability)
  • Holdout hash verifiable in the Nucleus Seal chain
03

Hallucination Audit

Gate 3 — Does it confidently make things up?

What

The most dangerous failure mode for a specialized model is confident fabrication. A model that knows when to say 'I don't know' is vastly more useful than one that invents plausible-sounding answers. Gate 3 systematically tests this.

How

Three sub-tests: (1) TruthfulQA via lm-eval-harness — resistance to common misconceptions. (2) HalluLens — structured hallucination benchmark. (3) Fabricated-entity detection — we ask questions with fictional people, invented APIs, and made-up court cases, then use spaCy NER to flag any confidently asserted entities that weren't in the prompt and aren't in the domain vocabulary.

Why it counts

A production model needs calibrated refusal. We also test out-of-distribution behavior — 30 questions from MMLU subjects outside the target domain — and score whether the model refuses, hedges, or answers confidently. The correct behavior is to refuse or hedge.

Measured by

  • Hallucination rate < 2%
  • Zero fabricated entities (hard fail)
  • Out-of-domain refusal rate ≥ 90%
  • TruthfulQA non-regression vs base model
04

Cryptographic Provenance

The Nucleus Seal

What

When all three gates pass, the system mints an Ed25519-signed DNA chain. The DNA covers six components: teacher model hash, corpus manifest hash, pipeline config, AutoResearch final report, three-gate scores, and the graduated model weight hash.

How

The chain hash is a SHA-256 of the canonical concatenation of all six component hashes. Ed25519 signs that chain hash with a key managed separately from the pipeline. The DNA is written as a JSON file next to the model, and a verification CLI can re-verify any DNA against the published public key.

Why it counts

Model provenance is a hard problem. In a world where weights can be swapped, fine-tuned, or silently replaced, you need a way to prove 'this is the model I certified.' If anything in the chain is tampered with — the corpus, the config, the weights, anything — the DNA breaks. This is how you trust a model without trusting the distributor.

Measured by

  • Ed25519 signature over six-component chain hash
  • Verification CLI for independent checking
  • Revocation registry for compromised DNA
  • Periodic re-verification as Prometheus liveness probe
05

Baseline Comparison

Full pipeline vs no pipeline

What

We run the same three gates against two models: the base Qwen 2.5-3B-Instruct (untouched) and the graduated shard. Every number in every table shows both. The delta is the evidence the pipeline did something measurable.

How

Same harness. Same seeds. Same sampling config. The only variable is whether the model went through KICE + SCoTD + RAFT + adversarial swarm. If the graduated shard isn't better than the base on the domain, the pipeline failed and we say so.

Why it counts

This is the core scientific control. The null hypothesis is 'the pipeline does nothing.' We have to reject it with real numbers. No hand-waving. No 'trust us, it's better.' The CSV comparison table is a file reviewers can open and verify.

Measured by

  • Per-task delta (graduated - base) with significance
  • Domain improvement magnitude
  • Retention ratio per general-capability task
  • CSV + JSON artifacts attached to the paper
06

Statistical Rigor

Confidence intervals, not just point estimates

What

Every reported score comes with uncertainty quantification. We don't say 'the model scored 73%.' We say 'the model scored 73% ± 4% (95% CI, n=200).' Reviewers can distinguish 'meaningfully better' from 'lucky sample.'

How

Wilson confidence intervals on binomial benchmark scores (via scipy.stats.binomtest). Wilcoxon signed-rank test for paired comparisons (student vs teacher on matched holdout). Cohen's kappa for inter-rater reliability on the LLM-as-Judge — we run the judge twice with different seeds and report agreement.

Why it counts

Workshop reviewers will tear apart any paper that reports percentages without error bars. Statistical rigor is table stakes for credibility. It's also honest — we want to know when we're reporting noise versus signal.

Measured by

  • Wilson 95% CIs on every percentage
  • Wilcoxon p-values on every paired comparison
  • Cohen's kappa on judge reliability
  • Sample sizes explicitly documented
07

Reproducibility Artifacts

Everything a reviewer needs to rerun

What

The certification output is a self-contained directory that anyone can use to reproduce our results. We don't ask reviewers to trust us — we give them the materials to check.

How

Each certification run produces: the raw lm-eval JSON outputs (byte-identical to what the harness emits), judge responses with full rationales, the random seeds used at every stage, the config file hash, the corpus manifest hash, the three-gate JSON report, and the Nucleus Seal. The paper's `make arxiv` target zips everything into an ancillary bundle.

Why it counts

Reproducibility is the single best defense against skepticism. If a reviewer can download the artifacts and rerun the certification in their own environment, 'I don't believe this' becomes 'I verified this.' It also lets community contributors audit future runs.

Measured by

  • Raw lm-eval JSON preserved verbatim
  • Random seeds table in the paper
  • Corpus + config hashes published
  • Docker image tag pinned for environment reproduction
08

Honest Limitations

What this pipeline does NOT claim

What

Every paper worth reading has a limitations section. Every claim has a counter. We state explicitly what this work does and does not demonstrate, so future skeptics can't say we overclaimed.

How

The paper's limitations section names eight specific constraints: 3B scale (not a frontier comparison), judge lineage (DeepSeek-R1 is the teacher family), single-domain validation (AI Model Engineer only), commodity hardware sample size limits, lm-eval contamination risk, DNA key custody, proposed-not-established thresholds, and v1 ablation scope (full vs base only).

Why it counts

Acknowledging limitations is how you earn trust. The goal is not to pretend the work is perfect — it's to be precise about what was shown and what remains future work. Reviewers respect honesty; they punish overclaiming.

Measured by

  • Eight explicit limitations stated in the paper
  • v1 ablation scope documented
  • Judge independence caveat with Claude API second-opinion
  • Single-domain generalization explicitly future work

“Not for nothing”

The questions a skeptical reviewer would ask, and the answers the pipeline is designed to provide.

What if the pipeline just did nothing?

Gate 2 baseline comparison catches it. If the graduated shard isn't significantly better than the base model on held-out domain questions, the Wilcoxon test fails and we don't get a DNA.

What if we hand-picked benchmarks to look good?

Gate 1 uses the HuggingFace Open LLM Leaderboard task list — we didn't choose them, the community did. Raw lm-eval JSON is preserved verbatim for audit.

What if we just memorized the training data?

The 10% holdout is split at extraction time and excluded from synthesis, training, and swarm. The holdout hash is bound to the corpus manifest — tampering breaks the DNA.

What if the judge is biased because it's the teacher?

We report Cohen's kappa from two independent judge passes. A second-opinion sample uses Claude API (independent lineage) and we compare. The paper's limitations section names this caveat explicitly.

What if the model confidently makes things up?

Gate 3 has zero tolerance for fabricated entities — one confidently asserted fictional person or invented API fails the whole run. spaCy NER catches them automatically.

What if someone swapped the model after certification?

The Nucleus Seal's Ed25519 signature covers the model weight hash. Verification CLI recomputes the hash and checks the signature. Any mutation breaks the chain.

How do we know the results are reproducible?

Every run produces a self-contained artifact directory with raw JSON, seeds, config hashes, and the DNA. The `make arxiv` target packages it for peer review.

The Endgame

Every Nucleus certification run produces a self-contained artifact directory: raw benchmark outputs, judge rationales, random seeds, config hashes, the corpus manifest chain, and a signed Nucleus Seal.

When the first Linux Kernel POC ships, this directory ships with it. Anyone can open it, run the verification CLI against the published public key, rerun the benchmarks with the preserved seeds, and check every number in the paper's tables against the JSON files they came from.

That's what “not for nothing” means in practice. The pipeline doesn't ask for trust — it delivers the materials that make trust unnecessary.

What v1 Doesn't Cover

Explicit limitations stated in the paper:

  • • 3B scale — not a frontier model comparison. We compare against the base Qwen 3B.
  • • Single-domain validation. Generalization across domains is future work.
  • • Judge lineage — DeepSeek-R1 shares family with the student's teacher. Claude API second-opinion mitigates but doesn't eliminate.
  • • Commodity hardware sample size limits — we report Wilson CIs to be honest about uncertainty.
  • • lm-eval task contamination risk (Open LLM Leaderboard caveat applies).
  • • DNA is only as strong as key custody — threat model explicitly stated.
  • • Gate thresholds (85% retention, <2% hallucination) are proposed, not established norms.
  • • v1 ablation scope: full pipeline vs base model only. The three-row intermediate ablation (no_raft, no_scotd, no_swarm) is deferred to v2.

Ready to start growing something that will pass all eight?

Learn →