QuKaiZen — Your Intelligence. Forever owned. Always improving.

Real Benchmarks, Not Probes

Gate 1 — Did we preserve general intelligence?

What

We run both the base Qwen 2.5-3B and the graduated shard through EleutherAI's lm-evaluation-harness — the same benchmark suite the HuggingFace Open LLM Leaderboard uses. Identical seeds, identical sampling, identical harness.

How

Six tasks: MMLU (general knowledge), HellaSwag (commonsense), ARC-Challenge (reasoning), Winogrande (reference resolution), GSM8K (multi-step math), IFEval (instruction following). Each task is scored with Wilson confidence intervals so we report not just a number but a range.

Why it counts

A specialized model can get great at one thing and catastrophically forget everything else. This measures retention — how much general capability survived the specialization. The threshold is 85% mean retention with no task below 70%. Below that, we assume we broke the model.

Measured by

Mean retention ≥ 85% (hard pass)
Per-task floor ≥ 70% (no catastrophic forgetting)
Wilson 95% CIs on every score
Raw lm-eval JSON output preserved for reviewer reproduction

Held-Out Domain Benchmark

Gate 2 — Did the model actually learn the domain?

What

At KICE extraction time, 10% of examples are deterministically split off into a holdout set and excluded from every downstream phase. Synthesis doesn't see them. Training doesn't see them. The adversarial swarm never probes them. They're the reserved test.

How

The holdout is hashed into the corpus manifest. Any tampering with the holdout file invalidates the Nucleus Seal. At certification time, the student generates answers to the holdout questions. A second LLM (DeepSeek-R1-14B locally, or Claude API for independence) scores student responses against a 5-criterion rubric: correctness, depth, edge-case coverage, reasoning faithfulness, domain idiom.

Why it counts

This is the honest test. The student either learned the domain well enough to handle questions it was never shown, or it didn't. We compare student scores to the teacher's scores on the same holdout and run a Wilcoxon signed-rank test — if the student isn't significantly worse than the teacher, it passes.

Measured by

Student mean judge score ≥ teacher × 0.95
Wilcoxon paired test p > 0.05 (not significantly worse)
Cohen's kappa ≥ 0.6 (judge inter-rater reliability)
Holdout hash verifiable in the Nucleus Seal chain

Hallucination Audit

Gate 3 — Does it confidently make things up?

What

The most dangerous failure mode for a specialized model is confident fabrication. A model that knows when to say 'I don't know' is vastly more useful than one that invents plausible-sounding answers. Gate 3 systematically tests this.

How

Three sub-tests: (1) TruthfulQA via lm-eval-harness — resistance to common misconceptions. (2) HalluLens — structured hallucination benchmark. (3) Fabricated-entity detection — we ask questions with fictional people, invented APIs, and made-up court cases, then use spaCy NER to flag any confidently asserted entities that weren't in the prompt and aren't in the domain vocabulary.

Why it counts

A production model needs calibrated refusal. We also test out-of-distribution behavior — 30 questions from MMLU subjects outside the target domain — and score whether the model refuses, hedges, or answers confidently. The correct behavior is to refuse or hedge.

Measured by

Hallucination rate < 2%
Zero fabricated entities (hard fail)
Out-of-domain refusal rate ≥ 90%
TruthfulQA non-regression vs base model

Cryptographic Provenance

The Nucleus Seal

What

When all three gates pass, the system mints an Ed25519-signed DNA chain. The DNA covers six components: teacher model hash, corpus manifest hash, pipeline config, AutoResearch final report, three-gate scores, and the graduated model weight hash.

How

The chain hash is a SHA-256 of the canonical concatenation of all six component hashes. Ed25519 signs that chain hash with a key managed separately from the pipeline. The DNA is written as a JSON file next to the model, and a verification CLI can re-verify any DNA against the published public key.

Why it counts

Model provenance is a hard problem. In a world where weights can be swapped, fine-tuned, or silently replaced, you need a way to prove 'this is the model I certified.' If anything in the chain is tampered with — the corpus, the config, the weights, anything — the DNA breaks. This is how you trust a model without trusting the distributor.

Measured by

Ed25519 signature over six-component chain hash
Verification CLI for independent checking
Revocation registry for compromised DNA
Periodic re-verification as Prometheus liveness probe

Baseline Comparison

Full pipeline vs no pipeline

What

We run the same three gates against two models: the base Qwen 2.5-3B-Instruct (untouched) and the graduated shard. Every number in every table shows both. The delta is the evidence the pipeline did something measurable.

How

Same harness. Same seeds. Same sampling config. The only variable is whether the model went through KICE + SCoTD + RAFT + adversarial swarm. If the graduated shard isn't better than the base on the domain, the pipeline failed and we say so.

Why it counts

This is the core scientific control. The null hypothesis is 'the pipeline does nothing.' We have to reject it with real numbers. No hand-waving. No 'trust us, it's better.' The CSV comparison table is a file reviewers can open and verify.

Measured by

Per-task delta (graduated - base) with significance
Domain improvement magnitude
Retention ratio per general-capability task
CSV + JSON artifacts attached to the paper

Statistical Rigor

Confidence intervals, not just point estimates

What

Every reported score comes with uncertainty quantification. We don't say 'the model scored 73%.' We say 'the model scored 73% ± 4% (95% CI, n=200).' Reviewers can distinguish 'meaningfully better' from 'lucky sample.'

How

Wilson confidence intervals on binomial benchmark scores (via scipy.stats.binomtest). Wilcoxon signed-rank test for paired comparisons (student vs teacher on matched holdout). Cohen's kappa for inter-rater reliability on the LLM-as-Judge — we run the judge twice with different seeds and report agreement.

Why it counts

Workshop reviewers will tear apart any paper that reports percentages without error bars. Statistical rigor is table stakes for credibility. It's also honest — we want to know when we're reporting noise versus signal.

Measured by

Wilson 95% CIs on every percentage
Wilcoxon p-values on every paired comparison
Cohen's kappa on judge reliability
Sample sizes explicitly documented

Reproducibility Artifacts

Everything a reviewer needs to rerun

What

The certification output is a self-contained directory that anyone can use to reproduce our results. We don't ask reviewers to trust us — we give them the materials to check.

How

Each certification run produces: the raw lm-eval JSON outputs (byte-identical to what the harness emits), judge responses with full rationales, the random seeds used at every stage, the config file hash, the corpus manifest hash, the three-gate JSON report, and the Nucleus Seal. The paper's `make arxiv` target zips everything into an ancillary bundle.

Why it counts

Reproducibility is the single best defense against skepticism. If a reviewer can download the artifacts and rerun the certification in their own environment, 'I don't believe this' becomes 'I verified this.' It also lets community contributors audit future runs.

Measured by

Raw lm-eval JSON preserved verbatim
Random seeds table in the paper
Corpus + config hashes published
Docker image tag pinned for environment reproduction

Honest Limitations

What this pipeline does NOT claim

What

Every paper worth reading has a limitations section. Every claim has a counter. We state explicitly what this work does and does not demonstrate, so future skeptics can't say we overclaimed.

How

The paper's limitations section names eight specific constraints: 3B scale (not a frontier comparison), judge lineage (DeepSeek-R1 is the teacher family), single-domain validation (AI Model Engineer only), commodity hardware sample size limits, lm-eval contamination risk, DNA key custody, proposed-not-established thresholds, and v1 ablation scope (full vs base only).

Why it counts

Acknowledging limitations is how you earn trust. The goal is not to pretend the work is perfect — it's to be precise about what was shown and what remains future work. Reviewers respect honesty; they punish overclaiming.

Measured by

Eight explicit limitations stated in the paper
v1 ablation scope documented
Judge independence caveat with Claude API second-opinion
Single-domain generalization explicitly future work

How we look for positive findings
— the validation process.

The Commitment

The Eight Pillars

Real Benchmarks, Not Probes

Held-Out Domain Benchmark

Hallucination Audit

Cryptographic Provenance

Baseline Comparison

Statistical Rigor

Reproducibility Artifacts

Honest Limitations

“Not for nothing”

The Endgame

What v1 Doesn't Cover