Real Benchmarks, Not Probes
Gate 1 — Did we preserve general intelligence?
What
We run both the base Qwen 2.5-3B and the graduated shard through EleutherAI's lm-evaluation-harness — the same benchmark suite the HuggingFace Open LLM Leaderboard uses. Identical seeds, identical sampling, identical harness.
How
Six tasks: MMLU (general knowledge), HellaSwag (commonsense), ARC-Challenge (reasoning), Winogrande (reference resolution), GSM8K (multi-step math), IFEval (instruction following). Each task is scored with Wilson confidence intervals so we report not just a number but a range.
Why it counts
A specialized model can get great at one thing and catastrophically forget everything else. This measures retention — how much general capability survived the specialization. The threshold is 85% mean retention with no task below 70%. Below that, we assume we broke the model.
Measured by
- Mean retention ≥ 85% (hard pass)
- Per-task floor ≥ 70% (no catastrophic forgetting)
- Wilson 95% CIs on every score
- Raw lm-eval JSON output preserved for reviewer reproduction