Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
Hsieh, Li, Yeh, Nakhost, Fujii, Ratner, Krishna, Lee, Pfister
770M T5 beat 540B PaLM in-domain — a 700x parameter reduction
220M model beat PaLM-540B on e-SNLI
2,000x smaller, wins in-domain
770M model beat PaLM-540B on ANLI
700x smaller with 80% less data
11B model beat PaLM-540B on CommonsenseQA
45x smaller
Standard fine-tuning could not catch up to PaLM even with 100% of data
Chain-of-thought distillation is the differentiator