[Submitted on 3 May 2023]
Abstract: Deploying large language models (LLMs) is challenging because they are memory
inefficient and compute-intensive for practical applications. In reaction,
researchers train smaller task-specific models by either finetuning with human
labels or distilling using LLM-generated labels. However, finetuning and
distillation require large amounts of training data to achieve comparable
performance to LLMs. We introduce Distilling step-by-step, a new mechanism that
(a) trains smaller models that outperform LLMs, and (b) achieves so by
leveraging less training data needed by finetuning or distillation. Our method
extracts LLM rationales as additional supervision for small models within a
multi-task training framework. We present three findings across 4 NLP
benchmarks: First, compared to both finetuning and distillation, our mechanism
achieves better performance with much fewer labeled/unlabeled training
examples. Second, compared to LLMs, we achieve better performance using
substantially smaller model sizes. Third, we reduce both the model size and the
amount of data required to outperform LLMs; our 770M T5 model outperforms the
540B PaLM model using only 80% of available data on a benchmark task.