Minimal Distillation Schedule for Extreme Language Model Compression

Chen Zhang, Yang Yang, Qifan Wang, Jiahao Liu, Jingang Wang, Wei Wu, Dawei Song*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

Recent studies have revealed that language model distillation can become less effective when there is a significant capacity gap between the teacher and the student models. In order to bridge the gap, teacher assistant-based distillation has been introduced, in which the selection of the teacher assistant plays a crucial role in transferring knowledge from the teacher to the student. However, existing approaches for teacher assistant-based distillation require numerous trials to find the optimal teacher assistant. In this paper, we propose a novel approach called Minimal Distillation Schedule (MINIDISC), which enables the scheduling of an optimal teacher assistant in just one trial for extreme model compression (e.g, to 5% scale). In particular, we empirically show that the performance of the student is positively correlated with the scale-performance tradeoff of the teacher assistant. We then introduce a new λ-tradeoff metric that quantifies the optimality of the teacher assistant without the need for trial distillation to the student. By employing a sandwich framework, MINIDISC can select the optimal teacher assistant with the best λtradeoff. We extensively evaluate MINIDISC through a series of experiments on the GLUE benchmark. The results demonstrate that our approach achieved an improved efficiency compared to various state-of-the-art baselines. Furthermore, we showcase the scalability of MINIDISC by applying it to a language model with billions of parameters.

Original languageEnglish
Title of host publicationEACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024
EditorsYvette Graham, Matthew Purver, Matthew Purver
PublisherAssociation for Computational Linguistics (ACL)
Pages1378-1394
Number of pages17
ISBN (Electronic)9798891760936
Publication statusPublished - 2024
Event18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Findings of EACL 2024 - St. Julian's, Malta
Duration: 17 Mar 202422 Mar 2024

Publication series

NameEACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024

Conference

Conference18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Findings of EACL 2024
Country/TerritoryMalta
CitySt. Julian's
Period17/03/2422/03/24

Fingerprint

Dive into the research topics of 'Minimal Distillation Schedule for Extreme Language Model Compression'. Together they form a unique fingerprint.

Cite this

Zhang, C., Yang, Y., Wang, Q., Liu, J., Wang, J., Wu, W., & Song, D. (2024). Minimal Distillation Schedule for Extreme Language Model Compression. In Y. Graham, M. Purver, & M. Purver (Eds.), EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024 (pp. 1378-1394). (EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024). Association for Computational Linguistics (ACL).