Minimal Distillation Schedule for Extreme Language Model Compression

Chen Zhang; Yang Yang; Qifan Wang; Jiahao Liu; Jingang Wang; Wei Wu; Dawei Song

Minimal Distillation Schedule for Extreme Language Model Compression

Chen Zhang, Yang Yang, Qifan Wang, Jiahao Liu, Jingang Wang, Wei Wu, Dawei Song^*

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

1 引用（Scopus）

摘要

Recent studies have revealed that language model distillation can become less effective when there is a significant capacity gap between the teacher and the student models. In order to bridge the gap, teacher assistant-based distillation has been introduced, in which the selection of the teacher assistant plays a crucial role in transferring knowledge from the teacher to the student. However, existing approaches for teacher assistant-based distillation require numerous trials to find the optimal teacher assistant. In this paper, we propose a novel approach called Minimal Distillation Schedule (MINIDISC), which enables the scheduling of an optimal teacher assistant in just one trial for extreme model compression (e.g, to 5% scale). In particular, we empirically show that the performance of the student is positively correlated with the scale-performance tradeoff of the teacher assistant. We then introduce a new λ-tradeoff metric that quantifies the optimality of the teacher assistant without the need for trial distillation to the student. By employing a sandwich framework, MINIDISC can select the optimal teacher assistant with the best λtradeoff. We extensively evaluate MINIDISC through a series of experiments on the GLUE benchmark. The results demonstrate that our approach achieved an improved efficiency compared to various state-of-the-art baselines. Furthermore, we showcase the scalability of MINIDISC by applying it to a language model with billions of parameters.

源语言	英语
主期刊名	EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024
编辑	Yvette Graham, Matthew Purver, Matthew Purver
出版商	Association for Computational Linguistics (ACL)
页	1378-1394
页数	17
ISBN（电子版）	9798891760936
出版状态	已出版 - 2024
活动	18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Findings of EACL 2024 - St. Julian's, 马耳他期限: 17 3月 2024 → 22 3月 2024

出版系列

姓名	EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024

会议

会议	18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Findings of EACL 2024
国家/地区	马耳他
市	St. Julian's
时期	17/03/24 → 22/03/24

其它文件与链接

链接到 Scopus 的出版物

引用此

Zhang, C., Yang, Y., Wang, Q., Liu, J., Wang, J., Wu, W., & Song, D. (2024). Minimal Distillation Schedule for Extreme Language Model Compression. 在 Y. Graham, M. Purver, & M. Purver (编辑), EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024 (页码 1378-1394). (EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024). Association for Computational Linguistics (ACL).

Zhang, Chen ; Yang, Yang ; Wang, Qifan 等. / Minimal Distillation Schedule for Extreme Language Model Compression. EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024. 编辑 / Yvette Graham ; Matthew Purver ; Matthew Purver. Association for Computational Linguistics (ACL), 2024. 页码 1378-1394 (EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024).

@inproceedings{d55ead767ce44d6db3368cd929583731,

title = "Minimal Distillation Schedule for Extreme Language Model Compression",

abstract = "Recent studies have revealed that language model distillation can become less effective when there is a significant capacity gap between the teacher and the student models. In order to bridge the gap, teacher assistant-based distillation has been introduced, in which the selection of the teacher assistant plays a crucial role in transferring knowledge from the teacher to the student. However, existing approaches for teacher assistant-based distillation require numerous trials to find the optimal teacher assistant. In this paper, we propose a novel approach called Minimal Distillation Schedule (MINIDISC), which enables the scheduling of an optimal teacher assistant in just one trial for extreme model compression (e.g, to 5% scale). In particular, we empirically show that the performance of the student is positively correlated with the scale-performance tradeoff of the teacher assistant. We then introduce a new λ-tradeoff metric that quantifies the optimality of the teacher assistant without the need for trial distillation to the student. By employing a sandwich framework, MINIDISC can select the optimal teacher assistant with the best λtradeoff. We extensively evaluate MINIDISC through a series of experiments on the GLUE benchmark. The results demonstrate that our approach achieved an improved efficiency compared to various state-of-the-art baselines. Furthermore, we showcase the scalability of MINIDISC by applying it to a language model with billions of parameters.",

author = "Chen Zhang and Yang Yang and Qifan Wang and Jiahao Liu and Jingang Wang and Wei Wu and Dawei Song",

note = "Publisher Copyright: {\textcopyright} 2024 Association for Computational Linguistics.; 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Findings of EACL 2024 ; Conference date: 17-03-2024 Through 22-03-2024",

year = "2024",

language = "English",

series = "EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024",

publisher = "Association for Computational Linguistics (ACL)",

pages = "1378--1394",

editor = "Yvette Graham and Matthew Purver and Matthew Purver",

booktitle = "EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024",

address = "United States",

}

Zhang, C, Yang, Y, Wang, Q, Liu, J, Wang, J, Wu, W & Song, D 2024, Minimal Distillation Schedule for Extreme Language Model Compression. 在 Y Graham, M Purver & M Purver (编辑), EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024. EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024, Association for Computational Linguistics (ACL), 页码 1378-1394, 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Findings of EACL 2024, St. Julian's, 马耳他, 17/03/24.

Minimal Distillation Schedule for Extreme Language Model Compression. / Zhang, Chen; Yang, Yang; Wang, Qifan 等.
EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024. 编辑 / Yvette Graham; Matthew Purver; Matthew Purver. Association for Computational Linguistics (ACL), 2024. 页码 1378-1394 (EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Minimal Distillation Schedule for Extreme Language Model Compression

AU - Zhang, Chen

AU - Yang, Yang

AU - Wang, Qifan

AU - Liu, Jiahao

AU - Wang, Jingang

AU - Wu, Wei

AU - Song, Dawei

PY - 2024

Y1 - 2024

N2 - Recent studies have revealed that language model distillation can become less effective when there is a significant capacity gap between the teacher and the student models. In order to bridge the gap, teacher assistant-based distillation has been introduced, in which the selection of the teacher assistant plays a crucial role in transferring knowledge from the teacher to the student. However, existing approaches for teacher assistant-based distillation require numerous trials to find the optimal teacher assistant. In this paper, we propose a novel approach called Minimal Distillation Schedule (MINIDISC), which enables the scheduling of an optimal teacher assistant in just one trial for extreme model compression (e.g, to 5% scale). In particular, we empirically show that the performance of the student is positively correlated with the scale-performance tradeoff of the teacher assistant. We then introduce a new λ-tradeoff metric that quantifies the optimality of the teacher assistant without the need for trial distillation to the student. By employing a sandwich framework, MINIDISC can select the optimal teacher assistant with the best λtradeoff. We extensively evaluate MINIDISC through a series of experiments on the GLUE benchmark. The results demonstrate that our approach achieved an improved efficiency compared to various state-of-the-art baselines. Furthermore, we showcase the scalability of MINIDISC by applying it to a language model with billions of parameters.

AB - Recent studies have revealed that language model distillation can become less effective when there is a significant capacity gap between the teacher and the student models. In order to bridge the gap, teacher assistant-based distillation has been introduced, in which the selection of the teacher assistant plays a crucial role in transferring knowledge from the teacher to the student. However, existing approaches for teacher assistant-based distillation require numerous trials to find the optimal teacher assistant. In this paper, we propose a novel approach called Minimal Distillation Schedule (MINIDISC), which enables the scheduling of an optimal teacher assistant in just one trial for extreme model compression (e.g, to 5% scale). In particular, we empirically show that the performance of the student is positively correlated with the scale-performance tradeoff of the teacher assistant. We then introduce a new λ-tradeoff metric that quantifies the optimality of the teacher assistant without the need for trial distillation to the student. By employing a sandwich framework, MINIDISC can select the optimal teacher assistant with the best λtradeoff. We extensively evaluate MINIDISC through a series of experiments on the GLUE benchmark. The results demonstrate that our approach achieved an improved efficiency compared to various state-of-the-art baselines. Furthermore, we showcase the scalability of MINIDISC by applying it to a language model with billions of parameters.

UR - http://www.scopus.com/inward/record.url?scp=85188681830&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85188681830

T3 - EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024

SP - 1378

EP - 1394

BT - EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024

A2 - Graham, Yvette

A2 - Purver, Matthew

PB - Association for Computational Linguistics (ACL)

T2 - 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Findings of EACL 2024

Y2 - 17 March 2024 through 22 March 2024

ER -

Zhang C, Yang Y, Wang Q, Liu J, Wang J, Wu W 等. Minimal Distillation Schedule for Extreme Language Model Compression. 在 Graham Y, Purver M, Purver M, 编辑, EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024. Association for Computational Linguistics (ACL). 2024. 页码 1378-1394. (EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2024).

Minimal Distillation Schedule for Extreme Language Model Compression

摘要

出版系列

会议

其它文件与链接

指纹

引用此