Sparse Teachers Can Be Dense with Knowledge

Yi Yang; Chen Zhang; Dawei Song

Sparse Teachers Can Be Dense with Knowledge

Yi Yang, Chen Zhang, Dawei Song^*

^*Corresponding author for this work

Beijing Institute of Technology

Research output: Contribution to conference › Paper › peer-review

2 Citations (Scopus)

Abstract

Recent advances in distilling pretrained language models have discovered that, besides the expressiveness of knowledge, the student-friendliness should be taken into consideration to realize a truly knowledgeable teacher. Based on a pilot study, we find that over-parameterized teachers can produce expressive yet student-unfriendly knowledge and are thus limited in overall knowledgeableness. To remove the parameters that result in student-unfriendliness, we propose a sparse teacher trick under the guidance of an overall knowledgeable score for each teacher parameter. The knowledgeable score is essentially an interpolation of the expressiveness and student-friendliness scores. The aim is to ensure that the expressive parameters are retained while the student-unfriendly ones are removed. Extensive experiments on the GLUE benchmark show that the proposed sparse teachers can be dense with knowledge and lead to students with compelling performance in comparison with a series of competitive baselines.

Original language	English
Pages	3904-3915
Number of pages	12
Publication status	Published - 2022
Event	2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 - Abu Dhabi, United Arab Emirates Duration: 7 Dec 2022 → 11 Dec 2022

Conference

Conference	2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
Country/Territory	United Arab Emirates
City	Abu Dhabi
Period	7/12/22 → 11/12/22

Cite this

@conference{ef829b610a93470f959ef82b62face85,

title = "Sparse Teachers Can Be Dense with Knowledge",

abstract = "Recent advances in distilling pretrained language models have discovered that, besides the expressiveness of knowledge, the student-friendliness should be taken into consideration to realize a truly knowledgeable teacher. Based on a pilot study, we find that over-parameterized teachers can produce expressive yet student-unfriendly knowledge and are thus limited in overall knowledgeableness. To remove the parameters that result in student-unfriendliness, we propose a sparse teacher trick under the guidance of an overall knowledgeable score for each teacher parameter. The knowledgeable score is essentially an interpolation of the expressiveness and student-friendliness scores. The aim is to ensure that the expressive parameters are retained while the student-unfriendly ones are removed. Extensive experiments on the GLUE benchmark show that the proposed sparse teachers can be dense with knowledge and lead to students with compelling performance in comparison with a series of competitive baselines.",

author = "Yi Yang and Chen Zhang and Dawei Song",

note = "Publisher Copyright: {\textcopyright} 2022 Association for Computational Linguistics.; 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 ; Conference date: 07-12-2022 Through 11-12-2022",

year = "2022",

language = "English",

pages = "3904--3915",

}

TY - CONF

T1 - Sparse Teachers Can Be Dense with Knowledge

AU - Yang, Yi

AU - Zhang, Chen

AU - Song, Dawei

PY - 2022

Y1 - 2022

N2 - Recent advances in distilling pretrained language models have discovered that, besides the expressiveness of knowledge, the student-friendliness should be taken into consideration to realize a truly knowledgeable teacher. Based on a pilot study, we find that over-parameterized teachers can produce expressive yet student-unfriendly knowledge and are thus limited in overall knowledgeableness. To remove the parameters that result in student-unfriendliness, we propose a sparse teacher trick under the guidance of an overall knowledgeable score for each teacher parameter. The knowledgeable score is essentially an interpolation of the expressiveness and student-friendliness scores. The aim is to ensure that the expressive parameters are retained while the student-unfriendly ones are removed. Extensive experiments on the GLUE benchmark show that the proposed sparse teachers can be dense with knowledge and lead to students with compelling performance in comparison with a series of competitive baselines.

AB - Recent advances in distilling pretrained language models have discovered that, besides the expressiveness of knowledge, the student-friendliness should be taken into consideration to realize a truly knowledgeable teacher. Based on a pilot study, we find that over-parameterized teachers can produce expressive yet student-unfriendly knowledge and are thus limited in overall knowledgeableness. To remove the parameters that result in student-unfriendliness, we propose a sparse teacher trick under the guidance of an overall knowledgeable score for each teacher parameter. The knowledgeable score is essentially an interpolation of the expressiveness and student-friendliness scores. The aim is to ensure that the expressive parameters are retained while the student-unfriendly ones are removed. Extensive experiments on the GLUE benchmark show that the proposed sparse teachers can be dense with knowledge and lead to students with compelling performance in comparison with a series of competitive baselines.

UR - http://www.scopus.com/inward/record.url?scp=85149439930&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85149439930

SP - 3904

EP - 3915

T2 - 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022

Y2 - 7 December 2022 through 11 December 2022

ER -

Sparse Teachers Can Be Dense with Knowledge

Abstract

Conference

Other files and links

Fingerprint

Cite this