TY - JOUR
T1 - Toward Effective Knowledge Distillation
T2 - Navigating Beyond Small-Data Pitfall
AU - Hao, Zhiwei
AU - Guo, Jianyuan
AU - Han, Kai
AU - Hu, Han
AU - Xu, Chang
AU - Wang, Yunhe
N1 - Publisher Copyright:
© 1979-2012 IEEE.
PY - 2026
Y1 - 2026
N2 - The spectacular success of training large models on extensive datasets highlights the potential of scaling up for exceptional performance. To deploy these models on edge devices, knowledge distillation (KD) is commonly used to create a compact model from a larger, pretrained teacher model. However, as models and datasets rapidly scale up in practical applications, it is crucial to consider the applicability of existing KD approaches originally designed for limited-capacity architectures and small-scale datasets. In this paper, we revisit current KD methods and identify the presence of a small-data pitfall, where most modifications to vanilla KD prove ineffective on large-scale datasets. To guide the design of consistently effective KD methods across different data scales, we conduct a meticulous evaluation of the knowledge transfer process. Our findings reveal that incorporating more useful information is crucial for achieving consistently effective KD methods, while modifications in loss functions show relatively less significance. In light of this, we present a paradigmatic example that combines vanilla KD with deep supervision, incorporating additional information into the student during distillation. This approach surpasses almost all recent KD methods. We believe our study will offer valuable insights to guide the community in navigating beyond the small-data pitfall and toward consistently effective KD.
AB - The spectacular success of training large models on extensive datasets highlights the potential of scaling up for exceptional performance. To deploy these models on edge devices, knowledge distillation (KD) is commonly used to create a compact model from a larger, pretrained teacher model. However, as models and datasets rapidly scale up in practical applications, it is crucial to consider the applicability of existing KD approaches originally designed for limited-capacity architectures and small-scale datasets. In this paper, we revisit current KD methods and identify the presence of a small-data pitfall, where most modifications to vanilla KD prove ineffective on large-scale datasets. To guide the design of consistently effective KD methods across different data scales, we conduct a meticulous evaluation of the knowledge transfer process. Our findings reveal that incorporating more useful information is crucial for achieving consistently effective KD methods, while modifications in loss functions show relatively less significance. In light of this, we present a paradigmatic example that combines vanilla KD with deep supervision, incorporating additional information into the student during distillation. This approach surpasses almost all recent KD methods. We believe our study will offer valuable insights to guide the community in navigating beyond the small-data pitfall and toward consistently effective KD.
KW - Computer vision
KW - deep supervision
KW - dense prediction
KW - knowledge distillation
UR - https://www.scopus.com/pages/publications/105015494565
U2 - 10.1109/TPAMI.2025.3607982
DO - 10.1109/TPAMI.2025.3607982
M3 - Article
AN - SCOPUS:105015494565
SN - 0162-8828
VL - 48
SP - 542
EP - 556
JO - IEEE Transactions on Pattern Analysis and Machine Intelligence
JF - IEEE Transactions on Pattern Analysis and Machine Intelligence
IS - 1
ER -