TY - GEN
T1 - PrimKD
T2 - 32nd ACM International Conference on Multimedia, MM 2024
AU - Hao, Zhiwei
AU - Xiao, Zhongyu
AU - Luo, Yong
AU - Guo, Jianyuan
AU - Wang, Jing
AU - Shen, Li
AU - Hu, Han
N1 - Publisher Copyright:
© 2024 Owner/Author.
PY - 2024/10/28
Y1 - 2024/10/28
N2 - The recent advancements in cross-modal transformers have demonstrated their superior performance in RGB-D segmentation tasks by effectively integrating information from both RGB and depth modalities. However, existing methods often overlook the varying levels of informative content present in each modality, treating them equally and using models of the same architecture. This oversight can potentially hinder segmentation performance, especially considering that RGB images typically contain significantly more information than depth images. To address this issue, we propose PrimKD, a knowledge distillation based approach that focuses on guided multimodal fusion, with an emphasis on leveraging the primary RGB modality. In our approach, we utilize a model trained exclusively on the RGB modality as the teacher, guiding the learning process of a student model that fuses both RGB and depth modalities. To prioritize information from the primary RGB modality while leveraging the depth modality, we incorporate primary focused feature reconstruction and a selective alignment scheme. This integration enhances the overall freature fusion, resulting in improved segmentation results. We evaluate our proposed method on the NYU Depth V2 and SUN-RGBD datasets, and the experimental results demonstrate the effectiveness of PrimKD. Specifically, our approach achieves mIoU scores of 57.8 and 52.5 on these two datasets, respectively, surpassing existing counterparts by 1.5 and 0.4 mIoU. The code is available at https://github.com/xiaoshideta/PrimKD.
AB - The recent advancements in cross-modal transformers have demonstrated their superior performance in RGB-D segmentation tasks by effectively integrating information from both RGB and depth modalities. However, existing methods often overlook the varying levels of informative content present in each modality, treating them equally and using models of the same architecture. This oversight can potentially hinder segmentation performance, especially considering that RGB images typically contain significantly more information than depth images. To address this issue, we propose PrimKD, a knowledge distillation based approach that focuses on guided multimodal fusion, with an emphasis on leveraging the primary RGB modality. In our approach, we utilize a model trained exclusively on the RGB modality as the teacher, guiding the learning process of a student model that fuses both RGB and depth modalities. To prioritize information from the primary RGB modality while leveraging the depth modality, we incorporate primary focused feature reconstruction and a selective alignment scheme. This integration enhances the overall freature fusion, resulting in improved segmentation results. We evaluate our proposed method on the NYU Depth V2 and SUN-RGBD datasets, and the experimental results demonstrate the effectiveness of PrimKD. Specifically, our approach achieves mIoU scores of 57.8 and 52.5 on these two datasets, respectively, surpassing existing counterparts by 1.5 and 0.4 mIoU. The code is available at https://github.com/xiaoshideta/PrimKD.
KW - knowledge distillation
KW - multimodal fusion
KW - rgb-d segmentation
UR - http://www.scopus.com/inward/record.url?scp=85209795473&partnerID=8YFLogxK
U2 - 10.1145/3664647.3681253
DO - 10.1145/3664647.3681253
M3 - Conference contribution
AN - SCOPUS:85209795473
T3 - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
SP - 1943
EP - 1951
BT - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
Y2 - 28 October 2024 through 1 November 2024
ER -