Ada-SwinBERT: Adaptive Token Selection for Efficient Video Captioning with Online Self-Distillation

Qianwen Cao, Heyan Huang, Minpeng Liao*, Xianling Mao

*此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Video captioning aims at producing textual descriptions for the given video. Benefiting from the self-attention mechanism for capturing long-distance dependencies between video patches and language sentences, the fully Transformer-based models achieve promising performance recently. However, due to continuous temporal information, there exists a large amount of redundant and unimportant visual content. Indiscriminate use of video patches results in expensive computation and inefficient use of resources. To tackle this issue, we propose Ada-SwinBERT, a novel approach that adaptively selects salient video tokens to achieve a balance between efficiency and performance for video captioning. Moreover, we devise a training strategy with online self-distillation to make up for the information loss caused by discarding video tokens. Video-text alignment knowledge distilled from the teacher leads to a robust training process. By pruning 78.1% input tokens hierarchically, our approach greatly reduces 62.0% FLOPs compared with the base model while achieving competitive performance with SOTA methods.

源语言英语
主期刊名Proceedings - 2023 IEEE International Conference on Multimedia and Expo, ICME 2023
出版商IEEE Computer Society
7-12
页数6
ISBN(电子版)9781665468916
DOI
出版状态已出版 - 2023
活动2023 IEEE International Conference on Multimedia and Expo, ICME 2023 - Brisbane, 澳大利亚
期限: 10 7月 202314 7月 2023

出版系列

姓名Proceedings - IEEE International Conference on Multimedia and Expo
2023-July
ISSN(印刷版)1945-7871
ISSN(电子版)1945-788X

会议

会议2023 IEEE International Conference on Multimedia and Expo, ICME 2023
国家/地区澳大利亚
Brisbane
时期10/07/2314/07/23

指纹

探究 'Ada-SwinBERT: Adaptive Token Selection for Efficient Video Captioning with Online Self-Distillation' 的科研主题。它们共同构成独一无二的指纹。

引用此