Ada-SwinBERT: Adaptive Token Selection for Efficient Video Captioning with Online Self-Distillation

Qianwen Cao, Heyan Huang, Minpeng Liao*, Xianling Mao

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Video captioning aims at producing textual descriptions for the given video. Benefiting from the self-attention mechanism for capturing long-distance dependencies between video patches and language sentences, the fully Transformer-based models achieve promising performance recently. However, due to continuous temporal information, there exists a large amount of redundant and unimportant visual content. Indiscriminate use of video patches results in expensive computation and inefficient use of resources. To tackle this issue, we propose Ada-SwinBERT, a novel approach that adaptively selects salient video tokens to achieve a balance between efficiency and performance for video captioning. Moreover, we devise a training strategy with online self-distillation to make up for the information loss caused by discarding video tokens. Video-text alignment knowledge distilled from the teacher leads to a robust training process. By pruning 78.1% input tokens hierarchically, our approach greatly reduces 62.0% FLOPs compared with the base model while achieving competitive performance with SOTA methods.

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE International Conference on Multimedia and Expo, ICME 2023
PublisherIEEE Computer Society
Pages7-12
Number of pages6
ISBN (Electronic)9781665468916
DOIs
Publication statusPublished - 2023
Event2023 IEEE International Conference on Multimedia and Expo, ICME 2023 - Brisbane, Australia
Duration: 10 Jul 202314 Jul 2023

Publication series

NameProceedings - IEEE International Conference on Multimedia and Expo
Volume2023-July
ISSN (Print)1945-7871
ISSN (Electronic)1945-788X

Conference

Conference2023 IEEE International Conference on Multimedia and Expo, ICME 2023
Country/TerritoryAustralia
CityBrisbane
Period10/07/2314/07/23

Keywords

  • efficient multimodal transformer
  • self-distillation
  • token pruning
  • video captioning

Fingerprint

Dive into the research topics of 'Ada-SwinBERT: Adaptive Token Selection for Efficient Video Captioning with Online Self-Distillation'. Together they form a unique fingerprint.

Cite this