Enhancing temporal action localization through cross-modal and cross-structural knowledge distillation

  • Yue Yu*
  • , Cheng Wang
  • , Yuxin Shi
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

This paper proposes Cross-Modal and Cross-Structure distillation for rgb-based temporal action detection(C2MS-Net), a novel fully supervised approach for enhancing temporal action localization by leveraging cross-modal and cross-structural distillation techniques. By integrating information from multiple modalities and structural representations, C2MS-Net significantly improves the discriminative power of action proposals. A distillation framework is introduced, which transfers knowledge from a teacher model trained on rich multi-modal data to a more efficient student model. This approach not only enhances temporal localization accuracy but also improves the robustness of action detection against visual content variations. Extensive experiments on benchmark datasets demonstrate that the proposed C2MS-Net performs competitively with or surpasses state-of-the-art methods, particularly at lower and mid-range tIoU thresholds, while offering substantial gains in computational efficiency. By eliminating the need for optical flow extraction, the proposed method substantially reduces computational complexity, achieving faster inference speeds and smaller model sizes without compromising accuracy. Code, dataset and models are available at: https://github.com/wangcheng666/ActionFormer.

Original languageEnglish
Article number104734
JournalJournal of Visual Communication and Image Representation
Volume116
DOIs
Publication statusPublished - Mar 2026

Keywords

  • Attention
  • Cross-modal distillation
  • Cross-structure distillation
  • Temporal action localization

Fingerprint

Dive into the research topics of 'Enhancing temporal action localization through cross-modal and cross-structural knowledge distillation'. Together they form a unique fingerprint.

Cite this