Abstract
This paper proposes Cross-Modal and Cross-Structure distillation for rgb-based temporal action detection(C2MS-Net), a novel fully supervised approach for enhancing temporal action localization by leveraging cross-modal and cross-structural distillation techniques. By integrating information from multiple modalities and structural representations, C2MS-Net significantly improves the discriminative power of action proposals. A distillation framework is introduced, which transfers knowledge from a teacher model trained on rich multi-modal data to a more efficient student model. This approach not only enhances temporal localization accuracy but also improves the robustness of action detection against visual content variations. Extensive experiments on benchmark datasets demonstrate that the proposed C2MS-Net performs competitively with or surpasses state-of-the-art methods, particularly at lower and mid-range tIoU thresholds, while offering substantial gains in computational efficiency. By eliminating the need for optical flow extraction, the proposed method substantially reduces computational complexity, achieving faster inference speeds and smaller model sizes without compromising accuracy. Code, dataset and models are available at: https://github.com/wangcheng666/ActionFormer.
| Original language | English |
|---|---|
| Article number | 104734 |
| Journal | Journal of Visual Communication and Image Representation |
| Volume | 116 |
| DOIs | |
| Publication status | Published - Mar 2026 |
Keywords
- Attention
- Cross-modal distillation
- Cross-structure distillation
- Temporal action localization
Fingerprint
Dive into the research topics of 'Enhancing temporal action localization through cross-modal and cross-structural knowledge distillation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver