TY - GEN
T1 - Efficient Language-Driven Action Localization by Feature Aggregation and Prediction Adjustment
AU - Shang, Zirui
AU - Yang, Shuo
AU - Wu, Xinxiao
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - Language-driven action localization is a challenging task that aims to identify action boundaries, namely the start and end timestamps, within untrimmed videos using natural language queries. Previous studies have made significant progress by extensively investigating cross-modal interactions between linguistic and visual modalities. However, the computational demands imposed by untrimmed and lengthy videos remain substantial, necessitating the development of more efficient algorithms. In this paper, we propose an efficient algorithm to address this computational challenge by aggregating adjacent similar redundant frame features. Specifically, we fuse neighboring frames based on their semantic similarity to the provided language query, facilitating the identification of relevant video segments while effectively managing computational complexity. To enhance localization accuracy, we introduce a prediction adjustment module that expands the fused frames, enabling a more precise determination of the action boundaries. Moreover, our method is model-agnostic and can be easily integrated with existing methods, functioning as a plugin-and-play solution. Extensive experimentation on two widely-used benchmark datasets (Charades-STA and TACoS) demonstrates the effectiveness and efficiency of our method.
AB - Language-driven action localization is a challenging task that aims to identify action boundaries, namely the start and end timestamps, within untrimmed videos using natural language queries. Previous studies have made significant progress by extensively investigating cross-modal interactions between linguistic and visual modalities. However, the computational demands imposed by untrimmed and lengthy videos remain substantial, necessitating the development of more efficient algorithms. In this paper, we propose an efficient algorithm to address this computational challenge by aggregating adjacent similar redundant frame features. Specifically, we fuse neighboring frames based on their semantic similarity to the provided language query, facilitating the identification of relevant video segments while effectively managing computational complexity. To enhance localization accuracy, we introduce a prediction adjustment module that expands the fused frames, enabling a more precise determination of the action boundaries. Moreover, our method is model-agnostic and can be easily integrated with existing methods, functioning as a plugin-and-play solution. Extensive experimentation on two widely-used benchmark datasets (Charades-STA and TACoS) demonstrates the effectiveness and efficiency of our method.
KW - Feature aggregation
KW - Language-driven action localization
KW - Prediction adjustment
KW - Temporal sentence grounding
KW - Video moment retrieval
UR - http://www.scopus.com/inward/record.url?scp=85208170194&partnerID=8YFLogxK
U2 - 10.1007/978-981-97-8620-6_38
DO - 10.1007/978-981-97-8620-6_38
M3 - Conference contribution
AN - SCOPUS:85208170194
SN - 9789819786190
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 555
EP - 568
BT - Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings
A2 - Lin, Zhouchen
A2 - Zha, Hongbin
A2 - Cheng, Ming-Ming
A2 - He, Ran
A2 - Liu, Cheng-Lin
A2 - Ubul, Kurban
A2 - Silamu, Wushouer
A2 - Zhou, Jie
PB - Springer Science and Business Media Deutschland GmbH
T2 - 7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024
Y2 - 18 October 2024 through 20 October 2024
ER -