Efficient Language-Driven Action Localization by Feature Aggregation and Prediction Adjustment

Zirui Shang, Shuo Yang*, Xinxiao Wu

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Language-driven action localization is a challenging task that aims to identify action boundaries, namely the start and end timestamps, within untrimmed videos using natural language queries. Previous studies have made significant progress by extensively investigating cross-modal interactions between linguistic and visual modalities. However, the computational demands imposed by untrimmed and lengthy videos remain substantial, necessitating the development of more efficient algorithms. In this paper, we propose an efficient algorithm to address this computational challenge by aggregating adjacent similar redundant frame features. Specifically, we fuse neighboring frames based on their semantic similarity to the provided language query, facilitating the identification of relevant video segments while effectively managing computational complexity. To enhance localization accuracy, we introduce a prediction adjustment module that expands the fused frames, enabling a more precise determination of the action boundaries. Moreover, our method is model-agnostic and can be easily integrated with existing methods, functioning as a plugin-and-play solution. Extensive experimentation on two widely-used benchmark datasets (Charades-STA and TACoS) demonstrates the effectiveness and efficiency of our method.

Original languageEnglish
Title of host publicationPattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings
EditorsZhouchen Lin, Hongbin Zha, Ming-Ming Cheng, Ran He, Cheng-Lin Liu, Kurban Ubul, Wushouer Silamu, Jie Zhou
PublisherSpringer Science and Business Media Deutschland GmbH
Pages555-568
Number of pages14
ISBN (Print)9789819786190
DOIs
Publication statusPublished - 2025
Event7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024 - Urumqi, China
Duration: 18 Oct 202420 Oct 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume15035 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024
Country/TerritoryChina
CityUrumqi
Period18/10/2420/10/24

Keywords

  • Feature aggregation
  • Language-driven action localization
  • Prediction adjustment
  • Temporal sentence grounding
  • Video moment retrieval

Fingerprint

Dive into the research topics of 'Efficient Language-Driven Action Localization by Feature Aggregation and Prediction Adjustment'. Together they form a unique fingerprint.

Cite this