TY - JOUR
T1 - Dynamic Pathway for Query-Aware Feature Learning in Language-Driven Action Localization
AU - Yang, Shuo
AU - Wu, Xinxiao
AU - Shang, Zirui
AU - Luo, Jiebo
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - Language-driven action localization aims to search a video segment in an untrimmed video, which is semantically relevant to an input language query. This task is challenging since language queries describe diverse actions with different motion characteristics and semantic granularities. Some actions, such as 'the person takes off their shoes, and goes to the door', are characterized by complex motion relationships, while others, such as 'a person is standing holding a mirror in one hand', are distinguished by salient body postures. In this paper, we propose a dynamic pathway between an exploitation module and an exploration module for query-aware feature learning to handle the diversity of actions. The exploitation module works in a coarse-to-fine manner, first learns the feature of general motion relationships to search the coarse segment of the target action and then learns the feature of subtle motion changes to predict the refined action boundaries. The exploration module functions in a point-to-area diffusion fashion, first learns the feature of sub-action pattern to search the salient postures of the target action and then learns the feature of temporal dependency to expand the posture frames to the action segment. The exploitation module and the exploration module are dynamically and adaptively selected to learn comprehensive representations of diverse actions to improve the action localization accuracy. Extensive experiments on the Charades-STA and TACoS datasets demonstrate that our method performs better than existing methods.
AB - Language-driven action localization aims to search a video segment in an untrimmed video, which is semantically relevant to an input language query. This task is challenging since language queries describe diverse actions with different motion characteristics and semantic granularities. Some actions, such as 'the person takes off their shoes, and goes to the door', are characterized by complex motion relationships, while others, such as 'a person is standing holding a mirror in one hand', are distinguished by salient body postures. In this paper, we propose a dynamic pathway between an exploitation module and an exploration module for query-aware feature learning to handle the diversity of actions. The exploitation module works in a coarse-to-fine manner, first learns the feature of general motion relationships to search the coarse segment of the target action and then learns the feature of subtle motion changes to predict the refined action boundaries. The exploration module functions in a point-to-area diffusion fashion, first learns the feature of sub-action pattern to search the salient postures of the target action and then learns the feature of temporal dependency to expand the posture frames to the action segment. The exploitation module and the exploration module are dynamically and adaptively selected to learn comprehensive representations of diverse actions to improve the action localization accuracy. Extensive experiments on the Charades-STA and TACoS datasets demonstrate that our method performs better than existing methods.
KW - Dynamic pathway
KW - exploitation
KW - exploration
KW - language-driven action localization
KW - video grounding
KW - video moment retrieval
UR - http://www.scopus.com/inward/record.url?scp=85187019948&partnerID=8YFLogxK
U2 - 10.1109/TMM.2024.3368919
DO - 10.1109/TMM.2024.3368919
M3 - Article
AN - SCOPUS:85187019948
SN - 1520-9210
VL - 26
SP - 7451
EP - 7461
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -