TY - GEN
T1 - Hierarchical Matching and Reasoning for Action Localization via Language Query
AU - Li, Tianyu
AU - Wu, Xinxiao
N1 - Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - This paper strives for temporal localization of actions in untrimmed videos via natural language queries. Prevailing methods represent both query sentence and video as a whole and perform sentence-video matching via global features, which neglects local correspondence between sentence and video. In this work, we aim to move beyond this limitation by delving into the fine-grained local sentence-video matching, such as phrase-motion matching and word-object matching. We propose a hierarchical matching and reasoning method based on deep conditional random field to integrate hierarchical matching between visual concepts and textual semantics for temporal action localization via query sentence. Our method decomposes each sentence into textual semantics (i.e., phrases and words), obtains multi-level matching results between the textual semantics and the visual concepts in a video (i.e., results of phrase-motion matching and word-object matching), and then reasons relations between multi-level matching via pairwise potentials of conditional random field to achieve coherence in hierarchical matching. By minimizing the overall potential, the final matching score between a sentence and a video is computed as the conditional probability of the conditional random field. Our proposed method is evaluated on public Charades-STA dataset and the experimental results verify its superiority over the state-of-the-art methods.
AB - This paper strives for temporal localization of actions in untrimmed videos via natural language queries. Prevailing methods represent both query sentence and video as a whole and perform sentence-video matching via global features, which neglects local correspondence between sentence and video. In this work, we aim to move beyond this limitation by delving into the fine-grained local sentence-video matching, such as phrase-motion matching and word-object matching. We propose a hierarchical matching and reasoning method based on deep conditional random field to integrate hierarchical matching between visual concepts and textual semantics for temporal action localization via query sentence. Our method decomposes each sentence into textual semantics (i.e., phrases and words), obtains multi-level matching results between the textual semantics and the visual concepts in a video (i.e., results of phrase-motion matching and word-object matching), and then reasons relations between multi-level matching via pairwise potentials of conditional random field to achieve coherence in hierarchical matching. By minimizing the overall potential, the final matching score between a sentence and a video is computed as the conditional probability of the conditional random field. Our proposed method is evaluated on public Charades-STA dataset and the experimental results verify its superiority over the state-of-the-art methods.
KW - Action localization via language query
KW - Conditional random field
KW - Hierarchical matching
UR - http://www.scopus.com/inward/record.url?scp=85094157049&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-60636-7_12
DO - 10.1007/978-3-030-60636-7_12
M3 - Conference contribution
AN - SCOPUS:85094157049
SN - 9783030606350
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 137
EP - 148
BT - Pattern Recognition and Computer Vision - 3rd Chinese Conference, PRCV 2020, Proceedings
A2 - Peng, Yuxin
A2 - Zha, Hongbin
A2 - Liu, Qingshan
A2 - Lu, Huchuan
A2 - Sun, Zhenan
A2 - Liu, Chenglin
A2 - Chen, Xilin
A2 - Yang, Jian
PB - Springer Science and Business Media Deutschland GmbH
T2 - 3rd Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2020
Y2 - 16 October 2020 through 18 October 2020
ER -