Hierarchical Matching and Reasoning for Action Localization via Language Query

Tianyu Li; Xinxiao Wu

doi:10.1007/978-3-030-60636-7_12

Hierarchical Matching and Reasoning for Action Localization via Language Query

Tianyu Li, Xinxiao Wu^*

^*Corresponding author for this work

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

This paper strives for temporal localization of actions in untrimmed videos via natural language queries. Prevailing methods represent both query sentence and video as a whole and perform sentence-video matching via global features, which neglects local correspondence between sentence and video. In this work, we aim to move beyond this limitation by delving into the fine-grained local sentence-video matching, such as phrase-motion matching and word-object matching. We propose a hierarchical matching and reasoning method based on deep conditional random field to integrate hierarchical matching between visual concepts and textual semantics for temporal action localization via query sentence. Our method decomposes each sentence into textual semantics (i.e., phrases and words), obtains multi-level matching results between the textual semantics and the visual concepts in a video (i.e., results of phrase-motion matching and word-object matching), and then reasons relations between multi-level matching via pairwise potentials of conditional random field to achieve coherence in hierarchical matching. By minimizing the overall potential, the final matching score between a sentence and a video is computed as the conditional probability of the conditional random field. Our proposed method is evaluated on public Charades-STA dataset and the experimental results verify its superiority over the state-of-the-art methods.

Original language	English
Title of host publication	Pattern Recognition and Computer Vision - 3rd Chinese Conference, PRCV 2020, Proceedings
Editors	Yuxin Peng, Hongbin Zha, Qingshan Liu, Huchuan Lu, Zhenan Sun, Chenglin Liu, Xilin Chen, Jian Yang
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	137-148
Number of pages	12
ISBN (Print)	9783030606350
DOIs	https://doi.org/10.1007/978-3-030-60636-7_12
Publication status	Published - 2020
Event	3rd Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2020 - Nanjing, China Duration: 16 Oct 2020 → 18 Oct 2020

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	12307 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	3rd Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2020
Country/Territory	China
City	Nanjing
Period	16/10/20 → 18/10/20

Keywords

Action localization via language query
Conditional random field
Hierarchical matching

Access to Document

10.1007/978-3-030-60636-7_12

Cite this

Li, T., & Wu, X. (2020). Hierarchical Matching and Reasoning for Action Localization via Language Query. In Y. Peng, H. Zha, Q. Liu, H. Lu, Z. Sun, C. Liu, X. Chen, & J. Yang (Eds.), Pattern Recognition and Computer Vision - 3rd Chinese Conference, PRCV 2020, Proceedings (pp. 137-148). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12307 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-60636-7_12

Li, Tianyu ; Wu, Xinxiao. / Hierarchical Matching and Reasoning for Action Localization via Language Query. Pattern Recognition and Computer Vision - 3rd Chinese Conference, PRCV 2020, Proceedings. editor / Yuxin Peng ; Hongbin Zha ; Qingshan Liu ; Huchuan Lu ; Zhenan Sun ; Chenglin Liu ; Xilin Chen ; Jian Yang. Springer Science and Business Media Deutschland GmbH, 2020. pp. 137-148 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{72ff4f3d78684105903434e5e8bcea22,

title = "Hierarchical Matching and Reasoning for Action Localization via Language Query",

abstract = "This paper strives for temporal localization of actions in untrimmed videos via natural language queries. Prevailing methods represent both query sentence and video as a whole and perform sentence-video matching via global features, which neglects local correspondence between sentence and video. In this work, we aim to move beyond this limitation by delving into the fine-grained local sentence-video matching, such as phrase-motion matching and word-object matching. We propose a hierarchical matching and reasoning method based on deep conditional random field to integrate hierarchical matching between visual concepts and textual semantics for temporal action localization via query sentence. Our method decomposes each sentence into textual semantics (i.e., phrases and words), obtains multi-level matching results between the textual semantics and the visual concepts in a video (i.e., results of phrase-motion matching and word-object matching), and then reasons relations between multi-level matching via pairwise potentials of conditional random field to achieve coherence in hierarchical matching. By minimizing the overall potential, the final matching score between a sentence and a video is computed as the conditional probability of the conditional random field. Our proposed method is evaluated on public Charades-STA dataset and the experimental results verify its superiority over the state-of-the-art methods.",

keywords = "Action localization via language query, Conditional random field, Hierarchical matching",

author = "Tianyu Li and Xinxiao Wu",

note = "Publisher Copyright: {\textcopyright} 2020, Springer Nature Switzerland AG.; 3rd Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2020 ; Conference date: 16-10-2020 Through 18-10-2020",

year = "2020",

doi = "10.1007/978-3-030-60636-7_12",

language = "English",

isbn = "9783030606350",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "137--148",

editor = "Yuxin Peng and Hongbin Zha and Qingshan Liu and Huchuan Lu and Zhenan Sun and Chenglin Liu and Xilin Chen and Jian Yang",

booktitle = "Pattern Recognition and Computer Vision - 3rd Chinese Conference, PRCV 2020, Proceedings",

address = "Germany",

}

Li, T & Wu, X 2020, Hierarchical Matching and Reasoning for Action Localization via Language Query. in Y Peng, H Zha, Q Liu, H Lu, Z Sun, C Liu, X Chen & J Yang (eds), Pattern Recognition and Computer Vision - 3rd Chinese Conference, PRCV 2020, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12307 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 137-148, 3rd Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2020, Nanjing, China, 16/10/20. https://doi.org/10.1007/978-3-030-60636-7_12

Hierarchical Matching and Reasoning for Action Localization via Language Query. / Li, Tianyu; Wu, Xinxiao.
Pattern Recognition and Computer Vision - 3rd Chinese Conference, PRCV 2020, Proceedings. ed. / Yuxin Peng; Hongbin Zha; Qingshan Liu; Huchuan Lu; Zhenan Sun; Chenglin Liu; Xilin Chen; Jian Yang. Springer Science and Business Media Deutschland GmbH, 2020. p. 137-148 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12307 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Hierarchical Matching and Reasoning for Action Localization via Language Query

AU - Li, Tianyu

AU - Wu, Xinxiao

PY - 2020

Y1 - 2020

N2 - This paper strives for temporal localization of actions in untrimmed videos via natural language queries. Prevailing methods represent both query sentence and video as a whole and perform sentence-video matching via global features, which neglects local correspondence between sentence and video. In this work, we aim to move beyond this limitation by delving into the fine-grained local sentence-video matching, such as phrase-motion matching and word-object matching. We propose a hierarchical matching and reasoning method based on deep conditional random field to integrate hierarchical matching between visual concepts and textual semantics for temporal action localization via query sentence. Our method decomposes each sentence into textual semantics (i.e., phrases and words), obtains multi-level matching results between the textual semantics and the visual concepts in a video (i.e., results of phrase-motion matching and word-object matching), and then reasons relations between multi-level matching via pairwise potentials of conditional random field to achieve coherence in hierarchical matching. By minimizing the overall potential, the final matching score between a sentence and a video is computed as the conditional probability of the conditional random field. Our proposed method is evaluated on public Charades-STA dataset and the experimental results verify its superiority over the state-of-the-art methods.

AB - This paper strives for temporal localization of actions in untrimmed videos via natural language queries. Prevailing methods represent both query sentence and video as a whole and perform sentence-video matching via global features, which neglects local correspondence between sentence and video. In this work, we aim to move beyond this limitation by delving into the fine-grained local sentence-video matching, such as phrase-motion matching and word-object matching. We propose a hierarchical matching and reasoning method based on deep conditional random field to integrate hierarchical matching between visual concepts and textual semantics for temporal action localization via query sentence. Our method decomposes each sentence into textual semantics (i.e., phrases and words), obtains multi-level matching results between the textual semantics and the visual concepts in a video (i.e., results of phrase-motion matching and word-object matching), and then reasons relations between multi-level matching via pairwise potentials of conditional random field to achieve coherence in hierarchical matching. By minimizing the overall potential, the final matching score between a sentence and a video is computed as the conditional probability of the conditional random field. Our proposed method is evaluated on public Charades-STA dataset and the experimental results verify its superiority over the state-of-the-art methods.

KW - Action localization via language query

KW - Conditional random field

KW - Hierarchical matching

UR - http://www.scopus.com/inward/record.url?scp=85094157049&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-60636-7_12

DO - 10.1007/978-3-030-60636-7_12

M3 - Conference contribution

AN - SCOPUS:85094157049

SN - 9783030606350

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 137

EP - 148

BT - Pattern Recognition and Computer Vision - 3rd Chinese Conference, PRCV 2020, Proceedings

A2 - Peng, Yuxin

A2 - Zha, Hongbin

A2 - Liu, Qingshan

A2 - Lu, Huchuan

A2 - Sun, Zhenan

A2 - Liu, Chenglin

A2 - Chen, Xilin

A2 - Yang, Jian

PB - Springer Science and Business Media Deutschland GmbH

T2 - 3rd Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2020

Y2 - 16 October 2020 through 18 October 2020

ER -

Li T, Wu X. Hierarchical Matching and Reasoning for Action Localization via Language Query. In Peng Y, Zha H, Liu Q, Lu H, Sun Z, Liu C, Chen X, Yang J, editors, Pattern Recognition and Computer Vision - 3rd Chinese Conference, PRCV 2020, Proceedings. Springer Science and Business Media Deutschland GmbH. 2020. p. 137-148. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-60636-7_12

Hierarchical Matching and Reasoning for Action Localization via Language Query

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this