Hierarchical Matching and Reasoning for Action Localization via Language Query

Tianyu Li, Xinxiao Wu*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This paper strives for temporal localization of actions in untrimmed videos via natural language queries. Prevailing methods represent both query sentence and video as a whole and perform sentence-video matching via global features, which neglects local correspondence between sentence and video. In this work, we aim to move beyond this limitation by delving into the fine-grained local sentence-video matching, such as phrase-motion matching and word-object matching. We propose a hierarchical matching and reasoning method based on deep conditional random field to integrate hierarchical matching between visual concepts and textual semantics for temporal action localization via query sentence. Our method decomposes each sentence into textual semantics (i.e., phrases and words), obtains multi-level matching results between the textual semantics and the visual concepts in a video (i.e., results of phrase-motion matching and word-object matching), and then reasons relations between multi-level matching via pairwise potentials of conditional random field to achieve coherence in hierarchical matching. By minimizing the overall potential, the final matching score between a sentence and a video is computed as the conditional probability of the conditional random field. Our proposed method is evaluated on public Charades-STA dataset and the experimental results verify its superiority over the state-of-the-art methods.

Original languageEnglish
Title of host publicationPattern Recognition and Computer Vision - 3rd Chinese Conference, PRCV 2020, Proceedings
EditorsYuxin Peng, Hongbin Zha, Qingshan Liu, Huchuan Lu, Zhenan Sun, Chenglin Liu, Xilin Chen, Jian Yang
PublisherSpringer Science and Business Media Deutschland GmbH
Pages137-148
Number of pages12
ISBN (Print)9783030606350
DOIs
Publication statusPublished - 2020
Event3rd Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2020 - Nanjing, China
Duration: 16 Oct 202018 Oct 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12307 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference3rd Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2020
Country/TerritoryChina
CityNanjing
Period16/10/2018/10/20

Keywords

  • Action localization via language query
  • Conditional random field
  • Hierarchical matching

Fingerprint

Dive into the research topics of 'Hierarchical Matching and Reasoning for Action Localization via Language Query'. Together they form a unique fingerprint.

Cite this