Temporal action localization in untrimmed videos using action pattern trees

Hao Song, Xinxiao Wu*, Bing Zhu, Yuwei Wu, Mei Chen, Yunde Jia

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

26 Citations (Scopus)

Abstract

In this paper, we present a novel framework of automatically localizing action instances based on action pattern trees (AP-Trees) in a long untrimmed video. For localizing action instances in videos with varied temporal lengths, we first split videos into sequential segments and then use the AP-Trees to produce precise temporal boundaries of action instances. The AP-Trees can exploit the temporal information between segments of videos based on the label vectors of segments, by learning the occurrence frequency and order of segments. In AP-Trees, nodes stand for action class labels of segments and edges represent the temporal relationships between two consecutive segments. Thus, we can discover the occurrence frequencies of segments by searching paths of AP-Trees. In order to obtain accurate labels of video segments, we introduce deep neural networks to annotate the segments by simultaneously leveraging the spatio-temporal information and the high-level semantic feature of segments. In the networks, informative action maps are generated by a global average pooling layer to retain the spatio-temporal information of segments. An overlap loss function is employed to further improve the precision of label vectors of segments by considering the temporal overlap between segments and the ground truth. The experiments on THUMOS2014, MSR ActionII, and MPII Cooking datasets demonstrate the effectiveness of the method.

Original languageEnglish
Article number8440749
Pages (from-to)717-730
Number of pages14
JournalIEEE Transactions on Multimedia
Volume21
Issue number3
DOIs
Publication statusPublished - Mar 2019

Keywords

  • Action pattern tree
  • Informative action maps
  • Overlap loss function
  • Temporal action localization

Fingerprint

Dive into the research topics of 'Temporal action localization in untrimmed videos using action pattern trees'. Together they form a unique fingerprint.

Cite this