Temporal action localization in untrimmed videos using action pattern trees

Hao Song; Xinxiao Wu; Bing Zhu; Yuwei Wu; Mei Chen; Yunde Jia

doi:10.1109/TMM.2018.2866370

Temporal action localization in untrimmed videos using action pattern trees

Hao Song, Xinxiao Wu^*, Bing Zhu, Yuwei Wu, Mei Chen, Yunde Jia

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

27 Citations (Scopus)

Abstract

In this paper, we present a novel framework of automatically localizing action instances based on action pattern trees (AP-Trees) in a long untrimmed video. For localizing action instances in videos with varied temporal lengths, we first split videos into sequential segments and then use the AP-Trees to produce precise temporal boundaries of action instances. The AP-Trees can exploit the temporal information between segments of videos based on the label vectors of segments, by learning the occurrence frequency and order of segments. In AP-Trees, nodes stand for action class labels of segments and edges represent the temporal relationships between two consecutive segments. Thus, we can discover the occurrence frequencies of segments by searching paths of AP-Trees. In order to obtain accurate labels of video segments, we introduce deep neural networks to annotate the segments by simultaneously leveraging the spatio-temporal information and the high-level semantic feature of segments. In the networks, informative action maps are generated by a global average pooling layer to retain the spatio-temporal information of segments. An overlap loss function is employed to further improve the precision of label vectors of segments by considering the temporal overlap between segments and the ground truth. The experiments on THUMOS2014, MSR ActionII, and MPII Cooking datasets demonstrate the effectiveness of the method.

Original language	English
Article number	8440749
Pages (from-to)	717-730
Number of pages	14
Journal	IEEE Transactions on Multimedia
Volume	21
Issue number	3
DOIs	https://doi.org/10.1109/TMM.2018.2866370
Publication status	Published - Mar 2019

Keywords

Action pattern tree
Informative action maps
Overlap loss function
Temporal action localization

Access to Document

10.1109/TMM.2018.2866370

Cite this

Song, H., Wu, X., Zhu, B., Wu, Y., Chen, M., & Jia, Y. (2019). Temporal action localization in untrimmed videos using action pattern trees. IEEE Transactions on Multimedia, 21(3), 717-730. Article 8440749. https://doi.org/10.1109/TMM.2018.2866370

@article{6fdba839718b492cbc0a22cf5b54ebd0,

title = "Temporal action localization in untrimmed videos using action pattern trees",

abstract = "In this paper, we present a novel framework of automatically localizing action instances based on action pattern trees (AP-Trees) in a long untrimmed video. For localizing action instances in videos with varied temporal lengths, we first split videos into sequential segments and then use the AP-Trees to produce precise temporal boundaries of action instances. The AP-Trees can exploit the temporal information between segments of videos based on the label vectors of segments, by learning the occurrence frequency and order of segments. In AP-Trees, nodes stand for action class labels of segments and edges represent the temporal relationships between two consecutive segments. Thus, we can discover the occurrence frequencies of segments by searching paths of AP-Trees. In order to obtain accurate labels of video segments, we introduce deep neural networks to annotate the segments by simultaneously leveraging the spatio-temporal information and the high-level semantic feature of segments. In the networks, informative action maps are generated by a global average pooling layer to retain the spatio-temporal information of segments. An overlap loss function is employed to further improve the precision of label vectors of segments by considering the temporal overlap between segments and the ground truth. The experiments on THUMOS2014, MSR ActionII, and MPII Cooking datasets demonstrate the effectiveness of the method.",

keywords = "Action pattern tree, Informative action maps, Overlap loss function, Temporal action localization",

author = "Hao Song and Xinxiao Wu and Bing Zhu and Yuwei Wu and Mei Chen and Yunde Jia",

note = "Publisher Copyright: {\textcopyright} 2018 IEEE.",

year = "2019",

month = mar,

doi = "10.1109/TMM.2018.2866370",

language = "English",

volume = "21",

pages = "717--730",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "3",

}

TY - JOUR

T1 - Temporal action localization in untrimmed videos using action pattern trees

AU - Song, Hao

AU - Wu, Xinxiao

AU - Zhu, Bing

AU - Wu, Yuwei

AU - Chen, Mei

AU - Jia, Yunde

PY - 2019/3

Y1 - 2019/3

N2 - In this paper, we present a novel framework of automatically localizing action instances based on action pattern trees (AP-Trees) in a long untrimmed video. For localizing action instances in videos with varied temporal lengths, we first split videos into sequential segments and then use the AP-Trees to produce precise temporal boundaries of action instances. The AP-Trees can exploit the temporal information between segments of videos based on the label vectors of segments, by learning the occurrence frequency and order of segments. In AP-Trees, nodes stand for action class labels of segments and edges represent the temporal relationships between two consecutive segments. Thus, we can discover the occurrence frequencies of segments by searching paths of AP-Trees. In order to obtain accurate labels of video segments, we introduce deep neural networks to annotate the segments by simultaneously leveraging the spatio-temporal information and the high-level semantic feature of segments. In the networks, informative action maps are generated by a global average pooling layer to retain the spatio-temporal information of segments. An overlap loss function is employed to further improve the precision of label vectors of segments by considering the temporal overlap between segments and the ground truth. The experiments on THUMOS2014, MSR ActionII, and MPII Cooking datasets demonstrate the effectiveness of the method.

AB - In this paper, we present a novel framework of automatically localizing action instances based on action pattern trees (AP-Trees) in a long untrimmed video. For localizing action instances in videos with varied temporal lengths, we first split videos into sequential segments and then use the AP-Trees to produce precise temporal boundaries of action instances. The AP-Trees can exploit the temporal information between segments of videos based on the label vectors of segments, by learning the occurrence frequency and order of segments. In AP-Trees, nodes stand for action class labels of segments and edges represent the temporal relationships between two consecutive segments. Thus, we can discover the occurrence frequencies of segments by searching paths of AP-Trees. In order to obtain accurate labels of video segments, we introduce deep neural networks to annotate the segments by simultaneously leveraging the spatio-temporal information and the high-level semantic feature of segments. In the networks, informative action maps are generated by a global average pooling layer to retain the spatio-temporal information of segments. An overlap loss function is employed to further improve the precision of label vectors of segments by considering the temporal overlap between segments and the ground truth. The experiments on THUMOS2014, MSR ActionII, and MPII Cooking datasets demonstrate the effectiveness of the method.

KW - Action pattern tree

KW - Informative action maps

KW - Overlap loss function

KW - Temporal action localization

UR - http://www.scopus.com/inward/record.url?scp=85051810917&partnerID=8YFLogxK

U2 - 10.1109/TMM.2018.2866370

DO - 10.1109/TMM.2018.2866370

M3 - Article

AN - SCOPUS:85051810917

SN - 1520-9210

VL - 21

SP - 717

EP - 730

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

IS - 3

M1 - 8440749

ER -

Temporal action localization in untrimmed videos using action pattern trees

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this