TY - JOUR
T1 - Temporal action localization in untrimmed videos using action pattern trees
AU - Song, Hao
AU - Wu, Xinxiao
AU - Zhu, Bing
AU - Wu, Yuwei
AU - Chen, Mei
AU - Jia, Yunde
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2019/3
Y1 - 2019/3
N2 - In this paper, we present a novel framework of automatically localizing action instances based on action pattern trees (AP-Trees) in a long untrimmed video. For localizing action instances in videos with varied temporal lengths, we first split videos into sequential segments and then use the AP-Trees to produce precise temporal boundaries of action instances. The AP-Trees can exploit the temporal information between segments of videos based on the label vectors of segments, by learning the occurrence frequency and order of segments. In AP-Trees, nodes stand for action class labels of segments and edges represent the temporal relationships between two consecutive segments. Thus, we can discover the occurrence frequencies of segments by searching paths of AP-Trees. In order to obtain accurate labels of video segments, we introduce deep neural networks to annotate the segments by simultaneously leveraging the spatio-temporal information and the high-level semantic feature of segments. In the networks, informative action maps are generated by a global average pooling layer to retain the spatio-temporal information of segments. An overlap loss function is employed to further improve the precision of label vectors of segments by considering the temporal overlap between segments and the ground truth. The experiments on THUMOS2014, MSR ActionII, and MPII Cooking datasets demonstrate the effectiveness of the method.
AB - In this paper, we present a novel framework of automatically localizing action instances based on action pattern trees (AP-Trees) in a long untrimmed video. For localizing action instances in videos with varied temporal lengths, we first split videos into sequential segments and then use the AP-Trees to produce precise temporal boundaries of action instances. The AP-Trees can exploit the temporal information between segments of videos based on the label vectors of segments, by learning the occurrence frequency and order of segments. In AP-Trees, nodes stand for action class labels of segments and edges represent the temporal relationships between two consecutive segments. Thus, we can discover the occurrence frequencies of segments by searching paths of AP-Trees. In order to obtain accurate labels of video segments, we introduce deep neural networks to annotate the segments by simultaneously leveraging the spatio-temporal information and the high-level semantic feature of segments. In the networks, informative action maps are generated by a global average pooling layer to retain the spatio-temporal information of segments. An overlap loss function is employed to further improve the precision of label vectors of segments by considering the temporal overlap between segments and the ground truth. The experiments on THUMOS2014, MSR ActionII, and MPII Cooking datasets demonstrate the effectiveness of the method.
KW - Action pattern tree
KW - Informative action maps
KW - Overlap loss function
KW - Temporal action localization
UR - http://www.scopus.com/inward/record.url?scp=85051810917&partnerID=8YFLogxK
U2 - 10.1109/TMM.2018.2866370
DO - 10.1109/TMM.2018.2866370
M3 - Article
AN - SCOPUS:85051810917
SN - 1520-9210
VL - 21
SP - 717
EP - 730
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
IS - 3
M1 - 8440749
ER -