TY - JOUR
T1 - Exploiting Informative Video Segments for Temporal Action Localization
AU - Sun, Che
AU - Song, Hao
AU - Wu, Xinxiao
AU - Jia, Yunde
AU - Luo, Jiebo
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2022
Y1 - 2022
N2 - We propose a novel method of exploiting informative video segments by learning segment weights for temporal action localization in untrimmed videos. Informative video segments represent the intrinsic motion and appearance of an action, and thus contribute crucially to action localization. The learned segment weights represent the informativeness of video segments to recognize actions and help infer the boundaries required to temporally localize actions. We build a supervised temporal attention network (STAN) that includes a supervised segment-level attention module to dynamically learn the weights of video segments, and a feature-level attention module to effectively fuse multiple features of segments. Through the cascade of the attention modules, STAN exploits informative video segments and generates descriptive and discriminative video representations. We use a proposal generator and a classifier to estimate the boundaries of actions and classify the classes of actions. Extensive experiments are conducted on two public benchmarks, i.e., THUMOS2014 and ActivityNet1.3. The results demonstrate that our proposed method achieves competitive performance compared with existing state-of-the-art methods. Moreover, compared with the baseline method that treats video segments equally, STAN achieves significant improvements with an increase of the mean average precision from 30.4% to 39.8% on the THUMOS2014 dataset, and from 31.4% to 35.9% on the ActivityNet1.3 dataset, demonstrating the effectiveness of learning informative video segments for temporal action localization.
AB - We propose a novel method of exploiting informative video segments by learning segment weights for temporal action localization in untrimmed videos. Informative video segments represent the intrinsic motion and appearance of an action, and thus contribute crucially to action localization. The learned segment weights represent the informativeness of video segments to recognize actions and help infer the boundaries required to temporally localize actions. We build a supervised temporal attention network (STAN) that includes a supervised segment-level attention module to dynamically learn the weights of video segments, and a feature-level attention module to effectively fuse multiple features of segments. Through the cascade of the attention modules, STAN exploits informative video segments and generates descriptive and discriminative video representations. We use a proposal generator and a classifier to estimate the boundaries of actions and classify the classes of actions. Extensive experiments are conducted on two public benchmarks, i.e., THUMOS2014 and ActivityNet1.3. The results demonstrate that our proposed method achieves competitive performance compared with existing state-of-the-art methods. Moreover, compared with the baseline method that treats video segments equally, STAN achieves significant improvements with an increase of the mean average precision from 30.4% to 39.8% on the THUMOS2014 dataset, and from 31.4% to 35.9% on the ActivityNet1.3 dataset, demonstrating the effectiveness of learning informative video segments for temporal action localization.
KW - Temporal action localization
KW - attention mechanism
KW - informative video segments
KW - supervised temporal attention network
UR - http://www.scopus.com/inward/record.url?scp=85099553870&partnerID=8YFLogxK
U2 - 10.1109/TMM.2021.3050067
DO - 10.1109/TMM.2021.3050067
M3 - Article
AN - SCOPUS:85099553870
SN - 1520-9210
VL - 24
SP - 274
EP - 287
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -