Exploiting Informative Video Segments for Temporal Action Localization

Che Sun; Hao Song; Xinxiao Wu; Yunde Jia; Jiebo Luo

doi:10.1109/TMM.2021.3050067

Exploiting Informative Video Segments for Temporal Action Localization

Che Sun, Hao Song, Xinxiao Wu^*, Yunde Jia, Jiebo Luo

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

24 Citations (Scopus)

Abstract

We propose a novel method of exploiting informative video segments by learning segment weights for temporal action localization in untrimmed videos. Informative video segments represent the intrinsic motion and appearance of an action, and thus contribute crucially to action localization. The learned segment weights represent the informativeness of video segments to recognize actions and help infer the boundaries required to temporally localize actions. We build a supervised temporal attention network (STAN) that includes a supervised segment-level attention module to dynamically learn the weights of video segments, and a feature-level attention module to effectively fuse multiple features of segments. Through the cascade of the attention modules, STAN exploits informative video segments and generates descriptive and discriminative video representations. We use a proposal generator and a classifier to estimate the boundaries of actions and classify the classes of actions. Extensive experiments are conducted on two public benchmarks, i.e., THUMOS2014 and ActivityNet1.3. The results demonstrate that our proposed method achieves competitive performance compared with existing state-of-the-art methods. Moreover, compared with the baseline method that treats video segments equally, STAN achieves significant improvements with an increase of the mean average precision from 30.4% to 39.8% on the THUMOS2014 dataset, and from 31.4% to 35.9% on the ActivityNet1.3 dataset, demonstrating the effectiveness of learning informative video segments for temporal action localization.

Original language	English
Pages (from-to)	274-287
Number of pages	14
Journal	IEEE Transactions on Multimedia
Volume	24
DOIs	https://doi.org/10.1109/TMM.2021.3050067
Publication status	Published - 2022

Keywords

Temporal action localization
attention mechanism
informative video segments
supervised temporal attention network

Access to Document

10.1109/TMM.2021.3050067

Cite this

Sun, C., Song, H., Wu, X., Jia, Y., & Luo, J. (2022). Exploiting Informative Video Segments for Temporal Action Localization. IEEE Transactions on Multimedia, 24, 274-287. https://doi.org/10.1109/TMM.2021.3050067

@article{58eabece210244b693cd06fc17cdefb3,

title = "Exploiting Informative Video Segments for Temporal Action Localization",

abstract = "We propose a novel method of exploiting informative video segments by learning segment weights for temporal action localization in untrimmed videos. Informative video segments represent the intrinsic motion and appearance of an action, and thus contribute crucially to action localization. The learned segment weights represent the informativeness of video segments to recognize actions and help infer the boundaries required to temporally localize actions. We build a supervised temporal attention network (STAN) that includes a supervised segment-level attention module to dynamically learn the weights of video segments, and a feature-level attention module to effectively fuse multiple features of segments. Through the cascade of the attention modules, STAN exploits informative video segments and generates descriptive and discriminative video representations. We use a proposal generator and a classifier to estimate the boundaries of actions and classify the classes of actions. Extensive experiments are conducted on two public benchmarks, i.e., THUMOS2014 and ActivityNet1.3. The results demonstrate that our proposed method achieves competitive performance compared with existing state-of-the-art methods. Moreover, compared with the baseline method that treats video segments equally, STAN achieves significant improvements with an increase of the mean average precision from 30.4% to 39.8% on the THUMOS2014 dataset, and from 31.4% to 35.9% on the ActivityNet1.3 dataset, demonstrating the effectiveness of learning informative video segments for temporal action localization.",

keywords = "Temporal action localization, attention mechanism, informative video segments, supervised temporal attention network",

author = "Che Sun and Hao Song and Xinxiao Wu and Yunde Jia and Jiebo Luo",

note = "Publisher Copyright: {\textcopyright} 2021 IEEE.",

year = "2022",

doi = "10.1109/TMM.2021.3050067",

language = "English",

volume = "24",

pages = "274--287",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Exploiting Informative Video Segments for Temporal Action Localization

AU - Sun, Che

AU - Song, Hao

AU - Wu, Xinxiao

AU - Jia, Yunde

AU - Luo, Jiebo

PY - 2022

Y1 - 2022

N2 - We propose a novel method of exploiting informative video segments by learning segment weights for temporal action localization in untrimmed videos. Informative video segments represent the intrinsic motion and appearance of an action, and thus contribute crucially to action localization. The learned segment weights represent the informativeness of video segments to recognize actions and help infer the boundaries required to temporally localize actions. We build a supervised temporal attention network (STAN) that includes a supervised segment-level attention module to dynamically learn the weights of video segments, and a feature-level attention module to effectively fuse multiple features of segments. Through the cascade of the attention modules, STAN exploits informative video segments and generates descriptive and discriminative video representations. We use a proposal generator and a classifier to estimate the boundaries of actions and classify the classes of actions. Extensive experiments are conducted on two public benchmarks, i.e., THUMOS2014 and ActivityNet1.3. The results demonstrate that our proposed method achieves competitive performance compared with existing state-of-the-art methods. Moreover, compared with the baseline method that treats video segments equally, STAN achieves significant improvements with an increase of the mean average precision from 30.4% to 39.8% on the THUMOS2014 dataset, and from 31.4% to 35.9% on the ActivityNet1.3 dataset, demonstrating the effectiveness of learning informative video segments for temporal action localization.

AB - We propose a novel method of exploiting informative video segments by learning segment weights for temporal action localization in untrimmed videos. Informative video segments represent the intrinsic motion and appearance of an action, and thus contribute crucially to action localization. The learned segment weights represent the informativeness of video segments to recognize actions and help infer the boundaries required to temporally localize actions. We build a supervised temporal attention network (STAN) that includes a supervised segment-level attention module to dynamically learn the weights of video segments, and a feature-level attention module to effectively fuse multiple features of segments. Through the cascade of the attention modules, STAN exploits informative video segments and generates descriptive and discriminative video representations. We use a proposal generator and a classifier to estimate the boundaries of actions and classify the classes of actions. Extensive experiments are conducted on two public benchmarks, i.e., THUMOS2014 and ActivityNet1.3. The results demonstrate that our proposed method achieves competitive performance compared with existing state-of-the-art methods. Moreover, compared with the baseline method that treats video segments equally, STAN achieves significant improvements with an increase of the mean average precision from 30.4% to 39.8% on the THUMOS2014 dataset, and from 31.4% to 35.9% on the ActivityNet1.3 dataset, demonstrating the effectiveness of learning informative video segments for temporal action localization.

KW - Temporal action localization

KW - attention mechanism

KW - informative video segments

KW - supervised temporal attention network

UR - http://www.scopus.com/inward/record.url?scp=85099553870&partnerID=8YFLogxK

U2 - 10.1109/TMM.2021.3050067

DO - 10.1109/TMM.2021.3050067

M3 - Article

AN - SCOPUS:85099553870

SN - 1520-9210

VL - 24

SP - 274

EP - 287

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

ER -

Exploiting Informative Video Segments for Temporal Action Localization

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this