TY - JOUR
T1 - StochasticFormer
T2 - Stochastic Modeling for Weakly Supervised Temporal Action Localization
AU - Shi, Haichao
AU - Zhang, Xiao Yu
AU - Li, Changsheng
N1 - Publisher Copyright:
© 1992-2012 IEEE.
PY - 2023
Y1 - 2023
N2 - Weakly supervised temporal action localization (WS-TAL) aims to identify the time intervals corresponding to actions of interest in untrimmed videos with video-level weak supervision. For most existing WS-TAL methods, two commonly encountered challenges are under-localization and over-localization, which inevitably bring about severe performance deterioration. To address the issues, this paper proposes a transformer-structured stochastic process modeling framework, namely StochasticFormer, to fully investigate finer-grained interactions among the intermediate predictions to achieve further refined localization. StochasticFormer is built on a standard attention-based pipeline to derive preliminary frame/snippet-level predictions. Then, the pseudo localization module generates variable-length pseudo action instances with the corresponding pseudo labels. Using the pseudo 'action instance - action category' pairs as fine-grained pseudo supervision, the stochastic modeler aims to learn the underlying interaction among the intermediate predictions with an encoder-decoder network. The encoder consists of the deterministic and latent path to capture the local and global information, which are subsequently integrated by the decoder to obtain reliable predictions. The framework is optimized with three carefully designed losses, i.e. the video-level classification loss, the frame-level semantic coherence loss, and the ELBO loss. Extensive experiments on two benchmarks, i.e., THUMOS14 and ActivityNet1.2, have shown the efficacy of StochasticFormer compared with the state-of-the-art methods.
AB - Weakly supervised temporal action localization (WS-TAL) aims to identify the time intervals corresponding to actions of interest in untrimmed videos with video-level weak supervision. For most existing WS-TAL methods, two commonly encountered challenges are under-localization and over-localization, which inevitably bring about severe performance deterioration. To address the issues, this paper proposes a transformer-structured stochastic process modeling framework, namely StochasticFormer, to fully investigate finer-grained interactions among the intermediate predictions to achieve further refined localization. StochasticFormer is built on a standard attention-based pipeline to derive preliminary frame/snippet-level predictions. Then, the pseudo localization module generates variable-length pseudo action instances with the corresponding pseudo labels. Using the pseudo 'action instance - action category' pairs as fine-grained pseudo supervision, the stochastic modeler aims to learn the underlying interaction among the intermediate predictions with an encoder-decoder network. The encoder consists of the deterministic and latent path to capture the local and global information, which are subsequently integrated by the decoder to obtain reliable predictions. The framework is optimized with three carefully designed losses, i.e. the video-level classification loss, the frame-level semantic coherence loss, and the ELBO loss. Extensive experiments on two benchmarks, i.e., THUMOS14 and ActivityNet1.2, have shown the efficacy of StochasticFormer compared with the state-of-the-art methods.
KW - Temporal action localization
KW - action recognition
KW - stochastic process
UR - http://www.scopus.com/inward/record.url?scp=85149282069&partnerID=8YFLogxK
U2 - 10.1109/TIP.2023.3244411
DO - 10.1109/TIP.2023.3244411
M3 - Article
AN - SCOPUS:85149282069
SN - 1057-7149
VL - 32
SP - 1379
EP - 1389
JO - IEEE Transactions on Image Processing
JF - IEEE Transactions on Image Processing
ER -