StochasticFormer: Stochastic Modeling for Weakly Supervised Temporal Action Localization

Haichao Shi; Xiao Yu Zhang; Changsheng Li

doi:10.1109/TIP.2023.3244411

StochasticFormer: Stochastic Modeling for Weakly Supervised Temporal Action Localization

Haichao Shi, Xiao Yu Zhang^*, Changsheng Li

^*此作品的通讯作者

计算机学院

CAS - Institute of Information Engineering

科研成果: 期刊稿件 › 文章 › 同行评审

5 引用（Scopus）

摘要

Weakly supervised temporal action localization (WS-TAL) aims to identify the time intervals corresponding to actions of interest in untrimmed videos with video-level weak supervision. For most existing WS-TAL methods, two commonly encountered challenges are under-localization and over-localization, which inevitably bring about severe performance deterioration. To address the issues, this paper proposes a transformer-structured stochastic process modeling framework, namely StochasticFormer, to fully investigate finer-grained interactions among the intermediate predictions to achieve further refined localization. StochasticFormer is built on a standard attention-based pipeline to derive preliminary frame/snippet-level predictions. Then, the pseudo localization module generates variable-length pseudo action instances with the corresponding pseudo labels. Using the pseudo 'action instance - action category' pairs as fine-grained pseudo supervision, the stochastic modeler aims to learn the underlying interaction among the intermediate predictions with an encoder-decoder network. The encoder consists of the deterministic and latent path to capture the local and global information, which are subsequently integrated by the decoder to obtain reliable predictions. The framework is optimized with three carefully designed losses, i.e. the video-level classification loss, the frame-level semantic coherence loss, and the ELBO loss. Extensive experiments on two benchmarks, i.e., THUMOS14 and ActivityNet1.2, have shown the efficacy of StochasticFormer compared with the state-of-the-art methods.

源语言	英语
页（从-至）	1379-1389
页数	11
期刊	IEEE Transactions on Image Processing
卷	32
DOI	https://doi.org/10.1109/TIP.2023.3244411
出版状态	已出版 - 2023

访问文件

10.1109/TIP.2023.3244411

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{e99c13ca813e4232aafcaa213dd267ad,

title = "StochasticFormer: Stochastic Modeling for Weakly Supervised Temporal Action Localization",

abstract = "Weakly supervised temporal action localization (WS-TAL) aims to identify the time intervals corresponding to actions of interest in untrimmed videos with video-level weak supervision. For most existing WS-TAL methods, two commonly encountered challenges are under-localization and over-localization, which inevitably bring about severe performance deterioration. To address the issues, this paper proposes a transformer-structured stochastic process modeling framework, namely StochasticFormer, to fully investigate finer-grained interactions among the intermediate predictions to achieve further refined localization. StochasticFormer is built on a standard attention-based pipeline to derive preliminary frame/snippet-level predictions. Then, the pseudo localization module generates variable-length pseudo action instances with the corresponding pseudo labels. Using the pseudo 'action instance - action category' pairs as fine-grained pseudo supervision, the stochastic modeler aims to learn the underlying interaction among the intermediate predictions with an encoder-decoder network. The encoder consists of the deterministic and latent path to capture the local and global information, which are subsequently integrated by the decoder to obtain reliable predictions. The framework is optimized with three carefully designed losses, i.e. the video-level classification loss, the frame-level semantic coherence loss, and the ELBO loss. Extensive experiments on two benchmarks, i.e., THUMOS14 and ActivityNet1.2, have shown the efficacy of StochasticFormer compared with the state-of-the-art methods.",

keywords = "Temporal action localization, action recognition, stochastic process",

author = "Haichao Shi and Zhang, {Xiao Yu} and Changsheng Li",

note = "Publisher Copyright: {\textcopyright} 1992-2012 IEEE.",

year = "2023",

doi = "10.1109/TIP.2023.3244411",

language = "English",

volume = "32",

pages = "1379--1389",

journal = "IEEE Transactions on Image Processing",

issn = "1057-7149",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - StochasticFormer

T2 - Stochastic Modeling for Weakly Supervised Temporal Action Localization

AU - Shi, Haichao

AU - Zhang, Xiao Yu

AU - Li, Changsheng

PY - 2023

Y1 - 2023

N2 - Weakly supervised temporal action localization (WS-TAL) aims to identify the time intervals corresponding to actions of interest in untrimmed videos with video-level weak supervision. For most existing WS-TAL methods, two commonly encountered challenges are under-localization and over-localization, which inevitably bring about severe performance deterioration. To address the issues, this paper proposes a transformer-structured stochastic process modeling framework, namely StochasticFormer, to fully investigate finer-grained interactions among the intermediate predictions to achieve further refined localization. StochasticFormer is built on a standard attention-based pipeline to derive preliminary frame/snippet-level predictions. Then, the pseudo localization module generates variable-length pseudo action instances with the corresponding pseudo labels. Using the pseudo 'action instance - action category' pairs as fine-grained pseudo supervision, the stochastic modeler aims to learn the underlying interaction among the intermediate predictions with an encoder-decoder network. The encoder consists of the deterministic and latent path to capture the local and global information, which are subsequently integrated by the decoder to obtain reliable predictions. The framework is optimized with three carefully designed losses, i.e. the video-level classification loss, the frame-level semantic coherence loss, and the ELBO loss. Extensive experiments on two benchmarks, i.e., THUMOS14 and ActivityNet1.2, have shown the efficacy of StochasticFormer compared with the state-of-the-art methods.

AB - Weakly supervised temporal action localization (WS-TAL) aims to identify the time intervals corresponding to actions of interest in untrimmed videos with video-level weak supervision. For most existing WS-TAL methods, two commonly encountered challenges are under-localization and over-localization, which inevitably bring about severe performance deterioration. To address the issues, this paper proposes a transformer-structured stochastic process modeling framework, namely StochasticFormer, to fully investigate finer-grained interactions among the intermediate predictions to achieve further refined localization. StochasticFormer is built on a standard attention-based pipeline to derive preliminary frame/snippet-level predictions. Then, the pseudo localization module generates variable-length pseudo action instances with the corresponding pseudo labels. Using the pseudo 'action instance - action category' pairs as fine-grained pseudo supervision, the stochastic modeler aims to learn the underlying interaction among the intermediate predictions with an encoder-decoder network. The encoder consists of the deterministic and latent path to capture the local and global information, which are subsequently integrated by the decoder to obtain reliable predictions. The framework is optimized with three carefully designed losses, i.e. the video-level classification loss, the frame-level semantic coherence loss, and the ELBO loss. Extensive experiments on two benchmarks, i.e., THUMOS14 and ActivityNet1.2, have shown the efficacy of StochasticFormer compared with the state-of-the-art methods.

KW - Temporal action localization

KW - action recognition

KW - stochastic process

UR - http://www.scopus.com/inward/record.url?scp=85149282069&partnerID=8YFLogxK

U2 - 10.1109/TIP.2023.3244411

DO - 10.1109/TIP.2023.3244411

M3 - Article

AN - SCOPUS:85149282069

SN - 1057-7149

VL - 32

SP - 1379

EP - 1389

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

ER -

StochasticFormer: Stochastic Modeling for Weakly Supervised Temporal Action Localization

摘要

访问文件

其它文件与链接

指纹

引用此