Weakly-supervised action localization via embedding-modeling iterative optimization

Xiao Yu Zhang; Haichao Shi; Changsheng Li; Peng Li; Zekun Li; Peng Ren

doi:10.1016/j.patcog.2021.107831

Weakly-supervised action localization via embedding-modeling iterative optimization

Xiao Yu Zhang^*, Haichao Shi, Changsheng Li, Peng Li, Zekun Li, Peng Ren

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

13 引用（Scopus）

摘要

Action recognition and localization in untrimmed videos in weakly supervised scenario is a challenging problem of great application prospects. Limited by the information available in video-level labels, it is a promising attempt to fully leverage the instructive knowledge learned on trimmed videos to facilitate analysis of untrimmed videos, considering that there are abundant trimmed videos which are publicly available and well segmented with semantic descriptions. In order to enforce effective trimmed-untrimmed augmentation, this paper presents a novel framework of embedding-modeling iterative optimization network, referred to as IONet. In the proposed method, action classification modeling and shared subspace embedding are learned jointly in an iterative way, so that robust cross-domain knowledge transfer is achieved. With a carefully designed two-stage self-attentive representation learning workflow for untrimmed videos, irrelevant backgrounds are eliminated and fine-grained temporal relevance can be robustly explored. Extensive experiments are conducted on two benchmark datasets, i.e., THUMOS14 and ActivityNet1.3, and experimental results clearly corroborate the efficacy of our method. Source code is available on GitHub.

源语言	英语
文章编号	107831
期刊	Pattern Recognition
卷	113
DOI	https://doi.org/10.1016/j.patcog.2021.107831
出版状态	已出版 - 5月 2021

访问文件

10.1016/j.patcog.2021.107831

其它文件与链接

链接到 Scopus 的出版物

引用此

Zhang, X. Y., Shi, H., Li, C., Li, P., Li, Z., & Ren, P. (2021). Weakly-supervised action localization via embedding-modeling iterative optimization. Pattern Recognition, 113, 文章 107831. https://doi.org/10.1016/j.patcog.2021.107831

@article{de8b762c26234888ad0e346bb96d7be5,

title = "Weakly-supervised action localization via embedding-modeling iterative optimization",

abstract = "Action recognition and localization in untrimmed videos in weakly supervised scenario is a challenging problem of great application prospects. Limited by the information available in video-level labels, it is a promising attempt to fully leverage the instructive knowledge learned on trimmed videos to facilitate analysis of untrimmed videos, considering that there are abundant trimmed videos which are publicly available and well segmented with semantic descriptions. In order to enforce effective trimmed-untrimmed augmentation, this paper presents a novel framework of embedding-modeling iterative optimization network, referred to as IONet. In the proposed method, action classification modeling and shared subspace embedding are learned jointly in an iterative way, so that robust cross-domain knowledge transfer is achieved. With a carefully designed two-stage self-attentive representation learning workflow for untrimmed videos, irrelevant backgrounds are eliminated and fine-grained temporal relevance can be robustly explored. Extensive experiments are conducted on two benchmark datasets, i.e., THUMOS14 and ActivityNet1.3, and experimental results clearly corroborate the efficacy of our method. Source code is available on GitHub.",

keywords = "Action recognition, Attention mechanism, Generative adversarial networks, Subspace embedding, Temporal action localization",

author = "Zhang, {Xiao Yu} and Haichao Shi and Changsheng Li and Peng Li and Zekun Li and Peng Ren",

note = "Publisher Copyright: {\textcopyright} 2021",

year = "2021",

month = may,

doi = "10.1016/j.patcog.2021.107831",

language = "English",

volume = "113",

journal = "Pattern Recognition",

issn = "0031-3203",

publisher = "Elsevier Ltd.",

}

TY - JOUR

T1 - Weakly-supervised action localization via embedding-modeling iterative optimization

AU - Zhang, Xiao Yu

AU - Shi, Haichao

AU - Li, Changsheng

AU - Li, Peng

AU - Li, Zekun

AU - Ren, Peng

PY - 2021/5

Y1 - 2021/5

N2 - Action recognition and localization in untrimmed videos in weakly supervised scenario is a challenging problem of great application prospects. Limited by the information available in video-level labels, it is a promising attempt to fully leverage the instructive knowledge learned on trimmed videos to facilitate analysis of untrimmed videos, considering that there are abundant trimmed videos which are publicly available and well segmented with semantic descriptions. In order to enforce effective trimmed-untrimmed augmentation, this paper presents a novel framework of embedding-modeling iterative optimization network, referred to as IONet. In the proposed method, action classification modeling and shared subspace embedding are learned jointly in an iterative way, so that robust cross-domain knowledge transfer is achieved. With a carefully designed two-stage self-attentive representation learning workflow for untrimmed videos, irrelevant backgrounds are eliminated and fine-grained temporal relevance can be robustly explored. Extensive experiments are conducted on two benchmark datasets, i.e., THUMOS14 and ActivityNet1.3, and experimental results clearly corroborate the efficacy of our method. Source code is available on GitHub.

AB - Action recognition and localization in untrimmed videos in weakly supervised scenario is a challenging problem of great application prospects. Limited by the information available in video-level labels, it is a promising attempt to fully leverage the instructive knowledge learned on trimmed videos to facilitate analysis of untrimmed videos, considering that there are abundant trimmed videos which are publicly available and well segmented with semantic descriptions. In order to enforce effective trimmed-untrimmed augmentation, this paper presents a novel framework of embedding-modeling iterative optimization network, referred to as IONet. In the proposed method, action classification modeling and shared subspace embedding are learned jointly in an iterative way, so that robust cross-domain knowledge transfer is achieved. With a carefully designed two-stage self-attentive representation learning workflow for untrimmed videos, irrelevant backgrounds are eliminated and fine-grained temporal relevance can be robustly explored. Extensive experiments are conducted on two benchmark datasets, i.e., THUMOS14 and ActivityNet1.3, and experimental results clearly corroborate the efficacy of our method. Source code is available on GitHub.

KW - Action recognition

KW - Attention mechanism

KW - Generative adversarial networks

KW - Subspace embedding

KW - Temporal action localization

UR - http://www.scopus.com/inward/record.url?scp=85099507575&partnerID=8YFLogxK

U2 - 10.1016/j.patcog.2021.107831

DO - 10.1016/j.patcog.2021.107831

M3 - Article

AN - SCOPUS:85099507575

SN - 0031-3203

VL - 113

JO - Pattern Recognition

JF - Pattern Recognition

M1 - 107831

ER -

Weakly-supervised action localization via embedding-modeling iterative optimization

摘要

访问文件

其它文件与链接

指纹

引用此