Weakly-supervised action localization via embedding-modeling iterative optimization

Xiao Yu Zhang; Haichao Shi; Changsheng Li; Peng Li; Zekun Li; Peng Ren

doi:10.1016/j.patcog.2021.107831

Weakly-supervised action localization via embedding-modeling iterative optimization

Xiao Yu Zhang^*, Haichao Shi, Changsheng Li, Peng Li, Zekun Li, Peng Ren

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

13 Citations (Scopus)

Abstract

Action recognition and localization in untrimmed videos in weakly supervised scenario is a challenging problem of great application prospects. Limited by the information available in video-level labels, it is a promising attempt to fully leverage the instructive knowledge learned on trimmed videos to facilitate analysis of untrimmed videos, considering that there are abundant trimmed videos which are publicly available and well segmented with semantic descriptions. In order to enforce effective trimmed-untrimmed augmentation, this paper presents a novel framework of embedding-modeling iterative optimization network, referred to as IONet. In the proposed method, action classification modeling and shared subspace embedding are learned jointly in an iterative way, so that robust cross-domain knowledge transfer is achieved. With a carefully designed two-stage self-attentive representation learning workflow for untrimmed videos, irrelevant backgrounds are eliminated and fine-grained temporal relevance can be robustly explored. Extensive experiments are conducted on two benchmark datasets, i.e., THUMOS14 and ActivityNet1.3, and experimental results clearly corroborate the efficacy of our method. Source code is available on GitHub.

Original language	English
Article number	107831
Journal	Pattern Recognition
Volume	113
DOIs	https://doi.org/10.1016/j.patcog.2021.107831
Publication status	Published - May 2021

Keywords

Action recognition
Attention mechanism
Generative adversarial networks
Subspace embedding
Temporal action localization

Access to Document

10.1016/j.patcog.2021.107831

Cite this

@article{de8b762c26234888ad0e346bb96d7be5,

title = "Weakly-supervised action localization via embedding-modeling iterative optimization",

abstract = "Action recognition and localization in untrimmed videos in weakly supervised scenario is a challenging problem of great application prospects. Limited by the information available in video-level labels, it is a promising attempt to fully leverage the instructive knowledge learned on trimmed videos to facilitate analysis of untrimmed videos, considering that there are abundant trimmed videos which are publicly available and well segmented with semantic descriptions. In order to enforce effective trimmed-untrimmed augmentation, this paper presents a novel framework of embedding-modeling iterative optimization network, referred to as IONet. In the proposed method, action classification modeling and shared subspace embedding are learned jointly in an iterative way, so that robust cross-domain knowledge transfer is achieved. With a carefully designed two-stage self-attentive representation learning workflow for untrimmed videos, irrelevant backgrounds are eliminated and fine-grained temporal relevance can be robustly explored. Extensive experiments are conducted on two benchmark datasets, i.e., THUMOS14 and ActivityNet1.3, and experimental results clearly corroborate the efficacy of our method. Source code is available on GitHub.",

keywords = "Action recognition, Attention mechanism, Generative adversarial networks, Subspace embedding, Temporal action localization",

author = "Zhang, {Xiao Yu} and Haichao Shi and Changsheng Li and Peng Li and Zekun Li and Peng Ren",

note = "Publisher Copyright: {\textcopyright} 2021",

year = "2021",

month = may,

doi = "10.1016/j.patcog.2021.107831",

language = "English",

volume = "113",

journal = "Pattern Recognition",

issn = "0031-3203",

publisher = "Elsevier Ltd.",

}

TY - JOUR

T1 - Weakly-supervised action localization via embedding-modeling iterative optimization

AU - Zhang, Xiao Yu

AU - Shi, Haichao

AU - Li, Changsheng

AU - Li, Peng

AU - Li, Zekun

AU - Ren, Peng

PY - 2021/5

Y1 - 2021/5

N2 - Action recognition and localization in untrimmed videos in weakly supervised scenario is a challenging problem of great application prospects. Limited by the information available in video-level labels, it is a promising attempt to fully leverage the instructive knowledge learned on trimmed videos to facilitate analysis of untrimmed videos, considering that there are abundant trimmed videos which are publicly available and well segmented with semantic descriptions. In order to enforce effective trimmed-untrimmed augmentation, this paper presents a novel framework of embedding-modeling iterative optimization network, referred to as IONet. In the proposed method, action classification modeling and shared subspace embedding are learned jointly in an iterative way, so that robust cross-domain knowledge transfer is achieved. With a carefully designed two-stage self-attentive representation learning workflow for untrimmed videos, irrelevant backgrounds are eliminated and fine-grained temporal relevance can be robustly explored. Extensive experiments are conducted on two benchmark datasets, i.e., THUMOS14 and ActivityNet1.3, and experimental results clearly corroborate the efficacy of our method. Source code is available on GitHub.

AB - Action recognition and localization in untrimmed videos in weakly supervised scenario is a challenging problem of great application prospects. Limited by the information available in video-level labels, it is a promising attempt to fully leverage the instructive knowledge learned on trimmed videos to facilitate analysis of untrimmed videos, considering that there are abundant trimmed videos which are publicly available and well segmented with semantic descriptions. In order to enforce effective trimmed-untrimmed augmentation, this paper presents a novel framework of embedding-modeling iterative optimization network, referred to as IONet. In the proposed method, action classification modeling and shared subspace embedding are learned jointly in an iterative way, so that robust cross-domain knowledge transfer is achieved. With a carefully designed two-stage self-attentive representation learning workflow for untrimmed videos, irrelevant backgrounds are eliminated and fine-grained temporal relevance can be robustly explored. Extensive experiments are conducted on two benchmark datasets, i.e., THUMOS14 and ActivityNet1.3, and experimental results clearly corroborate the efficacy of our method. Source code is available on GitHub.

KW - Action recognition

KW - Attention mechanism

KW - Generative adversarial networks

KW - Subspace embedding

KW - Temporal action localization

UR - http://www.scopus.com/inward/record.url?scp=85099507575&partnerID=8YFLogxK

U2 - 10.1016/j.patcog.2021.107831

DO - 10.1016/j.patcog.2021.107831

M3 - Article

AN - SCOPUS:85099507575

SN - 0031-3203

VL - 113

JO - Pattern Recognition

JF - Pattern Recognition

M1 - 107831

ER -

Weakly-supervised action localization via embedding-modeling iterative optimization

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this