TwinNet: Twin Structured Knowledge Transfer Network for Weakly Supervised Action Localization

Xiao Yu Zhang; Hai Chao Shi; Chang Sheng Li; Li Xin Duan

doi:10.1007/s11633-022-1333-4

TwinNet: Twin Structured Knowledge Transfer Network for Weakly Supervised Action Localization

Xiao Yu Zhang, Hai Chao Shi^*, Chang Sheng Li, Li Xin Duan

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

6 Citations (Scopus)

Abstract

Action recognition and localization in untrimmed videos is important for many applications and have attracted a lot of attention. Since full supervision with frame-level annotation places an overwhelming burden on manual labeling effort, learning with weak video-level supervision becomes a potential solution. In this paper, we propose a novel weakly supervised framework to recognize actions and locate the corresponding frames in untrimmed videos simultaneously. Considering that there are abundant trimmed videos publicly available and well-segmented with semantic descriptions, the instructive knowledge learned on trimmed videos can be fully leveraged to analyze untrimmed videos. We present an effective knowledge transfer strategy based on inter-class semantic relevance. We also take advantage of the self-attention mechanism to obtain a compact video representation, such that the influence of background frames can be effectively eliminated. A learning architecture is designed with twin networks for trimmed and untrimmed videos, to facilitate transferable self-attentive representation learning. Extensive experiments are conducted on three untrimmed benchmark datasets (i.e., THUMOS14, ActivityNet1.3, and MEXaction2), and the experimental results clearly corroborate the efficacy of our method. It is especially encouraging to see that the proposed weakly supervised method even achieves comparable results to some fully supervised methods.

Original language	English
Pages (from-to)	227-246
Number of pages	20
Journal	Machine Intelligence Research
Volume	19
Issue number	3
DOIs	https://doi.org/10.1007/s11633-022-1333-4
Publication status	Published - Jun 2022

Keywords

Knowledge transfer
action localization
representation learning
self-attention mechanism
weakly supervised learning

Access to Document

10.1007/s11633-022-1333-4

Cite this

@article{3475c170daaf41769a2f8794fff73ed3,

title = "TwinNet: Twin Structured Knowledge Transfer Network for Weakly Supervised Action Localization",

abstract = "Action recognition and localization in untrimmed videos is important for many applications and have attracted a lot of attention. Since full supervision with frame-level annotation places an overwhelming burden on manual labeling effort, learning with weak video-level supervision becomes a potential solution. In this paper, we propose a novel weakly supervised framework to recognize actions and locate the corresponding frames in untrimmed videos simultaneously. Considering that there are abundant trimmed videos publicly available and well-segmented with semantic descriptions, the instructive knowledge learned on trimmed videos can be fully leveraged to analyze untrimmed videos. We present an effective knowledge transfer strategy based on inter-class semantic relevance. We also take advantage of the self-attention mechanism to obtain a compact video representation, such that the influence of background frames can be effectively eliminated. A learning architecture is designed with twin networks for trimmed and untrimmed videos, to facilitate transferable self-attentive representation learning. Extensive experiments are conducted on three untrimmed benchmark datasets (i.e., THUMOS14, ActivityNet1.3, and MEXaction2), and the experimental results clearly corroborate the efficacy of our method. It is especially encouraging to see that the proposed weakly supervised method even achieves comparable results to some fully supervised methods.",

keywords = "Knowledge transfer, action localization, representation learning, self-attention mechanism, weakly supervised learning",

author = "Zhang, {Xiao Yu} and Shi, {Hai Chao} and Li, {Chang Sheng} and Duan, {Li Xin}",

note = "Publisher Copyright: {\textcopyright} 2022, Institute of Automation, Chinese Academy of Sciences and Springer-Verlag GmbH Germany, part of Springer Nature.",

year = "2022",

month = jun,

doi = "10.1007/s11633-022-1333-4",

language = "English",

volume = "19",

pages = "227--246",

journal = "Machine Intelligence Research",

issn = "2731-538X",

publisher = "Chinese Academy of Sciences",

number = "3",

}

TY - JOUR

T1 - TwinNet

T2 - Twin Structured Knowledge Transfer Network for Weakly Supervised Action Localization

AU - Zhang, Xiao Yu

AU - Shi, Hai Chao

AU - Li, Chang Sheng

AU - Duan, Li Xin

PY - 2022/6

Y1 - 2022/6

N2 - Action recognition and localization in untrimmed videos is important for many applications and have attracted a lot of attention. Since full supervision with frame-level annotation places an overwhelming burden on manual labeling effort, learning with weak video-level supervision becomes a potential solution. In this paper, we propose a novel weakly supervised framework to recognize actions and locate the corresponding frames in untrimmed videos simultaneously. Considering that there are abundant trimmed videos publicly available and well-segmented with semantic descriptions, the instructive knowledge learned on trimmed videos can be fully leveraged to analyze untrimmed videos. We present an effective knowledge transfer strategy based on inter-class semantic relevance. We also take advantage of the self-attention mechanism to obtain a compact video representation, such that the influence of background frames can be effectively eliminated. A learning architecture is designed with twin networks for trimmed and untrimmed videos, to facilitate transferable self-attentive representation learning. Extensive experiments are conducted on three untrimmed benchmark datasets (i.e., THUMOS14, ActivityNet1.3, and MEXaction2), and the experimental results clearly corroborate the efficacy of our method. It is especially encouraging to see that the proposed weakly supervised method even achieves comparable results to some fully supervised methods.

AB - Action recognition and localization in untrimmed videos is important for many applications and have attracted a lot of attention. Since full supervision with frame-level annotation places an overwhelming burden on manual labeling effort, learning with weak video-level supervision becomes a potential solution. In this paper, we propose a novel weakly supervised framework to recognize actions and locate the corresponding frames in untrimmed videos simultaneously. Considering that there are abundant trimmed videos publicly available and well-segmented with semantic descriptions, the instructive knowledge learned on trimmed videos can be fully leveraged to analyze untrimmed videos. We present an effective knowledge transfer strategy based on inter-class semantic relevance. We also take advantage of the self-attention mechanism to obtain a compact video representation, such that the influence of background frames can be effectively eliminated. A learning architecture is designed with twin networks for trimmed and untrimmed videos, to facilitate transferable self-attentive representation learning. Extensive experiments are conducted on three untrimmed benchmark datasets (i.e., THUMOS14, ActivityNet1.3, and MEXaction2), and the experimental results clearly corroborate the efficacy of our method. It is especially encouraging to see that the proposed weakly supervised method even achieves comparable results to some fully supervised methods.

KW - Knowledge transfer

KW - action localization

KW - representation learning

KW - self-attention mechanism

KW - weakly supervised learning

UR - http://www.scopus.com/inward/record.url?scp=85130925581&partnerID=8YFLogxK

U2 - 10.1007/s11633-022-1333-4

DO - 10.1007/s11633-022-1333-4

M3 - Article

AN - SCOPUS:85130925581

SN - 2731-538X

VL - 19

SP - 227

EP - 246

JO - Machine Intelligence Research

JF - Machine Intelligence Research

IS - 3

ER -

TwinNet: Twin Structured Knowledge Transfer Network for Weakly Supervised Action Localization

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this