AdapNet: Adaptability Decomposing Encoder-Decoder Network for Weakly Supervised Action Recognition and Localization

Xiao Yu Zhang; Changsheng Li; Haichao Shi; Xiaobin Zhu; Peng Li; Jing Dong

doi:10.1109/TNNLS.2019.2962815

AdapNet: Adaptability Decomposing Encoder-Decoder Network for Weakly Supervised Action Recognition and Localization

Xiao Yu Zhang, Changsheng Li, Haichao Shi^*, Xiaobin Zhu, Peng Li, Jing Dong

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

35 Citations (Scopus)

Abstract

The point process is a solid framework to model sequential data, such as videos, by exploring the underlying relevance. As a challenging problem for high-level video understanding, weakly supervised action recognition and localization in untrimmed videos have attracted intensive research attention. Knowledge transfer by leveraging the publicly available trimmed videos as external guidance is a promising attempt to make up for the coarse-grained video-level annotation and improve the generalization performance. However, unconstrained knowledge transfer may bring about irrelevant noise and jeopardize the learning model. This article proposes a novel adaptability decomposing encoder-decoder network to transfer reliable knowledge between the trimmed and untrimmed videos for action recognition and localization by bidirectional point process modeling, given only video-level annotations. By decomposing the original features into the domain-adaptable and domain-specific ones based on their adaptability, trimmed-untrimmed knowledge transfer can be safely confined within a more coherent subspace. An encoder-decoder-based structure is carefully designed and jointly optimized to facilitate effective action classification and temporal localization. Extensive experiments are conducted on two benchmark data sets (i.e., THUMOS14 and ActivityNet1.3), and the experimental results clearly corroborate the efficacy of our method.

Original language	English
Pages (from-to)	1852-1863
Number of pages	12
Journal	IEEE Transactions on Neural Networks and Learning Systems
Volume	34
Issue number	4
DOIs	https://doi.org/10.1109/TNNLS.2019.2962815
Publication status	Published - 1 Apr 2023

Keywords

Action recognition
encodera-decoder
knowledge transfer
point process
temporal action localization

Access to Document

10.1109/TNNLS.2019.2962815

Cite this

@article{fb73783e02da470d854419dfe384d51c,

title = "AdapNet: Adaptability Decomposing Encoder-Decoder Network for Weakly Supervised Action Recognition and Localization",

abstract = "The point process is a solid framework to model sequential data, such as videos, by exploring the underlying relevance. As a challenging problem for high-level video understanding, weakly supervised action recognition and localization in untrimmed videos have attracted intensive research attention. Knowledge transfer by leveraging the publicly available trimmed videos as external guidance is a promising attempt to make up for the coarse-grained video-level annotation and improve the generalization performance. However, unconstrained knowledge transfer may bring about irrelevant noise and jeopardize the learning model. This article proposes a novel adaptability decomposing encoder-decoder network to transfer reliable knowledge between the trimmed and untrimmed videos for action recognition and localization by bidirectional point process modeling, given only video-level annotations. By decomposing the original features into the domain-adaptable and domain-specific ones based on their adaptability, trimmed-untrimmed knowledge transfer can be safely confined within a more coherent subspace. An encoder-decoder-based structure is carefully designed and jointly optimized to facilitate effective action classification and temporal localization. Extensive experiments are conducted on two benchmark data sets (i.e., THUMOS14 and ActivityNet1.3), and the experimental results clearly corroborate the efficacy of our method.",

keywords = "Action recognition, encodera-decoder, knowledge transfer, point process, temporal action localization",

author = "Zhang, {Xiao Yu} and Changsheng Li and Haichao Shi and Xiaobin Zhu and Peng Li and Jing Dong",

note = "Publisher Copyright: {\textcopyright} 2012 IEEE.",

year = "2023",

month = apr,

day = "1",

doi = "10.1109/TNNLS.2019.2962815",

language = "English",

volume = "34",

pages = "1852--1863",

journal = "IEEE Transactions on Neural Networks and Learning Systems",

issn = "2162-237X",

publisher = "IEEE Computational Intelligence Society",

number = "4",

}

TY - JOUR

T1 - AdapNet

T2 - Adaptability Decomposing Encoder-Decoder Network for Weakly Supervised Action Recognition and Localization

AU - Zhang, Xiao Yu

AU - Li, Changsheng

AU - Shi, Haichao

AU - Zhu, Xiaobin

AU - Li, Peng

AU - Dong, Jing

PY - 2023/4/1

Y1 - 2023/4/1

N2 - The point process is a solid framework to model sequential data, such as videos, by exploring the underlying relevance. As a challenging problem for high-level video understanding, weakly supervised action recognition and localization in untrimmed videos have attracted intensive research attention. Knowledge transfer by leveraging the publicly available trimmed videos as external guidance is a promising attempt to make up for the coarse-grained video-level annotation and improve the generalization performance. However, unconstrained knowledge transfer may bring about irrelevant noise and jeopardize the learning model. This article proposes a novel adaptability decomposing encoder-decoder network to transfer reliable knowledge between the trimmed and untrimmed videos for action recognition and localization by bidirectional point process modeling, given only video-level annotations. By decomposing the original features into the domain-adaptable and domain-specific ones based on their adaptability, trimmed-untrimmed knowledge transfer can be safely confined within a more coherent subspace. An encoder-decoder-based structure is carefully designed and jointly optimized to facilitate effective action classification and temporal localization. Extensive experiments are conducted on two benchmark data sets (i.e., THUMOS14 and ActivityNet1.3), and the experimental results clearly corroborate the efficacy of our method.

AB - The point process is a solid framework to model sequential data, such as videos, by exploring the underlying relevance. As a challenging problem for high-level video understanding, weakly supervised action recognition and localization in untrimmed videos have attracted intensive research attention. Knowledge transfer by leveraging the publicly available trimmed videos as external guidance is a promising attempt to make up for the coarse-grained video-level annotation and improve the generalization performance. However, unconstrained knowledge transfer may bring about irrelevant noise and jeopardize the learning model. This article proposes a novel adaptability decomposing encoder-decoder network to transfer reliable knowledge between the trimmed and untrimmed videos for action recognition and localization by bidirectional point process modeling, given only video-level annotations. By decomposing the original features into the domain-adaptable and domain-specific ones based on their adaptability, trimmed-untrimmed knowledge transfer can be safely confined within a more coherent subspace. An encoder-decoder-based structure is carefully designed and jointly optimized to facilitate effective action classification and temporal localization. Extensive experiments are conducted on two benchmark data sets (i.e., THUMOS14 and ActivityNet1.3), and the experimental results clearly corroborate the efficacy of our method.

KW - Action recognition

KW - encodera-decoder

KW - knowledge transfer

KW - point process

KW - temporal action localization

UR - http://www.scopus.com/inward/record.url?scp=85079467157&partnerID=8YFLogxK

U2 - 10.1109/TNNLS.2019.2962815

DO - 10.1109/TNNLS.2019.2962815

M3 - Article

C2 - 31995502

AN - SCOPUS:85079467157

SN - 2162-237X

VL - 34

SP - 1852

EP - 1863

JO - IEEE Transactions on Neural Networks and Learning Systems

JF - IEEE Transactions on Neural Networks and Learning Systems

IS - 4

ER -

AdapNet: Adaptability Decomposing Encoder-Decoder Network for Weakly Supervised Action Recognition and Localization

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this