Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos

Xiao Yu Zhang; Haichao Shi; Changsheng Li; Peng Li

Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos

Xiao Yu Zhang, Haichao Shi, Changsheng Li^*, Peng Li^*

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

35 Citations (Scopus)

Abstract

Weakly supervised action recognition and localization for untrimmed videos is a challenging problem with extensive applications. The overwhelming irrelevant background contents in untrimmed videos severely hamper effective identification of actions of interest. In this paper, we propose a novel multi-instance multi-label modeling network based on spatio-temporal pre-trimming to recognize actions and locate corresponding frames in untrimmed videos. Motivated by the fact that person is the key factor in a human action, we spatially and temporally segment each untrimmed video into person-centric clips with pose estimation and tracking techniques. Given the bag-of-instances structure associated with video-level labels, action recognition is naturally formulated as a multi-instance multi-label learning problem. The network is optimized iteratively with selective coarse-to-fine pre-trimming based on instance-label activation. After convergence, temporal localization is further achieved with local-global temporal class activation map. Extensive experiments are conducted on two benchmark datasets, i.e. THUMOS14 and ActivityNet1.3, and experimental results clearly corroborate the efficacy of our method when compared with the state-of-the-arts.

Original language	English
Title of host publication	AAAI 2020 - 34th AAAI Conference on Artificial Intelligence
Publisher	AAAI press
Pages	12886-12893
Number of pages	8
ISBN (Electronic)	9781577358350
Publication status	Published - 2020
Event	34th AAAI Conference on Artificial Intelligence, AAAI 2020 - New York, United States Duration: 7 Feb 2020 → 12 Feb 2020

Publication series

Name	AAAI 2020 - 34th AAAI Conference on Artificial Intelligence

Conference

Conference	34th AAAI Conference on Artificial Intelligence, AAAI 2020
Country/Territory	United States
City	New York
Period	7/02/20 → 12/02/20

Cite this

@inproceedings{9dbb9b26781441ff89fc5dfc5cf02bd9,

title = "Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos",

abstract = "Weakly supervised action recognition and localization for untrimmed videos is a challenging problem with extensive applications. The overwhelming irrelevant background contents in untrimmed videos severely hamper effective identification of actions of interest. In this paper, we propose a novel multi-instance multi-label modeling network based on spatio-temporal pre-trimming to recognize actions and locate corresponding frames in untrimmed videos. Motivated by the fact that person is the key factor in a human action, we spatially and temporally segment each untrimmed video into person-centric clips with pose estimation and tracking techniques. Given the bag-of-instances structure associated with video-level labels, action recognition is naturally formulated as a multi-instance multi-label learning problem. The network is optimized iteratively with selective coarse-to-fine pre-trimming based on instance-label activation. After convergence, temporal localization is further achieved with local-global temporal class activation map. Extensive experiments are conducted on two benchmark datasets, i.e. THUMOS14 and ActivityNet1.3, and experimental results clearly corroborate the efficacy of our method when compared with the state-of-the-arts.",

author = "Zhang, {Xiao Yu} and Haichao Shi and Changsheng Li and Peng Li",

note = "Publisher Copyright: Copyright 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.; 34th AAAI Conference on Artificial Intelligence, AAAI 2020 ; Conference date: 07-02-2020 Through 12-02-2020",

year = "2020",

language = "English",

series = "AAAI 2020 - 34th AAAI Conference on Artificial Intelligence",

publisher = "AAAI press",

pages = "12886--12893",

booktitle = "AAAI 2020 - 34th AAAI Conference on Artificial Intelligence",

}

Zhang, XY, Shi, H, Li, C & Li, P 2020, Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos. in AAAI 2020 - 34th AAAI Conference on Artificial Intelligence. AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, AAAI press, pp. 12886-12893, 34th AAAI Conference on Artificial Intelligence, AAAI 2020, New York, United States, 7/02/20.

Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos. / Zhang, Xiao Yu; Shi, Haichao; Li, Changsheng et al.
AAAI 2020 - 34th AAAI Conference on Artificial Intelligence. AAAI press, 2020. p. 12886-12893 (AAAI 2020 - 34th AAAI Conference on Artificial Intelligence).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos

AU - Zhang, Xiao Yu

AU - Shi, Haichao

AU - Li, Changsheng

AU - Li, Peng

PY - 2020

Y1 - 2020

N2 - Weakly supervised action recognition and localization for untrimmed videos is a challenging problem with extensive applications. The overwhelming irrelevant background contents in untrimmed videos severely hamper effective identification of actions of interest. In this paper, we propose a novel multi-instance multi-label modeling network based on spatio-temporal pre-trimming to recognize actions and locate corresponding frames in untrimmed videos. Motivated by the fact that person is the key factor in a human action, we spatially and temporally segment each untrimmed video into person-centric clips with pose estimation and tracking techniques. Given the bag-of-instances structure associated with video-level labels, action recognition is naturally formulated as a multi-instance multi-label learning problem. The network is optimized iteratively with selective coarse-to-fine pre-trimming based on instance-label activation. After convergence, temporal localization is further achieved with local-global temporal class activation map. Extensive experiments are conducted on two benchmark datasets, i.e. THUMOS14 and ActivityNet1.3, and experimental results clearly corroborate the efficacy of our method when compared with the state-of-the-arts.

AB - Weakly supervised action recognition and localization for untrimmed videos is a challenging problem with extensive applications. The overwhelming irrelevant background contents in untrimmed videos severely hamper effective identification of actions of interest. In this paper, we propose a novel multi-instance multi-label modeling network based on spatio-temporal pre-trimming to recognize actions and locate corresponding frames in untrimmed videos. Motivated by the fact that person is the key factor in a human action, we spatially and temporally segment each untrimmed video into person-centric clips with pose estimation and tracking techniques. Given the bag-of-instances structure associated with video-level labels, action recognition is naturally formulated as a multi-instance multi-label learning problem. The network is optimized iteratively with selective coarse-to-fine pre-trimming based on instance-label activation. After convergence, temporal localization is further achieved with local-global temporal class activation map. Extensive experiments are conducted on two benchmark datasets, i.e. THUMOS14 and ActivityNet1.3, and experimental results clearly corroborate the efficacy of our method when compared with the state-of-the-arts.

UR - http://www.scopus.com/inward/record.url?scp=85092673141&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85092673141

T3 - AAAI 2020 - 34th AAAI Conference on Artificial Intelligence

SP - 12886

EP - 12893

BT - AAAI 2020 - 34th AAAI Conference on Artificial Intelligence

PB - AAAI press

T2 - 34th AAAI Conference on Artificial Intelligence, AAAI 2020

Y2 - 7 February 2020 through 12 February 2020

ER -

Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos

Abstract

Publication series

Conference

Other files and links

Fingerprint

Cite this