Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos

Xiao Yu Zhang; Haichao Shi; Changsheng Li; Peng Li

Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos

Xiao Yu Zhang, Haichao Shi, Changsheng Li^*, Peng Li^*

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

35 引用（Scopus）

摘要

Weakly supervised action recognition and localization for untrimmed videos is a challenging problem with extensive applications. The overwhelming irrelevant background contents in untrimmed videos severely hamper effective identification of actions of interest. In this paper, we propose a novel multi-instance multi-label modeling network based on spatio-temporal pre-trimming to recognize actions and locate corresponding frames in untrimmed videos. Motivated by the fact that person is the key factor in a human action, we spatially and temporally segment each untrimmed video into person-centric clips with pose estimation and tracking techniques. Given the bag-of-instances structure associated with video-level labels, action recognition is naturally formulated as a multi-instance multi-label learning problem. The network is optimized iteratively with selective coarse-to-fine pre-trimming based on instance-label activation. After convergence, temporal localization is further achieved with local-global temporal class activation map. Extensive experiments are conducted on two benchmark datasets, i.e. THUMOS14 and ActivityNet1.3, and experimental results clearly corroborate the efficacy of our method when compared with the state-of-the-arts.

源语言	英语
主期刊名	AAAI 2020 - 34th AAAI Conference on Artificial Intelligence
出版商	AAAI press
页	12886-12893
页数	8
ISBN（电子版）	9781577358350
出版状态	已出版 - 2020
活动	34th AAAI Conference on Artificial Intelligence, AAAI 2020 - New York, 美国期限: 7 2月 2020 → 12 2月 2020

出版系列

姓名	AAAI 2020 - 34th AAAI Conference on Artificial Intelligence

会议

会议	34th AAAI Conference on Artificial Intelligence, AAAI 2020
国家/地区	美国
市	New York
时期	7/02/20 → 12/02/20

其它文件与链接

链接到 Scopus 的出版物

引用此

@inproceedings{9dbb9b26781441ff89fc5dfc5cf02bd9,

title = "Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos",

abstract = "Weakly supervised action recognition and localization for untrimmed videos is a challenging problem with extensive applications. The overwhelming irrelevant background contents in untrimmed videos severely hamper effective identification of actions of interest. In this paper, we propose a novel multi-instance multi-label modeling network based on spatio-temporal pre-trimming to recognize actions and locate corresponding frames in untrimmed videos. Motivated by the fact that person is the key factor in a human action, we spatially and temporally segment each untrimmed video into person-centric clips with pose estimation and tracking techniques. Given the bag-of-instances structure associated with video-level labels, action recognition is naturally formulated as a multi-instance multi-label learning problem. The network is optimized iteratively with selective coarse-to-fine pre-trimming based on instance-label activation. After convergence, temporal localization is further achieved with local-global temporal class activation map. Extensive experiments are conducted on two benchmark datasets, i.e. THUMOS14 and ActivityNet1.3, and experimental results clearly corroborate the efficacy of our method when compared with the state-of-the-arts.",

author = "Zhang, {Xiao Yu} and Haichao Shi and Changsheng Li and Peng Li",

note = "Publisher Copyright: Copyright 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.; 34th AAAI Conference on Artificial Intelligence, AAAI 2020 ; Conference date: 07-02-2020 Through 12-02-2020",

year = "2020",

language = "English",

series = "AAAI 2020 - 34th AAAI Conference on Artificial Intelligence",

publisher = "AAAI press",

pages = "12886--12893",

booktitle = "AAAI 2020 - 34th AAAI Conference on Artificial Intelligence",

}

Zhang, XY, Shi, H, Li, C & Li, P 2020, Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos. 在 AAAI 2020 - 34th AAAI Conference on Artificial Intelligence. AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, AAAI press, 页码 12886-12893, 34th AAAI Conference on Artificial Intelligence, AAAI 2020, New York, 美国, 7/02/20.

Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos. / Zhang, Xiao Yu; Shi, Haichao; Li, Changsheng 等.
AAAI 2020 - 34th AAAI Conference on Artificial Intelligence. AAAI press, 2020. 页码 12886-12893 (AAAI 2020 - 34th AAAI Conference on Artificial Intelligence).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos

AU - Zhang, Xiao Yu

AU - Shi, Haichao

AU - Li, Changsheng

AU - Li, Peng

PY - 2020

Y1 - 2020

N2 - Weakly supervised action recognition and localization for untrimmed videos is a challenging problem with extensive applications. The overwhelming irrelevant background contents in untrimmed videos severely hamper effective identification of actions of interest. In this paper, we propose a novel multi-instance multi-label modeling network based on spatio-temporal pre-trimming to recognize actions and locate corresponding frames in untrimmed videos. Motivated by the fact that person is the key factor in a human action, we spatially and temporally segment each untrimmed video into person-centric clips with pose estimation and tracking techniques. Given the bag-of-instances structure associated with video-level labels, action recognition is naturally formulated as a multi-instance multi-label learning problem. The network is optimized iteratively with selective coarse-to-fine pre-trimming based on instance-label activation. After convergence, temporal localization is further achieved with local-global temporal class activation map. Extensive experiments are conducted on two benchmark datasets, i.e. THUMOS14 and ActivityNet1.3, and experimental results clearly corroborate the efficacy of our method when compared with the state-of-the-arts.

AB - Weakly supervised action recognition and localization for untrimmed videos is a challenging problem with extensive applications. The overwhelming irrelevant background contents in untrimmed videos severely hamper effective identification of actions of interest. In this paper, we propose a novel multi-instance multi-label modeling network based on spatio-temporal pre-trimming to recognize actions and locate corresponding frames in untrimmed videos. Motivated by the fact that person is the key factor in a human action, we spatially and temporally segment each untrimmed video into person-centric clips with pose estimation and tracking techniques. Given the bag-of-instances structure associated with video-level labels, action recognition is naturally formulated as a multi-instance multi-label learning problem. The network is optimized iteratively with selective coarse-to-fine pre-trimming based on instance-label activation. After convergence, temporal localization is further achieved with local-global temporal class activation map. Extensive experiments are conducted on two benchmark datasets, i.e. THUMOS14 and ActivityNet1.3, and experimental results clearly corroborate the efficacy of our method when compared with the state-of-the-arts.

UR - http://www.scopus.com/inward/record.url?scp=85092673141&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85092673141

T3 - AAAI 2020 - 34th AAAI Conference on Artificial Intelligence

SP - 12886

EP - 12893

BT - AAAI 2020 - 34th AAAI Conference on Artificial Intelligence

PB - AAAI press

T2 - 34th AAAI Conference on Artificial Intelligence, AAAI 2020

Y2 - 7 February 2020 through 12 February 2020

ER -

Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos

摘要

出版系列

会议

其它文件与链接

指纹

引用此