Unsupervised deep learning of mid-level video representation for action recognition

Jingyi Hou; Xinxiao Wu; Jin Chen; Jiebo Luo; Yunde Jia

Unsupervised deep learning of mid-level video representation for action recognition

Jingyi Hou, Xinxiao Wu, Jin Chen, Jiebo Luo, Yunde Jia

School of Computer Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

8 Citations (Scopus)

Abstract

Current deep learning methods for action recognition rely heavily on large scale labeled video datasets. Manually annotating video datasets is laborious and may introduce unexpected bias to train complex deep models for learning video representation. In this paper, we propose an unsupervised deep learning method which employs unlabeled local spatial-temporal volumes extracted from action videos to learn mid-level video representation for action recognition. Specifically, our method simultaneously discovers mid-level semantic concepts by discriminative clustering and optimizes local spatial-temporal features by two relatively small and simple deep neural networks. The clustering generates semantic visual concepts that guide the training of the deep networks, and the networks in turn guarantee the robustness of the semantic concepts. Experiments on the HMDB51 and the UCF101 datasets demonstrate the superiority of the proposed method, even over several supervised learning methods.

Original language	English
Title of host publication	32nd AAAI Conference on Artificial Intelligence, AAAI 2018
Publisher	AAAI press
Pages	6910-6917
Number of pages	8
ISBN (Electronic)	9781577358008
Publication status	Published - 2018
Event	32nd AAAI Conference on Artificial Intelligence, AAAI 2018 - New Orleans, United States Duration: 2 Feb 2018 → 7 Feb 2018

Publication series

Name	32nd AAAI Conference on Artificial Intelligence, AAAI 2018

Conference

Conference	32nd AAAI Conference on Artificial Intelligence, AAAI 2018
Country/Territory	United States
City	New Orleans
Period	2/02/18 → 7/02/18

Cite this

@inproceedings{a9291e7b5d3b4d348b412a7a10665380,

title = "Unsupervised deep learning of mid-level video representation for action recognition",

abstract = "Current deep learning methods for action recognition rely heavily on large scale labeled video datasets. Manually annotating video datasets is laborious and may introduce unexpected bias to train complex deep models for learning video representation. In this paper, we propose an unsupervised deep learning method which employs unlabeled local spatial-temporal volumes extracted from action videos to learn mid-level video representation for action recognition. Specifically, our method simultaneously discovers mid-level semantic concepts by discriminative clustering and optimizes local spatial-temporal features by two relatively small and simple deep neural networks. The clustering generates semantic visual concepts that guide the training of the deep networks, and the networks in turn guarantee the robustness of the semantic concepts. Experiments on the HMDB51 and the UCF101 datasets demonstrate the superiority of the proposed method, even over several supervised learning methods.",

author = "Jingyi Hou and Xinxiao Wu and Jin Chen and Jiebo Luo and Yunde Jia",

note = "Publisher Copyright: Copyright {\textcopyright} 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.; 32nd AAAI Conference on Artificial Intelligence, AAAI 2018 ; Conference date: 02-02-2018 Through 07-02-2018",

year = "2018",

language = "English",

series = "32nd AAAI Conference on Artificial Intelligence, AAAI 2018",

publisher = "AAAI press",

pages = "6910--6917",

booktitle = "32nd AAAI Conference on Artificial Intelligence, AAAI 2018",

}

Hou, J, Wu, X, Chen, J, Luo, J & Jia, Y 2018, Unsupervised deep learning of mid-level video representation for action recognition. in 32nd AAAI Conference on Artificial Intelligence, AAAI 2018. 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, AAAI press, pp. 6910-6917, 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, New Orleans, United States, 2/02/18.

Unsupervised deep learning of mid-level video representation for action recognition. / Hou, Jingyi; Wu, Xinxiao; Chen, Jin et al.
32nd AAAI Conference on Artificial Intelligence, AAAI 2018. AAAI press, 2018. p. 6910-6917 (32nd AAAI Conference on Artificial Intelligence, AAAI 2018).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Unsupervised deep learning of mid-level video representation for action recognition

AU - Hou, Jingyi

AU - Wu, Xinxiao

AU - Chen, Jin

AU - Luo, Jiebo

AU - Jia, Yunde

PY - 2018

Y1 - 2018

N2 - Current deep learning methods for action recognition rely heavily on large scale labeled video datasets. Manually annotating video datasets is laborious and may introduce unexpected bias to train complex deep models for learning video representation. In this paper, we propose an unsupervised deep learning method which employs unlabeled local spatial-temporal volumes extracted from action videos to learn mid-level video representation for action recognition. Specifically, our method simultaneously discovers mid-level semantic concepts by discriminative clustering and optimizes local spatial-temporal features by two relatively small and simple deep neural networks. The clustering generates semantic visual concepts that guide the training of the deep networks, and the networks in turn guarantee the robustness of the semantic concepts. Experiments on the HMDB51 and the UCF101 datasets demonstrate the superiority of the proposed method, even over several supervised learning methods.

AB - Current deep learning methods for action recognition rely heavily on large scale labeled video datasets. Manually annotating video datasets is laborious and may introduce unexpected bias to train complex deep models for learning video representation. In this paper, we propose an unsupervised deep learning method which employs unlabeled local spatial-temporal volumes extracted from action videos to learn mid-level video representation for action recognition. Specifically, our method simultaneously discovers mid-level semantic concepts by discriminative clustering and optimizes local spatial-temporal features by two relatively small and simple deep neural networks. The clustering generates semantic visual concepts that guide the training of the deep networks, and the networks in turn guarantee the robustness of the semantic concepts. Experiments on the HMDB51 and the UCF101 datasets demonstrate the superiority of the proposed method, even over several supervised learning methods.

UR - http://www.scopus.com/inward/record.url?scp=85060493991&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85060493991

T3 - 32nd AAAI Conference on Artificial Intelligence, AAAI 2018

SP - 6910

EP - 6917

BT - 32nd AAAI Conference on Artificial Intelligence, AAAI 2018

PB - AAAI press

T2 - 32nd AAAI Conference on Artificial Intelligence, AAAI 2018

Y2 - 2 February 2018 through 7 February 2018

ER -

Unsupervised deep learning of mid-level video representation for action recognition

Abstract

Publication series

Conference

Other files and links

Fingerprint

Cite this