Unsupervised deep learning of mid-level video representation for action recognition

Jingyi Hou, Xinxiao Wu, Jin Chen, Jiebo Luo, Yunde Jia

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Citations (Scopus)

Abstract

Current deep learning methods for action recognition rely heavily on large scale labeled video datasets. Manually annotating video datasets is laborious and may introduce unexpected bias to train complex deep models for learning video representation. In this paper, we propose an unsupervised deep learning method which employs unlabeled local spatial-temporal volumes extracted from action videos to learn mid-level video representation for action recognition. Specifically, our method simultaneously discovers mid-level semantic concepts by discriminative clustering and optimizes local spatial-temporal features by two relatively small and simple deep neural networks. The clustering generates semantic visual concepts that guide the training of the deep networks, and the networks in turn guarantee the robustness of the semantic concepts. Experiments on the HMDB51 and the UCF101 datasets demonstrate the superiority of the proposed method, even over several supervised learning methods.

Original languageEnglish
Title of host publication32nd AAAI Conference on Artificial Intelligence, AAAI 2018
PublisherAAAI press
Pages6910-6917
Number of pages8
ISBN (Electronic)9781577358008
Publication statusPublished - 2018
Event32nd AAAI Conference on Artificial Intelligence, AAAI 2018 - New Orleans, United States
Duration: 2 Feb 20187 Feb 2018

Publication series

Name32nd AAAI Conference on Artificial Intelligence, AAAI 2018

Conference

Conference32nd AAAI Conference on Artificial Intelligence, AAAI 2018
Country/TerritoryUnited States
CityNew Orleans
Period2/02/187/02/18

Fingerprint

Dive into the research topics of 'Unsupervised deep learning of mid-level video representation for action recognition'. Together they form a unique fingerprint.

Cite this